RubyGems - swarm_sdk - Versions diffs - 2.5.2 → 2.5.4 - Mend

swarm_sdk 2.5.2 → 2.5.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

checksums.yaml +4 -4
data/lib/swarm_sdk/agent/RETRY_LOGIC.md +77 -29
data/lib/swarm_sdk/agent/chat.rb +280 -33
data/lib/swarm_sdk/agent/definition.rb +16 -1
data/lib/swarm_sdk/models.json +4315 -4210
data/lib/swarm_sdk/version.rb +1 -1
data/lib/swarm_sdk.rb +2 -3
metadata +6 -6

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: d2d30192f340ca7aca46ae7aee4538753a28a9815ff45be7260f1ceb8415a805
-  data.tar.gz: 551313de97c99e736c826248fc9d3ef67afb980d97c43fa44bb20566b9c3c26d
+  metadata.gz: 4729a555c9f839d1c507a4353c74d522cfe21b8fdf50a7727d6ee078c89609e6
+  data.tar.gz: f21e5971305b0011f924861afc3738e30f3517e25da1ba2e1b26ad3b9052ccca
 SHA512:
-  metadata.gz: 05572dc6a8e17492bd9df3d4d1bf1ff4af5e4fccc35d3309b1cef11f7561adcbb1b5f07ab35e723c1780e96fc88f44dd737cc9d25f2b2298a2ec5b578a65e154
-  data.tar.gz: 7e55b40a56b7bf861f205308490f3a37036375f93465b54ffeb10032154af84ea1b7110ce31c353bb1cb344d1bcf95f7e5383cfb6c6d9c12caad65e74e137171
+  metadata.gz: 3117334f14af1d526b949b9a21d20b5bd2098a34b06aaf1dfde2499a98b94ec8e284269eeff2030856d4a6fa1ad3fe37d3cd05f6121c57495c7a11baf217d804
+  data.tar.gz: e076f6ccde790b5a5b209cd2c46e7d534e7c45a703338f306c191a131cb8e40eb5c40e064f3410e5a897e6d654f586bb89a641404910604c90f40e48d0b2472f

data/lib/swarm_sdk/agent/RETRY_LOGIC.md CHANGED Viewed

@@ -13,24 +13,12 @@ SwarmSDK automatically retries failed LLM API calls to handle transient failures
 ## Implementation
-**Location:** `lib/swarm_sdk/agent/chat.rb:768-801`
+**Location:** `lib/swarm_sdk/agent/chat.rb`
-```ruby
-def call_llm_with_retry(max_retries: 10, delay: 10, &block)
-  attempts = 0
-  loop do
-    attempts += 1
-    begin
-      return yield
-    rescue StandardError => e
-      raise if attempts >= max_retries
-      RubyLLM.logger.warn("SwarmSDK: LLM call failed (attempt #{attempts}/#{max_retries})")
-      sleep(delay)
-    end
-  end
-end
-```
+The retry logic handles two categories of errors:
+1. **Transient failures** - Network issues, timeouts, rate limits
+2. **Orphan tool call errors** - Special recovery for malformed conversation history
 ## Error Types Handled
@@ -38,9 +26,68 @@ end
 - `Faraday::TimeoutError` - Request timeouts
 - `RubyLLM::APIError` - API errors (500s, etc.)
 - `RubyLLM::RateLimitError` - Rate limit errors
-- `RubyLLM::BadRequestError` - Usually not transient, but retries anyway
+- `RubyLLM::BadRequestError` - With special handling for orphan tool calls
 - Any other `StandardError` - Catches proxy issues, DNS failures, etc.
+## Orphan Tool Call Recovery
+**What are orphan tool calls?**
+Orphan tool calls occur when an assistant message contains `tool_use` blocks but the conversation lacks corresponding `tool_result` messages. This can happen when:
+- Tool execution is interrupted mid-stream
+- Session state restoration is incomplete
+- Network issues cause partial message delivery
+**How recovery works:**
+When a `RubyLLM::BadRequestError` (400) is received with tool-related error messages:
+1. Clears stale ephemeral content from the failed call
+2. The system scans message history for orphan tool calls
+3. For each assistant message with `tool_calls`:
+   - Checks if all `tool_call_id`s have matching `tool_result` messages
+   - Any missing results indicate orphan tool calls
+4. Orphan tool calls are pruned:
+   - If assistant message has content, keeps content but removes `tool_calls`
+   - If assistant message is empty, removes the entire message
+5. **System reminder is added** to inform the agent:
+   - Lists which tool calls were interrupted
+   - Tells agent they were never executed
+   - Suggests re-running them if still needed
+6. Retries the LLM call immediately (doesn't count as a retry)
+7. If no orphans found, falls through to normal retry logic
+**Tool-related error patterns detected:**
+- `tool_use`, `tool_result`, `tool_use_id`
+- `corresponding tool_result`
+- `must immediately follow`
+**Logging:**
+When orphan pruning occurs, emits `orphan_tool_calls_pruned` event:
+```json
+{
+  "type": "orphan_tool_calls_pruned",
+  "agent": "agent_name",
+  "pruned_count": 1,
+  "original_error": "tool_use block must have corresponding tool_result"
+}
+```
+**System Reminder Format:**
+The agent receives a system reminder about the pruned tool calls:
+```
+<system-reminder>
+The following tool calls were interrupted and removed from conversation history:
+- Read(file_path: "/important/file.rb")
+- Write(file_path: "/output.txt", content: "Hello...")
+These tools were never executed. If you still need their results, please run them again.
+</system-reminder>
+```
 ## Usage
 **Automatic - No Configuration Needed:**
@@ -53,7 +100,7 @@ swarm = SwarmSDK.build do
   end
 end
-# Automatically retries on failure
+# Automatically retries on failure and recovers from orphan tool calls
 response = swarm.execute("Do something")
 ```
@@ -70,15 +117,6 @@ WARN: SwarmSDK: Retrying in 10 seconds...
 ERROR: SwarmSDK: LLM call failed after 10 attempts: Faraday::ConnectionFailed: Connection failed
 ```
-## Testing
-Retry logic has been verified through:
-- ✅ All 728 SwarmSDK tests passing
-- ✅ Manual testing with failing proxies
-- ✅ Evaluation harnesses (assistant/retrieval modes)
-**Note:** Direct unit tests would require reflection (`instance_variable_set`) which violates security policy. The retry logic is tested implicitly through integration tests and real usage.
 ## Behavior
 **Scenario 1: Transient failure**
@@ -103,6 +141,16 @@ Attempt 1: Success
   → Returns response (no retry needed)
 ```
+**Scenario 4: Orphan tool call recovery**
+```
+Attempt 1: BadRequestError (tool_use without tool_result)
+  → Detect orphan tool calls
+  → Prune orphan tool calls from message history
+  → Retry immediately (doesn't count as retry)
+Attempt 1: Success
+  → Returns response
+```
 ## Why No Exponential Backoff
 **Design Decision:** Fixed 10-second delay
@@ -124,4 +172,4 @@ Attempt 1: Success
 - [ ] Exponential backoff option
 - [ ] Circuit breaker pattern
-**Current State:** Production-ready with sensible defaults for proxy/network resilience.
+**Current State:** Production-ready with sensible defaults for proxy/network resilience and automatic orphan tool call recovery.

data/lib/swarm_sdk/agent/chat.rb CHANGED Viewed

@@ -644,13 +644,14 @@ module SwarmSDK
       # - Clear ephemeral content after each LLM call
       # - Add retry logic for transient failures
       def setup_llm_request_hook
-        @llm_chat.around_llm_request do |messages, &send_request|
-          # Inject ephemeral content (system reminders, etc.)
-          # These are sent to LLM but NOT persisted in message history
-          prepared_messages = @context_manager.prepare_for_llm(messages)
+        @llm_chat.around_llm_request do |_messages, &send_request|
           # Make the actual LLM API call with retry logic
+          # NOTE: prepare_for_llm must be called INSIDE the retry block so that
+          # ephemeral content is recalculated after orphan tool call pruning
           response = call_llm_with_retry do
+            # Inject ephemeral content fresh for each attempt
+            # Use @llm_chat.messages to get current state (may have been modified by pruning)
+            prepared_messages = @context_manager.prepare_for_llm(@llm_chat.messages)
             send_request.call(prepared_messages)
           end
@@ -713,51 +714,297 @@ module SwarmSDK
       # Call LLM provider with retry logic for transient failures
       #
+      # Includes special handling for 400 Bad Request errors:
+      # - Attempts to prune orphan tool calls (tool_use without tool_result)
+      # - If pruning succeeds, retries immediately without counting as retry
+      #
       # @param max_retries [Integer] Maximum retry attempts
       # @param delay [Integer] Delay between retries in seconds
       # @yield Block that performs the LLM call
       # @return [Object] Result from block
       def call_llm_with_retry(max_retries: 10, delay: 10, &block)
         attempts = 0
+        pruning_attempted = false
         loop do
           attempts += 1
           begin
             return yield
-          rescue StandardError => e
-            if attempts >= max_retries
-              LogStream.emit(
-                type: "llm_retry_exhausted",
-                agent: @agent_name,
-                swarm_id: @agent_context&.swarm_id,
-                parent_swarm_id: @agent_context&.parent_swarm_id,
-                model: model_id,
-                attempts: attempts,
-                error_class: e.class.name,
-                error_message: e.message,
-                error_backtrace: e.backtrace,
-              )
-              raise
+          rescue RubyLLM::BadRequestError => e
+            # Try to recover from 400 Bad Request by pruning orphan tool calls
+            # This can happen when tool execution is interrupted mid-stream
+            unless pruning_attempted
+              pruned = recover_from_orphan_tool_calls(e)
+              if pruned > 0
+                pruning_attempted = true
+                # Don't count this as a regular retry, try again immediately
+                attempts -= 1
+                next
+              end
             end
-            LogStream.emit(
-              type: "llm_retry_attempt",
-              agent: @agent_name,
-              swarm_id: @agent_context&.swarm_id,
-              parent_swarm_id: @agent_context&.parent_swarm_id,
-              model: model_id,
-              attempt: attempts,
-              max_retries: max_retries,
-              error_class: e.class.name,
-              error_message: e.message,
-              error_backtrace: e.backtrace,
-              retry_delay: delay,
-            )
+            # Fall through to standard retry logic
+            handle_retry_or_raise(e, attempts, max_retries, delay)
+          rescue StandardError => e
+            handle_retry_or_raise(e, attempts, max_retries, delay)
+          end
+        end
+      end
+      # Handle retry decision or re-raise error
+      #
+      # @param error [StandardError] The error that occurred
+      # @param attempts [Integer] Current attempt count
+      # @param max_retries [Integer] Maximum retry attempts
+      # @param delay [Integer] Delay between retries in seconds
+      # @raise [StandardError] Re-raises error if max retries exceeded
+      def handle_retry_or_raise(error, attempts, max_retries, delay)
+        if attempts >= max_retries
+          LogStream.emit(
+            type: "llm_retry_exhausted",
+            agent: @agent_name,
+            swarm_id: @agent_context&.swarm_id,
+            parent_swarm_id: @agent_context&.parent_swarm_id,
+            model: model_id,
+            attempts: attempts,
+            error_class: error.class.name,
+            error_message: error.message,
+            error_backtrace: error.backtrace,
+          )
+          raise
+        end
+        LogStream.emit(
+          type: "llm_retry_attempt",
+          agent: @agent_name,
+          swarm_id: @agent_context&.swarm_id,
+          parent_swarm_id: @agent_context&.parent_swarm_id,
+          model: model_id,
+          attempt: attempts,
+          max_retries: max_retries,
+          error_class: error.class.name,
+          error_message: error.message,
+          error_backtrace: error.backtrace,
+          retry_delay: delay,
+        )
+        sleep(delay)
+      end
+      # Recover from 400 Bad Request by pruning orphan tool calls
+      #
+      # @param error [RubyLLM::BadRequestError] The error that occurred
+      # @return [Integer] Number of orphan tool calls pruned (0 if none or not applicable)
+      def recover_from_orphan_tool_calls(error)
+        # Only attempt recovery for tool-related errors
+        error_message = error.message.to_s.downcase
+        tool_error_patterns = [
+          "tool_use",
+          "tool_result",
+          "tool_use_id",
+          "tool use",
+          "tool result",
+          "corresponding tool_result",
+          "must immediately follow",
+        ]
+        return 0 unless tool_error_patterns.any? { |pattern| error_message.include?(pattern) }
+        # Clear stale ephemeral content from the failed LLM call
+        # This is important because message indices changed after pruning
+        @context_manager&.clear_ephemeral
+        # Attempt to prune orphan tool calls
+        result = prune_orphan_tool_calls
+        pruned_count = result[:count]
+        if pruned_count > 0
+          LogStream.emit(
+            type: "orphan_tool_calls_pruned",
+            agent: @agent_name,
+            swarm_id: @agent_context&.swarm_id,
+            parent_swarm_id: @agent_context&.parent_swarm_id,
+            model: model_id,
+            pruned_count: pruned_count,
+            original_error: error.message,
+          )
+          # Add system reminder about pruned tool calls
+          add_orphan_tool_calls_reminder(result[:pruned_tools])
+        end
+        pruned_count
+      end
+      # Prune orphan tool calls from message history
+      #
+      # An orphan tool call is a tool_use in an assistant message that doesn't
+      # have a corresponding tool_result before the next user/assistant message.
+      #
+      # @return [Hash] { count: Integer, pruned_tools: Array<Hash> }
+      def prune_orphan_tool_calls
+        messages = @llm_chat.messages
+        return { count: 0, pruned_tools: [] } if messages.empty?
+        orphans = find_orphan_tool_calls(messages)
+        return { count: 0, pruned_tools: [] } if orphans.empty?
+        # Collect details about pruned tool calls
+        pruned_tools = collect_orphan_tool_details(messages, orphans)
+        # Build new message array with orphans removed
+        new_messages = remove_orphan_tool_calls(messages, orphans)
+        # Replace messages atomically
+        replace_messages(new_messages)
+        {
+          count: orphans.values.flatten.size,
+          pruned_tools: pruned_tools,
+        }
+      end
+      # Collect details about orphan tool calls for system reminder
+      #
+      # @param messages [Array<RubyLLM::Message>] Original messages
+      # @param orphans [Hash<Integer, Array<String>>] Map of message index to orphan tool_call_ids
+      # @return [Array<Hash>] Array of { name:, arguments: } hashes
+      def collect_orphan_tool_details(messages, orphans)
+        pruned_tools = []
+        orphans.each do |msg_idx, orphan_ids|
+          msg = messages[msg_idx]
+          next unless msg.tool_calls
+          orphan_ids.each do |tool_call_id|
+            tool_call = msg.tool_calls[tool_call_id]
+            next unless tool_call
+            pruned_tools << {
+              name: tool_call.name,
+              arguments: tool_call.arguments,
+            }
+          end
+        end
+        pruned_tools
+      end
+      # Add system reminder about pruned orphan tool calls
+      #
+      # @param pruned_tools [Array<Hash>] Array of { name:, arguments: } hashes
+      # @return [void]
+      def add_orphan_tool_calls_reminder(pruned_tools)
+        return if pruned_tools.empty?
+        # Format tool calls for the reminder
+        tool_list = pruned_tools.map do |tool|
+          args_str = format_tool_arguments(tool[:arguments])
+          "- #{tool[:name]}(#{args_str})"
+        end.join("\n")
+        reminder = <<~REMINDER
+          <system-reminder>
+          The following tool calls were interrupted and removed from conversation history:
+          #{tool_list}
+          These tools were never executed. If you still need their results, please run them again.
+          </system-reminder>
+        REMINDER
+        add_ephemeral_reminder(reminder.strip)
+      end
+      # Format tool arguments for display in reminder
+      #
+      # @param arguments [Hash] Tool call arguments
+      # @return [String] Formatted arguments
+      def format_tool_arguments(arguments)
+        return "" if arguments.nil? || arguments.empty?
+        # Format key-value pairs, truncating long values
+        args = arguments.map do |key, value|
+          formatted_value = if value.is_a?(String) && value.length > 50
+            "#{value[0...47]}..."
+          else
+            value.inspect
+          end
+          "#{key}: #{formatted_value}"
+        end
+        args.join(", ")
+      end
+      # Find all orphan tool calls in message history
+      #
+      # @param messages [Array<RubyLLM::Message>] Message array to scan
+      # @return [Hash<Integer, Array<String>>] Map of message index to orphan tool_call_ids
+      def find_orphan_tool_calls(messages)
+        orphans = {}
+        messages.each_with_index do |msg, idx|
+          next unless msg.role == :assistant && msg.tool_calls && !msg.tool_calls.empty?
-            sleep(delay)
+          # Get all tool_call_ids from this assistant message
+          expected_tool_call_ids = msg.tool_calls.keys.to_set
+          # Find tool results between this message and the next user/assistant message
+          found_tool_call_ids = Set.new
+          (idx + 1...messages.size).each do |subsequent_idx|
+            subsequent_msg = messages[subsequent_idx]
+            # Stop at next user or assistant message
+            break if [:user, :assistant].include?(subsequent_msg.role)
+            # Collect tool result IDs
+            if subsequent_msg.role == :tool && subsequent_msg.tool_call_id
+              found_tool_call_ids << subsequent_msg.tool_call_id
+            end
           end
+          # Identify orphan tool_call_ids (expected but not found)
+          orphan_ids = (expected_tool_call_ids - found_tool_call_ids).to_a
+          orphans[idx] = orphan_ids unless orphan_ids.empty?
         end
+        orphans
+      end
+      # Remove orphan tool calls from messages
+      #
+      # @param messages [Array<RubyLLM::Message>] Original messages
+      # @param orphans [Hash<Integer, Array<String>>] Map of message index to orphan tool_call_ids
+      # @return [Array<RubyLLM::Message>] New message array with orphans removed
+      def remove_orphan_tool_calls(messages, orphans)
+        messages.map.with_index do |msg, idx|
+          orphan_ids = orphans[idx]
+          # No orphans in this message - keep as-is
+          next msg unless orphan_ids
+          # Remove orphan tool_calls from this assistant message
+          remaining_tool_calls = msg.tool_calls.reject { |id, _| orphan_ids.include?(id) }
+          # If no tool_calls remain and no content, skip this message entirely
+          if remaining_tool_calls.empty? && (msg.content.nil? || msg.content.to_s.strip.empty?)
+            next nil
+          end
+          # Create new message with remaining tool_calls
+          RubyLLM::Message.new(
+            role: msg.role,
+            content: msg.content,
+            tool_calls: remaining_tool_calls.empty? ? nil : remaining_tool_calls,
+            model_id: msg.model_id,
+            input_tokens: msg.input_tokens,
+            output_tokens: msg.output_tokens,
+            cached_tokens: msg.cached_tokens,
+            cache_creation_tokens: msg.cache_creation_tokens,
+          )
+        end.compact
       end
       # Check if a tool call is a delegation tool

data/lib/swarm_sdk/agent/definition.rb CHANGED Viewed

@@ -71,7 +71,7 @@ module SwarmSDK
         @provider = config[:provider] || SwarmSDK.config.default_provider
         @base_url = config[:base_url]
         @api_version = config[:api_version]
-        @context_window = config[:context_window] # Explicit context window override
+        @context_window = coerce_to_integer(config[:context_window]) # Explicit context window override
         @parameters = config[:parameters] || {}
         @headers = Utils.stringify_keys(config[:headers] || {})
         @timeout = config[:timeout] || SwarmSDK.config.agent_request_timeout
@@ -447,6 +447,21 @@ module SwarmSDK
         end
       end
+      # Coerce value to integer if it's a numeric string
+      #
+      # YAML sometimes parses numbers as strings (especially when quoted).
+      # This ensures numeric values are properly converted.
+      #
+      # @param value [String, Integer, nil] Value to coerce
+      # @return [Integer, nil] Coerced integer or nil
+      def coerce_to_integer(value)
+        return if value.nil?
+        return value if value.is_a?(Integer)
+        return value.to_i if value.is_a?(String) && value.match?(/\A\d+\z/)
+        value
+      end
       def validate!
         raise ConfigurationError, "Agent '#{@name}' missing required 'description' field" unless @description