swarm_sdk 2.5.2 → 2.5.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: d2d30192f340ca7aca46ae7aee4538753a28a9815ff45be7260f1ceb8415a805
4
- data.tar.gz: 551313de97c99e736c826248fc9d3ef67afb980d97c43fa44bb20566b9c3c26d
3
+ metadata.gz: 4729a555c9f839d1c507a4353c74d522cfe21b8fdf50a7727d6ee078c89609e6
4
+ data.tar.gz: f21e5971305b0011f924861afc3738e30f3517e25da1ba2e1b26ad3b9052ccca
5
5
  SHA512:
6
- metadata.gz: 05572dc6a8e17492bd9df3d4d1bf1ff4af5e4fccc35d3309b1cef11f7561adcbb1b5f07ab35e723c1780e96fc88f44dd737cc9d25f2b2298a2ec5b578a65e154
7
- data.tar.gz: 7e55b40a56b7bf861f205308490f3a37036375f93465b54ffeb10032154af84ea1b7110ce31c353bb1cb344d1bcf95f7e5383cfb6c6d9c12caad65e74e137171
6
+ metadata.gz: 3117334f14af1d526b949b9a21d20b5bd2098a34b06aaf1dfde2499a98b94ec8e284269eeff2030856d4a6fa1ad3fe37d3cd05f6121c57495c7a11baf217d804
7
+ data.tar.gz: e076f6ccde790b5a5b209cd2c46e7d534e7c45a703338f306c191a131cb8e40eb5c40e064f3410e5a897e6d654f586bb89a641404910604c90f40e48d0b2472f
@@ -13,24 +13,12 @@ SwarmSDK automatically retries failed LLM API calls to handle transient failures
13
13
 
14
14
  ## Implementation
15
15
 
16
- **Location:** `lib/swarm_sdk/agent/chat.rb:768-801`
16
+ **Location:** `lib/swarm_sdk/agent/chat.rb`
17
17
 
18
- ```ruby
19
- def call_llm_with_retry(max_retries: 10, delay: 10, &block)
20
- attempts = 0
21
- loop do
22
- attempts += 1
23
- begin
24
- return yield
25
- rescue StandardError => e
26
- raise if attempts >= max_retries
27
-
28
- RubyLLM.logger.warn("SwarmSDK: LLM call failed (attempt #{attempts}/#{max_retries})")
29
- sleep(delay)
30
- end
31
- end
32
- end
33
- ```
18
+ The retry logic handles two categories of errors:
19
+
20
+ 1. **Transient failures** - Network issues, timeouts, rate limits
21
+ 2. **Orphan tool call errors** - Special recovery for malformed conversation history
34
22
 
35
23
  ## Error Types Handled
36
24
 
@@ -38,9 +26,68 @@ end
38
26
  - `Faraday::TimeoutError` - Request timeouts
39
27
  - `RubyLLM::APIError` - API errors (500s, etc.)
40
28
  - `RubyLLM::RateLimitError` - Rate limit errors
41
- - `RubyLLM::BadRequestError` - Usually not transient, but retries anyway
29
+ - `RubyLLM::BadRequestError` - With special handling for orphan tool calls
42
30
  - Any other `StandardError` - Catches proxy issues, DNS failures, etc.
43
31
 
32
+ ## Orphan Tool Call Recovery
33
+
34
+ **What are orphan tool calls?**
35
+
36
+ Orphan tool calls occur when an assistant message contains `tool_use` blocks but the conversation lacks corresponding `tool_result` messages. This can happen when:
37
+ - Tool execution is interrupted mid-stream
38
+ - Session state restoration is incomplete
39
+ - Network issues cause partial message delivery
40
+
41
+ **How recovery works:**
42
+
43
+ When a `RubyLLM::BadRequestError` (400) is received with tool-related error messages:
44
+
45
+ 1. Clears stale ephemeral content from the failed call
46
+ 2. The system scans message history for orphan tool calls
47
+ 3. For each assistant message with `tool_calls`:
48
+ - Checks if all `tool_call_id`s have matching `tool_result` messages
49
+ - Any missing results indicate orphan tool calls
50
+ 4. Orphan tool calls are pruned:
51
+ - If assistant message has content, keeps content but removes `tool_calls`
52
+ - If assistant message is empty, removes the entire message
53
+ 5. **System reminder is added** to inform the agent:
54
+ - Lists which tool calls were interrupted
55
+ - Tells agent they were never executed
56
+ - Suggests re-running them if still needed
57
+ 6. Retries the LLM call immediately (doesn't count as a retry)
58
+ 7. If no orphans found, falls through to normal retry logic
59
+
60
+ **Tool-related error patterns detected:**
61
+ - `tool_use`, `tool_result`, `tool_use_id`
62
+ - `corresponding tool_result`
63
+ - `must immediately follow`
64
+
65
+ **Logging:**
66
+
67
+ When orphan pruning occurs, emits `orphan_tool_calls_pruned` event:
68
+ ```json
69
+ {
70
+ "type": "orphan_tool_calls_pruned",
71
+ "agent": "agent_name",
72
+ "pruned_count": 1,
73
+ "original_error": "tool_use block must have corresponding tool_result"
74
+ }
75
+ ```
76
+
77
+ **System Reminder Format:**
78
+
79
+ The agent receives a system reminder about the pruned tool calls:
80
+ ```
81
+ <system-reminder>
82
+ The following tool calls were interrupted and removed from conversation history:
83
+
84
+ - Read(file_path: "/important/file.rb")
85
+ - Write(file_path: "/output.txt", content: "Hello...")
86
+
87
+ These tools were never executed. If you still need their results, please run them again.
88
+ </system-reminder>
89
+ ```
90
+
44
91
  ## Usage
45
92
 
46
93
  **Automatic - No Configuration Needed:**
@@ -53,7 +100,7 @@ swarm = SwarmSDK.build do
53
100
  end
54
101
  end
55
102
 
56
- # Automatically retries on failure
103
+ # Automatically retries on failure and recovers from orphan tool calls
57
104
  response = swarm.execute("Do something")
58
105
  ```
59
106
 
@@ -70,15 +117,6 @@ WARN: SwarmSDK: Retrying in 10 seconds...
70
117
  ERROR: SwarmSDK: LLM call failed after 10 attempts: Faraday::ConnectionFailed: Connection failed
71
118
  ```
72
119
 
73
- ## Testing
74
-
75
- Retry logic has been verified through:
76
- - ✅ All 728 SwarmSDK tests passing
77
- - ✅ Manual testing with failing proxies
78
- - ✅ Evaluation harnesses (assistant/retrieval modes)
79
-
80
- **Note:** Direct unit tests would require reflection (`instance_variable_set`) which violates security policy. The retry logic is tested implicitly through integration tests and real usage.
81
-
82
120
  ## Behavior
83
121
 
84
122
  **Scenario 1: Transient failure**
@@ -103,6 +141,16 @@ Attempt 1: Success
103
141
  → Returns response (no retry needed)
104
142
  ```
105
143
 
144
+ **Scenario 4: Orphan tool call recovery**
145
+ ```
146
+ Attempt 1: BadRequestError (tool_use without tool_result)
147
+ → Detect orphan tool calls
148
+ → Prune orphan tool calls from message history
149
+ → Retry immediately (doesn't count as retry)
150
+ Attempt 1: Success
151
+ → Returns response
152
+ ```
153
+
106
154
  ## Why No Exponential Backoff
107
155
 
108
156
  **Design Decision:** Fixed 10-second delay
@@ -124,4 +172,4 @@ Attempt 1: Success
124
172
  - [ ] Exponential backoff option
125
173
  - [ ] Circuit breaker pattern
126
174
 
127
- **Current State:** Production-ready with sensible defaults for proxy/network resilience.
175
+ **Current State:** Production-ready with sensible defaults for proxy/network resilience and automatic orphan tool call recovery.
@@ -644,13 +644,14 @@ module SwarmSDK
644
644
  # - Clear ephemeral content after each LLM call
645
645
  # - Add retry logic for transient failures
646
646
  def setup_llm_request_hook
647
- @llm_chat.around_llm_request do |messages, &send_request|
648
- # Inject ephemeral content (system reminders, etc.)
649
- # These are sent to LLM but NOT persisted in message history
650
- prepared_messages = @context_manager.prepare_for_llm(messages)
651
-
647
+ @llm_chat.around_llm_request do |_messages, &send_request|
652
648
  # Make the actual LLM API call with retry logic
649
+ # NOTE: prepare_for_llm must be called INSIDE the retry block so that
650
+ # ephemeral content is recalculated after orphan tool call pruning
653
651
  response = call_llm_with_retry do
652
+ # Inject ephemeral content fresh for each attempt
653
+ # Use @llm_chat.messages to get current state (may have been modified by pruning)
654
+ prepared_messages = @context_manager.prepare_for_llm(@llm_chat.messages)
654
655
  send_request.call(prepared_messages)
655
656
  end
656
657
 
@@ -713,51 +714,297 @@ module SwarmSDK
713
714
 
714
715
  # Call LLM provider with retry logic for transient failures
715
716
  #
717
+ # Includes special handling for 400 Bad Request errors:
718
+ # - Attempts to prune orphan tool calls (tool_use without tool_result)
719
+ # - If pruning succeeds, retries immediately without counting as retry
720
+ #
716
721
  # @param max_retries [Integer] Maximum retry attempts
717
722
  # @param delay [Integer] Delay between retries in seconds
718
723
  # @yield Block that performs the LLM call
719
724
  # @return [Object] Result from block
720
725
  def call_llm_with_retry(max_retries: 10, delay: 10, &block)
721
726
  attempts = 0
727
+ pruning_attempted = false
722
728
 
723
729
  loop do
724
730
  attempts += 1
725
731
 
726
732
  begin
727
733
  return yield
728
- rescue StandardError => e
729
- if attempts >= max_retries
730
- LogStream.emit(
731
- type: "llm_retry_exhausted",
732
- agent: @agent_name,
733
- swarm_id: @agent_context&.swarm_id,
734
- parent_swarm_id: @agent_context&.parent_swarm_id,
735
- model: model_id,
736
- attempts: attempts,
737
- error_class: e.class.name,
738
- error_message: e.message,
739
- error_backtrace: e.backtrace,
740
- )
741
- raise
734
+ rescue RubyLLM::BadRequestError => e
735
+ # Try to recover from 400 Bad Request by pruning orphan tool calls
736
+ # This can happen when tool execution is interrupted mid-stream
737
+ unless pruning_attempted
738
+ pruned = recover_from_orphan_tool_calls(e)
739
+ if pruned > 0
740
+ pruning_attempted = true
741
+ # Don't count this as a regular retry, try again immediately
742
+ attempts -= 1
743
+ next
744
+ end
742
745
  end
743
746
 
744
- LogStream.emit(
745
- type: "llm_retry_attempt",
746
- agent: @agent_name,
747
- swarm_id: @agent_context&.swarm_id,
748
- parent_swarm_id: @agent_context&.parent_swarm_id,
749
- model: model_id,
750
- attempt: attempts,
751
- max_retries: max_retries,
752
- error_class: e.class.name,
753
- error_message: e.message,
754
- error_backtrace: e.backtrace,
755
- retry_delay: delay,
756
- )
747
+ # Fall through to standard retry logic
748
+ handle_retry_or_raise(e, attempts, max_retries, delay)
749
+ rescue StandardError => e
750
+ handle_retry_or_raise(e, attempts, max_retries, delay)
751
+ end
752
+ end
753
+ end
754
+
755
+ # Handle retry decision or re-raise error
756
+ #
757
+ # @param error [StandardError] The error that occurred
758
+ # @param attempts [Integer] Current attempt count
759
+ # @param max_retries [Integer] Maximum retry attempts
760
+ # @param delay [Integer] Delay between retries in seconds
761
+ # @raise [StandardError] Re-raises error if max retries exceeded
762
+ def handle_retry_or_raise(error, attempts, max_retries, delay)
763
+ if attempts >= max_retries
764
+ LogStream.emit(
765
+ type: "llm_retry_exhausted",
766
+ agent: @agent_name,
767
+ swarm_id: @agent_context&.swarm_id,
768
+ parent_swarm_id: @agent_context&.parent_swarm_id,
769
+ model: model_id,
770
+ attempts: attempts,
771
+ error_class: error.class.name,
772
+ error_message: error.message,
773
+ error_backtrace: error.backtrace,
774
+ )
775
+ raise
776
+ end
777
+
778
+ LogStream.emit(
779
+ type: "llm_retry_attempt",
780
+ agent: @agent_name,
781
+ swarm_id: @agent_context&.swarm_id,
782
+ parent_swarm_id: @agent_context&.parent_swarm_id,
783
+ model: model_id,
784
+ attempt: attempts,
785
+ max_retries: max_retries,
786
+ error_class: error.class.name,
787
+ error_message: error.message,
788
+ error_backtrace: error.backtrace,
789
+ retry_delay: delay,
790
+ )
791
+
792
+ sleep(delay)
793
+ end
794
+
795
+ # Recover from 400 Bad Request by pruning orphan tool calls
796
+ #
797
+ # @param error [RubyLLM::BadRequestError] The error that occurred
798
+ # @return [Integer] Number of orphan tool calls pruned (0 if none or not applicable)
799
+ def recover_from_orphan_tool_calls(error)
800
+ # Only attempt recovery for tool-related errors
801
+ error_message = error.message.to_s.downcase
802
+ tool_error_patterns = [
803
+ "tool_use",
804
+ "tool_result",
805
+ "tool_use_id",
806
+ "tool use",
807
+ "tool result",
808
+ "corresponding tool_result",
809
+ "must immediately follow",
810
+ ]
811
+
812
+ return 0 unless tool_error_patterns.any? { |pattern| error_message.include?(pattern) }
813
+
814
+ # Clear stale ephemeral content from the failed LLM call
815
+ # This is important because message indices changed after pruning
816
+ @context_manager&.clear_ephemeral
817
+
818
+ # Attempt to prune orphan tool calls
819
+ result = prune_orphan_tool_calls
820
+ pruned_count = result[:count]
821
+
822
+ if pruned_count > 0
823
+ LogStream.emit(
824
+ type: "orphan_tool_calls_pruned",
825
+ agent: @agent_name,
826
+ swarm_id: @agent_context&.swarm_id,
827
+ parent_swarm_id: @agent_context&.parent_swarm_id,
828
+ model: model_id,
829
+ pruned_count: pruned_count,
830
+ original_error: error.message,
831
+ )
832
+
833
+ # Add system reminder about pruned tool calls
834
+ add_orphan_tool_calls_reminder(result[:pruned_tools])
835
+ end
836
+
837
+ pruned_count
838
+ end
839
+
840
+ # Prune orphan tool calls from message history
841
+ #
842
+ # An orphan tool call is a tool_use in an assistant message that doesn't
843
+ # have a corresponding tool_result before the next user/assistant message.
844
+ #
845
+ # @return [Hash] { count: Integer, pruned_tools: Array<Hash> }
846
+ def prune_orphan_tool_calls
847
+ messages = @llm_chat.messages
848
+ return { count: 0, pruned_tools: [] } if messages.empty?
849
+
850
+ orphans = find_orphan_tool_calls(messages)
851
+ return { count: 0, pruned_tools: [] } if orphans.empty?
852
+
853
+ # Collect details about pruned tool calls
854
+ pruned_tools = collect_orphan_tool_details(messages, orphans)
855
+
856
+ # Build new message array with orphans removed
857
+ new_messages = remove_orphan_tool_calls(messages, orphans)
858
+
859
+ # Replace messages atomically
860
+ replace_messages(new_messages)
861
+
862
+ {
863
+ count: orphans.values.flatten.size,
864
+ pruned_tools: pruned_tools,
865
+ }
866
+ end
867
+
868
+ # Collect details about orphan tool calls for system reminder
869
+ #
870
+ # @param messages [Array<RubyLLM::Message>] Original messages
871
+ # @param orphans [Hash<Integer, Array<String>>] Map of message index to orphan tool_call_ids
872
+ # @return [Array<Hash>] Array of { name:, arguments: } hashes
873
+ def collect_orphan_tool_details(messages, orphans)
874
+ pruned_tools = []
875
+
876
+ orphans.each do |msg_idx, orphan_ids|
877
+ msg = messages[msg_idx]
878
+ next unless msg.tool_calls
879
+
880
+ orphan_ids.each do |tool_call_id|
881
+ tool_call = msg.tool_calls[tool_call_id]
882
+ next unless tool_call
883
+
884
+ pruned_tools << {
885
+ name: tool_call.name,
886
+ arguments: tool_call.arguments,
887
+ }
888
+ end
889
+ end
890
+
891
+ pruned_tools
892
+ end
893
+
894
+ # Add system reminder about pruned orphan tool calls
895
+ #
896
+ # @param pruned_tools [Array<Hash>] Array of { name:, arguments: } hashes
897
+ # @return [void]
898
+ def add_orphan_tool_calls_reminder(pruned_tools)
899
+ return if pruned_tools.empty?
900
+
901
+ # Format tool calls for the reminder
902
+ tool_list = pruned_tools.map do |tool|
903
+ args_str = format_tool_arguments(tool[:arguments])
904
+ "- #{tool[:name]}(#{args_str})"
905
+ end.join("\n")
906
+
907
+ reminder = <<~REMINDER
908
+ <system-reminder>
909
+ The following tool calls were interrupted and removed from conversation history:
910
+
911
+ #{tool_list}
912
+
913
+ These tools were never executed. If you still need their results, please run them again.
914
+ </system-reminder>
915
+ REMINDER
916
+
917
+ add_ephemeral_reminder(reminder.strip)
918
+ end
919
+
920
+ # Format tool arguments for display in reminder
921
+ #
922
+ # @param arguments [Hash] Tool call arguments
923
+ # @return [String] Formatted arguments
924
+ def format_tool_arguments(arguments)
925
+ return "" if arguments.nil? || arguments.empty?
926
+
927
+ # Format key-value pairs, truncating long values
928
+ args = arguments.map do |key, value|
929
+ formatted_value = if value.is_a?(String) && value.length > 50
930
+ "#{value[0...47]}..."
931
+ else
932
+ value.inspect
933
+ end
934
+ "#{key}: #{formatted_value}"
935
+ end
936
+
937
+ args.join(", ")
938
+ end
939
+
940
+ # Find all orphan tool calls in message history
941
+ #
942
+ # @param messages [Array<RubyLLM::Message>] Message array to scan
943
+ # @return [Hash<Integer, Array<String>>] Map of message index to orphan tool_call_ids
944
+ def find_orphan_tool_calls(messages)
945
+ orphans = {}
946
+
947
+ messages.each_with_index do |msg, idx|
948
+ next unless msg.role == :assistant && msg.tool_calls && !msg.tool_calls.empty?
757
949
 
758
- sleep(delay)
950
+ # Get all tool_call_ids from this assistant message
951
+ expected_tool_call_ids = msg.tool_calls.keys.to_set
952
+
953
+ # Find tool results between this message and the next user/assistant message
954
+ found_tool_call_ids = Set.new
955
+
956
+ (idx + 1...messages.size).each do |subsequent_idx|
957
+ subsequent_msg = messages[subsequent_idx]
958
+
959
+ # Stop at next user or assistant message
960
+ break if [:user, :assistant].include?(subsequent_msg.role)
961
+
962
+ # Collect tool result IDs
963
+ if subsequent_msg.role == :tool && subsequent_msg.tool_call_id
964
+ found_tool_call_ids << subsequent_msg.tool_call_id
965
+ end
759
966
  end
967
+
968
+ # Identify orphan tool_call_ids (expected but not found)
969
+ orphan_ids = (expected_tool_call_ids - found_tool_call_ids).to_a
970
+ orphans[idx] = orphan_ids unless orphan_ids.empty?
760
971
  end
972
+
973
+ orphans
974
+ end
975
+
976
+ # Remove orphan tool calls from messages
977
+ #
978
+ # @param messages [Array<RubyLLM::Message>] Original messages
979
+ # @param orphans [Hash<Integer, Array<String>>] Map of message index to orphan tool_call_ids
980
+ # @return [Array<RubyLLM::Message>] New message array with orphans removed
981
+ def remove_orphan_tool_calls(messages, orphans)
982
+ messages.map.with_index do |msg, idx|
983
+ orphan_ids = orphans[idx]
984
+
985
+ # No orphans in this message - keep as-is
986
+ next msg unless orphan_ids
987
+
988
+ # Remove orphan tool_calls from this assistant message
989
+ remaining_tool_calls = msg.tool_calls.reject { |id, _| orphan_ids.include?(id) }
990
+
991
+ # If no tool_calls remain and no content, skip this message entirely
992
+ if remaining_tool_calls.empty? && (msg.content.nil? || msg.content.to_s.strip.empty?)
993
+ next nil
994
+ end
995
+
996
+ # Create new message with remaining tool_calls
997
+ RubyLLM::Message.new(
998
+ role: msg.role,
999
+ content: msg.content,
1000
+ tool_calls: remaining_tool_calls.empty? ? nil : remaining_tool_calls,
1001
+ model_id: msg.model_id,
1002
+ input_tokens: msg.input_tokens,
1003
+ output_tokens: msg.output_tokens,
1004
+ cached_tokens: msg.cached_tokens,
1005
+ cache_creation_tokens: msg.cache_creation_tokens,
1006
+ )
1007
+ end.compact
761
1008
  end
762
1009
 
763
1010
  # Check if a tool call is a delegation tool
@@ -71,7 +71,7 @@ module SwarmSDK
71
71
  @provider = config[:provider] || SwarmSDK.config.default_provider
72
72
  @base_url = config[:base_url]
73
73
  @api_version = config[:api_version]
74
- @context_window = config[:context_window] # Explicit context window override
74
+ @context_window = coerce_to_integer(config[:context_window]) # Explicit context window override
75
75
  @parameters = config[:parameters] || {}
76
76
  @headers = Utils.stringify_keys(config[:headers] || {})
77
77
  @timeout = config[:timeout] || SwarmSDK.config.agent_request_timeout
@@ -447,6 +447,21 @@ module SwarmSDK
447
447
  end
448
448
  end
449
449
 
450
+ # Coerce value to integer if it's a numeric string
451
+ #
452
+ # YAML sometimes parses numbers as strings (especially when quoted).
453
+ # This ensures numeric values are properly converted.
454
+ #
455
+ # @param value [String, Integer, nil] Value to coerce
456
+ # @return [Integer, nil] Coerced integer or nil
457
+ def coerce_to_integer(value)
458
+ return if value.nil?
459
+ return value if value.is_a?(Integer)
460
+ return value.to_i if value.is_a?(String) && value.match?(/\A\d+\z/)
461
+
462
+ value
463
+ end
464
+
450
465
  def validate!
451
466
  raise ConfigurationError, "Agent '#{@name}' missing required 'description' field" unless @description
452
467