RubyGems - llm_conductor - Versions diffs - 1.2.0 → 1.3.0 - Mend

llm_conductor 1.2.0 → 1.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

checksums.yaml +4 -4
data/.rubocop.yml +2 -0
data/VISION_USAGE.md +100 -8
data/examples/claude_vision_usage.rb +138 -0
data/examples/gpt_vision_usage.rb +156 -0
data/lib/llm_conductor/clients/anthropic_client.rb +28 -1
data/lib/llm_conductor/clients/concerns/vision_support.rb +159 -0
data/lib/llm_conductor/clients/gpt_client.rb +7 -1
data/lib/llm_conductor/clients/openrouter_client.rb +4 -81
data/lib/llm_conductor/clients/zai_client.rb +4 -81
data/lib/llm_conductor/version.rb +1 -1
metadata +5 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: c6ed179bb9142839bcc6feab8d06d61c27ff8279406bc7839f6d09ba14cb573f
-  data.tar.gz: a8ca32fecd9ac81326f7cefcf482f1b6a110b78ca2168c1c8ccbde5e034becb3
+  metadata.gz: bce592da24b8bb09f9702361a8d2de5051092290dd3b263f0026ddb877a8717b
+  data.tar.gz: 364a233ac3b1490010d949e15f83a3c45a5750ed117674ae2498508884cc365a
 SHA512:
-  metadata.gz: 581da83914c51a3966010d03491c3f57be4ed393bb572f2fdc9d0205f8680f4891f2b058ecf7642ea7bf26bea452a976946b6198d0419afb2e771de3bc112aea
-  data.tar.gz: 00eb70033cb739b7236b759a30219eb5eb6b72db7bba6c7ee519b98cf186e799cbf4f8696acf237d19a8fbfcca97dd9a189ce4f3b4f8f3d8a7d9ff1729d7eb86
+  metadata.gz: 3ea0a7fc5d5fe1f729e6eb76b9b81eb5b24aaad96ba59ef954637e00184eded4f6fd44c591ee3921f86dd3131403fc496a77b355bd59e60158849c2e3af44511
+  data.tar.gz: 322cfca7d9e8917761af1b5de1033d9c11f58fceaa8d79aa18feee6b65050c4ab123c340479d44adeafa42f85b25c9e406e17ffda0af0ed6f871cb3d4d7d682f

data/.rubocop.yml CHANGED Viewed

@@ -112,6 +112,8 @@ Metrics/PerceivedComplexity:
 Layout/LineLength:
   Max: 125
+  Exclude:
+    - 'examples/*.rb'
 # Performance cops (from .rubocop_todo.yml)
 Performance/RedundantEqualityComparisonBlock:

data/VISION_USAGE.md CHANGED Viewed

@@ -1,9 +1,55 @@
 # Vision/Multimodal Usage Guide
-This guide explains how to use vision/multimodal capabilities with the OpenRouter and Z.ai clients in LLM Conductor.
+This guide explains how to use vision/multimodal capabilities with LLM Conductor. Vision support is available for Claude (Anthropic), GPT (OpenAI), OpenRouter, and Z.ai clients.
 ## Quick Start
+### Using Claude (Anthropic)
+```ruby
+require 'llm_conductor'
+# Configure
+LlmConductor.configure do |config|
+  config.anthropic(api_key: ENV['ANTHROPIC_API_KEY'])
+end
+# Analyze an image
+response = LlmConductor.generate(
+  model: 'claude-sonnet-4-20250514',
+  vendor: :anthropic,
+  prompt: {
+    text: 'What is in this image?',
+    images: 'https://example.com/image.jpg'
+  }
+)
+puts response.output
+```
+### Using GPT (OpenAI)
+```ruby
+require 'llm_conductor'
+# Configure
+LlmConductor.configure do |config|
+  config.openai(api_key: ENV['OPENAI_API_KEY'])
+end
+# Analyze an image
+response = LlmConductor.generate(
+  model: 'gpt-4o',
+  vendor: :openai,
+  prompt: {
+    text: 'What is in this image?',
+    images: 'https://example.com/image.jpg'
+  }
+)
+puts response.output
+```
 ### Using OpenRouter
 ```ruby
@@ -52,6 +98,23 @@ puts response.output
 ## Recommended Models
+### Claude Models (Anthropic)
+For vision tasks via Anthropic API:
+- **`claude-sonnet-4-20250514`** - Claude Sonnet 4 (latest, best for vision) ✅
+- **`claude-opus-4-20250514`** - Claude Opus 4 (maximum quality)
+- **`claude-opus-4-1-20250805`** - Claude Opus 4.1 (newest flagship model)
+### GPT Models (OpenAI)
+For vision tasks via OpenAI API:
+- **`gpt-4o`** - Latest GPT-4 Omni with advanced vision capabilities ✅
+- **`gpt-4o-mini`** - Fast, cost-effective vision model
+- **`gpt-4-turbo`** - Previous generation with vision support
+- **`gpt-4-vision-preview`** - Legacy vision model (deprecated)
 ### OpenRouter Models
 For vision tasks via OpenRouter, these models work reliably:
@@ -103,12 +166,12 @@ response = LlmConductor.generate(
 ### 3. Image with Detail Level
-For high-resolution images, specify the detail level:
+For high-resolution images, specify the detail level (supported by GPT and OpenRouter):
 ```ruby
 response = LlmConductor.generate(
-  model: 'openai/gpt-4o-mini',
-  vendor: :openrouter,
+  model: 'gpt-4o',
+  vendor: :openai,
   prompt: {
     text: 'Analyze this image in detail',
     images: [
@@ -118,19 +181,22 @@ response = LlmConductor.generate(
 )
 ```
-Detail levels:
+Detail levels (GPT and OpenRouter only):
 - `'high'` - Better for detailed analysis (uses more tokens)
 - `'low'` - Faster, cheaper (default if not specified)
 - `'auto'` - Let the model decide
+**Note:** Claude (Anthropic) and Z.ai don't support the `detail` parameter.
 ### 4. Raw Format (Advanced)
-For maximum control, use the OpenAI-compatible array format:
+For maximum control, use provider-specific array formats:
+**GPT/OpenRouter Format:**
 ```ruby
 response = LlmConductor.generate(
-  model: 'openai/gpt-4o-mini',
-  vendor: :openrouter,
+  model: 'gpt-4o',
+  vendor: :openai,
   prompt: [
     { type: 'text', text: 'What is in this image?' },
     { type: 'image_url', image_url: { url: 'https://example.com/image.jpg' } },
@@ -139,6 +205,18 @@ response = LlmConductor.generate(
 )
 ```
+**Claude Format:**
+```ruby
+response = LlmConductor.generate(
+  model: 'claude-sonnet-4-20250514',
+  vendor: :anthropic,
+  prompt: [
+    { type: 'image', source: { type: 'url', url: 'https://example.com/image.jpg' } },
+    { type: 'text', text: 'What is in this image? Describe it in detail.' }
+  ]
+)
+```
 ## Text-Only Requests (Backward Compatible)
 The client still supports regular text-only requests:
@@ -204,6 +282,18 @@ response = LlmConductor.generate(
 ### Run Examples
+For Claude:
+```bash
+export ANTHROPIC_API_KEY='your-key'
+ruby examples/claude_vision_usage.rb
+```
+For GPT:
+```bash
+export OPENAI_API_KEY='your-key'
+ruby examples/gpt_vision_usage.rb
+```
 For OpenRouter:
 ```bash
 export OPENROUTER_API_KEY='your-key'
@@ -265,6 +355,8 @@ For production:
 ## Examples
+- `examples/claude_vision_usage.rb` - Complete Claude vision examples with Claude Sonnet 4
+- `examples/gpt_vision_usage.rb` - Complete GPT vision examples with GPT-4o
 - `examples/openrouter_vision_usage.rb` - Complete OpenRouter vision examples
 - `examples/zai_usage.rb` - Complete Z.ai GLM-4.5V examples including vision and text

data/examples/claude_vision_usage.rb ADDED Viewed

@@ -0,0 +1,138 @@
+#!/usr/bin/env ruby
+# frozen_string_literal: true
+require_relative '../lib/llm_conductor'
+# This example demonstrates using Claude Sonnet 4 vision capabilities
+# Set your Anthropic API key: export ANTHROPIC_API_KEY='your-key-here'
+puts '=' * 80
+puts 'Claude Sonnet 4 Vision Usage Examples'
+puts '=' * 80
+puts
+# Check for API key
+api_key = ENV['ANTHROPIC_API_KEY']
+if api_key.nil? || api_key.empty?
+  puts 'ERROR: ANTHROPIC_API_KEY environment variable is not set!'
+  puts
+  puts 'Please set your Anthropic API key:'
+  puts '  export ANTHROPIC_API_KEY="your-key-here"'
+  puts
+  puts 'You can get an API key from: https://console.anthropic.com/'
+  exit 1
+end
+# Configure the client
+LlmConductor.configure do |config|
+  config.anthropic(api_key:)
+end
+# Example 1: Single Image Analysis
+puts "\n1. Single Image Analysis"
+puts '-' * 80
+begin
+  response = LlmConductor.generate(
+    model: 'claude-sonnet-4-20250514',
+    vendor: :anthropic,
+    prompt: {
+      text: 'What is in this image? Please describe it in detail.',
+      images: 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg'
+    }
+  )
+  puts "Response: #{response.output}"
+  puts "Success: #{response.success?}"
+  puts "Tokens: #{response.input_tokens} input, #{response.output_tokens} output"
+  puts "Metadata: #{response.metadata.inspect}" if response.metadata && !response.metadata.empty?
+rescue StandardError => e
+  puts "ERROR: #{e.message}"
+  puts "Backtrace: #{e.backtrace.first(5).join("\n")}"
+end
+# Example 2: Multiple Images Comparison
+puts "\n2. Multiple Images Comparison"
+puts '-' * 80
+response = LlmConductor.generate(
+  model: 'claude-sonnet-4-20250514',
+  vendor: :anthropic,
+  prompt: {
+    text: 'Compare these two images. What are the main differences?',
+    images: [
+      'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/1024px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg',
+      'https://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/Placeholder_view_vector.svg/1024px-Placeholder_view_vector.svg.png'
+    ]
+  }
+)
+puts "Response: #{response.output}"
+puts "Tokens: #{response.input_tokens} input, #{response.output_tokens} output"
+# Example 3: Image with Specific Question
+puts "\n3. Image with Specific Question"
+puts '-' * 80
+response = LlmConductor.generate(
+  model: 'claude-sonnet-4-20250514',
+  vendor: :anthropic,
+  prompt: {
+    text: 'Is there a wooden boardwalk visible in this image? If yes, describe its condition.',
+    images: 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/1024px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg'
+  }
+)
+puts "Response: #{response.output}"
+puts "Tokens: #{response.input_tokens} input, #{response.output_tokens} output"
+# Example 4: Raw Format (Advanced)
+puts "\n4. Raw Format (Advanced)"
+puts '-' * 80
+response = LlmConductor.generate(
+  model: 'claude-sonnet-4-20250514',
+  vendor: :anthropic,
+  prompt: [
+    { type: 'image',
+      source: { type: 'url',
+                url: 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/1024px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg' } },
+    { type: 'text', text: 'Describe the weather conditions in this image.' }
+  ]
+)
+puts "Response: #{response.output}"
+puts "Tokens: #{response.input_tokens} input, #{response.output_tokens} output"
+# Example 5: Text-Only Request (Backward Compatible)
+puts "\n5. Text-Only Request (Backward Compatible)"
+puts '-' * 80
+response = LlmConductor.generate(
+  model: 'claude-sonnet-4-20250514',
+  vendor: :anthropic,
+  prompt: 'What is the capital of France?'
+)
+puts "Response: #{response.output}"
+puts "Tokens: #{response.input_tokens} input, #{response.output_tokens} output"
+# Example 6: Image Analysis with Detailed Instructions
+puts "\n6. Image Analysis with Detailed Instructions"
+puts '-' * 80
+response = LlmConductor.generate(
+  model: 'claude-sonnet-4-20250514',
+  vendor: :anthropic,
+  prompt: {
+    text: 'Analyze this image and provide: 1) Main subjects, 2) Colors and lighting, 3) Mood or atmosphere, 4) Any notable details',
+    images: 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/1024px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg'
+  }
+)
+puts "Response: #{response.output}"
+puts "Tokens: #{response.input_tokens} input, #{response.output_tokens} output"
+puts "\n#{'=' * 80}"
+puts 'All examples completed successfully!'
+puts '=' * 80

data/examples/gpt_vision_usage.rb ADDED Viewed

@@ -0,0 +1,156 @@
+#!/usr/bin/env ruby
+# frozen_string_literal: true
+require_relative '../lib/llm_conductor'
+# This example demonstrates using GPT-4o vision capabilities
+# Set your OpenAI API key: export OPENAI_API_KEY='your-key-here'
+puts '=' * 80
+puts 'GPT-4o Vision Usage Examples'
+puts '=' * 80
+puts
+# Check for API key
+api_key = ENV['OPENAI_API_KEY']
+if api_key.nil? || api_key.empty?
+  puts 'ERROR: OPENAI_API_KEY environment variable is not set!'
+  puts
+  puts 'Please set your OpenAI API key:'
+  puts '  export OPENAI_API_KEY="your-key-here"'
+  puts
+  puts 'You can get an API key from: https://platform.openai.com/api-keys'
+  exit 1
+end
+# Configure the client
+LlmConductor.configure do |config|
+  config.openai(api_key:)
+end
+# Example 1: Single Image Analysis
+puts "\n1. Single Image Analysis"
+puts '-' * 80
+response = LlmConductor.generate(
+  model: 'gpt-4o',
+  vendor: :openai,
+  prompt: {
+    text: 'What is in this image? Please describe it in detail.',
+    images: 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg'
+  }
+)
+puts "Response: #{response.output}"
+puts "Tokens: #{response.input_tokens} input, #{response.output_tokens} output"
+# Example 2: Multiple Images Comparison
+puts "\n2. Multiple Images Comparison"
+puts '-' * 80
+response = LlmConductor.generate(
+  model: 'gpt-4o',
+  vendor: :openai,
+  prompt: {
+    text: 'Compare these two images. What are the main differences?',
+    images: [
+      'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/1024px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg',
+      'https://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/Placeholder_view_vector.svg/1024px-Placeholder_view_vector.svg.png'
+    ]
+  }
+)
+puts "Response: #{response.output}"
+puts "Tokens: #{response.input_tokens} input, #{response.output_tokens} output"
+# Example 3: Image with Detail Level - High Resolution
+puts "\n3. Image with Detail Level - High Resolution"
+puts '-' * 80
+response = LlmConductor.generate(
+  model: 'gpt-4o',
+  vendor: :openai,
+  prompt: {
+    text: 'Analyze this high-resolution image in detail. What are all the elements you can see?',
+    images: [
+      { url: 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg', detail: 'high' }
+    ]
+  }
+)
+puts "Response: #{response.output}"
+puts "Tokens: #{response.input_tokens} input, #{response.output_tokens} output"
+# Example 4: Image with Detail Level - Low (Faster, Cheaper)
+puts "\n4. Image with Detail Level - Low (Faster, Cheaper)"
+puts '-' * 80
+response = LlmConductor.generate(
+  model: 'gpt-4o',
+  vendor: :openai,
+  prompt: {
+    text: 'Give me a quick description of this image.',
+    images: [
+      { url: 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/1024px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg', detail: 'low' }
+    ]
+  }
+)
+puts "Response: #{response.output}"
+puts "Tokens: #{response.input_tokens} input, #{response.output_tokens} output"
+# Example 5: Raw Format (Advanced)
+puts "\n5. Raw Format (Advanced)"
+puts '-' * 80
+response = LlmConductor.generate(
+  model: 'gpt-4o',
+  vendor: :openai,
+  prompt: [
+    { type: 'text', text: 'What is in this image?' },
+    { type: 'image_url',
+      image_url: { url: 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/1024px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg' } },
+    { type: 'text', text: 'Describe the weather conditions.' }
+  ]
+)
+puts "Response: #{response.output}"
+puts "Tokens: #{response.input_tokens} input, #{response.output_tokens} output"
+# Example 6: Text-Only Request (Backward Compatible)
+puts "\n6. Text-Only Request (Backward Compatible)"
+puts '-' * 80
+response = LlmConductor.generate(
+  model: 'gpt-4o',
+  vendor: :openai,
+  prompt: 'What is the capital of France?'
+)
+puts "Response: #{response.output}"
+puts "Tokens: #{response.input_tokens} input, #{response.output_tokens} output"
+# Example 7: Multiple Images with Mixed Detail Levels
+puts "\n7. Multiple Images with Mixed Detail Levels"
+puts '-' * 80
+response = LlmConductor.generate(
+  model: 'gpt-4o',
+  vendor: :openai,
+  prompt: {
+    text: 'Compare these images at different detail levels.',
+    images: [
+      {
+        url: 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/1024px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg', detail: 'high'
+      },
+      { url: 'https://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/Placeholder_view_vector.svg/1024px-Placeholder_view_vector.svg.png', detail: 'low' }
+    ]
+  }
+)
+puts "Response: #{response.output}"
+puts "Tokens: #{response.input_tokens} input, #{response.output_tokens} output"
+puts "\n#{'=' * 80}"
+puts 'All examples completed successfully!'
+puts '=' * 80

data/lib/llm_conductor/clients/anthropic_client.rb CHANGED Viewed

@@ -1,18 +1,23 @@
 # frozen_string_literal: true
 require 'anthropic'
+require_relative 'concerns/vision_support'
 module LlmConductor
   module Clients
     # Anthropic Claude client implementation for accessing Claude models via Anthropic API
+    # Supports both text-only and multimodal (vision) requests
     class AnthropicClient < BaseClient
+      include Concerns::VisionSupport
       private
       def generate_content(prompt)
+        content = format_content(prompt)
         response = client.messages.create(
           model:,
           max_tokens: 4096,
-          messages: [{ role: 'user', content: prompt }]
+          messages: [{ role: 'user', content: }]
         )
         response.content.first.text
@@ -20,6 +25,28 @@ module LlmConductor
         raise StandardError, "Anthropic API error: #{e.message}"
       end
+      # Anthropic uses a different image format than OpenAI
+      # Format: { type: 'image', source: { type: 'url', url: '...' } }
+      def format_image_url(url)
+        { type: 'image', source: { type: 'url', url: } }
+      end
+      def format_image_hash(image_hash)
+        # Anthropic doesn't have a 'detail' parameter like OpenAI
+        {
+          type: 'image',
+          source: {
+            type: 'url',
+            url: image_hash[:url] || image_hash['url']
+          }
+        }
+      end
+      # Anthropic recommends placing images before text
+      def images_before_text?
+        true
+      end
       def client
         @client ||= begin
           config = LlmConductor.configuration.provider_config(:anthropic)

data/lib/llm_conductor/clients/concerns/vision_support.rb ADDED Viewed

@@ -0,0 +1,159 @@
+# frozen_string_literal: true
+module LlmConductor
+  module Clients
+    module Concerns
+      # Shared module for vision/multimodal support across different LLM clients
+      # Provides common functionality for formatting images and text content
+      module VisionSupport
+        private
+        # Override token calculation to handle multimodal content
+        def calculate_tokens(content)
+          case content
+          when String then super(content)
+          when Hash then calculate_tokens_from_hash(content)
+          when Array then calculate_tokens_from_array(content)
+          else super(content.to_s)
+          end
+        end
+        # Calculate tokens from a hash containing text and/or images
+        # @param content_hash [Hash] Hash with :text and/or :images keys
+        # @return [Integer] Token count for text portion
+        def calculate_tokens_from_hash(content_hash)
+          text = content_hash[:text] || content_hash['text'] || ''
+          # Call the parent class's calculate_tokens with the extracted text
+          method(:calculate_tokens).super_method.call(text)
+        end
+        # Calculate tokens from an array of content parts
+        # @param content_array [Array] Array of content parts with type and text
+        # @return [Integer] Token count for all text parts
+        def calculate_tokens_from_array(content_array)
+          text_parts = extract_text_from_array(content_array)
+          # Call the parent class's calculate_tokens with the joined text
+          method(:calculate_tokens).super_method.call(text_parts)
+        end
+        # Extract and join text from array of content parts
+        # @param content_array [Array] Array of content parts
+        # @return [String] Joined text from all text parts
+        def extract_text_from_array(content_array)
+          content_array
+            .select { |part| text_part?(part) }
+            .map { |part| extract_text_from_part(part) }
+            .join(' ')
+        end
+        # Check if a content part is a text part
+        # @param part [Hash] Content part
+        # @return [Boolean] true if part is a text type
+        def text_part?(part)
+          part[:type] == 'text' || part['type'] == 'text'
+        end
+        # Extract text from a content part
+        # @param part [Hash] Content part with text
+        # @return [String] Text content
+        def extract_text_from_part(part)
+          part[:text] || part['text'] || ''
+        end
+        # Format content based on whether it's a simple string or multimodal content
+        # @param prompt [String, Hash, Array] The prompt content
+        # @return [String, Array] Formatted content for the API
+        def format_content(prompt)
+          case prompt
+          when Hash
+            # Handle hash with text and/or images
+            format_multimodal_hash(prompt)
+          when Array
+            # Already formatted as array of content parts
+            prompt
+          else
+            # Simple string prompt
+            prompt.to_s
+          end
+        end
+        # Format a hash containing text and/or images into multimodal content array
+        # @param prompt_hash [Hash] Hash with :text and/or :images keys
+        # @return [Array] Array of content parts for the API
+        def format_multimodal_hash(prompt_hash)
+          content_parts = []
+          # Add image parts (order depends on provider)
+          images = prompt_hash[:images] || prompt_hash['images'] || []
+          images = [images] unless images.is_a?(Array)
+          if images_before_text?
+            # Anthropic recommends images before text
+            images.each { |image| content_parts << format_image_part(image) }
+            add_text_part(content_parts, prompt_hash)
+          else
+            # OpenAI/most others: text before images
+            add_text_part(content_parts, prompt_hash)
+            images.each { |image| content_parts << format_image_part(image) }
+          end
+          content_parts
+        end
+        # Add text part to content array if present
+        # @param content_parts [Array] The content parts array
+        # @param prompt_hash [Hash] Hash with :text key
+        def add_text_part(content_parts, prompt_hash)
+          return unless prompt_hash[:text] || prompt_hash['text']
+          text = prompt_hash[:text] || prompt_hash['text']
+          content_parts << { type: 'text', text: }
+        end
+        # Format an image into the appropriate API structure
+        # This method should be overridden by clients that need different formats
+        # @param image [String, Hash] Image URL or hash with url/detail keys
+        # @return [Hash] Formatted image part for the API
+        def format_image_part(image)
+          case image
+          when String
+            format_image_url(image)
+          when Hash
+            format_image_hash(image)
+          end
+        end
+        # Format a simple image URL string
+        # Override this in subclasses for provider-specific format
+        # @param url [String] Image URL
+        # @return [Hash] Formatted image part
+        def format_image_url(url)
+          # Default: OpenAI format
+          { type: 'image_url', image_url: { url: } }
+        end
+        # Format an image hash with url and optional detail
+        # Override this in subclasses for provider-specific format
+        # @param image_hash [Hash] Hash with url and optional detail keys
+        # @return [Hash] Formatted image part
+        def format_image_hash(image_hash)
+          # Default: OpenAI format with detail support
+          {
+            type: 'image_url',
+            image_url: {
+              url: image_hash[:url] || image_hash['url'],
+              detail: image_hash[:detail] || image_hash['detail']
+            }.compact
+          }
+        end
+        # Whether to place images before text in the content array
+        # Override this in subclasses if needed (e.g., Anthropic recommends images first)
+        # @return [Boolean] true if images should come before text
+        def images_before_text?
+          false
+        end
+      end
+    end
+  end
+end

data/lib/llm_conductor/clients/gpt_client.rb CHANGED Viewed

@@ -1,13 +1,19 @@
 # frozen_string_literal: true
+require_relative 'concerns/vision_support'
 module LlmConductor
   module Clients
     # OpenAI GPT client implementation for accessing GPT models via OpenAI API
+    # Supports both text-only and multimodal (vision) requests
     class GptClient < BaseClient
+      include Concerns::VisionSupport
       private
       def generate_content(prompt)
-        client.chat(parameters: { model:, messages: [{ role: 'user', content: prompt }] })
+        content = format_content(prompt)
+        client.chat(parameters: { model:, messages: [{ role: 'user', content: }] })
               .dig('choices', 0, 'message', 'content')
       end

data/lib/llm_conductor/clients/openrouter_client.rb CHANGED Viewed

@@ -1,32 +1,15 @@
 # frozen_string_literal: true
+require_relative 'concerns/vision_support'
 module LlmConductor
   module Clients
     # OpenRouter client implementation for accessing various LLM providers through OpenRouter API
     # Supports both text-only and multimodal (vision) requests
     class OpenrouterClient < BaseClient
-      private
+      include Concerns::VisionSupport
-      # Override token calculation to handle multimodal content
-      def calculate_tokens(content)
-        case content
-        when String
-          super(content)
-        when Hash
-          # For multimodal content, count tokens only for text part
-          # Note: This is an approximation as images have variable token counts
-          text = content[:text] || content['text'] || ''
-          super(text)
-        when Array
-          # For pre-formatted arrays, extract and count text parts
-          text_parts = content.select { |part| part[:type] == 'text' || part['type'] == 'text' }
-                              .map { |part| part[:text] || part['text'] || '' }
-                              .join(' ')
-          super(text_parts)
-        else
-          super(content.to_s)
-        end
-      end
+      private
       def generate_content(prompt)
         content = format_content(prompt)
@@ -61,66 +44,6 @@ module LlmConductor
         end
       end
-      # Format content based on whether it's a simple string or multimodal content
-      # @param prompt [String, Hash, Array] The prompt content
-      # @return [String, Array] Formatted content for the API
-      def format_content(prompt)
-        case prompt
-        when Hash
-          # Handle hash with text and/or images
-          format_multimodal_hash(prompt)
-        when Array
-          # Already formatted as array of content parts
-          prompt
-        else
-          # Simple string prompt
-          prompt.to_s
-        end
-      end
-      # Format a hash containing text and/or images into multimodal content array
-      # @param prompt_hash [Hash] Hash with :text and/or :images keys
-      # @return [Array] Array of content parts for the API
-      def format_multimodal_hash(prompt_hash)
-        content_parts = []
-        # Add text part if present
-        if prompt_hash[:text] || prompt_hash['text']
-          text = prompt_hash[:text] || prompt_hash['text']
-          content_parts << { type: 'text', text: }
-        end
-        # Add image parts if present
-        images = prompt_hash[:images] || prompt_hash['images'] || []
-        images = [images] unless images.is_a?(Array)
-        images.each do |image|
-          content_parts << format_image_part(image)
-        end
-        content_parts
-      end
-      # Format an image into the appropriate API structure
-      # @param image [String, Hash] Image URL or hash with url/detail keys
-      # @return [Hash] Formatted image part for the API
-      def format_image_part(image)
-        case image
-        when String
-          # Simple URL string
-          { type: 'image_url', image_url: { url: image } }
-        when Hash
-          # Hash with url and optional detail level
-          {
-            type: 'image_url',
-            image_url: {
-              url: image[:url] || image['url'],
-              detail: image[:detail] || image['detail']
-            }.compact
-          }
-        end
-      end
       def client
         @client ||= begin
           config = LlmConductor.configuration.provider_config(:openrouter)

data/lib/llm_conductor/clients/zai_client.rb CHANGED Viewed

@@ -1,5 +1,7 @@
 # frozen_string_literal: true
+require_relative 'concerns/vision_support'
 module LlmConductor
   module Clients
     # Z.ai client implementation for accessing GLM models including GLM-4.5V
@@ -8,28 +10,9 @@ module LlmConductor
     # Note: Z.ai uses OpenAI-compatible API format but with /v4/ path instead of /v1/
     # We use Faraday directly instead of the ruby-openai gem to properly handle the API path
     class ZaiClient < BaseClient
-      private
+      include Concerns::VisionSupport
-      # Override token calculation to handle multimodal content
-      def calculate_tokens(content)
-        case content
-        when String
-          super(content)
-        when Hash
-          # For multimodal content, count tokens only for text part
-          # Note: This is an approximation as images have variable token counts
-          text = content[:text] || content['text'] || ''
-          super(text)
-        when Array
-          # For pre-formatted arrays, extract and count text parts
-          text_parts = content.select { |part| part[:type] == 'text' || part['type'] == 'text' }
-                              .map { |part| part[:text] || part['text'] || '' }
-                              .join(' ')
-          super(text_parts)
-        else
-          super(content.to_s)
-        end
-      end
+      private
       def generate_content(prompt)
         content = format_content(prompt)
@@ -67,66 +50,6 @@ module LlmConductor
         end
       end
-      # Format content based on whether it's a simple string or multimodal content
-      # @param prompt [String, Hash, Array] The prompt content
-      # @return [String, Array] Formatted content for the API
-      def format_content(prompt)
-        case prompt
-        when Hash
-          # Handle hash with text and/or images
-          format_multimodal_hash(prompt)
-        when Array
-          # Already formatted as array of content parts
-          prompt
-        else
-          # Simple string prompt
-          prompt.to_s
-        end
-      end
-      # Format a hash containing text and/or images into multimodal content array
-      # @param prompt_hash [Hash] Hash with :text and/or :images keys
-      # @return [Array] Array of content parts for the API
-      def format_multimodal_hash(prompt_hash)
-        content_parts = []
-        # Add text part if present
-        if prompt_hash[:text] || prompt_hash['text']
-          text = prompt_hash[:text] || prompt_hash['text']
-          content_parts << { type: 'text', text: }
-        end
-        # Add image parts if present
-        images = prompt_hash[:images] || prompt_hash['images'] || []
-        images = [images] unless images.is_a?(Array)
-        images.each do |image|
-          content_parts << format_image_part(image)
-        end
-        content_parts
-      end
-      # Format an image into the appropriate API structure
-      # @param image [String, Hash] Image URL or hash with url/detail keys
-      # @return [Hash] Formatted image part for the API
-      def format_image_part(image)
-        case image
-        when String
-          # Simple URL string or base64 data
-          { type: 'image_url', image_url: { url: image } }
-        when Hash
-          # Hash with url and optional detail level
-          {
-            type: 'image_url',
-            image_url: {
-              url: image[:url] || image['url'],
-              detail: image[:detail] || image['detail']
-            }.compact
-          }
-        end
-      end
       # HTTP client for making requests to Z.ai API
       # Z.ai uses /v4/ in their path, not /v1/ like OpenAI, so we use Faraday directly
       def http_client

data/lib/llm_conductor/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module LlmConductor
-  VERSION = '1.2.0'
+  VERSION = '1.3.0'
 end

metadata CHANGED Viewed

@@ -1,13 +1,13 @@
 --- !ruby/object:Gem::Specification
 name: llm_conductor
 version: !ruby/object:Gem::Version
-  version: 1.2.0
+  version: 1.3.0
 platform: ruby
 authors:
 - Ben Zheng
 bindir: exe
 cert_chain: []
-date: 2025-10-29 00:00:00.000000000 Z
+date: 2025-11-04 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: activesupport
@@ -154,8 +154,10 @@ files:
 - Rakefile
 - VISION_USAGE.md
 - config/initializers/llm_conductor.rb
+- examples/claude_vision_usage.rb
 - examples/data_builder_usage.rb
 - examples/gemini_usage.rb
+- examples/gpt_vision_usage.rb
 - examples/groq_usage.rb
 - examples/openrouter_vision_usage.rb
 - examples/prompt_registration.rb
@@ -166,6 +168,7 @@ files:
 - lib/llm_conductor/client_factory.rb
 - lib/llm_conductor/clients/anthropic_client.rb
 - lib/llm_conductor/clients/base_client.rb
+- lib/llm_conductor/clients/concerns/vision_support.rb
 - lib/llm_conductor/clients/gemini_client.rb
 - lib/llm_conductor/clients/gpt_client.rb
 - lib/llm_conductor/clients/groq_client.rb