RubyGems - llm_conductor - Versions diffs - 1.2.0 → 1.4.0 - Mend

llm_conductor 1.2.0 → 1.4.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

checksums.yaml +4 -4
data/.rubocop.yml +2 -0
data/VISION_USAGE.md +158 -8
data/examples/claude_vision_usage.rb +138 -0
data/examples/gemini_usage.rb +1 -1
data/examples/gemini_vision_usage.rb +168 -0
data/examples/gpt_vision_usage.rb +156 -0
data/lib/llm_conductor/clients/anthropic_client.rb +28 -1
data/lib/llm_conductor/clients/concerns/vision_support.rb +159 -0
data/lib/llm_conductor/clients/gemini_client.rb +105 -1
data/lib/llm_conductor/clients/gpt_client.rb +7 -1
data/lib/llm_conductor/clients/openrouter_client.rb +4 -81
data/lib/llm_conductor/clients/zai_client.rb +4 -81
data/lib/llm_conductor/version.rb +1 -1
metadata +6 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: c6ed179bb9142839bcc6feab8d06d61c27ff8279406bc7839f6d09ba14cb573f
-  data.tar.gz: a8ca32fecd9ac81326f7cefcf482f1b6a110b78ca2168c1c8ccbde5e034becb3
+  metadata.gz: 8e5bb3310ea1328acac93c59e7bc227e63de52397e51402eee0eb921eb92acc8
+  data.tar.gz: cacb73f7d04e46a100581df3b77781a98d0d4277fc726a0b32c266443999f66a
 SHA512:
-  metadata.gz: 581da83914c51a3966010d03491c3f57be4ed393bb572f2fdc9d0205f8680f4891f2b058ecf7642ea7bf26bea452a976946b6198d0419afb2e771de3bc112aea
-  data.tar.gz: 00eb70033cb739b7236b759a30219eb5eb6b72db7bba6c7ee519b98cf186e799cbf4f8696acf237d19a8fbfcca97dd9a189ce4f3b4f8f3d8a7d9ff1729d7eb86
+  metadata.gz: 97d9c89718834420532c391c207790b416048b65958f3b6d7feac008f099cc533c644c614a014bc437b1631cf9fa60c2d0e340b8bea8316217e950af5449764b
+  data.tar.gz: 4fb3301001cebc258485568ebfa0078b3773b3563dd5149c8cd7527cc9a306ae41785fea6a73f09d1dca19a09f1af92b88611efa283e80efed317ef876f59330

data/.rubocop.yml CHANGED Viewed

@@ -112,6 +112,8 @@ Metrics/PerceivedComplexity:
 Layout/LineLength:
   Max: 125
+  Exclude:
+    - 'examples/*.rb'
 # Performance cops (from .rubocop_todo.yml)
 Performance/RedundantEqualityComparisonBlock:

data/VISION_USAGE.md CHANGED Viewed

@@ -1,9 +1,55 @@
 # Vision/Multimodal Usage Guide
-This guide explains how to use vision/multimodal capabilities with the OpenRouter and Z.ai clients in LLM Conductor.
+This guide explains how to use vision/multimodal capabilities with LLM Conductor. Vision support is available for Claude (Anthropic), GPT (OpenAI), Gemini (Google), OpenRouter, and Z.ai clients.
 ## Quick Start
+### Using Claude (Anthropic)
+```ruby
+require 'llm_conductor'
+# Configure
+LlmConductor.configure do |config|
+  config.anthropic(api_key: ENV['ANTHROPIC_API_KEY'])
+end
+# Analyze an image
+response = LlmConductor.generate(
+  model: 'claude-sonnet-4-20250514',
+  vendor: :anthropic,
+  prompt: {
+    text: 'What is in this image?',
+    images: 'https://example.com/image.jpg'
+  }
+)
+puts response.output
+```
+### Using GPT (OpenAI)
+```ruby
+require 'llm_conductor'
+# Configure
+LlmConductor.configure do |config|
+  config.openai(api_key: ENV['OPENAI_API_KEY'])
+end
+# Analyze an image
+response = LlmConductor.generate(
+  model: 'gpt-4o',
+  vendor: :openai,
+  prompt: {
+    text: 'What is in this image?',
+    images: 'https://example.com/image.jpg'
+  }
+)
+puts response.output
+```
 ### Using OpenRouter
 ```ruby
@@ -27,6 +73,29 @@ response = LlmConductor.generate(
 puts response.output
 ```
+### Using Gemini (Google)
+```ruby
+require 'llm_conductor'
+# Configure
+LlmConductor.configure do |config|
+  config.gemini(api_key: ENV['GEMINI_API_KEY'])
+end
+# Analyze an image
+response = LlmConductor.generate(
+  model: 'gemini-2.5-flash',
+  vendor: :gemini,
+  prompt: {
+    text: 'What is in this image?',
+    images: 'https://cdn.autonomous.ai/production/ecm/230930/10-Comfortable-Office-Chairs-for-Gaming-A-Comprehensive-Review00002.webp'
+  }
+)
+puts response.output
+```
 ### Using Z.ai (Zhipu AI)
 ```ruby
@@ -52,6 +121,23 @@ puts response.output
 ## Recommended Models
+### Claude Models (Anthropic)
+For vision tasks via Anthropic API:
+- **`claude-sonnet-4-20250514`** - Claude Sonnet 4 (latest, best for vision) ✅
+- **`claude-opus-4-20250514`** - Claude Opus 4 (maximum quality)
+- **`claude-opus-4-1-20250805`** - Claude Opus 4.1 (newest flagship model)
+### GPT Models (OpenAI)
+For vision tasks via OpenAI API:
+- **`gpt-4o`** - Latest GPT-4 Omni with advanced vision capabilities ✅
+- **`gpt-4o-mini`** - Fast, cost-effective vision model
+- **`gpt-4-turbo`** - Previous generation with vision support
+- **`gpt-4-vision-preview`** - Legacy vision model (deprecated)
 ### OpenRouter Models
 For vision tasks via OpenRouter, these models work reliably:
@@ -61,6 +147,17 @@ For vision tasks via OpenRouter, these models work reliably:
 - **`anthropic/claude-3.5-sonnet`** - High quality analysis
 - **`openai/gpt-4o`** - Best quality (higher cost)
+### Gemini Models (Google)
+For vision tasks via Google Gemini API:
+- **`gemini-2.0-flash`** - Gemini 2.0 Flash (fast, efficient, multimodal) ✅
+- **`gemini-2.5-flash`** - Gemini 2.5 Flash (latest fast model)
+- **`gemini-1.5-pro`** - Gemini 1.5 Pro (high quality, large context window)
+- **`gemini-1.5-flash`** - Gemini 1.5 Flash (previous generation fast model)
+**Note:** Gemini client automatically fetches images from URLs and encodes them as base64, as required by the Gemini API.
 ### Z.ai Models (Zhipu AI)
 For vision tasks via Z.ai, these GLM models are recommended:
@@ -103,12 +200,12 @@ response = LlmConductor.generate(
 ### 3. Image with Detail Level
-For high-resolution images, specify the detail level:
+For high-resolution images, specify the detail level (supported by GPT and OpenRouter):
 ```ruby
 response = LlmConductor.generate(
-  model: 'openai/gpt-4o-mini',
-  vendor: :openrouter,
+  model: 'gpt-4o',
+  vendor: :openai,
   prompt: {
     text: 'Analyze this image in detail',
     images: [
@@ -118,19 +215,22 @@ response = LlmConductor.generate(
 )
 ```
-Detail levels:
+Detail levels (GPT and OpenRouter only):
 - `'high'` - Better for detailed analysis (uses more tokens)
 - `'low'` - Faster, cheaper (default if not specified)
 - `'auto'` - Let the model decide
+**Note:** Claude (Anthropic), Gemini (Google), and Z.ai don't support the `detail` parameter.
 ### 4. Raw Format (Advanced)
-For maximum control, use the OpenAI-compatible array format:
+For maximum control, use provider-specific array formats:
+**GPT/OpenRouter Format:**
 ```ruby
 response = LlmConductor.generate(
-  model: 'openai/gpt-4o-mini',
-  vendor: :openrouter,
+  model: 'gpt-4o',
+  vendor: :openai,
   prompt: [
     { type: 'text', text: 'What is in this image?' },
     { type: 'image_url', image_url: { url: 'https://example.com/image.jpg' } },
@@ -139,6 +239,30 @@ response = LlmConductor.generate(
 )
 ```
+**Claude Format:**
+```ruby
+response = LlmConductor.generate(
+  model: 'claude-sonnet-4-20250514',
+  vendor: :anthropic,
+  prompt: [
+    { type: 'image', source: { type: 'url', url: 'https://example.com/image.jpg' } },
+    { type: 'text', text: 'What is in this image? Describe it in detail.' }
+  ]
+)
+```
+**Gemini Format:**
+```ruby
+response = LlmConductor.generate(
+  model: 'gemini-2.0-flash',
+  vendor: :gemini,
+  prompt: [
+    { type: 'text', text: 'What is in this image? Describe it in detail.' },
+    { type: 'image_url', image_url: { url: 'https://example.com/image.jpg' } }
+  ]
+)
+```
 ## Text-Only Requests (Backward Compatible)
 The client still supports regular text-only requests:
@@ -158,6 +282,10 @@ response = LlmConductor.generate(
 - Maximum file size depends on the model
 - Use HTTPS URLs when possible
+**Provider-Specific Notes:**
+- **Gemini**: URLs are automatically fetched and base64-encoded by the client before sending to the API
+- **Claude, GPT, OpenRouter, Z.ai**: URLs are sent directly to the API (no preprocessing required)
 ## Error Handling
 ```ruby
@@ -204,12 +332,30 @@ response = LlmConductor.generate(
 ### Run Examples
+For Claude:
+```bash
+export ANTHROPIC_API_KEY='your-key'
+ruby examples/claude_vision_usage.rb
+```
+For GPT:
+```bash
+export OPENAI_API_KEY='your-key'
+ruby examples/gpt_vision_usage.rb
+```
 For OpenRouter:
 ```bash
 export OPENROUTER_API_KEY='your-key'
 ruby examples/openrouter_vision_usage.rb
 ```
+For Gemini:
+```bash
+export GEMINI_API_KEY='your-key'
+ruby examples/gemini_vision_usage.rb
+```
 For Z.ai:
 ```bash
 export ZAI_API_KEY='your-key'
@@ -265,6 +411,9 @@ For production:
 ## Examples
+- `examples/claude_vision_usage.rb` - Complete Claude vision examples with Claude Sonnet 4
+- `examples/gpt_vision_usage.rb` - Complete GPT vision examples with GPT-4o
+- `examples/gemini_vision_usage.rb` - Complete Gemini vision examples with Gemini 2.0 Flash
 - `examples/openrouter_vision_usage.rb` - Complete OpenRouter vision examples
 - `examples/zai_usage.rb` - Complete Z.ai GLM-4.5V examples including vision and text
@@ -273,6 +422,7 @@ For production:
 - [OpenRouter Documentation](https://openrouter.ai/docs)
 - [OpenAI Vision API Reference](https://platform.openai.com/docs/guides/vision)
 - [Anthropic Claude Vision](https://docs.anthropic.com/claude/docs/vision)
+- [Google Gemini API Documentation](https://ai.google.dev/docs)
 - [Z.ai API Platform](https://api.z.ai/)
 - [GLM-4.5V Documentation](https://bigmodel.cn/)

data/examples/claude_vision_usage.rb ADDED Viewed

@@ -0,0 +1,138 @@
+#!/usr/bin/env ruby
+# frozen_string_literal: true
+require_relative '../lib/llm_conductor'
+# This example demonstrates using Claude Sonnet 4 vision capabilities
+# Set your Anthropic API key: export ANTHROPIC_API_KEY='your-key-here'
+puts '=' * 80
+puts 'Claude Sonnet 4 Vision Usage Examples'
+puts '=' * 80
+puts
+# Check for API key
+api_key = ENV['ANTHROPIC_API_KEY']
+if api_key.nil? || api_key.empty?
+  puts 'ERROR: ANTHROPIC_API_KEY environment variable is not set!'
+  puts
+  puts 'Please set your Anthropic API key:'
+  puts '  export ANTHROPIC_API_KEY="your-key-here"'
+  puts
+  puts 'You can get an API key from: https://console.anthropic.com/'
+  exit 1
+end
+# Configure the client
+LlmConductor.configure do |config|
+  config.anthropic(api_key:)
+end
+# Example 1: Single Image Analysis
+puts "\n1. Single Image Analysis"
+puts '-' * 80
+begin
+  response = LlmConductor.generate(
+    model: 'claude-sonnet-4-20250514',
+    vendor: :anthropic,
+    prompt: {
+      text: 'What is in this image? Please describe it in detail.',
+      images: 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg'
+    }
+  )
+  puts "Response: #{response.output}"
+  puts "Success: #{response.success?}"
+  puts "Tokens: #{response.input_tokens} input, #{response.output_tokens} output"
+  puts "Metadata: #{response.metadata.inspect}" if response.metadata && !response.metadata.empty?
+rescue StandardError => e
+  puts "ERROR: #{e.message}"
+  puts "Backtrace: #{e.backtrace.first(5).join("\n")}"
+end
+# Example 2: Multiple Images Comparison
+puts "\n2. Multiple Images Comparison"
+puts '-' * 80
+response = LlmConductor.generate(
+  model: 'claude-sonnet-4-20250514',
+  vendor: :anthropic,
+  prompt: {
+    text: 'Compare these two images. What are the main differences?',
+    images: [
+      'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/1024px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg',
+      'https://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/Placeholder_view_vector.svg/1024px-Placeholder_view_vector.svg.png'
+    ]
+  }
+)
+puts "Response: #{response.output}"
+puts "Tokens: #{response.input_tokens} input, #{response.output_tokens} output"
+# Example 3: Image with Specific Question
+puts "\n3. Image with Specific Question"
+puts '-' * 80
+response = LlmConductor.generate(
+  model: 'claude-sonnet-4-20250514',
+  vendor: :anthropic,
+  prompt: {
+    text: 'Is there a wooden boardwalk visible in this image? If yes, describe its condition.',
+    images: 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/1024px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg'
+  }
+)
+puts "Response: #{response.output}"
+puts "Tokens: #{response.input_tokens} input, #{response.output_tokens} output"
+# Example 4: Raw Format (Advanced)
+puts "\n4. Raw Format (Advanced)"
+puts '-' * 80
+response = LlmConductor.generate(
+  model: 'claude-sonnet-4-20250514',
+  vendor: :anthropic,
+  prompt: [
+    { type: 'image',
+      source: { type: 'url',
+                url: 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/1024px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg' } },
+    { type: 'text', text: 'Describe the weather conditions in this image.' }
+  ]
+)
+puts "Response: #{response.output}"
+puts "Tokens: #{response.input_tokens} input, #{response.output_tokens} output"
+# Example 5: Text-Only Request (Backward Compatible)
+puts "\n5. Text-Only Request (Backward Compatible)"
+puts '-' * 80
+response = LlmConductor.generate(
+  model: 'claude-sonnet-4-20250514',
+  vendor: :anthropic,
+  prompt: 'What is the capital of France?'
+)
+puts "Response: #{response.output}"
+puts "Tokens: #{response.input_tokens} input, #{response.output_tokens} output"
+# Example 6: Image Analysis with Detailed Instructions
+puts "\n6. Image Analysis with Detailed Instructions"
+puts '-' * 80
+response = LlmConductor.generate(
+  model: 'claude-sonnet-4-20250514',
+  vendor: :anthropic,
+  prompt: {
+    text: 'Analyze this image and provide: 1) Main subjects, 2) Colors and lighting, 3) Mood or atmosphere, 4) Any notable details',
+    images: 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/1024px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg'
+  }
+)
+puts "Response: #{response.output}"
+puts "Tokens: #{response.input_tokens} input, #{response.output_tokens} output"
+puts "\n#{'=' * 80}"
+puts 'All examples completed successfully!'
+puts '=' * 80

data/examples/gemini_usage.rb CHANGED Viewed

@@ -4,7 +4,7 @@ require_relative '../lib/llm_conductor'
 # Configure Gemini API key
 LlmConductor.configure do |config|
-  config.gemini_api_key = ENV['GEMINI_API_KEY'] || 'your_gemini_api_key_here'
+  config.gemini(api_key: ENV['GEMINI_API_KEY'] || 'your_gemini_api_key_here')
 end
 # Example usage

data/examples/gemini_vision_usage.rb ADDED Viewed

@@ -0,0 +1,168 @@
+# frozen_string_literal: true
+require_relative '../lib/llm_conductor'
+# Configure Gemini API key
+LlmConductor.configure do |config|
+  config.gemini(api_key: ENV['GEMINI_API_KEY'] || 'your_gemini_api_key_here')
+end
+puts '=' * 80
+puts 'Google Gemini Vision Examples'
+puts '=' * 80
+puts
+# Example 1: Single image analysis (simple format)
+puts 'Example 1: Single Image Analysis'
+puts '-' * 40
+response = LlmConductor.generate(
+  model: 'gemini-2.0-flash',
+  vendor: :gemini,
+  prompt: {
+    text: 'What is in this image? Describe it in detail.',
+    images: 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg'
+  }
+)
+puts "Model: #{response.model}"
+puts "Vendor: #{response.metadata[:vendor]}"
+puts "Input tokens: #{response.input_tokens}"
+puts "Output tokens: #{response.output_tokens}"
+puts "\nResponse:"
+puts response.output
+puts
+# Example 2: Multiple images comparison
+puts '=' * 80
+puts 'Example 2: Multiple Images Comparison'
+puts '-' * 40
+response = LlmConductor.generate(
+  model: 'gemini-2.0-flash',
+  vendor: :gemini,
+  prompt: {
+    text: 'Compare these images. What are the main differences?',
+    images: [
+      'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg',
+      'https://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/Placeholder_view_vector.svg/681px-Placeholder_view_vector.svg.png'
+    ]
+  }
+)
+puts "Model: #{response.model}"
+puts "Input tokens: #{response.input_tokens}"
+puts "Output tokens: #{response.output_tokens}"
+puts "\nResponse:"
+puts response.output
+puts
+# Example 3: Raw format with Gemini-specific structure
+puts '=' * 80
+puts 'Example 3: Raw Format (Gemini-specific)'
+puts '-' * 40
+response = LlmConductor.generate(
+  model: 'gemini-2.0-flash',
+  vendor: :gemini,
+  prompt: [
+    { type: 'text', text: 'Analyze this nature scene:' },
+    { type: 'image_url', image_url: { url: 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg' } },
+    { type: 'text', text: 'What time of day do you think this photo was taken?' }
+  ]
+)
+puts "Model: #{response.model}"
+puts "Input tokens: #{response.input_tokens}"
+puts "Output tokens: #{response.output_tokens}"
+puts "\nResponse:"
+puts response.output
+puts
+# Example 4: Image with specific analysis request
+puts '=' * 80
+puts 'Example 4: Specific Analysis Request'
+puts '-' * 40
+response = LlmConductor.generate(
+  model: 'gemini-2.0-flash',
+  vendor: :gemini,
+  prompt: {
+    text: 'Count the number of distinct colors visible in this image and list them.',
+    images: 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg'
+  }
+)
+puts "Model: #{response.model}"
+puts "\nResponse:"
+puts response.output
+puts
+# Example 5: Error handling
+puts '=' * 80
+puts 'Example 5: Error Handling'
+puts '-' * 40
+begin
+  response = LlmConductor.generate(
+    model: 'gemini-2.0-flash',
+    vendor: :gemini,
+    prompt: {
+      text: 'What is in this image?',
+      images: 'https://example.com/nonexistent-image.jpg'
+    }
+  )
+  if response.success?
+    puts 'Success! Response:'
+    puts response.output
+  else
+    puts "Request failed: #{response.metadata[:error]}"
+  end
+rescue StandardError => e
+  puts "Error occurred: #{e.message}"
+end
+puts
+# Example 6: Text-only request (backward compatibility)
+puts '=' * 80
+puts 'Example 6: Text-Only Request (No Images)'
+puts '-' * 40
+response = LlmConductor.generate(
+  model: 'gemini-2.0-flash',
+  vendor: :gemini,
+  prompt: 'Explain how neural networks work in 3 sentences.'
+)
+puts "Model: #{response.model}"
+puts "Input tokens: #{response.input_tokens}"
+puts "Output tokens: #{response.output_tokens}"
+puts "\nResponse:"
+puts response.output
+puts
+# Example 7: Image with hash format (URL specified explicitly)
+puts '=' * 80
+puts 'Example 7: Image Hash Format'
+puts '-' * 40
+response = LlmConductor.generate(
+  model: 'gemini-2.0-flash',
+  vendor: :gemini,
+  prompt: {
+    text: 'Describe the mood and atmosphere of this image.',
+    images: [
+      { url: 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg' }
+    ]
+  }
+)
+puts "Model: #{response.model}"
+puts "\nResponse:"
+puts response.output
+puts
+puts '=' * 80
+puts 'Examples completed!'
+puts '=' * 80

data/examples/gpt_vision_usage.rb ADDED Viewed

@@ -0,0 +1,156 @@
+#!/usr/bin/env ruby
+# frozen_string_literal: true
+require_relative '../lib/llm_conductor'
+# This example demonstrates using GPT-4o vision capabilities
+# Set your OpenAI API key: export OPENAI_API_KEY='your-key-here'
+puts '=' * 80
+puts 'GPT-4o Vision Usage Examples'
+puts '=' * 80
+puts
+# Check for API key
+api_key = ENV['OPENAI_API_KEY']
+if api_key.nil? || api_key.empty?
+  puts 'ERROR: OPENAI_API_KEY environment variable is not set!'
+  puts
+  puts 'Please set your OpenAI API key:'
+  puts '  export OPENAI_API_KEY="your-key-here"'
+  puts
+  puts 'You can get an API key from: https://platform.openai.com/api-keys'
+  exit 1
+end
+# Configure the client
+LlmConductor.configure do |config|
+  config.openai(api_key:)
+end
+# Example 1: Single Image Analysis
+puts "\n1. Single Image Analysis"
+puts '-' * 80
+response = LlmConductor.generate(
+  model: 'gpt-4o',
+  vendor: :openai,
+  prompt: {
+    text: 'What is in this image? Please describe it in detail.',
+    images: 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg'
+  }
+)
+puts "Response: #{response.output}"
+puts "Tokens: #{response.input_tokens} input, #{response.output_tokens} output"
+# Example 2: Multiple Images Comparison
+puts "\n2. Multiple Images Comparison"
+puts '-' * 80
+response = LlmConductor.generate(
+  model: 'gpt-4o',
+  vendor: :openai,
+  prompt: {
+    text: 'Compare these two images. What are the main differences?',
+    images: [
+      'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/1024px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg',
+      'https://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/Placeholder_view_vector.svg/1024px-Placeholder_view_vector.svg.png'
+    ]
+  }
+)
+puts "Response: #{response.output}"
+puts "Tokens: #{response.input_tokens} input, #{response.output_tokens} output"
+# Example 3: Image with Detail Level - High Resolution
+puts "\n3. Image with Detail Level - High Resolution"
+puts '-' * 80
+response = LlmConductor.generate(
+  model: 'gpt-4o',
+  vendor: :openai,
+  prompt: {
+    text: 'Analyze this high-resolution image in detail. What are all the elements you can see?',
+    images: [
+      { url: 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg', detail: 'high' }
+    ]
+  }
+)
+puts "Response: #{response.output}"
+puts "Tokens: #{response.input_tokens} input, #{response.output_tokens} output"
+# Example 4: Image with Detail Level - Low (Faster, Cheaper)
+puts "\n4. Image with Detail Level - Low (Faster, Cheaper)"
+puts '-' * 80
+response = LlmConductor.generate(
+  model: 'gpt-4o',
+  vendor: :openai,
+  prompt: {
+    text: 'Give me a quick description of this image.',
+    images: [
+      { url: 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/1024px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg', detail: 'low' }
+    ]
+  }
+)
+puts "Response: #{response.output}"
+puts "Tokens: #{response.input_tokens} input, #{response.output_tokens} output"
+# Example 5: Raw Format (Advanced)
+puts "\n5. Raw Format (Advanced)"
+puts '-' * 80
+response = LlmConductor.generate(
+  model: 'gpt-4o',
+  vendor: :openai,
+  prompt: [
+    { type: 'text', text: 'What is in this image?' },
+    { type: 'image_url',
+      image_url: { url: 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/1024px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg' } },
+    { type: 'text', text: 'Describe the weather conditions.' }
+  ]
+)
+puts "Response: #{response.output}"
+puts "Tokens: #{response.input_tokens} input, #{response.output_tokens} output"
+# Example 6: Text-Only Request (Backward Compatible)
+puts "\n6. Text-Only Request (Backward Compatible)"
+puts '-' * 80
+response = LlmConductor.generate(
+  model: 'gpt-4o',
+  vendor: :openai,
+  prompt: 'What is the capital of France?'
+)
+puts "Response: #{response.output}"
+puts "Tokens: #{response.input_tokens} input, #{response.output_tokens} output"
+# Example 7: Multiple Images with Mixed Detail Levels
+puts "\n7. Multiple Images with Mixed Detail Levels"
+puts '-' * 80
+response = LlmConductor.generate(
+  model: 'gpt-4o',
+  vendor: :openai,
+  prompt: {
+    text: 'Compare these images at different detail levels.',
+    images: [
+      {
+        url: 'https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/1024px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg', detail: 'high'
+      },
+      { url: 'https://upload.wikimedia.org/wikipedia/commons/thumb/3/3f/Placeholder_view_vector.svg/1024px-Placeholder_view_vector.svg.png', detail: 'low' }
+    ]
+  }
+)
+puts "Response: #{response.output}"
+puts "Tokens: #{response.input_tokens} input, #{response.output_tokens} output"
+puts "\n#{'=' * 80}"
+puts 'All examples completed successfully!'
+puts '=' * 80

data/lib/llm_conductor/clients/anthropic_client.rb CHANGED Viewed

@@ -1,18 +1,23 @@
 # frozen_string_literal: true
 require 'anthropic'
+require_relative 'concerns/vision_support'
 module LlmConductor
   module Clients
     # Anthropic Claude client implementation for accessing Claude models via Anthropic API
+    # Supports both text-only and multimodal (vision) requests
     class AnthropicClient < BaseClient
+      include Concerns::VisionSupport
       private
       def generate_content(prompt)
+        content = format_content(prompt)
         response = client.messages.create(
           model:,
           max_tokens: 4096,
-          messages: [{ role: 'user', content: prompt }]
+          messages: [{ role: 'user', content: }]
         )
         response.content.first.text
@@ -20,6 +25,28 @@ module LlmConductor
         raise StandardError, "Anthropic API error: #{e.message}"
       end
+      # Anthropic uses a different image format than OpenAI
+      # Format: { type: 'image', source: { type: 'url', url: '...' } }
+      def format_image_url(url)
+        { type: 'image', source: { type: 'url', url: } }
+      end
+      def format_image_hash(image_hash)
+        # Anthropic doesn't have a 'detail' parameter like OpenAI
+        {
+          type: 'image',
+          source: {
+            type: 'url',
+            url: image_hash[:url] || image_hash['url']
+          }
+        }
+      end
+      # Anthropic recommends placing images before text
+      def images_before_text?
+        true
+      end
       def client
         @client ||= begin
           config = LlmConductor.configuration.provider_config(:anthropic)

data/lib/llm_conductor/clients/concerns/vision_support.rb ADDED Viewed

@@ -0,0 +1,159 @@
+# frozen_string_literal: true
+module LlmConductor
+  module Clients
+    module Concerns
+      # Shared module for vision/multimodal support across different LLM clients
+      # Provides common functionality for formatting images and text content
+      module VisionSupport
+        private
+        # Override token calculation to handle multimodal content
+        def calculate_tokens(content)
+          case content
+          when String then super(content)
+          when Hash then calculate_tokens_from_hash(content)
+          when Array then calculate_tokens_from_array(content)
+          else super(content.to_s)
+          end
+        end
+        # Calculate tokens from a hash containing text and/or images
+        # @param content_hash [Hash] Hash with :text and/or :images keys
+        # @return [Integer] Token count for text portion
+        def calculate_tokens_from_hash(content_hash)
+          text = content_hash[:text] || content_hash['text'] || ''
+          # Call the parent class's calculate_tokens with the extracted text
+          method(:calculate_tokens).super_method.call(text)
+        end
+        # Calculate tokens from an array of content parts
+        # @param content_array [Array] Array of content parts with type and text
+        # @return [Integer] Token count for all text parts
+        def calculate_tokens_from_array(content_array)
+          text_parts = extract_text_from_array(content_array)
+          # Call the parent class's calculate_tokens with the joined text
+          method(:calculate_tokens).super_method.call(text_parts)
+        end
+        # Extract and join text from array of content parts
+        # @param content_array [Array] Array of content parts
+        # @return [String] Joined text from all text parts
+        def extract_text_from_array(content_array)
+          content_array
+            .select { |part| text_part?(part) }
+            .map { |part| extract_text_from_part(part) }
+            .join(' ')
+        end
+        # Check if a content part is a text part
+        # @param part [Hash] Content part
+        # @return [Boolean] true if part is a text type
+        def text_part?(part)
+          part[:type] == 'text' || part['type'] == 'text'
+        end
+        # Extract text from a content part
+        # @param part [Hash] Content part with text
+        # @return [String] Text content
+        def extract_text_from_part(part)
+          part[:text] || part['text'] || ''
+        end
+        # Format content based on whether it's a simple string or multimodal content
+        # @param prompt [String, Hash, Array] The prompt content
+        # @return [String, Array] Formatted content for the API
+        def format_content(prompt)
+          case prompt
+          when Hash
+            # Handle hash with text and/or images
+            format_multimodal_hash(prompt)
+          when Array
+            # Already formatted as array of content parts
+            prompt
+          else
+            # Simple string prompt
+            prompt.to_s
+          end
+        end
+        # Format a hash containing text and/or images into multimodal content array
+        # @param prompt_hash [Hash] Hash with :text and/or :images keys
+        # @return [Array] Array of content parts for the API
+        def format_multimodal_hash(prompt_hash)
+          content_parts = []
+          # Add image parts (order depends on provider)
+          images = prompt_hash[:images] || prompt_hash['images'] || []
+          images = [images] unless images.is_a?(Array)
+          if images_before_text?
+            # Anthropic recommends images before text
+            images.each { |image| content_parts << format_image_part(image) }
+            add_text_part(content_parts, prompt_hash)
+          else
+            # OpenAI/most others: text before images
+            add_text_part(content_parts, prompt_hash)
+            images.each { |image| content_parts << format_image_part(image) }
+          end
+          content_parts
+        end
+        # Add text part to content array if present
+        # @param content_parts [Array] The content parts array
+        # @param prompt_hash [Hash] Hash with :text key
+        def add_text_part(content_parts, prompt_hash)
+          return unless prompt_hash[:text] || prompt_hash['text']
+          text = prompt_hash[:text] || prompt_hash['text']
+          content_parts << { type: 'text', text: }
+        end
+        # Format an image into the appropriate API structure
+        # This method should be overridden by clients that need different formats
+        # @param image [String, Hash] Image URL or hash with url/detail keys
+        # @return [Hash] Formatted image part for the API
+        def format_image_part(image)
+          case image
+          when String
+            format_image_url(image)
+          when Hash
+            format_image_hash(image)
+          end
+        end
+        # Format a simple image URL string
+        # Override this in subclasses for provider-specific format
+        # @param url [String] Image URL
+        # @return [Hash] Formatted image part
+        def format_image_url(url)
+          # Default: OpenAI format
+          { type: 'image_url', image_url: { url: } }
+        end
+        # Format an image hash with url and optional detail
+        # Override this in subclasses for provider-specific format
+        # @param image_hash [Hash] Hash with url and optional detail keys
+        # @return [Hash] Formatted image part
+        def format_image_hash(image_hash)
+          # Default: OpenAI format with detail support
+          {
+            type: 'image_url',
+            image_url: {
+              url: image_hash[:url] || image_hash['url'],
+              detail: image_hash[:detail] || image_hash['detail']
+            }.compact
+          }
+        end
+        # Whether to place images before text in the content array
+        # Override this in subclasses if needed (e.g., Anthropic recommends images first)
+        # @return [Boolean] true if images should come before text
+        def images_before_text?
+          false
+        end
+      end
+    end
+  end
+end

data/lib/llm_conductor/clients/gemini_client.rb CHANGED Viewed

@@ -1,17 +1,27 @@
 # frozen_string_literal: true
 require 'gemini-ai'
+require 'base64'
+require 'net/http'
+require 'uri'
+require_relative 'concerns/vision_support'
 module LlmConductor
   module Clients
     # Google Gemini client implementation for accessing Gemini models via Google AI API
+    # Supports both text-only and multimodal (vision) requests
     class GeminiClient < BaseClient
+      include Concerns::VisionSupport
       private
       def generate_content(prompt)
+        content = format_content(prompt)
+        parts = build_parts_for_gemini(content)
         payload = {
           contents: [
-            { parts: [{ text: prompt }] }
+            { parts: }
           ]
         }
@@ -19,6 +29,100 @@ module LlmConductor
         response.dig('candidates', 0, 'content', 'parts', 0, 'text')
       end
+      # Build parts array for Gemini API from formatted content
+      # Converts VisionSupport format to Gemini's specific format
+      # @param content [String, Array] Formatted content from VisionSupport
+      # @return [Array] Array of parts in Gemini format
+      def build_parts_for_gemini(content)
+        case content
+        when String
+          [{ text: content }]
+        when Array
+          content.map { |part| convert_to_gemini_part(part) }
+        else
+          [{ text: content.to_s }]
+        end
+      end
+      # Convert a VisionSupport formatted part to Gemini format
+      # @param part [Hash] Content part with type and data
+      # @return [Hash] Gemini-formatted part
+      def convert_to_gemini_part(part)
+        case part[:type]
+        when 'text'
+          { text: part[:text] }
+        when 'image_url'
+          convert_image_url_to_inline_data(part)
+        when 'inline_data'
+          part # Already in Gemini format
+        else
+          part
+        end
+      end
+      # Convert image_url part to Gemini's inline_data format
+      # @param part [Hash] Part with image_url
+      # @return [Hash] Gemini inline_data format
+      def convert_image_url_to_inline_data(part)
+        url = part.dig(:image_url, :url)
+        {
+          inline_data: {
+            mime_type: detect_mime_type(url),
+            data: fetch_and_encode_image(url)
+          }
+        }
+      end
+      # Fetch image from URL and encode as base64
+      # Gemini API requires images to be base64-encoded
+      # @param url [String] Image URL
+      # @return [String] Base64-encoded image data
+      def fetch_and_encode_image(url)
+        uri = URI.parse(url)
+        response = fetch_image_from_uri(uri)
+        raise StandardError, "HTTP #{response.code}" unless response.is_a?(Net::HTTPSuccess)
+        Base64.strict_encode64(response.body)
+      rescue StandardError => e
+        raise StandardError, "Error fetching image from #{url}: #{e.message}"
+      end
+      # Fetch image from URI using Net::HTTP
+      # @param uri [URI] Parsed URI
+      # @return [Net::HTTPResponse] HTTP response
+      def fetch_image_from_uri(uri)
+        http = create_http_client(uri)
+        request = Net::HTTP::Get.new(uri.request_uri)
+        http.request(request)
+      end
+      # Create HTTP client with SSL configuration
+      # @param uri [URI] Parsed URI
+      # @return [Net::HTTP] Configured HTTP client
+      def create_http_client(uri)
+        http = Net::HTTP.new(uri.host, uri.port)
+        return http unless uri.scheme == 'https'
+        http.use_ssl = true
+        http.verify_mode = OpenSSL::SSL::VERIFY_NONE
+        http
+      end
+      # Detect MIME type from URL file extension
+      # @param url [String] Image URL
+      # @return [String] MIME type (e.g., 'image/jpeg', 'image/png')
+      def detect_mime_type(url)
+        extension = File.extname(URI.parse(url).path).downcase
+        case extension
+        when '.jpg', '.jpeg' then 'image/jpeg'
+        when '.png' then 'image/png'
+        when '.gif' then 'image/gif'
+        when '.webp' then 'image/webp'
+        else 'image/jpeg' # Default to jpeg
+        end
+      end
       def client
         @client ||= begin
           config = LlmConductor.configuration.provider_config(:gemini)

data/lib/llm_conductor/clients/gpt_client.rb CHANGED Viewed

@@ -1,13 +1,19 @@
 # frozen_string_literal: true
+require_relative 'concerns/vision_support'
 module LlmConductor
   module Clients
     # OpenAI GPT client implementation for accessing GPT models via OpenAI API
+    # Supports both text-only and multimodal (vision) requests
     class GptClient < BaseClient
+      include Concerns::VisionSupport
       private
       def generate_content(prompt)
-        client.chat(parameters: { model:, messages: [{ role: 'user', content: prompt }] })
+        content = format_content(prompt)
+        client.chat(parameters: { model:, messages: [{ role: 'user', content: }] })
               .dig('choices', 0, 'message', 'content')
       end

data/lib/llm_conductor/clients/openrouter_client.rb CHANGED Viewed

@@ -1,32 +1,15 @@
 # frozen_string_literal: true
+require_relative 'concerns/vision_support'
 module LlmConductor
   module Clients
     # OpenRouter client implementation for accessing various LLM providers through OpenRouter API
     # Supports both text-only and multimodal (vision) requests
     class OpenrouterClient < BaseClient
-      private
+      include Concerns::VisionSupport
-      # Override token calculation to handle multimodal content
-      def calculate_tokens(content)
-        case content
-        when String
-          super(content)
-        when Hash
-          # For multimodal content, count tokens only for text part
-          # Note: This is an approximation as images have variable token counts
-          text = content[:text] || content['text'] || ''
-          super(text)
-        when Array
-          # For pre-formatted arrays, extract and count text parts
-          text_parts = content.select { |part| part[:type] == 'text' || part['type'] == 'text' }
-                              .map { |part| part[:text] || part['text'] || '' }
-                              .join(' ')
-          super(text_parts)
-        else
-          super(content.to_s)
-        end
-      end
+      private
       def generate_content(prompt)
         content = format_content(prompt)
@@ -61,66 +44,6 @@ module LlmConductor
         end
       end
-      # Format content based on whether it's a simple string or multimodal content
-      # @param prompt [String, Hash, Array] The prompt content
-      # @return [String, Array] Formatted content for the API
-      def format_content(prompt)
-        case prompt
-        when Hash
-          # Handle hash with text and/or images
-          format_multimodal_hash(prompt)
-        when Array
-          # Already formatted as array of content parts
-          prompt
-        else
-          # Simple string prompt
-          prompt.to_s
-        end
-      end
-      # Format a hash containing text and/or images into multimodal content array
-      # @param prompt_hash [Hash] Hash with :text and/or :images keys
-      # @return [Array] Array of content parts for the API
-      def format_multimodal_hash(prompt_hash)
-        content_parts = []
-        # Add text part if present
-        if prompt_hash[:text] || prompt_hash['text']
-          text = prompt_hash[:text] || prompt_hash['text']
-          content_parts << { type: 'text', text: }
-        end
-        # Add image parts if present
-        images = prompt_hash[:images] || prompt_hash['images'] || []
-        images = [images] unless images.is_a?(Array)
-        images.each do |image|
-          content_parts << format_image_part(image)
-        end
-        content_parts
-      end
-      # Format an image into the appropriate API structure
-      # @param image [String, Hash] Image URL or hash with url/detail keys
-      # @return [Hash] Formatted image part for the API
-      def format_image_part(image)
-        case image
-        when String
-          # Simple URL string
-          { type: 'image_url', image_url: { url: image } }
-        when Hash
-          # Hash with url and optional detail level
-          {
-            type: 'image_url',
-            image_url: {
-              url: image[:url] || image['url'],
-              detail: image[:detail] || image['detail']
-            }.compact
-          }
-        end
-      end
       def client
         @client ||= begin
           config = LlmConductor.configuration.provider_config(:openrouter)

data/lib/llm_conductor/clients/zai_client.rb CHANGED Viewed

@@ -1,5 +1,7 @@
 # frozen_string_literal: true
+require_relative 'concerns/vision_support'
 module LlmConductor
   module Clients
     # Z.ai client implementation for accessing GLM models including GLM-4.5V
@@ -8,28 +10,9 @@ module LlmConductor
     # Note: Z.ai uses OpenAI-compatible API format but with /v4/ path instead of /v1/
     # We use Faraday directly instead of the ruby-openai gem to properly handle the API path
     class ZaiClient < BaseClient
-      private
+      include Concerns::VisionSupport
-      # Override token calculation to handle multimodal content
-      def calculate_tokens(content)
-        case content
-        when String
-          super(content)
-        when Hash
-          # For multimodal content, count tokens only for text part
-          # Note: This is an approximation as images have variable token counts
-          text = content[:text] || content['text'] || ''
-          super(text)
-        when Array
-          # For pre-formatted arrays, extract and count text parts
-          text_parts = content.select { |part| part[:type] == 'text' || part['type'] == 'text' }
-                              .map { |part| part[:text] || part['text'] || '' }
-                              .join(' ')
-          super(text_parts)
-        else
-          super(content.to_s)
-        end
-      end
+      private
       def generate_content(prompt)
         content = format_content(prompt)
@@ -67,66 +50,6 @@ module LlmConductor
         end
       end
-      # Format content based on whether it's a simple string or multimodal content
-      # @param prompt [String, Hash, Array] The prompt content
-      # @return [String, Array] Formatted content for the API
-      def format_content(prompt)
-        case prompt
-        when Hash
-          # Handle hash with text and/or images
-          format_multimodal_hash(prompt)
-        when Array
-          # Already formatted as array of content parts
-          prompt
-        else
-          # Simple string prompt
-          prompt.to_s
-        end
-      end
-      # Format a hash containing text and/or images into multimodal content array
-      # @param prompt_hash [Hash] Hash with :text and/or :images keys
-      # @return [Array] Array of content parts for the API
-      def format_multimodal_hash(prompt_hash)
-        content_parts = []
-        # Add text part if present
-        if prompt_hash[:text] || prompt_hash['text']
-          text = prompt_hash[:text] || prompt_hash['text']
-          content_parts << { type: 'text', text: }
-        end
-        # Add image parts if present
-        images = prompt_hash[:images] || prompt_hash['images'] || []
-        images = [images] unless images.is_a?(Array)
-        images.each do |image|
-          content_parts << format_image_part(image)
-        end
-        content_parts
-      end
-      # Format an image into the appropriate API structure
-      # @param image [String, Hash] Image URL or hash with url/detail keys
-      # @return [Hash] Formatted image part for the API
-      def format_image_part(image)
-        case image
-        when String
-          # Simple URL string or base64 data
-          { type: 'image_url', image_url: { url: image } }
-        when Hash
-          # Hash with url and optional detail level
-          {
-            type: 'image_url',
-            image_url: {
-              url: image[:url] || image['url'],
-              detail: image[:detail] || image['detail']
-            }.compact
-          }
-        end
-      end
       # HTTP client for making requests to Z.ai API
       # Z.ai uses /v4/ in their path, not /v1/ like OpenAI, so we use Faraday directly
       def http_client

data/lib/llm_conductor/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module LlmConductor
-  VERSION = '1.2.0'
+  VERSION = '1.4.0'
 end

metadata CHANGED Viewed

@@ -1,13 +1,13 @@
 --- !ruby/object:Gem::Specification
 name: llm_conductor
 version: !ruby/object:Gem::Version
-  version: 1.2.0
+  version: 1.4.0
 platform: ruby
 authors:
 - Ben Zheng
 bindir: exe
 cert_chain: []
-date: 2025-10-29 00:00:00.000000000 Z
+date: 2025-11-13 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: activesupport
@@ -154,8 +154,11 @@ files:
 - Rakefile
 - VISION_USAGE.md
 - config/initializers/llm_conductor.rb
+- examples/claude_vision_usage.rb
 - examples/data_builder_usage.rb
 - examples/gemini_usage.rb
+- examples/gemini_vision_usage.rb
+- examples/gpt_vision_usage.rb
 - examples/groq_usage.rb
 - examples/openrouter_vision_usage.rb
 - examples/prompt_registration.rb
@@ -166,6 +169,7 @@ files:
 - lib/llm_conductor/client_factory.rb
 - lib/llm_conductor/clients/anthropic_client.rb
 - lib/llm_conductor/clients/base_client.rb
+- lib/llm_conductor/clients/concerns/vision_support.rb
 - lib/llm_conductor/clients/gemini_client.rb
 - lib/llm_conductor/clients/gpt_client.rb
 - lib/llm_conductor/clients/groq_client.rb