RubyGems - universal_document_processor - Versions diffs - 1.0.5 → 1.1.1 - Mend

universal_document_processor 1.0.5 → 1.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (19) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +13 -0
data/README.md +237 -2
data/lib/universal_document_processor/ai_agent.rb +48 -49
data/lib/universal_document_processor/document.rb +130 -13
data/lib/universal_document_processor/processors/archive_processor.rb +26 -0
data/lib/universal_document_processor/processors/base_processor.rb +17 -0
data/lib/universal_document_processor/processors/excel_processor.rb +30 -0
data/lib/universal_document_processor/processors/pdf_processor.rb +21 -1
data/lib/universal_document_processor/processors/text_processor.rb +21 -0
data/lib/universal_document_processor/processors/word_processor.rb +30 -0
data/lib/universal_document_processor/version.rb +1 -1
data/lib/universal_document_processor.rb +10 -0
metadata +1 -6
data/debug_test.rb +0 -35
data/test_ai_dependency.rb +0 -80
data/test_core_functionality.rb +0 -280
data/test_performance_memory.rb +0 -271
data/test_published_gem.rb +0 -349

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 4b4c918d869d7ecc4420b740c032d07eb9d5344fc5049f2522c2de92ac5ced17
-  data.tar.gz: acc85eb5cf922ce1e29384fc5624e1095df40a444bc5ee39fff23ce875f8b5a4
+  metadata.gz: '0612949a026d62fd8fd9c9c1372cfa70cdeb8bdd1677475be639cf35cd684f4c'
+  data.tar.gz: 82780d2c062034be663b3d21275e9d27addc1e44f5705de7dc6b23e70293216e
 SHA512:
-  metadata.gz: 9a072e0dda668c534edbcc118591807fe55d8acca8257c2d339d709ca5892f3b6b9eca53a4467763f87977c30016546f6c0fbcb2c81c61c96fd2d9c427905c0f
-  data.tar.gz: c5567a97e9630cd89822afaa151ac4aff39ca6195be4fefe7b67bf72686f29d665c8727bd86d29897c8e7c587c85da4f8abd8495c4c9a4ef48d2b8c22537fd33
+  metadata.gz: 07a4fe1b792226dae8135e6620f455640d0ba137777b238052916d06a0b4f32b113414886479e0ad48b76ead57d9b7d6a577a76748dff926a9575ad124dc7ee5
+  data.tar.gz: fd3b8fb692f87755a657eb1270631c19ca99bcb5fcbb224e4ebcc22c1b19af1eb2aa3b2aed6ff3f94f84a6fdba8d5e255b28575d5fe1b074244e4bc160820f33

data/CHANGELOG.md CHANGED Viewed

@@ -7,6 +7,19 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
+## [1.1.0] - 2025-01-XX
+### Added
+- **ZIP File Creation**: New functionality to create ZIP archives programmatically
+  - `ArchiveProcessor.create_zip()` class method for creating ZIP files
+  - Support for creating archives from individual files or entire directories
+  - Recursive directory archiving with proper path structure preservation
+  - Comprehensive test coverage with error handling
+  - Integration with existing archive processing capabilities
+### Enhanced
+- **ArchiveProcessor**: Extended with ZIP creation capabilities alongside existing extraction features
+- **Archive Support**: Now supports both reading/extracting and creating ZIP archives
 ## [1.2.0] - 2024-01-15
 ### Added
 - **TSV (Tab-Separated Values) File Support**: Complete built-in TSV processing capabilities

data/README.md CHANGED Viewed

@@ -29,6 +29,7 @@ A comprehensive Ruby gem that provides unified document processing capabilities
 - **Table Detection**: Structured data extraction
 - **Character Validation**: Invalid character detection and cleaning
 - **Multi-language Support**: Full Unicode support including Japanese (日本語)
+- **Archive Creation**: Create ZIP files from individual files or directories
 ### **Character & Encoding Support**
 - **Smart encoding detection** (UTF-8, Shift_JIS, EUC-JP, ISO-8859-1)
@@ -110,7 +111,7 @@ result = UniversalDocumentProcessor.process('document.pdf')
   },
   images: [...],
   tables: [...],
-  processed_at: 2024-01-15 10:30:00 UTC
+  processed_at: 2025-07-06 10:30:00 UTC
 }
 ```
@@ -218,6 +219,33 @@ puts "Tables found: #{result[:tables].length}"
 full_text = result[:text_content]
 ```
+### Creating PDF Documents
+```ruby
+# Install Prawn for PDF creation (optional dependency)
+# gem install prawn
+# Create PDF from any supported document format
+pdf_path = UniversalDocumentProcessor.create_pdf('document.docx')
+puts "PDF created at: #{pdf_path}"
+# Or use the convert method
+pdf_path = UniversalDocumentProcessor.convert('spreadsheet.xlsx', :pdf)
+# Check if PDF creation is available
+if UniversalDocumentProcessor.pdf_creation_available?
+  puts "PDF creation is available!"
+else
+  puts "Install 'prawn' gem to enable PDF creation: gem install prawn"
+end
+# The created PDF includes:
+# - Document title and metadata
+# - Full text content with formatting
+# - Tables (if present in original document)
+# - File information and statistics
+```
 ### Processing Excel Spreadsheets
 ```ruby
@@ -412,6 +440,89 @@ summary = japanese_doc.ai_summarize(length: :medium)
 ```ruby
 # Custom AI agent configuration
+## ⚙️ Agentic AI Configuration & Usage
+To enable and use the AI-powered features (agentic AI) in your application, follow these steps:
+### 1. Install AI Dependency
+You need the `ruby-openai` gem for AI features:
+```bash
+gem install ruby-openai
+```
+Or add to your Gemfile:
+```ruby
+gem 'ruby-openai'
+```
+Then run:
+```bash
+bundle install
+```
+### 2. Set Your OpenAI API Key
+You must provide your OpenAI API key for agentic AI features to work. You can do this in two ways:
+#### a) Environment Variable (Recommended)
+Set the API key in your environment (e.g., in `.env`, `application.yml`, or your deployment environment):
+```ruby
+ENV['OPENAI_API_KEY'] = 'your-api-key-here'
+```
+#### b) Pass Directly When Creating the Agent
+```ruby
+agent = UniversalDocumentProcessor.create_ai_agent(api_key: 'your-api-key-here')
+```
+### 3. Rails: Where to Configure
+If you are using Rails, add your configuration to:
+`config/initializers/universal_document_processor.rb`
+Example initializer:
+```ruby
+# config/initializers/universal_document_processor.rb
+require 'universal_document_processor'
+# Set your API key (or use ENV)
+ENV['OPENAI_API_KEY'] ||= 'your-api-key-here' # (or use Rails credentials)
+# Optionally, create a default agent with custom options
+UniversalDocumentProcessor.create_ai_agent(
+  model: 'gpt-4',
+  temperature: 0.7,
+  max_history: 10
+)
+Rails.logger.info "Universal Document Processor with AI agent loaded" if defined?(Rails)
+```
+### 4. Using Agentic AI Features
+You can now use the AI-powered methods:
+```ruby
+summary = UniversalDocumentProcessor.ai_summarize('document.pdf', length: :short)
+insights = UniversalDocumentProcessor.ai_insights('document.pdf')
+classification = UniversalDocumentProcessor.ai_classify('document.pdf')
+key_info = UniversalDocumentProcessor.ai_extract_info('document.pdf', ['dates', 'names', 'amounts'])
+action_items = UniversalDocumentProcessor.ai_action_items('document.pdf')
+translation = UniversalDocumentProcessor.ai_translate('日本語文書.pdf', 'English')
+```
+Or create and use a persistent agent:
+```ruby
 agent = UniversalDocumentProcessor.create_ai_agent(
   api_key: 'your-openai-key',       # OpenAI API key
   model: 'gpt-4',                   # Model to use (gpt-4, gpt-3.5-turbo)
@@ -419,6 +530,130 @@ agent = UniversalDocumentProcessor.create_ai_agent(
   max_history: 20,                  # Conversation memory length
   base_url: 'https://api.openai.com/v1'  # Custom API endpoint
 )
+# Chat about a document
+response = agent.analyze_document('report.pdf')
+```
+---
+**Note:**
+- The API key is required for all AI features.
+- You can override the model, temperature, and other options per agent.
+- For more, see the `USER_GUIDE.md` and the examples above.
+```
+## 📦 Archive Processing (ZIP Creation & Extraction)
+The gem provides comprehensive archive processing capabilities, including both extracting from existing archives and creating new ZIP files.
+### Extracting from Archives
+```ruby
+# Extract text and metadata from ZIP archives
+result = UniversalDocumentProcessor.process('archive.zip')
+# Access archive-specific metadata
+metadata = result[:metadata]
+puts "Archive type: #{metadata[:archive_type]}"           # => "zip"
+puts "Total files: #{metadata[:total_files]}"             # => 15
+puts "Uncompressed size: #{metadata[:total_uncompressed_size]} bytes"
+puts "Compression ratio: #{metadata[:compression_ratio]}%" # => 75%
+puts "Directory structure: #{metadata[:directory_structure]}"
+# Check for specific file types
+puts "File types: #{metadata[:file_types]}"               # => {"txt"=>5, "pdf"=>3, "jpg"=>7}
+puts "Has executables: #{metadata[:has_executable_files]}" # => false
+puts "Largest file: #{metadata[:largest_file][:path]} (#{metadata[:largest_file][:size]} bytes)"
+# Extract text from text files within the archive
+text_content = result[:text_content]
+puts "Combined text from archive: #{text_content.length} characters"
+```
+### Creating ZIP Archives
+```ruby
+# Create ZIP from individual files
+files_to_zip = ['document1.pdf', 'document2.txt', 'image.jpg']
+output_zip = 'my_archive.zip'
+zip_path = UniversalDocumentProcessor::Processors::ArchiveProcessor.create_zip(
+  output_zip,
+  files_to_zip
+)
+puts "ZIP created: #{zip_path}"
+# Create ZIP from entire directory (preserves folder structure)
+directory_to_zip = '/path/to/documents'
+archive_path = UniversalDocumentProcessor::Processors::ArchiveProcessor.create_zip(
+  'directory_backup.zip',
+  directory_to_zip
+)
+puts "Directory archived: #{archive_path}"
+# Working with temporary directories
+require 'tmpdir'
+Dir.mktmpdir do |tmpdir|
+  # Create some test files
+  File.write(File.join(tmpdir, 'file1.txt'), 'Hello from file 1')
+  File.write(File.join(tmpdir, 'file2.txt'), 'Hello from file 2')
+  # Create subdirectory with files
+  subdir = File.join(tmpdir, 'subfolder')
+  Dir.mkdir(subdir)
+  File.write(File.join(subdir, 'file3.txt'), 'Hello from subfolder')
+  # Archive the entire directory structure
+  zip_file = File.join(tmpdir, 'complete_backup.zip')
+  UniversalDocumentProcessor::Processors::ArchiveProcessor.create_zip(zip_file, tmpdir)
+  puts "Archive size: #{File.size(zip_file)} bytes"
+  # Verify archive contents by processing it
+  archive_result = UniversalDocumentProcessor.process(zip_file)
+  puts "Files in archive: #{archive_result[:metadata][:total_files]}"
+end
+# Error handling for ZIP creation
+begin
+  UniversalDocumentProcessor::Processors::ArchiveProcessor.create_zip(
+    '/invalid/path/archive.zip',
+    ['file1.txt', 'file2.txt']
+  )
+rescue => e
+  puts "Error creating ZIP: #{e.message}"
+end
+# Validate input before creating ZIP
+files = ['doc1.pdf', 'doc2.txt']
+files.each do |file|
+  unless File.exist?(file)
+    puts "Warning: #{file} does not exist"
+  end
+end
+```
+### Archive Analysis
+```ruby
+# Analyze archive security and structure
+result = UniversalDocumentProcessor.process('suspicious_archive.zip')
+metadata = result[:metadata]
+# Security analysis
+if metadata[:has_executable_files]
+  puts "⚠️  Archive contains executable files"
+end
+# Directory structure analysis
+structure = metadata[:directory_structure]
+puts "Top-level directories: #{structure.keys.join(', ')}"
+# File type distribution
+file_types = metadata[:file_types]
+puts "Most common file type: #{file_types.max_by{|k,v| v}}"
 ```
 ## 🎌 Japanese Filename Support
@@ -743,7 +978,7 @@ bundle exec rspec
 ## 📝 Changelog
-### Version 1.0.0
+### Version 1.1.0
 - Initial release
 - Support for PDF, Word, Excel, PowerPoint, images, archives
 - Character validation and cleaning

data/lib/universal_document_processor/ai_agent.rb CHANGED Viewed

@@ -14,16 +14,16 @@ module UniversalDocumentProcessor
       @max_history = options[:max_history] || 10
       @temperature = options[:temperature] || 0.7
       @ai_enabled = false
       validate_configuration
     end
     # Main document analysis with AI
     def analyze_document(document_result, query = nil)
       ensure_ai_available!
       context = build_document_context(document_result)
       if query
         # Specific query about the document
         analyze_with_query(context, query)
@@ -67,12 +67,12 @@ Please provide:
     # Ask specific questions about a document
     def ask_document_question(document_result, question)
       ensure_ai_available!
       context = build_document_context(document_result)
       prompt = build_question_prompt(context, question)
       response = call_openai_api(prompt)
       add_to_history(question, response)
       response
     end
@@ -80,19 +80,19 @@ Please provide:
     # Summarize document content
     def summarize_document(document_result, length: :medium)
       ensure_ai_available!
       context = build_document_context(document_result)
       length_instruction = case length
       when :short then "in 2-3 sentences"
       when :medium then "in 1-2 paragraphs"
       when :long then "in detail with key points"
       else "concisely"
       end
       prompt = build_summary_prompt(context, length_instruction)
       response = call_openai_api(prompt)
       add_to_history("Summarize document #{length_instruction}", response)
       response
     end
@@ -100,13 +100,13 @@ Please provide:
     # Extract key information from document
     def extract_key_information(document_result, categories = nil)
       ensure_ai_available!
       context = build_document_context(document_result)
       categories ||= ['key_facts', 'important_dates', 'names', 'locations', 'numbers']
       prompt = build_extraction_prompt(context, categories)
       response = call_openai_api(prompt)
       add_to_history("Extract key information: #{categories.join(', ')}", response)
       parse_extraction_response(response)
     end
@@ -114,12 +114,12 @@ Please provide:
     # Translate document content
     def translate_document(document_result, target_language)
       ensure_ai_available!
       context = build_document_context(document_result)
       prompt = build_translation_prompt(context, target_language)
       response = call_openai_api(prompt)
       add_to_history("Translate to #{target_language}", response)
       response
     end
@@ -127,12 +127,12 @@ Please provide:
     # Generate document insights and recommendations
     def generate_insights(document_result)
       ensure_ai_available!
       context = build_document_context(document_result)
       prompt = build_insights_prompt(context)
       response = call_openai_api(prompt)
       add_to_history("Generate insights", response)
       parse_insights_response(response)
     end
@@ -140,12 +140,12 @@ Please provide:
     # Compare multiple documents
     def compare_documents(document_results, comparison_type = :content)
       ensure_ai_available!
       contexts = document_results.map { |doc| build_document_context(doc) }
       prompt = build_comparison_prompt(contexts, comparison_type)
       response = call_openai_api(prompt)
       add_to_history("Compare documents (#{comparison_type})", response)
       response
     end
@@ -153,12 +153,12 @@ Please provide:
     # Classify document type and purpose
     def classify_document(document_result)
       ensure_ai_available!
       context = build_document_context(document_result)
       prompt = build_classification_prompt(context)
       response = call_openai_api(prompt)
       add_to_history("Classify document", response)
       parse_classification_response(response)
     end
@@ -166,12 +166,12 @@ Please provide:
     # Generate action items from document
     def extract_action_items(document_result)
       ensure_ai_available!
       context = build_document_context(document_result)
       prompt = build_action_items_prompt(context)
       response = call_openai_api(prompt)
       add_to_history("Extract action items", response)
       parse_action_items_response(response)
     end
@@ -179,14 +179,14 @@ Please provide:
     # Chat about the document
     def chat(message, document_result = nil)
       ensure_ai_available!
       if document_result
         context = build_document_context(document_result)
         prompt = build_chat_prompt(context, message)
       else
         prompt = build_general_chat_prompt(message)
       end
       response = call_openai_api(prompt)
       add_to_history(message, response)
       response
@@ -200,15 +200,15 @@ Please provide:
     # Get conversation summary
     def conversation_summary
       return "No conversation history" if @conversation_history.empty?
       unless @ai_enabled
         return "AI features are disabled. Cannot generate conversation summary."
       end
       history_text = @conversation_history.map do |entry|
         "Q: #{entry[:question]}\nA: #{entry[:answer]}"
       end.join("\n\n")
       prompt = "Summarize this conversation:\n\n#{history_text}"
       call_openai_api(prompt)
     end
@@ -247,13 +247,13 @@ Please provide:
         tables_count: document_result[:tables]&.length || 0,
         filename_info: document_result[:filename_info] || {}
       }
       # Add Japanese-specific information if available
       if context[:filename_info][:contains_japanese]
         context[:japanese_filename] = true
         context[:japanese_parts] = context[:filename_info][:japanese_parts]
       end
       context
     end
@@ -324,8 +324,7 @@ Please provide:
     def build_comparison_prompt(contexts, comparison_type)
       comparison_content = contexts.map.with_index do |context, index|
-        "Document #{index + 1}: #{context[:filename]}
-Content: #{truncate_content(context[:text_content], 1500)}"
+        "Document #{index + 1}: #{context[:filename]}\nContent: #{truncate_content(context[:text_content], 1500)}"
       end.join("\n\n---\n\n")
       "You are an AI analyst. Compare these documents focusing on #{comparison_type}:
@@ -404,15 +403,15 @@ Please respond helpfully."
     def call_openai_api(prompt)
       uri = URI("#{@base_url}/chat/completions")
       http = Net::HTTP.new(uri.host, uri.port)
       http.use_ssl = true
       http.read_timeout = 60
       request = Net::HTTP::Post.new(uri)
       request['Content-Type'] = 'application/json'
       request['Authorization'] = "Bearer #{@api_key}"
       request.body = {
         model: @model,
         messages: [
@@ -421,16 +420,16 @@ Please respond helpfully."
             content: "You are an intelligent document processing assistant with expertise in analyzing, summarizing, and extracting information from various document types. You support multiple languages including Japanese."
           },
           {
-            role: "user",
+            role: "user",
             content: prompt
           }
         ],
         temperature: @temperature,
         max_tokens: 2000
       }.to_json
       response = http.request(request)
       if response.code.to_i == 200
         result = JSON.parse(response.body)
         result.dig('choices', 0, 'message', 'content') || "No response generated"
@@ -446,14 +445,14 @@ Please respond helpfully."
         answer: answer,
         timestamp: Time.now
       }
       # Keep only the most recent conversations
       @conversation_history = @conversation_history.last(@max_history) if @conversation_history.length > @max_history
     end
     def truncate_content(content, max_length)
       return "" unless content.is_a?(String)
       if content.length > max_length
         "#{content[0...max_length]}...\n\n[Content truncated for analysis]"
       else
@@ -463,16 +462,16 @@ Please respond helpfully."
     def format_file_size(bytes)
       return "0 B" if bytes == 0
       units = ['B', 'KB', 'MB', 'GB']
       size = bytes.to_f
       unit_index = 0
       while size >= 1024 && unit_index < units.length - 1
         size /= 1024
         unit_index += 1
       end
       "#{size.round(2)} #{units[unit_index]}"
     end
@@ -490,7 +489,7 @@ Please respond helpfully."
       rescue JSON::ParserError
         # Fall back to plain text response
       end
       response
     end