RubyGems - universal_document_processor - Versions diffs - 1.1.0 → 1.1.1 - Mend

universal_document_processor 1.1.0 → 1.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

checksums.yaml +4 -4
data/README.md +122 -1
data/lib/universal_document_processor/ai_agent.rb +48 -49
data/lib/universal_document_processor/document.rb +130 -13
data/lib/universal_document_processor/processors/base_processor.rb +17 -0
data/lib/universal_document_processor/processors/excel_processor.rb +30 -0
data/lib/universal_document_processor/processors/pdf_processor.rb +21 -1
data/lib/universal_document_processor/processors/text_processor.rb +21 -0
data/lib/universal_document_processor/processors/word_processor.rb +30 -0
data/lib/universal_document_processor/version.rb +1 -1
data/lib/universal_document_processor.rb +10 -0
metadata +1 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 8444e9dc03cd125a0a9e62df6370b7dbba4adf4777d89478d1d51c60f5c83d70
-  data.tar.gz: 3a5fc000774c34683c7b0d95c0ca9a034838cd5e69be51c43c98792070b278aa
+  metadata.gz: '0612949a026d62fd8fd9c9c1372cfa70cdeb8bdd1677475be639cf35cd684f4c'
+  data.tar.gz: 82780d2c062034be663b3d21275e9d27addc1e44f5705de7dc6b23e70293216e
 SHA512:
-  metadata.gz: be66ce6b411fcfa52eaf6353ef1f37785cbe996c8c977bb671c971988c43a500108004741b157ca7636d9e2f20991c43b44884134c3ffd4fde2f6b4a90d27380
-  data.tar.gz: 94e90e17615093e529db4100d674944150d0cbbcd527df3341b8c025ada3fe389dc8dbc64099b1cfa87473105c57038ca1fb5106281242b8658e583042cc52c9
+  metadata.gz: 07a4fe1b792226dae8135e6620f455640d0ba137777b238052916d06a0b4f32b113414886479e0ad48b76ead57d9b7d6a577a76748dff926a9575ad124dc7ee5
+  data.tar.gz: fd3b8fb692f87755a657eb1270631c19ca99bcb5fcbb224e4ebcc22c1b19af1eb2aa3b2aed6ff3f94f84a6fdba8d5e255b28575d5fe1b074244e4bc160820f33

data/README.md CHANGED Viewed

@@ -219,6 +219,33 @@ puts "Tables found: #{result[:tables].length}"
 full_text = result[:text_content]
 ```
+### Creating PDF Documents
+```ruby
+# Install Prawn for PDF creation (optional dependency)
+# gem install prawn
+# Create PDF from any supported document format
+pdf_path = UniversalDocumentProcessor.create_pdf('document.docx')
+puts "PDF created at: #{pdf_path}"
+# Or use the convert method
+pdf_path = UniversalDocumentProcessor.convert('spreadsheet.xlsx', :pdf)
+# Check if PDF creation is available
+if UniversalDocumentProcessor.pdf_creation_available?
+  puts "PDF creation is available!"
+else
+  puts "Install 'prawn' gem to enable PDF creation: gem install prawn"
+end
+# The created PDF includes:
+# - Document title and metadata
+# - Full text content with formatting
+# - Tables (if present in original document)
+# - File information and statistics
+```
 ### Processing Excel Spreadsheets
 ```ruby
@@ -413,6 +440,89 @@ summary = japanese_doc.ai_summarize(length: :medium)
 ```ruby
 # Custom AI agent configuration
+## ⚙️ Agentic AI Configuration & Usage
+To enable and use the AI-powered features (agentic AI) in your application, follow these steps:
+### 1. Install AI Dependency
+You need the `ruby-openai` gem for AI features:
+```bash
+gem install ruby-openai
+```
+Or add to your Gemfile:
+```ruby
+gem 'ruby-openai'
+```
+Then run:
+```bash
+bundle install
+```
+### 2. Set Your OpenAI API Key
+You must provide your OpenAI API key for agentic AI features to work. You can do this in two ways:
+#### a) Environment Variable (Recommended)
+Set the API key in your environment (e.g., in `.env`, `application.yml`, or your deployment environment):
+```ruby
+ENV['OPENAI_API_KEY'] = 'your-api-key-here'
+```
+#### b) Pass Directly When Creating the Agent
+```ruby
+agent = UniversalDocumentProcessor.create_ai_agent(api_key: 'your-api-key-here')
+```
+### 3. Rails: Where to Configure
+If you are using Rails, add your configuration to:
+`config/initializers/universal_document_processor.rb`
+Example initializer:
+```ruby
+# config/initializers/universal_document_processor.rb
+require 'universal_document_processor'
+# Set your API key (or use ENV)
+ENV['OPENAI_API_KEY'] ||= 'your-api-key-here' # (or use Rails credentials)
+# Optionally, create a default agent with custom options
+UniversalDocumentProcessor.create_ai_agent(
+  model: 'gpt-4',
+  temperature: 0.7,
+  max_history: 10
+)
+Rails.logger.info "Universal Document Processor with AI agent loaded" if defined?(Rails)
+```
+### 4. Using Agentic AI Features
+You can now use the AI-powered methods:
+```ruby
+summary = UniversalDocumentProcessor.ai_summarize('document.pdf', length: :short)
+insights = UniversalDocumentProcessor.ai_insights('document.pdf')
+classification = UniversalDocumentProcessor.ai_classify('document.pdf')
+key_info = UniversalDocumentProcessor.ai_extract_info('document.pdf', ['dates', 'names', 'amounts'])
+action_items = UniversalDocumentProcessor.ai_action_items('document.pdf')
+translation = UniversalDocumentProcessor.ai_translate('日本語文書.pdf', 'English')
+```
+Or create and use a persistent agent:
+```ruby
 agent = UniversalDocumentProcessor.create_ai_agent(
   api_key: 'your-openai-key',       # OpenAI API key
   model: 'gpt-4',                   # Model to use (gpt-4, gpt-3.5-turbo)
@@ -420,6 +530,17 @@ agent = UniversalDocumentProcessor.create_ai_agent(
   max_history: 20,                  # Conversation memory length
   base_url: 'https://api.openai.com/v1'  # Custom API endpoint
 )
+# Chat about a document
+response = agent.analyze_document('report.pdf')
+```
+---
+**Note:**
+- The API key is required for all AI features.
+- You can override the model, temperature, and other options per agent.
+- For more, see the `USER_GUIDE.md` and the examples above.
 ```
 ## 📦 Archive Processing (ZIP Creation & Extraction)
@@ -857,7 +978,7 @@ bundle exec rspec
 ## 📝 Changelog
-### Version 1.0.0
+### Version 1.1.0
 - Initial release
 - Support for PDF, Word, Excel, PowerPoint, images, archives
 - Character validation and cleaning

data/lib/universal_document_processor/ai_agent.rb CHANGED Viewed

@@ -14,16 +14,16 @@ module UniversalDocumentProcessor
       @max_history = options[:max_history] || 10
       @temperature = options[:temperature] || 0.7
       @ai_enabled = false
       validate_configuration
     end
     # Main document analysis with AI
     def analyze_document(document_result, query = nil)
       ensure_ai_available!
       context = build_document_context(document_result)
       if query
         # Specific query about the document
         analyze_with_query(context, query)
@@ -67,12 +67,12 @@ Please provide:
     # Ask specific questions about a document
     def ask_document_question(document_result, question)
       ensure_ai_available!
       context = build_document_context(document_result)
       prompt = build_question_prompt(context, question)
       response = call_openai_api(prompt)
       add_to_history(question, response)
       response
     end
@@ -80,19 +80,19 @@ Please provide:
     # Summarize document content
     def summarize_document(document_result, length: :medium)
       ensure_ai_available!
       context = build_document_context(document_result)
       length_instruction = case length
       when :short then "in 2-3 sentences"
       when :medium then "in 1-2 paragraphs"
       when :long then "in detail with key points"
       else "concisely"
       end
       prompt = build_summary_prompt(context, length_instruction)
       response = call_openai_api(prompt)
       add_to_history("Summarize document #{length_instruction}", response)
       response
     end
@@ -100,13 +100,13 @@ Please provide:
     # Extract key information from document
     def extract_key_information(document_result, categories = nil)
       ensure_ai_available!
       context = build_document_context(document_result)
       categories ||= ['key_facts', 'important_dates', 'names', 'locations', 'numbers']
       prompt = build_extraction_prompt(context, categories)
       response = call_openai_api(prompt)
       add_to_history("Extract key information: #{categories.join(', ')}", response)
       parse_extraction_response(response)
     end
@@ -114,12 +114,12 @@ Please provide:
     # Translate document content
     def translate_document(document_result, target_language)
       ensure_ai_available!
       context = build_document_context(document_result)
       prompt = build_translation_prompt(context, target_language)
       response = call_openai_api(prompt)
       add_to_history("Translate to #{target_language}", response)
       response
     end
@@ -127,12 +127,12 @@ Please provide:
     # Generate document insights and recommendations
     def generate_insights(document_result)
       ensure_ai_available!
       context = build_document_context(document_result)
       prompt = build_insights_prompt(context)
       response = call_openai_api(prompt)
       add_to_history("Generate insights", response)
       parse_insights_response(response)
     end
@@ -140,12 +140,12 @@ Please provide:
     # Compare multiple documents
     def compare_documents(document_results, comparison_type = :content)
       ensure_ai_available!
       contexts = document_results.map { |doc| build_document_context(doc) }
       prompt = build_comparison_prompt(contexts, comparison_type)
       response = call_openai_api(prompt)
       add_to_history("Compare documents (#{comparison_type})", response)
       response
     end
@@ -153,12 +153,12 @@ Please provide:
     # Classify document type and purpose
     def classify_document(document_result)
       ensure_ai_available!
       context = build_document_context(document_result)
       prompt = build_classification_prompt(context)
       response = call_openai_api(prompt)
       add_to_history("Classify document", response)
       parse_classification_response(response)
     end
@@ -166,12 +166,12 @@ Please provide:
     # Generate action items from document
     def extract_action_items(document_result)
       ensure_ai_available!
       context = build_document_context(document_result)
       prompt = build_action_items_prompt(context)
       response = call_openai_api(prompt)
       add_to_history("Extract action items", response)
       parse_action_items_response(response)
     end
@@ -179,14 +179,14 @@ Please provide:
     # Chat about the document
     def chat(message, document_result = nil)
       ensure_ai_available!
       if document_result
         context = build_document_context(document_result)
         prompt = build_chat_prompt(context, message)
       else
         prompt = build_general_chat_prompt(message)
       end
       response = call_openai_api(prompt)
       add_to_history(message, response)
       response
@@ -200,15 +200,15 @@ Please provide:
     # Get conversation summary
     def conversation_summary
       return "No conversation history" if @conversation_history.empty?
       unless @ai_enabled
         return "AI features are disabled. Cannot generate conversation summary."
       end
       history_text = @conversation_history.map do |entry|
         "Q: #{entry[:question]}\nA: #{entry[:answer]}"
       end.join("\n\n")
       prompt = "Summarize this conversation:\n\n#{history_text}"
       call_openai_api(prompt)
     end
@@ -247,13 +247,13 @@ Please provide:
         tables_count: document_result[:tables]&.length || 0,
         filename_info: document_result[:filename_info] || {}
       }
       # Add Japanese-specific information if available
       if context[:filename_info][:contains_japanese]
         context[:japanese_filename] = true
         context[:japanese_parts] = context[:filename_info][:japanese_parts]
       end
       context
     end
@@ -324,8 +324,7 @@ Please provide:
     def build_comparison_prompt(contexts, comparison_type)
       comparison_content = contexts.map.with_index do |context, index|
-        "Document #{index + 1}: #{context[:filename]}
-Content: #{truncate_content(context[:text_content], 1500)}"
+        "Document #{index + 1}: #{context[:filename]}\nContent: #{truncate_content(context[:text_content], 1500)}"
       end.join("\n\n---\n\n")
       "You are an AI analyst. Compare these documents focusing on #{comparison_type}:
@@ -404,15 +403,15 @@ Please respond helpfully."
     def call_openai_api(prompt)
       uri = URI("#{@base_url}/chat/completions")
       http = Net::HTTP.new(uri.host, uri.port)
       http.use_ssl = true
       http.read_timeout = 60
       request = Net::HTTP::Post.new(uri)
       request['Content-Type'] = 'application/json'
       request['Authorization'] = "Bearer #{@api_key}"
       request.body = {
         model: @model,
         messages: [
@@ -421,16 +420,16 @@ Please respond helpfully."
             content: "You are an intelligent document processing assistant with expertise in analyzing, summarizing, and extracting information from various document types. You support multiple languages including Japanese."
           },
           {
-            role: "user",
+            role: "user",
             content: prompt
           }
         ],
         temperature: @temperature,
         max_tokens: 2000
       }.to_json
       response = http.request(request)
       if response.code.to_i == 200
         result = JSON.parse(response.body)
         result.dig('choices', 0, 'message', 'content') || "No response generated"
@@ -446,14 +445,14 @@ Please respond helpfully."
         answer: answer,
         timestamp: Time.now
       }
       # Keep only the most recent conversations
       @conversation_history = @conversation_history.last(@max_history) if @conversation_history.length > @max_history
     end
     def truncate_content(content, max_length)
       return "" unless content.is_a?(String)
       if content.length > max_length
         "#{content[0...max_length]}...\n\n[Content truncated for analysis]"
       else
@@ -463,16 +462,16 @@ Please respond helpfully."
     def format_file_size(bytes)
       return "0 B" if bytes == 0
       units = ['B', 'KB', 'MB', 'GB']
       size = bytes.to_f
       unit_index = 0
       while size >= 1024 && unit_index < units.length - 1
         size /= 1024
         unit_index += 1
       end
       "#{size.round(2)} #{units[unit_index]}"
     end
@@ -490,7 +489,7 @@ Please respond helpfully."
       rescue JSON::ParserError
         # Fall back to plain text response
       end
       response
     end

data/lib/universal_document_processor/document.rb CHANGED Viewed

@@ -2,29 +2,62 @@ module UniversalDocumentProcessor
   class Document
     attr_reader :file_path, :content_type, :file_size, :options, :filename_validation
+    class LargeFileError < StandardError; end
+    class FileValidationError < StandardError; end
+    MAX_FILE_SIZE = 50 * 1024 * 1024 # 50 MB
     def initialize(file_path_or_io, options = {})
       @file_path = file_path_or_io.is_a?(String) ? normalize_file_path(file_path_or_io) : save_temp_file(file_path_or_io)
       @options = options
+      # 1. Check file existence and readability
+      unless File.exist?(@file_path) && File.readable?(@file_path)
+        raise FileValidationError, "File is missing or unreadable: #{@file_path}"
+      end
       @content_type = detect_content_type
       @file_size = File.size(@file_path)
+      # 2. Large file safeguard
+      if @file_size > MAX_FILE_SIZE
+        raise LargeFileError, "File size #{@file_size} exceeds maximum allowed (#{MAX_FILE_SIZE} bytes)"
+      end
       @filename_validation = validate_filename_encoding
+      # 3. Encoding validation and cleaning for text files
+      if @content_type =~ /text|plain/
+        validation = UniversalDocumentProcessor.validate_file(@file_path)
+        unless validation[:valid]
+          @cleaned_text_content = UniversalDocumentProcessor.clean_text(validation[:content], {
+            remove_null_bytes: true,
+            remove_control_chars: true,
+            normalize_whitespace: true
+          })
+        else
+          @cleaned_text_content = nil
+        end
+      end
     end
     def process
-      {
-        file_path: @file_path,
-        content_type: @content_type,
-        file_size: @file_size,
-        text_content: extract_text,
-        metadata: metadata,
-        images: extract_images,
-        tables: extract_tables,
-        filename_info: filename_info,
-        processed_at: Time.current
-      }
+      begin
+        {
+          file_path: @file_path,
+          content_type: @content_type,
+          file_size: @file_size,
+          text_content: extract_text,
+          metadata: metadata,
+          images: extract_images,
+          tables: extract_tables,
+          filename_info: filename_info,
+          processed_at: Time.current
+        }
+      rescue LargeFileError, FileValidationError => e
+        { error: e.class.name, message: e.message, file_path: @file_path }
+      rescue => e
+        { error: 'ProcessingError', message: e.message, file_path: @file_path }
+      end
     end
     def extract_text
+      # Use cleaned text if available (from encoding validation)
+      return @cleaned_text_content if defined?(@cleaned_text_content) && @cleaned_text_content
       processor.extract_text
     rescue => e
       fallback_text_extraction
@@ -253,13 +286,97 @@ module UniversalDocumentProcessor
     end
     def convert_to_pdf
-      # Implementation for PDF conversion
-      raise NotImplementedError, "PDF conversion not yet implemented"
+      ensure_prawn_available!
+      output_path = @file_path.gsub(File.extname(@file_path), '.pdf')
+      Prawn::Document.generate(output_path) do |pdf|
+        # Add title
+        pdf.font_size 18
+        pdf.text "Document: #{File.basename(@file_path)}", style: :bold
+        pdf.move_down 20
+        # Add metadata section
+        pdf.font_size 12
+        pdf.text "Document Information", style: :bold
+        pdf.move_down 10
+        metadata_info = metadata
+        pdf.text "File Size: #{format_file_size(@file_size)}"
+        pdf.text "Content Type: #{@content_type}"
+        pdf.text "Created: #{metadata_info[:created_at]}" if metadata_info[:created_at]
+        pdf.text "Modified: #{metadata_info[:modified_at]}" if metadata_info[:modified_at]
+        pdf.move_down 20
+        # Add content section
+        pdf.text "Content", style: :bold
+        pdf.move_down 10
+        text_content = extract_text
+        if text_content && !text_content.strip.empty?
+          pdf.font_size 10
+          pdf.text text_content
+        else
+          pdf.text "No text content available for this document."
+        end
+        # Add tables if available
+        tables = extract_tables
+        unless tables.empty?
+          pdf.start_new_page
+          pdf.font_size 12
+          pdf.text "Tables", style: :bold
+          pdf.move_down 10
+          tables.each_with_index do |table, index|
+            pdf.text "Table #{index + 1}", style: :bold
+            pdf.move_down 5
+            if table[:content] && !table[:content].empty?
+              # Format table data for Prawn
+              table_data = table[:content].first(20) # Limit to first 20 rows
+              pdf.table(table_data, header: true) do
+                row(0).font_style = :bold
+                cells.size = 8
+                cells.padding = 3
+              end
+            end
+            pdf.move_down 15
+          end
+        end
+      end
+      output_path
+    rescue => e
+      raise ProcessingError, "Failed to create PDF: #{e.message}"
     end
     def convert_to_html
       # Implementation for HTML conversion
       raise NotImplementedError, "HTML conversion not yet implemented"
     end
+    private
+    def ensure_prawn_available!
+      unless defined?(Prawn)
+        raise DependencyMissingError, "PDF creation requires the 'prawn' gem. Install it with: gem install prawn -v '~> 2.4'"
+      end
+    end
+    def format_file_size(bytes)
+      return "0 B" if bytes == 0
+      units = ['B', 'KB', 'MB', 'GB']
+      size = bytes.to_f
+      unit_index = 0
+      while size >= 1024 && unit_index < units.length - 1
+        size /= 1024
+        unit_index += 1
+      end
+      "#{size.round(2)} #{units[unit_index]}"
+    end
   end
 end

data/lib/universal_document_processor/processors/base_processor.rb CHANGED Viewed

@@ -3,6 +3,8 @@ module UniversalDocumentProcessor
     class BaseProcessor
       attr_reader :file_path, :options
+      MAX_FILE_SIZE = 50 * 1024 * 1024 # 50 MB
       def initialize(file_path, options = {})
         @file_path = file_path
         @options = options
@@ -11,6 +13,17 @@ module UniversalDocumentProcessor
       def extract_text
         # Fallback to universal text extraction
         if defined?(Yomu)
+          # Encoding validation for text files
+          if File.extname(@file_path) =~ /\.(txt|csv|tsv|md|json|xml|html|htm)$/i
+            validation = UniversalDocumentProcessor.validate_file(@file_path)
+            unless validation[:valid]
+              return UniversalDocumentProcessor.clean_text(validation[:content], {
+                remove_null_bytes: true,
+                remove_control_chars: true,
+                normalize_whitespace: true
+              })
+            end
+          end
           Yomu.new(@file_path).text
         else
           raise ProcessingError, "Universal text extraction requires the 'yomu' gem. Install it with: gem install yomu -v '~> 0.2'"
@@ -49,6 +62,10 @@ module UniversalDocumentProcessor
       def validate_file
         raise ProcessingError, "File not found: #{@file_path}" unless File.exist?(@file_path)
         raise ProcessingError, "File is empty: #{@file_path}" if File.zero?(@file_path)
+        # Large file safeguard
+        if File.size(@file_path) > MAX_FILE_SIZE
+          raise ProcessingError, "File size #{File.size(@file_path)} exceeds maximum allowed (#{MAX_FILE_SIZE} bytes)"
+        end
       end
       def with_error_handling

data/lib/universal_document_processor/processors/excel_processor.rb CHANGED Viewed

@@ -6,11 +6,32 @@ require 'csv'
 module UniversalDocumentProcessor
   module Processors
     class ExcelProcessor < BaseProcessor
+      MAX_FILE_SIZE = 50 * 1024 * 1024 # 50 MB
       def extract_text
+        validate_file
         with_error_handling do
           if @file_path.end_with?('.csv')
+            # Encoding validation for CSV
+            validation = UniversalDocumentProcessor.validate_file(@file_path)
+            unless validation[:valid]
+              return UniversalDocumentProcessor.clean_text(validation[:content], {
+                remove_null_bytes: true,
+                remove_control_chars: true,
+                normalize_whitespace: true
+              })
+            end
             extract_csv_text
           elsif @file_path.end_with?('.tsv')
+            # Encoding validation for TSV
+            validation = UniversalDocumentProcessor.validate_file(@file_path)
+            unless validation[:valid]
+              return UniversalDocumentProcessor.clean_text(validation[:content], {
+                remove_null_bytes: true,
+                remove_control_chars: true,
+                normalize_whitespace: true
+              })
+            end
             extract_tsv_text
           elsif @file_path.end_with?('.xlsx')
             extract_xlsx_text_builtin
@@ -208,6 +229,15 @@ module UniversalDocumentProcessor
       private
+      def validate_file
+        raise ProcessingError, "File not found: #{@file_path}" unless File.exist?(@file_path)
+        raise ProcessingError, "File is empty: #{@file_path}" if File.zero?(@file_path)
+        # Large file safeguard
+        if File.size(@file_path) > MAX_FILE_SIZE
+          raise ProcessingError, "File size #{File.size(@file_path)} exceeds maximum allowed (#{MAX_FILE_SIZE} bytes)"
+        end
+      end
       # CSV Processing Methods
       def extract_csv_text
         content = File.read(@file_path, encoding: 'UTF-8')

data/lib/universal_document_processor/processors/pdf_processor.rb CHANGED Viewed

@@ -1,12 +1,23 @@
 module UniversalDocumentProcessor
   module Processors
     class PdfProcessor < BaseProcessor
+      MAX_FILE_SIZE = 50 * 1024 * 1024 # 50 MB
       def extract_text
         ensure_pdf_reader_available!
+        validate_file
         with_error_handling do
           reader = PDF::Reader.new(@file_path)
           text = reader.pages.map(&:text).join("\n")
+          # Encoding validation for extracted text
+          validation = UniversalDocumentProcessor.validate_file(@file_path)
+          unless validation[:valid]
+            return UniversalDocumentProcessor.clean_text(validation[:content], {
+              remove_null_bytes: true,
+              remove_control_chars: true,
+              normalize_whitespace: true
+            })
+          end
           text.strip.empty? ? "No text content found in PDF" : text
         end
       rescue => e
@@ -104,6 +115,15 @@ module UniversalDocumentProcessor
         end
       end
+      def validate_file
+        raise ProcessingError, "File not found: #{@file_path}" unless File.exist?(@file_path)
+        raise ProcessingError, "File is empty: #{@file_path}" if File.zero?(@file_path)
+        # Large file safeguard
+        if File.size(@file_path) > MAX_FILE_SIZE
+          raise ProcessingError, "File size #{File.size(@file_path)} exceeds maximum allowed (#{MAX_FILE_SIZE} bytes)"
+        end
+      end
       def extract_form_fields(reader)
         # Extract PDF form fields if present
         []

data/lib/universal_document_processor/processors/text_processor.rb CHANGED Viewed

@@ -1,7 +1,10 @@
 module UniversalDocumentProcessor
   module Processors
     class TextProcessor < BaseProcessor
+      MAX_FILE_SIZE = 50 * 1024 * 1024 # 50 MB
       def extract_text
+        validate_file
         with_error_handling do
           case detect_text_format
           when :rtf
@@ -15,6 +18,15 @@ module UniversalDocumentProcessor
           when :json
             extract_json_text
           else
+            # Encoding validation for plain text
+            validation = UniversalDocumentProcessor.validate_file(@file_path)
+            unless validation[:valid]
+              return UniversalDocumentProcessor.clean_text(validation[:content], {
+                remove_null_bytes: true,
+                remove_control_chars: true,
+                normalize_whitespace: true
+              })
+            end
             extract_plain_text
           end
         end
@@ -81,6 +93,15 @@ module UniversalDocumentProcessor
       private
+      def validate_file
+        raise ProcessingError, "File not found: #{@file_path}" unless File.exist?(@file_path)
+        raise ProcessingError, "File is empty: #{@file_path}" if File.zero?(@file_path)
+        # Large file safeguard
+        if File.size(@file_path) > MAX_FILE_SIZE
+          raise ProcessingError, "File size #{File.size(@file_path)} exceeds maximum allowed (#{MAX_FILE_SIZE} bytes)"
+        end
+      end
       def detect_text_format
         extension = File.extname(@file_path).downcase
         case extension

data/lib/universal_document_processor/processors/word_processor.rb CHANGED Viewed

@@ -1,11 +1,32 @@
 module UniversalDocumentProcessor
   module Processors
     class WordProcessor < BaseProcessor
+      MAX_FILE_SIZE = 50 * 1024 * 1024 # 50 MB
       def extract_text
+        validate_file
         with_error_handling do
           if @file_path.end_with?('.docx')
+            # Encoding validation for docx (if possible)
+            validation = UniversalDocumentProcessor.validate_file(@file_path)
+            unless validation[:valid]
+              return UniversalDocumentProcessor.clean_text(validation[:content], {
+                remove_null_bytes: true,
+                remove_control_chars: true,
+                normalize_whitespace: true
+              })
+            end
             extract_docx_text
           elsif @file_path.end_with?('.doc')
+            # Encoding validation for doc (if possible)
+            validation = UniversalDocumentProcessor.validate_file(@file_path)
+            unless validation[:valid]
+              return UniversalDocumentProcessor.clean_text(validation[:content], {
+                remove_null_bytes: true,
+                remove_control_chars: true,
+                normalize_whitespace: true
+              })
+            end
             # Built-in .doc file processing
             fallback_text_extraction
           else
@@ -90,6 +111,15 @@ module UniversalDocumentProcessor
       private
+      def validate_file
+        raise ProcessingError, "File not found: #{@file_path}" unless File.exist?(@file_path)
+        raise ProcessingError, "File is empty: #{@file_path}" if File.zero?(@file_path)
+        # Large file safeguard
+        if File.size(@file_path) > MAX_FILE_SIZE
+          raise ProcessingError, "File size #{File.size(@file_path)} exceeds maximum allowed (#{MAX_FILE_SIZE} bytes)"
+        end
+      end
       def ensure_docx_available!
         unless defined?(Docx)
           raise DependencyMissingError, "DOCX processing requires the 'docx' gem. Install it with: gem install docx -v '~> 0.8'"

data/lib/universal_document_processor/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module UniversalDocumentProcessor
-  VERSION = "1.1.0"
+  VERSION = "1.1.1"
 end

data/lib/universal_document_processor.rb CHANGED Viewed

@@ -206,6 +206,16 @@ module UniversalDocumentProcessor
     Document.new(file_path_or_io, options).convert_to(target_format)
   end
+  # Create PDF from any supported document
+  def self.create_pdf(file_path, options = {})
+    Document.new(file_path, options).convert_to(:pdf)
+  end
+  # Check if PDF creation is available
+  def self.pdf_creation_available?
+    defined?(Prawn)
+  end
   # Batch process multiple documents
   def self.batch_process(file_paths, options = {})
     file_paths.map do |file_path|

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: universal_document_processor
 version: !ruby/object:Gem::Version
-  version: 1.1.0
+  version: 1.1.1
 platform: ruby
 authors:
 - Vikas Patil