universal_document_processor 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md ADDED
@@ -0,0 +1,726 @@
1
+ # Universal Document Processor
2
+
3
+ [![Gem Version](https://badge.fury.io/rb/universal_document_processor.svg)](https://badge.fury.io/rb/universal_document_processor)
4
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
5
+ [![Ruby](https://img.shields.io/badge/ruby-%3E%3D%202.7.0-ruby.svg)](https://www.ruby-lang.org/)
6
+
7
+ A comprehensive Ruby gem that provides unified document processing capabilities across multiple file formats. Extract text, metadata, images, and tables from PDFs, Word documents, Excel spreadsheets, PowerPoint presentations, images, archives, and more with a single, consistent API.
8
+
9
+ ## ๐ŸŽฏ Features
10
+
11
+ ### **Unified Document Processing**
12
+ - **Single API** for all document types
13
+ - **Intelligent format detection** and processing
14
+ - **Production-ready** error handling and fallbacks
15
+ - **Extensible architecture** for future enhancements
16
+
17
+ ### **Supported File Formats**
18
+ - **๐Ÿ“„ Documents**: PDF, DOC, DOCX, RTF
19
+ - **๐Ÿ“Š Spreadsheets**: XLS, XLSX, CSV
20
+ - **๐Ÿ“บ Presentations**: PPT, PPTX
21
+ - **๐Ÿ–ผ๏ธ Images**: JPG, PNG, GIF, BMP, TIFF
22
+ - **๐Ÿ“ Archives**: ZIP, RAR, 7Z
23
+ - **๐Ÿ“„ Text**: TXT, HTML, XML, JSON, Markdown
24
+
25
+ ### **Advanced Content Extraction**
26
+ - **Text Extraction**: Full text content from any supported format
27
+ - **Metadata Extraction**: File properties, author, creation date, etc.
28
+ - **Image Extraction**: Embedded images from documents
29
+ - **Table Detection**: Structured data extraction
30
+ - **Character Validation**: Invalid character detection and cleaning
31
+ - **Multi-language Support**: Full Unicode support including Japanese (ๆ—ฅๆœฌ่ชž)
32
+
33
+ ### **Character & Encoding Support**
34
+ - **Smart encoding detection** (UTF-8, Shift_JIS, EUC-JP, ISO-8859-1)
35
+ - **Invalid character detection** and cleaning
36
+ - **Japanese text support** (Hiragana, Katakana, Kanji)
37
+ - **Control character handling**
38
+ - **Text repair and normalization**
39
+
40
+ ## ๐Ÿš€ Installation
41
+
42
+ Add this line to your application's Gemfile:
43
+
44
+ ```ruby
45
+ gem 'universal_document_processor'
46
+ ```
47
+
48
+ And then execute:
49
+ ```bash
50
+ bundle install
51
+ ```
52
+
53
+ Or install it yourself as:
54
+ ```bash
55
+ gem install universal_document_processor
56
+ ```
57
+
58
+ ### Optional Dependencies
59
+
60
+ For enhanced functionality, install additional gems:
61
+
62
+ ```ruby
63
+ # PDF processing
64
+ gem 'pdf-reader', '~> 2.0'
65
+ gem 'prawn', '~> 2.4'
66
+
67
+ # Microsoft Office documents
68
+ gem 'docx', '~> 0.8'
69
+ gem 'roo', '~> 2.8'
70
+
71
+ # Image processing
72
+ gem 'mini_magick', '~> 4.11'
73
+
74
+ # Universal text extraction fallback
75
+ gem 'yomu', '~> 0.2'
76
+ ```
77
+
78
+ ## ๐Ÿ“– Quick Start
79
+
80
+ ### Basic Usage
81
+
82
+ ```ruby
83
+ require 'universal_document_processor'
84
+
85
+ # Process any document
86
+ result = UniversalDocumentProcessor.process('document.pdf')
87
+
88
+ # Extract text only
89
+ text = UniversalDocumentProcessor.extract_text('document.docx')
90
+
91
+ # Get metadata only
92
+ metadata = UniversalDocumentProcessor.get_metadata('spreadsheet.xlsx')
93
+ ```
94
+
95
+ ### Processing Result
96
+
97
+ ```ruby
98
+ result = UniversalDocumentProcessor.process('document.pdf')
99
+
100
+ # Returns comprehensive information:
101
+ {
102
+ file_path: "document.pdf",
103
+ content_type: "application/pdf",
104
+ file_size: 1024576,
105
+ text_content: "Extracted text content...",
106
+ metadata: {
107
+ title: "Document Title",
108
+ author: "Author Name",
109
+ page_count: 25
110
+ },
111
+ images: [...],
112
+ tables: [...],
113
+ processed_at: 2024-01-15 10:30:00 UTC
114
+ }
115
+ ```
116
+
117
+ ## ๐Ÿ”ง Advanced Usage
118
+
119
+ ### Character Validation and Cleaning
120
+
121
+ ```ruby
122
+ # Analyze text quality and character issues
123
+ analysis = UniversalDocumentProcessor.analyze_text_quality(text)
124
+
125
+ # Returns:
126
+ {
127
+ encoding: "UTF-8",
128
+ valid_encoding: true,
129
+ has_invalid_chars: false,
130
+ has_control_chars: true,
131
+ character_issues: [...],
132
+ statistics: {
133
+ total_chars: 1500,
134
+ japanese_chars: 250,
135
+ hiragana_chars: 100,
136
+ katakana_chars: 50,
137
+ kanji_chars: 100
138
+ },
139
+ japanese_analysis: {
140
+ japanese: true,
141
+ scripts: ['hiragana', 'katakana', 'kanji'],
142
+ mixed_with_latin: true
143
+ }
144
+ }
145
+ ```
146
+
147
+ ### Text Cleaning
148
+
149
+ ```ruby
150
+ # Clean text by removing invalid characters
151
+ clean_text = UniversalDocumentProcessor.clean_text(corrupted_text, {
152
+ remove_null_bytes: true,
153
+ remove_control_chars: true,
154
+ normalize_whitespace: true
155
+ })
156
+ ```
157
+
158
+ ### File Encoding Validation
159
+
160
+ ```ruby
161
+ # Validate file encoding (supports Japanese encodings)
162
+ validation = UniversalDocumentProcessor.validate_file('japanese_document.txt')
163
+
164
+ # Returns:
165
+ {
166
+ detected_encoding: "Shift_JIS",
167
+ valid: true,
168
+ content: "ใ“ใ‚“ใซใกใฏ",
169
+ analysis: {...}
170
+ }
171
+ ```
172
+
173
+ ### Japanese Text Support
174
+
175
+ ```ruby
176
+ # Check if text contains Japanese
177
+ is_japanese = UniversalDocumentProcessor.japanese_text?("ใ“ใ‚“ใซใกใฏ World")
178
+ # => true
179
+
180
+ # Detailed Japanese analysis
181
+ japanese_info = UniversalDocumentProcessor.validate_japanese_text("ใ“ใ‚“ใซใกใฏ ไธ–็•Œ")
182
+ # Returns detailed Japanese character analysis
183
+ ```
184
+
185
+ ### Batch Processing
186
+
187
+ ```ruby
188
+ # Process multiple documents
189
+ file_paths = ['file1.pdf', 'file2.docx', 'file3.xlsx']
190
+ results = UniversalDocumentProcessor.batch_process(file_paths)
191
+
192
+ # Returns array with success/error status for each file
193
+ ```
194
+
195
+ ### Document Conversion
196
+
197
+ ```ruby
198
+ # Convert to different formats
199
+ text_content = UniversalDocumentProcessor.convert('document.pdf', :text)
200
+ json_data = UniversalDocumentProcessor.convert('document.docx', :json)
201
+ ```
202
+
203
+ ## ๐Ÿ“‹ Detailed Examples
204
+
205
+ ### Processing PDF Documents
206
+
207
+ ```ruby
208
+ # Extract comprehensive PDF information
209
+ result = UniversalDocumentProcessor.process('report.pdf')
210
+
211
+ # Access specific data
212
+ puts "Title: #{result[:metadata][:title]}"
213
+ puts "Pages: #{result[:metadata][:page_count]}"
214
+ puts "Images found: #{result[:images].length}"
215
+ puts "Tables found: #{result[:tables].length}"
216
+
217
+ # Get text content
218
+ full_text = result[:text_content]
219
+ ```
220
+
221
+ ### Processing Excel Spreadsheets
222
+
223
+ ```ruby
224
+ # Extract data from Excel files
225
+ result = UniversalDocumentProcessor.process('data.xlsx')
226
+
227
+ # Access spreadsheet-specific metadata
228
+ metadata = result[:metadata]
229
+ puts "Worksheets: #{metadata[:worksheet_count]}"
230
+ puts "Has formulas: #{metadata[:has_formulas]}"
231
+
232
+ # Extract tables/data
233
+ tables = result[:tables]
234
+ tables.each_with_index do |table, index|
235
+ puts "Table #{index + 1}: #{table[:rows]} rows"
236
+ end
237
+ ```
238
+
239
+ ### Processing Word Documents
240
+
241
+ ```ruby
242
+ # Extract from Word documents
243
+ result = UniversalDocumentProcessor.process('report.docx')
244
+
245
+ # Get document structure
246
+ metadata = result[:metadata]
247
+ puts "Word count: #{metadata[:word_count]}"
248
+ puts "Paragraph count: #{metadata[:paragraph_count]}"
249
+
250
+ # Extract embedded images
251
+ images = result[:images]
252
+ puts "Found #{images.length} embedded images"
253
+ ```
254
+
255
+ ### Processing Japanese Documents & Filenames
256
+
257
+ ```ruby
258
+ # Process Japanese content
259
+ japanese_doc = "ใ“ใ‚“ใซใกใฏ ไธ–็•Œ๏ผ Hello World!"
260
+ analysis = UniversalDocumentProcessor.analyze_text_quality(japanese_doc)
261
+
262
+ # Japanese-specific information
263
+ japanese_info = analysis[:japanese_analysis]
264
+ puts "Contains Japanese: #{japanese_info[:japanese]}"
265
+ puts "Scripts found: #{japanese_info[:scripts].join(', ')}"
266
+ puts "Mixed with Latin: #{japanese_info[:mixed_with_latin]}"
267
+
268
+ # Character statistics
269
+ stats = analysis[:statistics]
270
+ puts "Hiragana: #{stats[:hiragana_chars]}"
271
+ puts "Katakana: #{stats[:katakana_chars]}"
272
+ puts "Kanji: #{stats[:kanji_chars]}"
273
+
274
+ # Japanese filename support
275
+ filename = "้‡่ฆใช่ณ‡ๆ–™_2024ๅนดๅบฆ.pdf"
276
+ validation = UniversalDocumentProcessor.validate_filename(filename)
277
+ puts "Japanese filename: #{validation[:contains_japanese]}"
278
+ puts "Filename valid: #{validation[:valid]}"
279
+
280
+ # Safe filename generation
281
+ safe_name = UniversalDocumentProcessor.safe_filename("ใƒ‡ใƒผใ‚ฟใƒ•ใ‚กใ‚คใƒซ<้‡่ฆ>.xlsx")
282
+ puts "Safe filename: #{safe_name}" # => "ใƒ‡ใƒผใ‚ฟใƒ•ใ‚กใ‚คใƒซ_้‡่ฆ_.xlsx"
283
+
284
+ # Process documents with Japanese filenames
285
+ result = UniversalDocumentProcessor.process("ๆ—ฅๆœฌ่ชžใƒ•ใ‚กใ‚คใƒซ.pdf")
286
+ puts "Original filename: #{result[:filename_info][:original_filename]}"
287
+ puts "Contains Japanese: #{result[:filename_info][:contains_japanese]}"
288
+ puts "Japanese parts: #{result[:filename_info][:japanese_parts]}"
289
+ ```
290
+
291
+ ## ๐Ÿค– AI Agent Integration
292
+
293
+ The gem includes a powerful AI agent that provides intelligent document analysis and interaction capabilities using OpenAI's GPT models:
294
+
295
+ ### Quick AI Analysis
296
+
297
+ ```ruby
298
+ # Set your OpenAI API key
299
+ ENV['OPENAI_API_KEY'] = 'your-api-key-here'
300
+
301
+ # Quick AI-powered analysis
302
+ summary = UniversalDocumentProcessor.ai_summarize('document.pdf', length: :short)
303
+ insights = UniversalDocumentProcessor.ai_insights('document.pdf')
304
+ classification = UniversalDocumentProcessor.ai_classify('document.pdf')
305
+
306
+ # Extract specific information
307
+ key_info = UniversalDocumentProcessor.ai_extract_info('document.pdf', ['dates', 'names', 'amounts'])
308
+ action_items = UniversalDocumentProcessor.ai_action_items('document.pdf')
309
+
310
+ # Translate documents (great for Japanese documents)
311
+ translation = UniversalDocumentProcessor.ai_translate('ๆ—ฅๆœฌ่ชžๆ–‡ๆ›ธ.pdf', 'English')
312
+ ```
313
+
314
+ ### Interactive AI Agent
315
+
316
+ ```ruby
317
+ # Create a persistent AI agent for conversations
318
+ agent = UniversalDocumentProcessor.create_ai_agent(
319
+ model: 'gpt-4',
320
+ temperature: 0.7,
321
+ max_history: 10
322
+ )
323
+
324
+ # Process document and start conversation
325
+ document = UniversalDocumentProcessor::Document.new('report.pdf')
326
+
327
+ # Ask questions about the document
328
+ response1 = document.ai_chat('What is this document about?')
329
+ response2 = document.ai_chat('What are the key financial figures?')
330
+ response3 = document.ai_chat('Based on our discussion, what should I focus on?')
331
+
332
+ # Get conversation summary
333
+ summary = agent.conversation_summary
334
+ ```
335
+
336
+ ### Advanced AI Features
337
+
338
+ ```ruby
339
+ # Compare multiple documents
340
+ comparison = UniversalDocumentProcessor.ai_compare(
341
+ ['doc1.pdf', 'doc2.pdf', 'doc3.pdf'],
342
+ :content # or :themes, :structure, etc.
343
+ )
344
+
345
+ # Document-specific AI analysis
346
+ document = UniversalDocumentProcessor::Document.new('business_plan.pdf')
347
+
348
+ analysis = document.ai_analyze('What are the growth projections?')
349
+ insights = document.ai_insights
350
+ classification = document.ai_classify
351
+ action_items = document.ai_action_items
352
+
353
+ # Japanese document support
354
+ japanese_doc = UniversalDocumentProcessor::Document.new('ใƒ—ใƒญใ‚ธใ‚งใ‚ฏใƒˆ่จˆ็”ปๆ›ธ.pdf')
355
+ translation = japanese_doc.ai_translate('English')
356
+ summary = japanese_doc.ai_summarize(length: :medium)
357
+ ```
358
+
359
+ ### AI Configuration Options
360
+
361
+ ```ruby
362
+ # Custom AI agent configuration
363
+ agent = UniversalDocumentProcessor.create_ai_agent(
364
+ api_key: 'your-openai-key', # OpenAI API key
365
+ model: 'gpt-4', # Model to use (gpt-4, gpt-3.5-turbo)
366
+ temperature: 0.3, # Response creativity (0.0-1.0)
367
+ max_history: 20, # Conversation memory length
368
+ base_url: 'https://api.openai.com/v1' # Custom API endpoint
369
+ )
370
+ ```
371
+
372
+ ## ๐ŸŽŒ Japanese Filename Support
373
+
374
+ The gem provides comprehensive support for Japanese filenames across all operating systems:
375
+
376
+ ### Basic Filename Validation
377
+
378
+ ```ruby
379
+ # Check if filename contains Japanese characters
380
+ UniversalDocumentProcessor.japanese_filename?("ๆ—ฅๆœฌ่ชžใƒ•ใ‚กใ‚คใƒซ.pdf")
381
+ # => true
382
+
383
+ # Validate Japanese filename
384
+ validation = UniversalDocumentProcessor.validate_filename("ใ“ใ‚“ใซใกใฏไธ–็•Œ.docx")
385
+ puts validation[:valid] # => true
386
+ puts validation[:contains_japanese] # => true
387
+ puts validation[:japanese_parts] # => {hiragana: ["ใ“","ใ‚“","ใซ","ใก","ใฏ"], katakana: [], kanji: ["ไธ–","็•Œ"]}
388
+
389
+ # Handle mixed language filenames
390
+ validation = UniversalDocumentProcessor.validate_filename("Project_ใƒ—ใƒญใ‚ธใ‚งใ‚ฏใƒˆ_2024.xlsx")
391
+ puts validation[:contains_japanese] # => true
392
+ ```
393
+
394
+ ### Safe Filename Generation
395
+
396
+ ```ruby
397
+ # Create cross-platform safe filenames
398
+ problematic_name = "ใƒ‡ใƒผใ‚ฟใƒ•ใ‚กใ‚คใƒซ<้‡่ฆ>:็ฎก็†.xlsx"
399
+ safe_name = UniversalDocumentProcessor.safe_filename(problematic_name)
400
+ puts safe_name # => "ใƒ‡ใƒผใ‚ฟใƒ•ใ‚กใ‚คใƒซ_้‡่ฆ__็ฎก็†.xlsx"
401
+
402
+ # Handle extremely long Japanese filenames
403
+ long_name = "้žๅธธใซ้•ทใ„ใƒ•ใ‚กใ‚คใƒซๅ" * 20 + ".pdf"
404
+ safe_name = UniversalDocumentProcessor.safe_filename(long_name)
405
+ puts safe_name.bytesize <= 200 # => true (safely truncated)
406
+ ```
407
+
408
+ ### Encoding Analysis & Normalization
409
+
410
+ ```ruby
411
+ # Analyze filename encoding
412
+ filename = "ใƒ‡ใƒผใ‚ฟใƒ•ใ‚กใ‚คใƒซ.pdf"
413
+ analysis = UniversalDocumentProcessor::Utils::JapaneseFilenameHandler.analyze_filename_encoding(filename)
414
+ puts "Original encoding: #{analysis[:original_encoding]}"
415
+ puts "Recommended encoding: #{analysis[:recommended_encoding]}"
416
+
417
+ # Normalize filename to UTF-8
418
+ normalized = UniversalDocumentProcessor.normalize_filename(filename)
419
+ puts normalized.encoding # => UTF-8
420
+ ```
421
+
422
+ ### Document Processing with Japanese Filenames
423
+
424
+ ```ruby
425
+ # Process documents with Japanese filenames
426
+ result = UniversalDocumentProcessor.process("้‡่ฆใชไผš่ญฐ่ณ‡ๆ–™.pdf")
427
+
428
+ # Access filename information
429
+ filename_info = result[:filename_info]
430
+ puts "Original: #{filename_info[:original_filename]}"
431
+ puts "Japanese: #{filename_info[:contains_japanese]}"
432
+ puts "Validation: #{filename_info[:validation][:valid]}"
433
+
434
+ # Japanese character breakdown
435
+ japanese_parts = filename_info[:japanese_parts]
436
+ puts "Hiragana: #{japanese_parts[:hiragana]&.join('')}"
437
+ puts "Katakana: #{japanese_parts[:katakana]&.join('')}"
438
+ puts "Kanji: #{japanese_parts[:kanji]&.join('')}"
439
+ ```
440
+
441
+ ### Cross-Platform Compatibility
442
+
443
+ ```ruby
444
+ # Test filename compatibility across platforms
445
+ test_files = [
446
+ "ๆ—ฅๆœฌ่ชžใƒ•ใ‚กใ‚คใƒซ.pdf", # Standard Japanese
447
+ "ใ“ใ‚“ใซใกใฏworld.docx", # Mixed Japanese-English
448
+ "ใƒ‡ใƒผใ‚ฟ_analysis.xlsx", # Japanese with underscore
449
+ "ไผš่ญฐ่ญฐไบ‹้Œฒ๏ผˆ้‡่ฆ๏ผ‰.txt" # Japanese with parentheses
450
+ ]
451
+
452
+ test_files.each do |filename|
453
+ validation = UniversalDocumentProcessor.validate_filename(filename)
454
+ safe_version = UniversalDocumentProcessor.safe_filename(filename)
455
+
456
+ puts "#{filename}:"
457
+ puts " Windows compatible: #{validation[:valid]}"
458
+ puts " Safe version: #{safe_version}"
459
+ puts " Byte size: #{safe_version.bytesize} bytes"
460
+ end
461
+ ```
462
+
463
+ ## ๐Ÿ” Character Validation Features
464
+
465
+ ### Detecting Invalid Characters
466
+
467
+ ```ruby
468
+ text_with_issues = "Hello\x00World\x01ใ“ใ‚“ใซใกใฏ"
469
+ analysis = UniversalDocumentProcessor.analyze_text_quality(text_with_issues)
470
+
471
+ # Check for specific issues
472
+ puts "Has null bytes: #{analysis[:has_null_bytes]}"
473
+ puts "Has control chars: #{analysis[:has_control_chars]}"
474
+ puts "Valid encoding: #{analysis[:valid_encoding]}"
475
+
476
+ # Get detailed issue report
477
+ issues = analysis[:character_issues]
478
+ issues.each do |issue|
479
+ puts "#{issue[:type]}: #{issue[:message]} (#{issue[:severity]})"
480
+ end
481
+ ```
482
+
483
+ ### Text Repair Strategies
484
+
485
+ ```ruby
486
+ corrupted_text = "Hello\x00World\x01ใ“ใ‚“ใซใกใฏ\uFFFD"
487
+
488
+ # Conservative repair (recommended)
489
+ clean = UniversalDocumentProcessor::Processors::CharacterValidator.repair_text(
490
+ corrupted_text, :conservative
491
+ )
492
+
493
+ # Aggressive repair (removes all non-printable)
494
+ clean = UniversalDocumentProcessor::Processors::CharacterValidator.repair_text(
495
+ corrupted_text, :aggressive
496
+ )
497
+
498
+ # Replace strategy (replaces with safe alternatives)
499
+ clean = UniversalDocumentProcessor::Processors::CharacterValidator.repair_text(
500
+ corrupted_text, :replace
501
+ )
502
+ ```
503
+
504
+ ## ๐ŸŽ›๏ธ Configuration
505
+
506
+ ### Checking Available Features
507
+
508
+ ```ruby
509
+ # Check what features are available based on installed gems
510
+ features = UniversalDocumentProcessor.available_features
511
+ puts "Available features: #{features.join(', ')}"
512
+
513
+ # Check specific dependencies
514
+ puts "PDF processing: #{UniversalDocumentProcessor.dependency_available?(:pdf_reader)}"
515
+ puts "Word processing: #{UniversalDocumentProcessor.dependency_available?(:docx)}"
516
+ puts "Excel processing: #{UniversalDocumentProcessor.dependency_available?(:roo)}"
517
+ ```
518
+
519
+ ### Custom Options
520
+
521
+ ```ruby
522
+ # Process with custom options
523
+ options = {
524
+ extract_images: true,
525
+ extract_tables: true,
526
+ clean_text: true,
527
+ validate_encoding: true
528
+ }
529
+
530
+ result = UniversalDocumentProcessor.process('document.pdf', options)
531
+ ```
532
+
533
+ ## ๐Ÿ—๏ธ Architecture
534
+
535
+ The gem uses a modular processor-based architecture:
536
+
537
+ - **BaseProcessor**: Common functionality and interface
538
+ - **PdfProcessor**: Advanced PDF processing
539
+ - **WordProcessor**: Microsoft Word documents
540
+ - **ExcelProcessor**: Spreadsheet processing
541
+ - **PowerpointProcessor**: Presentation processing
542
+ - **ImageProcessor**: Image analysis and OCR
543
+ - **ArchiveProcessor**: Compressed file handling
544
+ - **TextProcessor**: Plain text and markup files
545
+ - **CharacterValidator**: Text quality and encoding validation
546
+
547
+ ## ๐ŸŒ Multi-language Support
548
+
549
+ ### Supported Encodings
550
+ - **UTF-8** (recommended)
551
+ - **Shift_JIS** (Japanese)
552
+ - **EUC-JP** (Japanese)
553
+ - **ISO-8859-1** (Latin-1)
554
+ - **Windows-1252**
555
+ - **ASCII**
556
+
557
+ ### Supported Scripts
558
+ - **Latin** (English, European languages)
559
+ - **Japanese** (Hiragana, Katakana, Kanji)
560
+ - **Chinese** (Simplified/Traditional)
561
+ - **Korean** (Hangul)
562
+ - **Cyrillic** (Russian, etc.)
563
+ - **Arabic**
564
+ - **Hebrew**
565
+
566
+ ## โšก Performance
567
+
568
+ ### Benchmarks (Average)
569
+ - **Small PDF (1-10 pages)**: 0.5-2 seconds
570
+ - **Large PDF (100+ pages)**: 5-15 seconds
571
+ - **Word Document**: 0.3-1 second
572
+ - **Excel Spreadsheet**: 0.5-3 seconds
573
+ - **PowerPoint**: 1-5 seconds
574
+ - **Image with OCR**: 2-10 seconds
575
+
576
+ ### Best Practices
577
+ 1. Use **batch processing** for multiple files
578
+ 2. Process files **asynchronously** for better UX
579
+ 3. Implement **caching** for frequently accessed documents
580
+ 4. Set **appropriate timeouts** for large files
581
+ 5. Monitor **memory usage** in production
582
+
583
+ ## ๐Ÿ”’ Security
584
+
585
+ ### File Validation
586
+ - MIME type verification prevents file spoofing
587
+ - File size limits prevent resource exhaustion
588
+ - Content scanning for malicious payloads
589
+ - Sandbox processing for untrusted files
590
+
591
+ ### Best Practices
592
+ 1. Always **validate uploaded files** before processing
593
+ 2. Set **reasonable limits** on file size and processing time
594
+ 3. Use **temporary directories** with proper cleanup
595
+ 4. **Log processing activities** for audit trails
596
+ 5. Handle **errors gracefully** without exposing system info
597
+
598
+ ## ๐Ÿงช Rails Integration
599
+
600
+ ### Controller Example
601
+
602
+ ```ruby
603
+ class DocumentsController < ApplicationController
604
+ def create
605
+ uploaded_file = params[:file]
606
+
607
+ # Process the document
608
+ result = UniversalDocumentProcessor.process(uploaded_file.tempfile.path)
609
+
610
+ # Store in database
611
+ @document = Document.create!(
612
+ filename: uploaded_file.original_filename,
613
+ content_type: result[:content_type],
614
+ text_content: result[:text_content],
615
+ metadata: result[:metadata]
616
+ )
617
+
618
+ render json: { success: true, document: @document }
619
+ rescue UniversalDocumentProcessor::Error => e
620
+ render json: { success: false, error: e.message }, status: 422
621
+ end
622
+ end
623
+ ```
624
+
625
+ ### Background Job Example
626
+
627
+ ```ruby
628
+ class DocumentProcessorJob < ApplicationJob
629
+ def perform(document_id)
630
+ document = Document.find(document_id)
631
+
632
+ result = UniversalDocumentProcessor.process(document.file_path)
633
+
634
+ document.update!(
635
+ text_content: result[:text_content],
636
+ metadata: result[:metadata],
637
+ processed_at: Time.current
638
+ )
639
+ end
640
+ end
641
+ ```
642
+
643
+ ## ๐Ÿšจ Error Handling
644
+
645
+ The gem provides comprehensive error handling with custom exceptions:
646
+
647
+ ```ruby
648
+ begin
649
+ result = UniversalDocumentProcessor.process('document.pdf')
650
+ rescue UniversalDocumentProcessor::UnsupportedFormatError => e
651
+ # Handle unsupported file format
652
+ rescue UniversalDocumentProcessor::ProcessingError => e
653
+ # Handle processing failure
654
+ rescue UniversalDocumentProcessor::DependencyMissingError => e
655
+ # Handle missing optional dependency
656
+ rescue UniversalDocumentProcessor::Error => e
657
+ # Handle general gem errors
658
+ end
659
+ ```
660
+
661
+ ## ๐Ÿงช Testing
662
+
663
+ Run the test suite:
664
+
665
+ ```bash
666
+ bundle exec rspec
667
+ ```
668
+
669
+ Run with coverage:
670
+
671
+ ```bash
672
+ COVERAGE=true bundle exec rspec
673
+ ```
674
+
675
+ ## ๐Ÿค Contributing
676
+
677
+ 1. Fork the repository
678
+ 2. Create your feature branch (`git checkout -b feature/amazing-feature`)
679
+ 3. Commit your changes (`git commit -am 'Add amazing feature'`)
680
+ 4. Push to the branch (`git push origin feature/amazing-feature`)
681
+ 5. Create a Pull Request
682
+
683
+ ### Development Setup
684
+
685
+ ```bash
686
+ git clone https://github.com/yourusername/universal_document_processor.git
687
+ cd universal_document_processor
688
+ bundle install
689
+ bundle exec rspec
690
+ ```
691
+
692
+ ## ๐Ÿ“ Changelog
693
+
694
+ ### Version 1.0.0
695
+ - Initial release
696
+ - Support for PDF, Word, Excel, PowerPoint, images, archives
697
+ - Character validation and cleaning
698
+ - Japanese text support
699
+ - Multi-encoding support
700
+ - Batch processing capabilities
701
+
702
+ ## ๐Ÿ†˜ Support
703
+
704
+ - **Issues**: [GitHub Issues](https://github.com/yourusername/universal_document_processor/issues)
705
+ - **Documentation**: [Wiki](https://github.com/yourusername/universal_document_processor/wiki)
706
+ - **Email**: vikas.v.patil1696@gmail.com
707
+
708
+ ## ๐Ÿ“„ License
709
+
710
+ The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
711
+
712
+ ## ๐Ÿ‘จโ€๐Ÿ’ป Author
713
+
714
+ **Vikas Patil**
715
+ - Email: vikas.v.patil1696@gmail.com
716
+ - GitHub: [@vpatil160](https://github.com/vpatil160)
717
+
718
+ ## ๐Ÿ™ Acknowledgments
719
+
720
+ - Built with Ruby and love โค๏ธ
721
+ - Thanks to all the amazing open source libraries this gem depends on
722
+ - Special thanks to the Ruby community for continuous inspiration
723
+
724
+ ---
725
+
726
+ **Made with โค๏ธ for the Ruby community**