universal_document_processor 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/AI_USAGE_GUIDE.md +404 -0
- data/CHANGELOG.md +67 -0
- data/GEM_RELEASE_GUIDE.md +288 -0
- data/Gemfile +27 -0
- data/LICENSE +21 -0
- data/README.md +726 -0
- data/Rakefile +36 -0
- data/lib/universal_document_processor/ai_agent.rb +491 -0
- data/lib/universal_document_processor/document.rb +225 -0
- data/lib/universal_document_processor/processors/archive_processor.rb +290 -0
- data/lib/universal_document_processor/processors/base_processor.rb +58 -0
- data/lib/universal_document_processor/processors/character_validator.rb +283 -0
- data/lib/universal_document_processor/processors/excel_processor.rb +219 -0
- data/lib/universal_document_processor/processors/image_processor.rb +172 -0
- data/lib/universal_document_processor/processors/pdf_processor.rb +105 -0
- data/lib/universal_document_processor/processors/powerpoint_processor.rb +214 -0
- data/lib/universal_document_processor/processors/text_processor.rb +360 -0
- data/lib/universal_document_processor/processors/word_processor.rb +137 -0
- data/lib/universal_document_processor/utils/file_detector.rb +83 -0
- data/lib/universal_document_processor/utils/japanese_filename_handler.rb +205 -0
- data/lib/universal_document_processor/version.rb +3 -0
- data/lib/universal_document_processor.rb +223 -0
- metadata +198 -0
data/README.md
ADDED
@@ -0,0 +1,726 @@
|
|
1
|
+
# Universal Document Processor
|
2
|
+
|
3
|
+
[](https://badge.fury.io/rb/universal_document_processor)
|
4
|
+
[](https://opensource.org/licenses/MIT)
|
5
|
+
[](https://www.ruby-lang.org/)
|
6
|
+
|
7
|
+
A comprehensive Ruby gem that provides unified document processing capabilities across multiple file formats. Extract text, metadata, images, and tables from PDFs, Word documents, Excel spreadsheets, PowerPoint presentations, images, archives, and more with a single, consistent API.
|
8
|
+
|
9
|
+
## ๐ฏ Features
|
10
|
+
|
11
|
+
### **Unified Document Processing**
|
12
|
+
- **Single API** for all document types
|
13
|
+
- **Intelligent format detection** and processing
|
14
|
+
- **Production-ready** error handling and fallbacks
|
15
|
+
- **Extensible architecture** for future enhancements
|
16
|
+
|
17
|
+
### **Supported File Formats**
|
18
|
+
- **๐ Documents**: PDF, DOC, DOCX, RTF
|
19
|
+
- **๐ Spreadsheets**: XLS, XLSX, CSV
|
20
|
+
- **๐บ Presentations**: PPT, PPTX
|
21
|
+
- **๐ผ๏ธ Images**: JPG, PNG, GIF, BMP, TIFF
|
22
|
+
- **๐ Archives**: ZIP, RAR, 7Z
|
23
|
+
- **๐ Text**: TXT, HTML, XML, JSON, Markdown
|
24
|
+
|
25
|
+
### **Advanced Content Extraction**
|
26
|
+
- **Text Extraction**: Full text content from any supported format
|
27
|
+
- **Metadata Extraction**: File properties, author, creation date, etc.
|
28
|
+
- **Image Extraction**: Embedded images from documents
|
29
|
+
- **Table Detection**: Structured data extraction
|
30
|
+
- **Character Validation**: Invalid character detection and cleaning
|
31
|
+
- **Multi-language Support**: Full Unicode support including Japanese (ๆฅๆฌ่ช)
|
32
|
+
|
33
|
+
### **Character & Encoding Support**
|
34
|
+
- **Smart encoding detection** (UTF-8, Shift_JIS, EUC-JP, ISO-8859-1)
|
35
|
+
- **Invalid character detection** and cleaning
|
36
|
+
- **Japanese text support** (Hiragana, Katakana, Kanji)
|
37
|
+
- **Control character handling**
|
38
|
+
- **Text repair and normalization**
|
39
|
+
|
40
|
+
## ๐ Installation
|
41
|
+
|
42
|
+
Add this line to your application's Gemfile:
|
43
|
+
|
44
|
+
```ruby
|
45
|
+
gem 'universal_document_processor'
|
46
|
+
```
|
47
|
+
|
48
|
+
And then execute:
|
49
|
+
```bash
|
50
|
+
bundle install
|
51
|
+
```
|
52
|
+
|
53
|
+
Or install it yourself as:
|
54
|
+
```bash
|
55
|
+
gem install universal_document_processor
|
56
|
+
```
|
57
|
+
|
58
|
+
### Optional Dependencies
|
59
|
+
|
60
|
+
For enhanced functionality, install additional gems:
|
61
|
+
|
62
|
+
```ruby
|
63
|
+
# PDF processing
|
64
|
+
gem 'pdf-reader', '~> 2.0'
|
65
|
+
gem 'prawn', '~> 2.4'
|
66
|
+
|
67
|
+
# Microsoft Office documents
|
68
|
+
gem 'docx', '~> 0.8'
|
69
|
+
gem 'roo', '~> 2.8'
|
70
|
+
|
71
|
+
# Image processing
|
72
|
+
gem 'mini_magick', '~> 4.11'
|
73
|
+
|
74
|
+
# Universal text extraction fallback
|
75
|
+
gem 'yomu', '~> 0.2'
|
76
|
+
```
|
77
|
+
|
78
|
+
## ๐ Quick Start
|
79
|
+
|
80
|
+
### Basic Usage
|
81
|
+
|
82
|
+
```ruby
|
83
|
+
require 'universal_document_processor'
|
84
|
+
|
85
|
+
# Process any document
|
86
|
+
result = UniversalDocumentProcessor.process('document.pdf')
|
87
|
+
|
88
|
+
# Extract text only
|
89
|
+
text = UniversalDocumentProcessor.extract_text('document.docx')
|
90
|
+
|
91
|
+
# Get metadata only
|
92
|
+
metadata = UniversalDocumentProcessor.get_metadata('spreadsheet.xlsx')
|
93
|
+
```
|
94
|
+
|
95
|
+
### Processing Result
|
96
|
+
|
97
|
+
```ruby
|
98
|
+
result = UniversalDocumentProcessor.process('document.pdf')
|
99
|
+
|
100
|
+
# Returns comprehensive information:
|
101
|
+
{
|
102
|
+
file_path: "document.pdf",
|
103
|
+
content_type: "application/pdf",
|
104
|
+
file_size: 1024576,
|
105
|
+
text_content: "Extracted text content...",
|
106
|
+
metadata: {
|
107
|
+
title: "Document Title",
|
108
|
+
author: "Author Name",
|
109
|
+
page_count: 25
|
110
|
+
},
|
111
|
+
images: [...],
|
112
|
+
tables: [...],
|
113
|
+
processed_at: 2024-01-15 10:30:00 UTC
|
114
|
+
}
|
115
|
+
```
|
116
|
+
|
117
|
+
## ๐ง Advanced Usage
|
118
|
+
|
119
|
+
### Character Validation and Cleaning
|
120
|
+
|
121
|
+
```ruby
|
122
|
+
# Analyze text quality and character issues
|
123
|
+
analysis = UniversalDocumentProcessor.analyze_text_quality(text)
|
124
|
+
|
125
|
+
# Returns:
|
126
|
+
{
|
127
|
+
encoding: "UTF-8",
|
128
|
+
valid_encoding: true,
|
129
|
+
has_invalid_chars: false,
|
130
|
+
has_control_chars: true,
|
131
|
+
character_issues: [...],
|
132
|
+
statistics: {
|
133
|
+
total_chars: 1500,
|
134
|
+
japanese_chars: 250,
|
135
|
+
hiragana_chars: 100,
|
136
|
+
katakana_chars: 50,
|
137
|
+
kanji_chars: 100
|
138
|
+
},
|
139
|
+
japanese_analysis: {
|
140
|
+
japanese: true,
|
141
|
+
scripts: ['hiragana', 'katakana', 'kanji'],
|
142
|
+
mixed_with_latin: true
|
143
|
+
}
|
144
|
+
}
|
145
|
+
```
|
146
|
+
|
147
|
+
### Text Cleaning
|
148
|
+
|
149
|
+
```ruby
|
150
|
+
# Clean text by removing invalid characters
|
151
|
+
clean_text = UniversalDocumentProcessor.clean_text(corrupted_text, {
|
152
|
+
remove_null_bytes: true,
|
153
|
+
remove_control_chars: true,
|
154
|
+
normalize_whitespace: true
|
155
|
+
})
|
156
|
+
```
|
157
|
+
|
158
|
+
### File Encoding Validation
|
159
|
+
|
160
|
+
```ruby
|
161
|
+
# Validate file encoding (supports Japanese encodings)
|
162
|
+
validation = UniversalDocumentProcessor.validate_file('japanese_document.txt')
|
163
|
+
|
164
|
+
# Returns:
|
165
|
+
{
|
166
|
+
detected_encoding: "Shift_JIS",
|
167
|
+
valid: true,
|
168
|
+
content: "ใใใซใกใฏ",
|
169
|
+
analysis: {...}
|
170
|
+
}
|
171
|
+
```
|
172
|
+
|
173
|
+
### Japanese Text Support
|
174
|
+
|
175
|
+
```ruby
|
176
|
+
# Check if text contains Japanese
|
177
|
+
is_japanese = UniversalDocumentProcessor.japanese_text?("ใใใซใกใฏ World")
|
178
|
+
# => true
|
179
|
+
|
180
|
+
# Detailed Japanese analysis
|
181
|
+
japanese_info = UniversalDocumentProcessor.validate_japanese_text("ใใใซใกใฏ ไธ็")
|
182
|
+
# Returns detailed Japanese character analysis
|
183
|
+
```
|
184
|
+
|
185
|
+
### Batch Processing
|
186
|
+
|
187
|
+
```ruby
|
188
|
+
# Process multiple documents
|
189
|
+
file_paths = ['file1.pdf', 'file2.docx', 'file3.xlsx']
|
190
|
+
results = UniversalDocumentProcessor.batch_process(file_paths)
|
191
|
+
|
192
|
+
# Returns array with success/error status for each file
|
193
|
+
```
|
194
|
+
|
195
|
+
### Document Conversion
|
196
|
+
|
197
|
+
```ruby
|
198
|
+
# Convert to different formats
|
199
|
+
text_content = UniversalDocumentProcessor.convert('document.pdf', :text)
|
200
|
+
json_data = UniversalDocumentProcessor.convert('document.docx', :json)
|
201
|
+
```
|
202
|
+
|
203
|
+
## ๐ Detailed Examples
|
204
|
+
|
205
|
+
### Processing PDF Documents
|
206
|
+
|
207
|
+
```ruby
|
208
|
+
# Extract comprehensive PDF information
|
209
|
+
result = UniversalDocumentProcessor.process('report.pdf')
|
210
|
+
|
211
|
+
# Access specific data
|
212
|
+
puts "Title: #{result[:metadata][:title]}"
|
213
|
+
puts "Pages: #{result[:metadata][:page_count]}"
|
214
|
+
puts "Images found: #{result[:images].length}"
|
215
|
+
puts "Tables found: #{result[:tables].length}"
|
216
|
+
|
217
|
+
# Get text content
|
218
|
+
full_text = result[:text_content]
|
219
|
+
```
|
220
|
+
|
221
|
+
### Processing Excel Spreadsheets
|
222
|
+
|
223
|
+
```ruby
|
224
|
+
# Extract data from Excel files
|
225
|
+
result = UniversalDocumentProcessor.process('data.xlsx')
|
226
|
+
|
227
|
+
# Access spreadsheet-specific metadata
|
228
|
+
metadata = result[:metadata]
|
229
|
+
puts "Worksheets: #{metadata[:worksheet_count]}"
|
230
|
+
puts "Has formulas: #{metadata[:has_formulas]}"
|
231
|
+
|
232
|
+
# Extract tables/data
|
233
|
+
tables = result[:tables]
|
234
|
+
tables.each_with_index do |table, index|
|
235
|
+
puts "Table #{index + 1}: #{table[:rows]} rows"
|
236
|
+
end
|
237
|
+
```
|
238
|
+
|
239
|
+
### Processing Word Documents
|
240
|
+
|
241
|
+
```ruby
|
242
|
+
# Extract from Word documents
|
243
|
+
result = UniversalDocumentProcessor.process('report.docx')
|
244
|
+
|
245
|
+
# Get document structure
|
246
|
+
metadata = result[:metadata]
|
247
|
+
puts "Word count: #{metadata[:word_count]}"
|
248
|
+
puts "Paragraph count: #{metadata[:paragraph_count]}"
|
249
|
+
|
250
|
+
# Extract embedded images
|
251
|
+
images = result[:images]
|
252
|
+
puts "Found #{images.length} embedded images"
|
253
|
+
```
|
254
|
+
|
255
|
+
### Processing Japanese Documents & Filenames
|
256
|
+
|
257
|
+
```ruby
|
258
|
+
# Process Japanese content
|
259
|
+
japanese_doc = "ใใใซใกใฏ ไธ็๏ผ Hello World!"
|
260
|
+
analysis = UniversalDocumentProcessor.analyze_text_quality(japanese_doc)
|
261
|
+
|
262
|
+
# Japanese-specific information
|
263
|
+
japanese_info = analysis[:japanese_analysis]
|
264
|
+
puts "Contains Japanese: #{japanese_info[:japanese]}"
|
265
|
+
puts "Scripts found: #{japanese_info[:scripts].join(', ')}"
|
266
|
+
puts "Mixed with Latin: #{japanese_info[:mixed_with_latin]}"
|
267
|
+
|
268
|
+
# Character statistics
|
269
|
+
stats = analysis[:statistics]
|
270
|
+
puts "Hiragana: #{stats[:hiragana_chars]}"
|
271
|
+
puts "Katakana: #{stats[:katakana_chars]}"
|
272
|
+
puts "Kanji: #{stats[:kanji_chars]}"
|
273
|
+
|
274
|
+
# Japanese filename support
|
275
|
+
filename = "้่ฆใช่ณๆ_2024ๅนดๅบฆ.pdf"
|
276
|
+
validation = UniversalDocumentProcessor.validate_filename(filename)
|
277
|
+
puts "Japanese filename: #{validation[:contains_japanese]}"
|
278
|
+
puts "Filename valid: #{validation[:valid]}"
|
279
|
+
|
280
|
+
# Safe filename generation
|
281
|
+
safe_name = UniversalDocumentProcessor.safe_filename("ใใผใฟใใกใคใซ<้่ฆ>.xlsx")
|
282
|
+
puts "Safe filename: #{safe_name}" # => "ใใผใฟใใกใคใซ_้่ฆ_.xlsx"
|
283
|
+
|
284
|
+
# Process documents with Japanese filenames
|
285
|
+
result = UniversalDocumentProcessor.process("ๆฅๆฌ่ชใใกใคใซ.pdf")
|
286
|
+
puts "Original filename: #{result[:filename_info][:original_filename]}"
|
287
|
+
puts "Contains Japanese: #{result[:filename_info][:contains_japanese]}"
|
288
|
+
puts "Japanese parts: #{result[:filename_info][:japanese_parts]}"
|
289
|
+
```
|
290
|
+
|
291
|
+
## ๐ค AI Agent Integration
|
292
|
+
|
293
|
+
The gem includes a powerful AI agent that provides intelligent document analysis and interaction capabilities using OpenAI's GPT models:
|
294
|
+
|
295
|
+
### Quick AI Analysis
|
296
|
+
|
297
|
+
```ruby
|
298
|
+
# Set your OpenAI API key
|
299
|
+
ENV['OPENAI_API_KEY'] = 'your-api-key-here'
|
300
|
+
|
301
|
+
# Quick AI-powered analysis
|
302
|
+
summary = UniversalDocumentProcessor.ai_summarize('document.pdf', length: :short)
|
303
|
+
insights = UniversalDocumentProcessor.ai_insights('document.pdf')
|
304
|
+
classification = UniversalDocumentProcessor.ai_classify('document.pdf')
|
305
|
+
|
306
|
+
# Extract specific information
|
307
|
+
key_info = UniversalDocumentProcessor.ai_extract_info('document.pdf', ['dates', 'names', 'amounts'])
|
308
|
+
action_items = UniversalDocumentProcessor.ai_action_items('document.pdf')
|
309
|
+
|
310
|
+
# Translate documents (great for Japanese documents)
|
311
|
+
translation = UniversalDocumentProcessor.ai_translate('ๆฅๆฌ่ชๆๆธ.pdf', 'English')
|
312
|
+
```
|
313
|
+
|
314
|
+
### Interactive AI Agent
|
315
|
+
|
316
|
+
```ruby
|
317
|
+
# Create a persistent AI agent for conversations
|
318
|
+
agent = UniversalDocumentProcessor.create_ai_agent(
|
319
|
+
model: 'gpt-4',
|
320
|
+
temperature: 0.7,
|
321
|
+
max_history: 10
|
322
|
+
)
|
323
|
+
|
324
|
+
# Process document and start conversation
|
325
|
+
document = UniversalDocumentProcessor::Document.new('report.pdf')
|
326
|
+
|
327
|
+
# Ask questions about the document
|
328
|
+
response1 = document.ai_chat('What is this document about?')
|
329
|
+
response2 = document.ai_chat('What are the key financial figures?')
|
330
|
+
response3 = document.ai_chat('Based on our discussion, what should I focus on?')
|
331
|
+
|
332
|
+
# Get conversation summary
|
333
|
+
summary = agent.conversation_summary
|
334
|
+
```
|
335
|
+
|
336
|
+
### Advanced AI Features
|
337
|
+
|
338
|
+
```ruby
|
339
|
+
# Compare multiple documents
|
340
|
+
comparison = UniversalDocumentProcessor.ai_compare(
|
341
|
+
['doc1.pdf', 'doc2.pdf', 'doc3.pdf'],
|
342
|
+
:content # or :themes, :structure, etc.
|
343
|
+
)
|
344
|
+
|
345
|
+
# Document-specific AI analysis
|
346
|
+
document = UniversalDocumentProcessor::Document.new('business_plan.pdf')
|
347
|
+
|
348
|
+
analysis = document.ai_analyze('What are the growth projections?')
|
349
|
+
insights = document.ai_insights
|
350
|
+
classification = document.ai_classify
|
351
|
+
action_items = document.ai_action_items
|
352
|
+
|
353
|
+
# Japanese document support
|
354
|
+
japanese_doc = UniversalDocumentProcessor::Document.new('ใใญใธใงใฏใ่จ็ปๆธ.pdf')
|
355
|
+
translation = japanese_doc.ai_translate('English')
|
356
|
+
summary = japanese_doc.ai_summarize(length: :medium)
|
357
|
+
```
|
358
|
+
|
359
|
+
### AI Configuration Options
|
360
|
+
|
361
|
+
```ruby
|
362
|
+
# Custom AI agent configuration
|
363
|
+
agent = UniversalDocumentProcessor.create_ai_agent(
|
364
|
+
api_key: 'your-openai-key', # OpenAI API key
|
365
|
+
model: 'gpt-4', # Model to use (gpt-4, gpt-3.5-turbo)
|
366
|
+
temperature: 0.3, # Response creativity (0.0-1.0)
|
367
|
+
max_history: 20, # Conversation memory length
|
368
|
+
base_url: 'https://api.openai.com/v1' # Custom API endpoint
|
369
|
+
)
|
370
|
+
```
|
371
|
+
|
372
|
+
## ๐ Japanese Filename Support
|
373
|
+
|
374
|
+
The gem provides comprehensive support for Japanese filenames across all operating systems:
|
375
|
+
|
376
|
+
### Basic Filename Validation
|
377
|
+
|
378
|
+
```ruby
|
379
|
+
# Check if filename contains Japanese characters
|
380
|
+
UniversalDocumentProcessor.japanese_filename?("ๆฅๆฌ่ชใใกใคใซ.pdf")
|
381
|
+
# => true
|
382
|
+
|
383
|
+
# Validate Japanese filename
|
384
|
+
validation = UniversalDocumentProcessor.validate_filename("ใใใซใกใฏไธ็.docx")
|
385
|
+
puts validation[:valid] # => true
|
386
|
+
puts validation[:contains_japanese] # => true
|
387
|
+
puts validation[:japanese_parts] # => {hiragana: ["ใ","ใ","ใซ","ใก","ใฏ"], katakana: [], kanji: ["ไธ","็"]}
|
388
|
+
|
389
|
+
# Handle mixed language filenames
|
390
|
+
validation = UniversalDocumentProcessor.validate_filename("Project_ใใญใธใงใฏใ_2024.xlsx")
|
391
|
+
puts validation[:contains_japanese] # => true
|
392
|
+
```
|
393
|
+
|
394
|
+
### Safe Filename Generation
|
395
|
+
|
396
|
+
```ruby
|
397
|
+
# Create cross-platform safe filenames
|
398
|
+
problematic_name = "ใใผใฟใใกใคใซ<้่ฆ>:็ฎก็.xlsx"
|
399
|
+
safe_name = UniversalDocumentProcessor.safe_filename(problematic_name)
|
400
|
+
puts safe_name # => "ใใผใฟใใกใคใซ_้่ฆ__็ฎก็.xlsx"
|
401
|
+
|
402
|
+
# Handle extremely long Japanese filenames
|
403
|
+
long_name = "้ๅธธใซ้ทใใใกใคใซๅ" * 20 + ".pdf"
|
404
|
+
safe_name = UniversalDocumentProcessor.safe_filename(long_name)
|
405
|
+
puts safe_name.bytesize <= 200 # => true (safely truncated)
|
406
|
+
```
|
407
|
+
|
408
|
+
### Encoding Analysis & Normalization
|
409
|
+
|
410
|
+
```ruby
|
411
|
+
# Analyze filename encoding
|
412
|
+
filename = "ใใผใฟใใกใคใซ.pdf"
|
413
|
+
analysis = UniversalDocumentProcessor::Utils::JapaneseFilenameHandler.analyze_filename_encoding(filename)
|
414
|
+
puts "Original encoding: #{analysis[:original_encoding]}"
|
415
|
+
puts "Recommended encoding: #{analysis[:recommended_encoding]}"
|
416
|
+
|
417
|
+
# Normalize filename to UTF-8
|
418
|
+
normalized = UniversalDocumentProcessor.normalize_filename(filename)
|
419
|
+
puts normalized.encoding # => UTF-8
|
420
|
+
```
|
421
|
+
|
422
|
+
### Document Processing with Japanese Filenames
|
423
|
+
|
424
|
+
```ruby
|
425
|
+
# Process documents with Japanese filenames
|
426
|
+
result = UniversalDocumentProcessor.process("้่ฆใชไผ่ญฐ่ณๆ.pdf")
|
427
|
+
|
428
|
+
# Access filename information
|
429
|
+
filename_info = result[:filename_info]
|
430
|
+
puts "Original: #{filename_info[:original_filename]}"
|
431
|
+
puts "Japanese: #{filename_info[:contains_japanese]}"
|
432
|
+
puts "Validation: #{filename_info[:validation][:valid]}"
|
433
|
+
|
434
|
+
# Japanese character breakdown
|
435
|
+
japanese_parts = filename_info[:japanese_parts]
|
436
|
+
puts "Hiragana: #{japanese_parts[:hiragana]&.join('')}"
|
437
|
+
puts "Katakana: #{japanese_parts[:katakana]&.join('')}"
|
438
|
+
puts "Kanji: #{japanese_parts[:kanji]&.join('')}"
|
439
|
+
```
|
440
|
+
|
441
|
+
### Cross-Platform Compatibility
|
442
|
+
|
443
|
+
```ruby
|
444
|
+
# Test filename compatibility across platforms
|
445
|
+
test_files = [
|
446
|
+
"ๆฅๆฌ่ชใใกใคใซ.pdf", # Standard Japanese
|
447
|
+
"ใใใซใกใฏworld.docx", # Mixed Japanese-English
|
448
|
+
"ใใผใฟ_analysis.xlsx", # Japanese with underscore
|
449
|
+
"ไผ่ญฐ่ญฐไบ้ฒ๏ผ้่ฆ๏ผ.txt" # Japanese with parentheses
|
450
|
+
]
|
451
|
+
|
452
|
+
test_files.each do |filename|
|
453
|
+
validation = UniversalDocumentProcessor.validate_filename(filename)
|
454
|
+
safe_version = UniversalDocumentProcessor.safe_filename(filename)
|
455
|
+
|
456
|
+
puts "#{filename}:"
|
457
|
+
puts " Windows compatible: #{validation[:valid]}"
|
458
|
+
puts " Safe version: #{safe_version}"
|
459
|
+
puts " Byte size: #{safe_version.bytesize} bytes"
|
460
|
+
end
|
461
|
+
```
|
462
|
+
|
463
|
+
## ๐ Character Validation Features
|
464
|
+
|
465
|
+
### Detecting Invalid Characters
|
466
|
+
|
467
|
+
```ruby
|
468
|
+
text_with_issues = "Hello\x00World\x01ใใใซใกใฏ"
|
469
|
+
analysis = UniversalDocumentProcessor.analyze_text_quality(text_with_issues)
|
470
|
+
|
471
|
+
# Check for specific issues
|
472
|
+
puts "Has null bytes: #{analysis[:has_null_bytes]}"
|
473
|
+
puts "Has control chars: #{analysis[:has_control_chars]}"
|
474
|
+
puts "Valid encoding: #{analysis[:valid_encoding]}"
|
475
|
+
|
476
|
+
# Get detailed issue report
|
477
|
+
issues = analysis[:character_issues]
|
478
|
+
issues.each do |issue|
|
479
|
+
puts "#{issue[:type]}: #{issue[:message]} (#{issue[:severity]})"
|
480
|
+
end
|
481
|
+
```
|
482
|
+
|
483
|
+
### Text Repair Strategies
|
484
|
+
|
485
|
+
```ruby
|
486
|
+
corrupted_text = "Hello\x00World\x01ใใใซใกใฏ\uFFFD"
|
487
|
+
|
488
|
+
# Conservative repair (recommended)
|
489
|
+
clean = UniversalDocumentProcessor::Processors::CharacterValidator.repair_text(
|
490
|
+
corrupted_text, :conservative
|
491
|
+
)
|
492
|
+
|
493
|
+
# Aggressive repair (removes all non-printable)
|
494
|
+
clean = UniversalDocumentProcessor::Processors::CharacterValidator.repair_text(
|
495
|
+
corrupted_text, :aggressive
|
496
|
+
)
|
497
|
+
|
498
|
+
# Replace strategy (replaces with safe alternatives)
|
499
|
+
clean = UniversalDocumentProcessor::Processors::CharacterValidator.repair_text(
|
500
|
+
corrupted_text, :replace
|
501
|
+
)
|
502
|
+
```
|
503
|
+
|
504
|
+
## ๐๏ธ Configuration
|
505
|
+
|
506
|
+
### Checking Available Features
|
507
|
+
|
508
|
+
```ruby
|
509
|
+
# Check what features are available based on installed gems
|
510
|
+
features = UniversalDocumentProcessor.available_features
|
511
|
+
puts "Available features: #{features.join(', ')}"
|
512
|
+
|
513
|
+
# Check specific dependencies
|
514
|
+
puts "PDF processing: #{UniversalDocumentProcessor.dependency_available?(:pdf_reader)}"
|
515
|
+
puts "Word processing: #{UniversalDocumentProcessor.dependency_available?(:docx)}"
|
516
|
+
puts "Excel processing: #{UniversalDocumentProcessor.dependency_available?(:roo)}"
|
517
|
+
```
|
518
|
+
|
519
|
+
### Custom Options
|
520
|
+
|
521
|
+
```ruby
|
522
|
+
# Process with custom options
|
523
|
+
options = {
|
524
|
+
extract_images: true,
|
525
|
+
extract_tables: true,
|
526
|
+
clean_text: true,
|
527
|
+
validate_encoding: true
|
528
|
+
}
|
529
|
+
|
530
|
+
result = UniversalDocumentProcessor.process('document.pdf', options)
|
531
|
+
```
|
532
|
+
|
533
|
+
## ๐๏ธ Architecture
|
534
|
+
|
535
|
+
The gem uses a modular processor-based architecture:
|
536
|
+
|
537
|
+
- **BaseProcessor**: Common functionality and interface
|
538
|
+
- **PdfProcessor**: Advanced PDF processing
|
539
|
+
- **WordProcessor**: Microsoft Word documents
|
540
|
+
- **ExcelProcessor**: Spreadsheet processing
|
541
|
+
- **PowerpointProcessor**: Presentation processing
|
542
|
+
- **ImageProcessor**: Image analysis and OCR
|
543
|
+
- **ArchiveProcessor**: Compressed file handling
|
544
|
+
- **TextProcessor**: Plain text and markup files
|
545
|
+
- **CharacterValidator**: Text quality and encoding validation
|
546
|
+
|
547
|
+
## ๐ Multi-language Support
|
548
|
+
|
549
|
+
### Supported Encodings
|
550
|
+
- **UTF-8** (recommended)
|
551
|
+
- **Shift_JIS** (Japanese)
|
552
|
+
- **EUC-JP** (Japanese)
|
553
|
+
- **ISO-8859-1** (Latin-1)
|
554
|
+
- **Windows-1252**
|
555
|
+
- **ASCII**
|
556
|
+
|
557
|
+
### Supported Scripts
|
558
|
+
- **Latin** (English, European languages)
|
559
|
+
- **Japanese** (Hiragana, Katakana, Kanji)
|
560
|
+
- **Chinese** (Simplified/Traditional)
|
561
|
+
- **Korean** (Hangul)
|
562
|
+
- **Cyrillic** (Russian, etc.)
|
563
|
+
- **Arabic**
|
564
|
+
- **Hebrew**
|
565
|
+
|
566
|
+
## โก Performance
|
567
|
+
|
568
|
+
### Benchmarks (Average)
|
569
|
+
- **Small PDF (1-10 pages)**: 0.5-2 seconds
|
570
|
+
- **Large PDF (100+ pages)**: 5-15 seconds
|
571
|
+
- **Word Document**: 0.3-1 second
|
572
|
+
- **Excel Spreadsheet**: 0.5-3 seconds
|
573
|
+
- **PowerPoint**: 1-5 seconds
|
574
|
+
- **Image with OCR**: 2-10 seconds
|
575
|
+
|
576
|
+
### Best Practices
|
577
|
+
1. Use **batch processing** for multiple files
|
578
|
+
2. Process files **asynchronously** for better UX
|
579
|
+
3. Implement **caching** for frequently accessed documents
|
580
|
+
4. Set **appropriate timeouts** for large files
|
581
|
+
5. Monitor **memory usage** in production
|
582
|
+
|
583
|
+
## ๐ Security
|
584
|
+
|
585
|
+
### File Validation
|
586
|
+
- MIME type verification prevents file spoofing
|
587
|
+
- File size limits prevent resource exhaustion
|
588
|
+
- Content scanning for malicious payloads
|
589
|
+
- Sandbox processing for untrusted files
|
590
|
+
|
591
|
+
### Best Practices
|
592
|
+
1. Always **validate uploaded files** before processing
|
593
|
+
2. Set **reasonable limits** on file size and processing time
|
594
|
+
3. Use **temporary directories** with proper cleanup
|
595
|
+
4. **Log processing activities** for audit trails
|
596
|
+
5. Handle **errors gracefully** without exposing system info
|
597
|
+
|
598
|
+
## ๐งช Rails Integration
|
599
|
+
|
600
|
+
### Controller Example
|
601
|
+
|
602
|
+
```ruby
|
603
|
+
class DocumentsController < ApplicationController
|
604
|
+
def create
|
605
|
+
uploaded_file = params[:file]
|
606
|
+
|
607
|
+
# Process the document
|
608
|
+
result = UniversalDocumentProcessor.process(uploaded_file.tempfile.path)
|
609
|
+
|
610
|
+
# Store in database
|
611
|
+
@document = Document.create!(
|
612
|
+
filename: uploaded_file.original_filename,
|
613
|
+
content_type: result[:content_type],
|
614
|
+
text_content: result[:text_content],
|
615
|
+
metadata: result[:metadata]
|
616
|
+
)
|
617
|
+
|
618
|
+
render json: { success: true, document: @document }
|
619
|
+
rescue UniversalDocumentProcessor::Error => e
|
620
|
+
render json: { success: false, error: e.message }, status: 422
|
621
|
+
end
|
622
|
+
end
|
623
|
+
```
|
624
|
+
|
625
|
+
### Background Job Example
|
626
|
+
|
627
|
+
```ruby
|
628
|
+
class DocumentProcessorJob < ApplicationJob
|
629
|
+
def perform(document_id)
|
630
|
+
document = Document.find(document_id)
|
631
|
+
|
632
|
+
result = UniversalDocumentProcessor.process(document.file_path)
|
633
|
+
|
634
|
+
document.update!(
|
635
|
+
text_content: result[:text_content],
|
636
|
+
metadata: result[:metadata],
|
637
|
+
processed_at: Time.current
|
638
|
+
)
|
639
|
+
end
|
640
|
+
end
|
641
|
+
```
|
642
|
+
|
643
|
+
## ๐จ Error Handling
|
644
|
+
|
645
|
+
The gem provides comprehensive error handling with custom exceptions:
|
646
|
+
|
647
|
+
```ruby
|
648
|
+
begin
|
649
|
+
result = UniversalDocumentProcessor.process('document.pdf')
|
650
|
+
rescue UniversalDocumentProcessor::UnsupportedFormatError => e
|
651
|
+
# Handle unsupported file format
|
652
|
+
rescue UniversalDocumentProcessor::ProcessingError => e
|
653
|
+
# Handle processing failure
|
654
|
+
rescue UniversalDocumentProcessor::DependencyMissingError => e
|
655
|
+
# Handle missing optional dependency
|
656
|
+
rescue UniversalDocumentProcessor::Error => e
|
657
|
+
# Handle general gem errors
|
658
|
+
end
|
659
|
+
```
|
660
|
+
|
661
|
+
## ๐งช Testing
|
662
|
+
|
663
|
+
Run the test suite:
|
664
|
+
|
665
|
+
```bash
|
666
|
+
bundle exec rspec
|
667
|
+
```
|
668
|
+
|
669
|
+
Run with coverage:
|
670
|
+
|
671
|
+
```bash
|
672
|
+
COVERAGE=true bundle exec rspec
|
673
|
+
```
|
674
|
+
|
675
|
+
## ๐ค Contributing
|
676
|
+
|
677
|
+
1. Fork the repository
|
678
|
+
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
|
679
|
+
3. Commit your changes (`git commit -am 'Add amazing feature'`)
|
680
|
+
4. Push to the branch (`git push origin feature/amazing-feature`)
|
681
|
+
5. Create a Pull Request
|
682
|
+
|
683
|
+
### Development Setup
|
684
|
+
|
685
|
+
```bash
|
686
|
+
git clone https://github.com/yourusername/universal_document_processor.git
|
687
|
+
cd universal_document_processor
|
688
|
+
bundle install
|
689
|
+
bundle exec rspec
|
690
|
+
```
|
691
|
+
|
692
|
+
## ๐ Changelog
|
693
|
+
|
694
|
+
### Version 1.0.0
|
695
|
+
- Initial release
|
696
|
+
- Support for PDF, Word, Excel, PowerPoint, images, archives
|
697
|
+
- Character validation and cleaning
|
698
|
+
- Japanese text support
|
699
|
+
- Multi-encoding support
|
700
|
+
- Batch processing capabilities
|
701
|
+
|
702
|
+
## ๐ Support
|
703
|
+
|
704
|
+
- **Issues**: [GitHub Issues](https://github.com/yourusername/universal_document_processor/issues)
|
705
|
+
- **Documentation**: [Wiki](https://github.com/yourusername/universal_document_processor/wiki)
|
706
|
+
- **Email**: vikas.v.patil1696@gmail.com
|
707
|
+
|
708
|
+
## ๐ License
|
709
|
+
|
710
|
+
The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
|
711
|
+
|
712
|
+
## ๐จโ๐ป Author
|
713
|
+
|
714
|
+
**Vikas Patil**
|
715
|
+
- Email: vikas.v.patil1696@gmail.com
|
716
|
+
- GitHub: [@vpatil160](https://github.com/vpatil160)
|
717
|
+
|
718
|
+
## ๐ Acknowledgments
|
719
|
+
|
720
|
+
- Built with Ruby and love โค๏ธ
|
721
|
+
- Thanks to all the amazing open source libraries this gem depends on
|
722
|
+
- Special thanks to the Ruby community for continuous inspiration
|
723
|
+
|
724
|
+
---
|
725
|
+
|
726
|
+
**Made with โค๏ธ for the Ruby community**
|