universal_document_processor 1.0.1 → 1.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -5,8 +5,11 @@ module UniversalDocumentProcessor
5
5
  with_error_handling do
6
6
  if @file_path.end_with?('.docx')
7
7
  extract_docx_text
8
+ elsif @file_path.end_with?('.doc')
9
+ # Built-in .doc file processing
10
+ fallback_text_extraction
8
11
  else
9
- # Fallback for .doc files
12
+ # Handle other Word formats
10
13
  fallback_text_extraction
11
14
  end
12
15
  end
@@ -16,6 +19,8 @@ module UniversalDocumentProcessor
16
19
  with_error_handling do
17
20
  if @file_path.end_with?('.docx')
18
21
  extract_docx_metadata
22
+ elsif @file_path.end_with?('.doc')
23
+ extract_doc_metadata
19
24
  else
20
25
  super
21
26
  end
@@ -73,7 +78,12 @@ module UniversalDocumentProcessor
73
78
  end
74
79
 
75
80
  def supported_operations
76
- super + [:extract_images, :extract_tables, :extract_styles, :extract_comments]
81
+ if @file_path.end_with?('.docx')
82
+ super + [:extract_images, :extract_tables, :extract_styles, :extract_comments]
83
+ else
84
+ # .doc files support basic text and metadata extraction
85
+ super + [:extract_basic_formatting]
86
+ end
77
87
  end
78
88
 
79
89
  private
@@ -126,12 +136,80 @@ module UniversalDocumentProcessor
126
136
  0
127
137
  end
128
138
 
139
+ def extract_doc_metadata
140
+ # Extract basic metadata from .doc files
141
+ file_stats = File.stat(@file_path)
142
+ extracted_text = extract_doc_text_builtin
143
+
144
+ super.merge({
145
+ format: 'Microsoft Word Document (.doc)',
146
+ word_count: count_words(extracted_text),
147
+ character_count: extracted_text.length,
148
+ created_at: file_stats.ctime,
149
+ modified_at: file_stats.mtime,
150
+ file_size: file_stats.size,
151
+ extraction_method: 'Built-in binary parsing'
152
+ })
153
+ rescue => e
154
+ super.merge({
155
+ format: 'Microsoft Word Document (.doc)',
156
+ extraction_error: e.message
157
+ })
158
+ end
159
+
129
160
  def fallback_text_extraction
130
- # Use Yomu for .doc files or as fallback
131
- Yomu.new(@file_path).text
161
+ # Built-in .doc file text extraction
162
+ extract_doc_text_builtin
132
163
  rescue => e
133
164
  "Unable to extract text from Word document: #{e.message}"
134
165
  end
166
+
167
+ def extract_doc_text_builtin
168
+ # Read .doc file as binary and extract readable text
169
+ content = File.binread(@file_path)
170
+
171
+ # .doc files store text in a specific format - extract readable ASCII text
172
+ # This is a simplified extraction that works for basic .doc files
173
+ text_content = []
174
+
175
+ # Look for text patterns in the binary data
176
+ # .doc files often have text stored with null bytes between characters
177
+ content.force_encoding('ASCII-8BIT').scan(/[\x20-\x7E\x0A\x0D]{4,}/) do |match|
178
+ # Clean up the extracted text
179
+ cleaned_text = match.gsub(/[\x00-\x1F\x7F-\xFF]/n, ' ').strip
180
+ text_content << cleaned_text if cleaned_text.length > 3
181
+ end
182
+
183
+ # Try alternative extraction method if first method yields little text
184
+ if text_content.join(' ').length < 50
185
+ text_content = extract_doc_alternative_method(content)
186
+ end
187
+
188
+ result = text_content.join("\n").strip
189
+ result.empty? ? "Text extracted from .doc file (content may be limited due to complex formatting)" : result
190
+ end
191
+
192
+ def extract_doc_alternative_method(content)
193
+ # Alternative method: look for Word document text patterns
194
+ text_parts = []
195
+
196
+ # .doc files often have text in UTF-16 or with specific markers
197
+ # Try to find readable text segments
198
+ content.force_encoding('UTF-16LE').encode('UTF-8', invalid: :replace, undef: :replace).scan(/[[:print:]]{5,}/m) do |match|
199
+ cleaned = match.strip
200
+ text_parts << cleaned if cleaned.length > 4 && !cleaned.match?(/^[\x00-\x1F]*$/)
201
+ end
202
+
203
+ # If UTF-16 doesn't work, try scanning for ASCII patterns
204
+ if text_parts.empty?
205
+ content.force_encoding('ASCII-8BIT').scan(/[a-zA-Z0-9\s\.\,\!\?\;\:]{10,}/n) do |match|
206
+ cleaned = match.strip
207
+ text_parts << cleaned if cleaned.length > 9
208
+ end
209
+ end
210
+
211
+ text_parts.uniq
212
+ end
135
213
  end
136
214
  end
137
215
  end
@@ -15,6 +15,7 @@ module UniversalDocumentProcessor
15
15
  'htm' => 'text/html',
16
16
  'xml' => 'application/xml',
17
17
  'csv' => 'text/csv',
18
+ 'tsv' => 'text/tab-separated-values',
18
19
  'json' => 'application/json',
19
20
  'jpg' => 'image/jpeg',
20
21
  'jpeg' => 'image/jpeg',
@@ -1,3 +1,3 @@
1
1
  module UniversalDocumentProcessor
2
- VERSION = "1.0.1"
2
+ VERSION = "1.0.2"
3
3
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: universal_document_processor
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.1
4
+ version: 1.0.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Vikas Patil
@@ -65,6 +65,20 @@ dependencies:
65
65
  - - "~>"
66
66
  - !ruby/object:Gem::Version
67
67
  version: '2.3'
68
+ - !ruby/object:Gem::Dependency
69
+ name: rexml
70
+ requirement: !ruby/object:Gem::Requirement
71
+ requirements:
72
+ - - "~>"
73
+ - !ruby/object:Gem::Version
74
+ version: '3.2'
75
+ type: :runtime
76
+ prerelease: false
77
+ version_requirements: !ruby/object:Gem::Requirement
78
+ requirements:
79
+ - - "~>"
80
+ - !ruby/object:Gem::Version
81
+ version: '3.2'
68
82
  - !ruby/object:Gem::Dependency
69
83
  name: rspec
70
84
  requirement: !ruby/object:Gem::Requirement
@@ -145,9 +159,7 @@ executables: []
145
159
  extensions: []
146
160
  extra_rdoc_files: []
147
161
  files:
148
- - AI_USAGE_GUIDE.md
149
162
  - CHANGELOG.md
150
- - GEM_RELEASE_GUIDE.md
151
163
  - Gemfile
152
164
  - LICENSE
153
165
  - README.md
data/AI_USAGE_GUIDE.md DELETED
@@ -1,404 +0,0 @@
1
- # 🤖 Universal Document Processor - AI Agent Usage Guide
2
-
3
- ## Overview
4
-
5
- The Universal Document Processor gem includes powerful AI-powered document analysis capabilities through its built-in **Agentic AI** features. Once you've installed the gem, you can leverage AI to analyze, summarize, extract information, and interact with your documents intelligently.
6
-
7
- ## 🚀 Quick Setup
8
-
9
- ### 1. Install the Gem
10
-
11
- ```bash
12
- gem install universal_document_processor
13
- ```
14
-
15
- ### 2. Set Up Your OpenAI API Key
16
-
17
- ```bash
18
- # Set environment variable
19
- export OPENAI_API_KEY="your-openai-api-key-here"
20
- ```
21
-
22
- Or pass it directly in your code:
23
-
24
- ```ruby
25
- options = { api_key: 'your-openai-api-key-here' }
26
- ```
27
-
28
- ### 3. Basic AI Usage
29
-
30
- ```ruby
31
- require 'universal_document_processor'
32
-
33
- # Basic AI analysis
34
- result = UniversalDocumentProcessor.ai_analyze('document.pdf')
35
- puts result
36
- ```
37
-
38
- ## 🧠 AI Features Overview
39
-
40
- ### Available AI Methods
41
-
42
- 1. **`ai_analyze`** - Comprehensive document analysis
43
- 2. **`ai_summarize`** - Generate summaries of different lengths
44
- 3. **`ai_extract_info`** - Extract specific information categories
45
- 4. **`ai_translate`** - Translate document content
46
- 5. **`ai_classify`** - Classify document type and purpose
47
- 6. **`ai_insights`** - Generate insights and recommendations
48
- 7. **`ai_action_items`** - Extract actionable items
49
- 8. **`ai_compare`** - Compare multiple documents
50
- 9. **`ai_chat`** - Interactive chat about documents
51
-
52
- ## 📝 Detailed Usage Examples
53
-
54
- ### 1. Document Analysis
55
-
56
- #### General Analysis
57
- ```ruby
58
- # Analyze any document comprehensively
59
- analysis = UniversalDocumentProcessor.ai_analyze('report.pdf')
60
- puts analysis
61
- ```
62
-
63
- #### Specific Query Analysis
64
- ```ruby
65
- # Ask specific questions about the document
66
- analysis = UniversalDocumentProcessor.ai_analyze('contract.pdf', {
67
- query: "What are the key terms and conditions?"
68
- })
69
- puts analysis
70
- ```
71
-
72
- ### 2. Document Summarization
73
-
74
- ```ruby
75
- # Short summary (2-3 sentences)
76
- summary = UniversalDocumentProcessor.ai_summarize('document.pdf', length: :short)
77
-
78
- # Medium summary (1-2 paragraphs) - default
79
- summary = UniversalDocumentProcessor.ai_summarize('document.pdf', length: :medium)
80
-
81
- # Detailed summary
82
- summary = UniversalDocumentProcessor.ai_summarize('document.pdf', length: :long)
83
-
84
- puts summary
85
- ```
86
-
87
- ### 3. Information Extraction
88
-
89
- ```ruby
90
- # Extract default categories
91
- info = UniversalDocumentProcessor.ai_extract_info('meeting_notes.pdf')
92
-
93
- # Extract specific categories
94
- info = UniversalDocumentProcessor.ai_extract_info('contract.pdf', [
95
- 'parties', 'dates', 'financial_terms', 'obligations', 'deadlines'
96
- ])
97
-
98
- puts info
99
- ```
100
-
101
- ### 4. Document Translation
102
-
103
- ```ruby
104
- # Translate to different languages
105
- spanish_content = UniversalDocumentProcessor.ai_translate('document.pdf', 'Spanish')
106
- japanese_content = UniversalDocumentProcessor.ai_translate('document.pdf', 'Japanese')
107
- french_content = UniversalDocumentProcessor.ai_translate('document.pdf', 'French')
108
-
109
- puts spanish_content
110
- ```
111
-
112
- ### 5. Document Classification
113
-
114
- ```ruby
115
- # Classify document type and purpose
116
- classification = UniversalDocumentProcessor.ai_classify('unknown_document.pdf')
117
-
118
- # Returns structured information about document type
119
- puts classification
120
- ```
121
-
122
- ### 6. Generate Insights
123
-
124
- ```ruby
125
- # Get AI-powered insights and recommendations
126
- insights = UniversalDocumentProcessor.ai_insights('business_plan.pdf')
127
-
128
- # Returns analysis of key themes, recommendations, etc.
129
- puts insights
130
- ```
131
-
132
- ### 7. Extract Action Items
133
-
134
- ```ruby
135
- # Extract actionable items from documents
136
- action_items = UniversalDocumentProcessor.ai_action_items('meeting_minutes.pdf')
137
-
138
- # Returns structured list of tasks, deadlines, assignments
139
- puts action_items
140
- ```
141
-
142
- ### 8. Compare Documents
143
-
144
- ```ruby
145
- # Compare multiple documents
146
- comparison = UniversalDocumentProcessor.ai_compare([
147
- 'version1.pdf',
148
- 'version2.pdf',
149
- 'version3.pdf'
150
- ], :content)
151
-
152
- puts comparison
153
- ```
154
-
155
- ## 🎯 Advanced Usage with Document Objects
156
-
157
- ### Using Document Objects for More Control
158
-
159
- ```ruby
160
- # Create document object for advanced operations
161
- doc = UniversalDocumentProcessor::Document.new('complex_document.pdf')
162
-
163
- # Use AI methods on the document object
164
- summary = doc.ai_summarize(length: :medium)
165
- insights = doc.ai_insights
166
- action_items = doc.ai_action_items
167
-
168
- # Interactive chat about the document
169
- response = doc.ai_chat("What are the main risks mentioned in this document?")
170
- puts response
171
- ```
172
-
173
- ### Creating and Reusing AI Agent
174
-
175
- ```ruby
176
- # Create an AI agent with custom configuration
177
- ai_agent = UniversalDocumentProcessor.create_ai_agent({
178
- model: 'gpt-4',
179
- temperature: 0.7,
180
- api_key: 'your-api-key'
181
- })
182
-
183
- # Process document
184
- doc_result = UniversalDocumentProcessor.process('document.pdf')
185
-
186
- # Use AI agent for multiple operations
187
- summary = ai_agent.summarize_document(doc_result, length: :short)
188
- insights = ai_agent.generate_insights(doc_result)
189
- classification = ai_agent.classify_document(doc_result)
190
-
191
- # Interactive chat
192
- response = ai_agent.chat("Tell me about the financial projections", doc_result)
193
- ```
194
-
195
- ## 🛠️ Configuration Options
196
-
197
- ### AI Agent Configuration
198
-
199
- ```ruby
200
- options = {
201
- api_key: 'your-openai-api-key', # OpenAI API key
202
- model: 'gpt-4', # AI model to use
203
- temperature: 0.7, # Response creativity (0.0-1.0)
204
- max_history: 10, # Conversation history limit
205
- base_url: 'https://api.openai.com/v1' # API endpoint
206
- }
207
-
208
- # Use with any AI method
209
- result = UniversalDocumentProcessor.ai_analyze('document.pdf', options)
210
- ```
211
-
212
- ## 💡 Use Case Examples
213
-
214
- ### 1. Legal Document Analysis
215
-
216
- ```ruby
217
- # Analyze legal contracts
218
- contract_analysis = UniversalDocumentProcessor.ai_analyze('contract.pdf', {
219
- query: "Extract all key terms, obligations, and potential risks"
220
- })
221
-
222
- # Extract specific legal information
223
- legal_info = UniversalDocumentProcessor.ai_extract_info('contract.pdf', [
224
- 'parties', 'effective_date', 'termination_clauses', 'payment_terms', 'liabilities'
225
- ])
226
- ```
227
-
228
- ### 2. Business Report Processing
229
-
230
- ```ruby
231
- # Summarize quarterly reports
232
- summary = UniversalDocumentProcessor.ai_summarize('q4_report.pdf', length: :medium)
233
-
234
- # Extract key business metrics
235
- metrics = UniversalDocumentProcessor.ai_extract_info('q4_report.pdf', [
236
- 'revenue', 'expenses', 'profit_margins', 'growth_metrics', 'forecasts'
237
- ])
238
-
239
- # Get strategic insights
240
- insights = UniversalDocumentProcessor.ai_insights('q4_report.pdf')
241
- ```
242
-
243
- ### 3. Meeting Minutes Processing
244
-
245
- ```ruby
246
- # Extract action items from meeting notes
247
- action_items = UniversalDocumentProcessor.ai_action_items('meeting_notes.pdf')
248
-
249
- # Summarize meeting outcomes
250
- summary = UniversalDocumentProcessor.ai_summarize('meeting_notes.pdf', length: :short)
251
-
252
- # Extract key decisions and follow-ups
253
- decisions = UniversalDocumentProcessor.ai_extract_info('meeting_notes.pdf', [
254
- 'decisions_made', 'action_items', 'deadlines', 'assigned_people'
255
- ])
256
- ```
257
-
258
- ### 4. Research Paper Analysis
259
-
260
- ```ruby
261
- # Analyze research papers
262
- analysis = UniversalDocumentProcessor.ai_analyze('research_paper.pdf', {
263
- query: "What are the main findings and methodology used?"
264
- })
265
-
266
- # Extract research data
267
- research_info = UniversalDocumentProcessor.ai_extract_info('research_paper.pdf', [
268
- 'hypothesis', 'methodology', 'results', 'conclusions', 'future_work'
269
- ])
270
- ```
271
-
272
- ## 🔄 Interactive Document Chat
273
-
274
- ```ruby
275
- # Create document object
276
- doc = UniversalDocumentProcessor::Document.new('document.pdf')
277
-
278
- # Start interactive chat session
279
- puts "Chat with your document (type 'exit' to quit):"
280
-
281
- loop do
282
- print "> "
283
- user_input = gets.chomp
284
- break if user_input.downcase == 'exit'
285
-
286
- response = doc.ai_chat(user_input)
287
- puts "AI: #{response}\n\n"
288
- end
289
- ```
290
-
291
- ## 📊 Batch AI Processing
292
-
293
- ```ruby
294
- # Process multiple documents with AI
295
- documents = ['doc1.pdf', 'doc2.docx', 'doc3.xlsx']
296
-
297
- # Batch summarization
298
- summaries = documents.map do |file|
299
- {
300
- file: file,
301
- summary: UniversalDocumentProcessor.ai_summarize(file, length: :short)
302
- }
303
- end
304
-
305
- # Batch classification
306
- classifications = documents.map do |file|
307
- {
308
- file: file,
309
- classification: UniversalDocumentProcessor.ai_classify(file)
310
- }
311
- end
312
- ```
313
-
314
- ## 🚨 Error Handling
315
-
316
- ```ruby
317
- begin
318
- result = UniversalDocumentProcessor.ai_analyze('document.pdf')
319
- puts result
320
- rescue ArgumentError => e
321
- puts "Configuration error: #{e.message}"
322
- puts "Please check your OpenAI API key"
323
- rescue UniversalDocumentProcessor::ProcessingError => e
324
- puts "Processing error: #{e.message}"
325
- rescue StandardError => e
326
- puts "Unexpected error: #{e.message}"
327
- end
328
- ```
329
-
330
- ## 🎛️ Environment Variables
331
-
332
- Set these environment variables for seamless operation:
333
-
334
- ```bash
335
- # Required
336
- export OPENAI_API_KEY="your-openai-api-key"
337
-
338
- # Optional
339
- export OPENAI_MODEL="gpt-4"
340
- export OPENAI_TEMPERATURE="0.7"
341
- export OPENAI_BASE_URL="https://api.openai.com/v1"
342
- ```
343
-
344
- ## 🔧 Troubleshooting
345
-
346
- ### Common Issues and Solutions
347
-
348
- 1. **Missing API Key**
349
- ```ruby
350
- # Error: ArgumentError: OpenAI API key is required
351
- # Solution: Set OPENAI_API_KEY environment variable or pass api_key in options
352
- ```
353
-
354
- 2. **API Rate Limits**
355
- ```ruby
356
- # Add delays between requests for large batch operations
357
- documents.each_with_index do |doc, index|
358
- result = UniversalDocumentProcessor.ai_analyze(doc)
359
- sleep(1) if index % 10 == 0 # Pause every 10 requests
360
- end
361
- ```
362
-
363
- 3. **Large Documents**
364
- ```ruby
365
- # For very large documents, consider processing in chunks
366
- options = { max_content_length: 10000 }
367
- result = UniversalDocumentProcessor.ai_analyze('large_doc.pdf', options)
368
- ```
369
-
370
- ## 📚 Best Practices
371
-
372
- 1. **Optimize API Usage**
373
- - Cache results for repeated analysis
374
- - Use appropriate summary lengths
375
- - Batch similar operations
376
-
377
- 2. **Security**
378
- - Store API keys securely
379
- - Don't log sensitive document content
380
- - Use environment variables for configuration
381
-
382
- 3. **Performance**
383
- - Process documents in parallel when possible
384
- - Use specific queries rather than general analysis
385
- - Consider document size when choosing AI operations
386
-
387
- ## 🎯 Next Steps
388
-
389
- 1. **Explore Advanced Features**: Try different AI models and temperature settings
390
- 2. **Integrate with Your Application**: Build AI-powered document workflows
391
- 3. **Customize for Your Domain**: Create domain-specific extraction categories
392
- 4. **Scale Your Usage**: Implement batch processing for large document sets
393
-
394
- ## 📞 Support
395
-
396
- For issues with AI functionality:
397
- 1. Check your OpenAI API key and credits
398
- 2. Verify document format compatibility
399
- 3. Review error messages for specific guidance
400
- 4. Consult the main gem documentation for additional features
401
-
402
- ---
403
-
404
- *This guide covers the AI capabilities of the Universal Document Processor gem. The AI features require an OpenAI API key and internet connection to function.*