ragdoll 0.1.11 → 0.1.12

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md CHANGED
@@ -1,7 +1,7 @@
1
- <div align="center" style="background-color: yellow; color: black; padding: 20px; margin: 20px 0; border: 2px solid black; font-size: 48px; font-weight: bold;">
2
- ⚠️ CAUTION ⚠️<br />
3
- Software Under Development by a Crazy Man
4
- </div>
1
+ > [!CAUTION]<br />
2
+ > **Software Under Development by a Crazy Man**<br />
3
+ > Gave up on the multi-modal vectorization approach,<br />
4
+ > now using a unified text-based RAG architecture.
5
5
  <br />
6
6
  <div align="center">
7
7
  <table>
@@ -12,7 +12,8 @@
12
12
  </a>
13
13
  </td>
14
14
  <td width="50%" valign="top">
15
- <p>Multi-modal RAG (Retrieval-Augmented Generation) is an architecture that integrates multiple data types (such as text, images, and audio) to enhance AI response generation. It combines retrieval-based methods, which fetch relevant information from a knowledge base, with generative large language models (LLMs) that create coherent and contextually appropriate outputs. This approach allows for more comprehensive and engaging user interactions, such as chatbots that respond with both text and images or educational tools that incorporate visual aids into learning materials. By leveraging various modalities, multi-modal RAG systems improve context understanding and user experience.</p>
15
+ <p><strong>🔄 NEW: Unified Text-Based RAG Architecture</strong></p>
16
+ <p>Ragdoll has evolved to a unified text-based RAG (Retrieval-Augmented Generation) architecture that converts all media types—text, images, audio, and video—to comprehensive text representations before vectorization. This approach enables true cross-modal search where you can find images through their AI-generated descriptions, audio through transcripts, and all content through a single, powerful text-based search index.</p>
16
17
  </td>
17
18
  </tr>
18
19
  </table>
@@ -20,62 +21,66 @@
20
21
 
21
22
  # Ragdoll
22
23
 
23
- Database-oriented multi-modal RAG (Retrieval-Augmented Generation) library built on ActiveRecord. Features PostgreSQL + pgvector for high-performance semantic search, polymorphic content architecture, and dual metadata design for sophisticated document analysis.
24
+ **Unified Text-Based RAG (Retrieval-Augmented Generation) library built on ActiveRecord.** Features PostgreSQL + pgvector for high-performance semantic search with a simplified architecture that converts all media types to searchable text.
25
+
26
+ RAG does not have to be hard. The new unified approach eliminates the complexity of multi-modal vectorization while enabling powerful cross-modal search capabilities. See: [https://0x1eef.github.io/posts/an-introduction-to-rag-with-llm.rb/](https://0x1eef.github.io/posts/an-introduction-to-rag-with-llm.rb/)
27
+
28
+ ## 🆕 **What's New: Unified Text-Based Architecture**
29
+
30
+ Ragdoll 2.0 introduces a revolutionary unified approach:
24
31
 
25
- RAG does not have to be hard. Every week its getting simpler. The frontier LLM providers are starting to encorporate RAG services. For example OpenAI offers a vector search service. See: [https://0x1eef.github.io/posts/an-introduction-to-rag-with-llm.rb/](https://0x1eef.github.io/posts/an-introduction-to-rag-with-llm.rb/)
32
+ - **All Media Text**: Images become comprehensive descriptions, audio becomes transcripts
33
+ - **Single Embedding Model**: One text embedding model for all content types
34
+ - **Cross-Modal Search**: Find images through descriptions, audio through transcripts
35
+ - **Simplified Architecture**: No more complex STI (Single Table Inheritance) models
36
+ - **Better Search**: Unified text index enables more sophisticated queries
37
+ - **Migration Path**: Smooth transition from the previous multi-modal system
26
38
 
27
39
  ## Overview
28
40
 
29
- Ragdoll is a database-first, multi-modal Retrieval-Augmented Generation (RAG) library for Ruby. It pairs PostgreSQL + pgvector with an ActiveRecord-driven schema to deliver fast, production-grade semantic search and clean data modeling. Today it ships with robust text processing; image and audio pipelines are scaffolded and actively being completed.
41
+ Ragdoll is a database-first, unified text-based Retrieval-Augmented Generation (RAG) library for Ruby. It pairs PostgreSQL + pgvector with an ActiveRecord-driven schema to deliver fast, production-grade semantic search through a simplified unified architecture.
30
42
 
31
- The library emphasizes a dual-metadata design: LLM-derived semantic metadata for understanding content, and system file metadata for managing assets. With built-in analytics, background processing, and a high-level API, you can go from ingest to answer quickly—and scale confidently.
43
+ The library converts all document types to rich text representations: PDFs and documents are extracted as text, images are converted to comprehensive AI-generated descriptions, and audio files are transcribed. This unified approach enables powerful cross-modal search while maintaining simplicity.
32
44
 
33
- ### Why Ragdoll?
45
+ ### Why the New Unified Architecture?
34
46
 
35
- - Database-first foundation on ActiveRecord (PostgreSQL + pgvector only) for performance and reliability
36
- - Multi-modal architecture (text today; image/audio next) via polymorphic content design
37
- - Dual metadata model separating semantic analysis from file properties
38
- - Provider-agnostic LLM integration via `ruby_llm` (OpenAI, Anthropic, Google)
39
- - Production-friendly: background jobs, connection pooling, indexing, and search analytics
40
- - Simple, ergonomic high-level API to keep your application code clean
47
+ - **Simplified Complexity**: Single content model instead of multiple polymorphic types
48
+ - **Cross-Modal Search**: Find images by searching for objects or concepts in their descriptions
49
+ - **Unified Index**: One text-based search index for all content types
50
+ - **Better Retrieval**: Text descriptions often contain more searchable information than raw media
51
+ - **Cost Effective**: Single embedding model instead of specialized models per media type
52
+ - **Easier Maintenance**: One embedding pipeline to maintain and optimize
41
53
 
42
54
  ### Key Capabilities
43
55
 
44
- - Semantic search with vector similarity (cosine) across polymorphic content
45
- - Text ingestion, chunking, and embedding generation
46
- - LLM-powered structured metadata with schema validation
47
- - Search tracking and analytics (CTR, performance, similarity of queries)
48
- - Hybrid search (semantic + full-text) planned
49
- - Extensible model and configuration system
56
+ - **Universal Text Conversion**: Converts any media type to searchable text
57
+ - **AI-Powered Descriptions**: Comprehensive image descriptions using vision models
58
+ - **Audio Transcription**: Speech-to-text conversion for audio content
59
+ - **Semantic Search**: Vector similarity search across all converted content
60
+ - **Cross-Modal Retrieval**: Search for images using text descriptions of their content
61
+ - **Content Quality Assessment**: Automatic scoring of converted content quality
62
+ - **Migration Support**: Tools to migrate from previous multi-modal architecture
50
63
 
51
64
  ## Table of Contents
52
65
 
53
66
  - [Quick Start](#quick-start)
67
+ - [Unified Architecture Guide](#unified-architecture-guide)
68
+ - [Document Processing Pipeline](#document-processing-pipeline)
69
+ - [Cross-Modal Search](#cross-modal-search)
70
+ - [Migration from Multi-Modal](#migration-from-multi-modal)
54
71
  - [API Overview](#api-overview)
55
- - [Search and Retrieval](#search-and-retrieval)
56
- - [Search Analytics and Tracking](#search-analytics-and-tracking)
57
- - [System Operations](#system-operations)
58
72
  - [Configuration](#configuration)
59
- - [Current Implementation Status](#current-implementation-status)
60
- - [Architecture Highlights](#architecture-highlights)
61
- - [Text Document Processing](#text-document-processing-current)
62
- - [PostgreSQL + pgvector Configuration](#postgresql--pgvector-configuration)
63
- - [Performance Features](#performance-features)
64
73
  - [Installation](#installation)
65
74
  - [Requirements](#requirements)
66
- - [Use Cases](#use-cases)
67
- - [Environment Variables](#environment-variables)
75
+ - [Performance Features](#performance-features)
68
76
  - [Troubleshooting](#troubleshooting)
69
- - [Related Projects](#related-projects)
70
- - [Key Design Principles](#key-design-principles)
71
- - [Contributing & Support](#contributing--support)
72
77
 
73
78
  ## Quick Start
74
79
 
75
80
  ```ruby
76
81
  require 'ragdoll'
77
82
 
78
- # Configure with PostgreSQL + pgvector
83
+ # Configure with unified text-based architecture
79
84
  Ragdoll.configure do |config|
80
85
  # Database configuration (PostgreSQL only)
81
86
  config.database_config = {
@@ -88,260 +93,234 @@ Ragdoll.configure do |config|
88
93
  auto_migrate: true
89
94
  }
90
95
 
91
- # Ruby LLM configuration
92
- config.ruby_llm_config[:openai][:api_key] = ENV['OPENAI_API_KEY']
93
- config.ruby_llm_config[:openai][:organization] = ENV['OPENAI_ORGANIZATION']
94
- config.ruby_llm_config[:openai][:project] = ENV['OPENAI_PROJECT']
96
+ # Enable unified text-based models
97
+ config.use_unified_models = true
98
+
99
+ # Text conversion settings
100
+ config.text_conversion = {
101
+ image_detail_level: :comprehensive, # :minimal, :standard, :comprehensive, :analytical
102
+ audio_transcription_provider: :openai, # :azure, :google, :whisper_local
103
+ enable_fallback_descriptions: true
104
+ }
95
105
 
96
- # Model configuration
97
- config.models[:default] = 'openai/gpt-4o'
98
- config.models[:embedding][:text] = 'text-embedding-3-small'
106
+ # Single embedding model for all content
107
+ config.embedding_model = "text-embedding-3-large"
108
+ config.embedding_provider = :openai
99
109
 
100
- # Logging configuration
101
- config.logging_config[:log_level] = :warn
102
- config.logging_config[:log_filepath] = File.join(Dir.home, '.ragdoll', 'ragdoll.log')
110
+ # Ruby LLM configuration
111
+ config.ruby_llm_config[:openai][:api_key] = ENV['OPENAI_API_KEY']
103
112
  end
104
113
 
105
- # Add documents - returns detailed result
114
+ # Add documents - all types converted to text
106
115
  result = Ragdoll.add_document(path: 'research_paper.pdf')
107
- puts result[:message] # "Document 'research_paper' added successfully with ID 123"
108
- doc_id = result[:document_id]
116
+ image_result = Ragdoll.add_document(path: 'diagram.png') # Converted to description
117
+ audio_result = Ragdoll.add_document(path: 'lecture.mp3') # Converted to transcript
109
118
 
110
- # Check document status
111
- status = Ragdoll.document_status(id: doc_id)
112
- puts status[:message] # Shows processing status and embeddings count
119
+ # Cross-modal search - find images by describing their content
120
+ results = Ragdoll.search(query: 'neural network architecture diagram')
121
+ # This can return the image document if its AI description mentions neural networks
113
122
 
114
- # Search across content
115
- results = Ragdoll.search(query: 'neural networks')
123
+ # Search for audio content by transcript content
124
+ results = Ragdoll.search(query: 'machine learning discussion')
125
+ # Returns audio documents whose transcripts mention machine learning
116
126
 
117
- # Get detailed document information
118
- document = Ragdoll.get_document(id: doc_id)
127
+ # Check content quality
128
+ document = Ragdoll.get_document(id: result[:document_id])
129
+ puts document[:content_quality_score] # 0.0 to 1.0 rating
119
130
  ```
120
131
 
121
- ## API Overview
132
+ ## Unified Architecture Guide
122
133
 
123
- The `Ragdoll` module provides a convenient high-level API for common operations:
134
+ ### Document Processing Pipeline
124
135
 
125
- ### Document Management
136
+ The new unified pipeline converts all media types to searchable text:
126
137
 
127
138
  ```ruby
128
- # Add single document - returns detailed result hash
129
- result = Ragdoll.add_document(path: 'document.pdf')
130
- puts result[:success] # true
131
- puts result[:document_id] # "123"
132
- puts result[:message] # "Document 'document' added successfully with ID 123"
133
- puts result[:embeddings_queued] # true
134
-
135
- # Add document with force option to override duplicate detection
136
- result = Ragdoll.add_document(path: 'document.pdf', force: true)
137
- # Creates new document even if duplicate exists
138
-
139
- # Check document processing status
140
- status = Ragdoll.document_status(id: result[:document_id])
141
- puts status[:status] # "processed"
142
- puts status[:embeddings_count] # 15
143
- puts status[:embeddings_ready] # true
144
- puts status[:message] # "Document processed successfully with 15 embeddings"
145
-
146
- # Get detailed document information
147
- document = Ragdoll.get_document(id: result[:document_id])
148
- puts document[:title] # "document"
149
- puts document[:status] # "processed"
150
- puts document[:embeddings_count] # 15
151
- puts document[:content_length] # 5000
139
+ # Text files: Direct extraction
140
+ text_doc = Ragdoll.add_document(path: 'article.md')
141
+ # Content: Original markdown text
152
142
 
153
- # Update document metadata
154
- Ragdoll.update_document(id: result[:document_id], title: 'New Title')
143
+ # PDF/DOCX: Text extraction
144
+ pdf_doc = Ragdoll.add_document(path: 'research.pdf')
145
+ # Content: Extracted text from all pages
155
146
 
156
- # Delete document
157
- Ragdoll.delete_document(id: result[:document_id])
147
+ # Images: AI-generated descriptions
148
+ image_doc = Ragdoll.add_document(path: 'chart.png')
149
+ # Content: "Bar chart showing quarterly sales data with increasing trend..."
158
150
 
159
- # List all documents
160
- documents = Ragdoll.list_documents(limit: 10)
151
+ # Audio: Speech-to-text transcription
152
+ audio_doc = Ragdoll.add_document(path: 'meeting.mp3')
153
+ # Content: "In this meeting we discussed the quarterly results..."
161
154
 
162
- # System statistics
163
- stats = Ragdoll.stats
164
- puts stats[:total_documents] # 50
165
- puts stats[:total_embeddings] # 1250
155
+ # Video: Audio transcription + metadata
156
+ video_doc = Ragdoll.add_document(path: 'presentation.mp4')
157
+ # Content: Combination of audio transcript and video metadata
166
158
  ```
167
159
 
168
- ### Duplicate Detection
169
-
170
- Ragdoll includes sophisticated duplicate detection to prevent redundant document processing:
160
+ ### Text Conversion Services
171
161
 
172
162
  ```ruby
173
- # Automatic duplicate detection (default behavior)
174
- result1 = Ragdoll.add_document(path: 'research.pdf')
175
- result2 = Ragdoll.add_document(path: 'research.pdf')
176
- # result2 returns the same document_id as result1 (duplicate detected)
177
-
178
- # Force adding a duplicate document
179
- result3 = Ragdoll.add_document(path: 'research.pdf', force: true)
180
- # Creates a new document with modified location identifier
181
-
182
- # Duplicate detection criteria:
183
- # 1. Exact location/path match
184
- # 2. File modification time (for files)
185
- # 3. File content hash (SHA256)
186
- # 4. Content hash for text
187
- # 5. File size and metadata similarity
188
- # 6. Document title and type matching
189
- ```
163
+ # Use individual conversion services
164
+ text_content = Ragdoll::TextExtractionService.extract('document.pdf')
165
+ image_description = Ragdoll::ImageToTextService.convert('photo.jpg', detail_level: :comprehensive)
166
+ audio_transcript = Ragdoll::AudioToTextService.transcribe('speech.wav')
190
167
 
191
- **Duplicate Detection Features:**
192
- - **Multi-level detection**: Checks location, file hash, content hash, and metadata
193
- - **Smart similarity**: Detects duplicates even with minor differences (5% content tolerance)
194
- - **File integrity**: SHA256 hashing for reliable file comparison
195
- - **URL support**: Content-based detection for web documents
196
- - **Force option**: Override detection when needed
197
- - **Performance optimized**: Database indexes for fast lookups
168
+ # Use unified converter (orchestrates all services)
169
+ unified_text = Ragdoll::DocumentConverter.convert_to_text('any_file.ext')
198
170
 
199
- ### Search and Retrieval
171
+ # Manage documents with unified approach
172
+ management = Ragdoll::UnifiedDocumentManagement.new
173
+ document = management.add_document('mixed_media_file.mov')
174
+ ```
200
175
 
201
- ```ruby
202
- # Semantic search across all content types
203
- results = Ragdoll.search(query: 'artificial intelligence')
204
-
205
- # Search with automatic tracking (default)
206
- results = Ragdoll.search(
207
- query: 'machine learning',
208
- session_id: 123, # Optional: track user sessions
209
- user_id: 456 # Optional: track by user
210
- )
176
+ ### Content Quality Assessment
211
177
 
212
- # Search specific content types
213
- text_results = Ragdoll.search(query: 'machine learning', content_type: 'text')
214
- image_results = Ragdoll.search(query: 'neural network diagram', content_type: 'image')
215
- audio_results = Ragdoll.search(query: 'AI discussion', content_type: 'audio')
178
+ ```ruby
179
+ # Get content quality scores
180
+ document = Ragdoll::UnifiedDocument.find(id)
181
+ quality = document.content_quality_score # 0.0 to 1.0
182
+
183
+ # Quality factors:
184
+ # - Content length (50-2000 words optimal)
185
+ # - Original media type (text > documents > descriptions > placeholders)
186
+ # - Conversion success (full content > partial > fallback)
187
+
188
+ # Batch quality assessment
189
+ stats = Ragdoll::UnifiedContent.stats
190
+ puts stats[:content_quality_distribution]
191
+ # => { high: 150, medium: 75, low: 25 }
192
+ ```
216
193
 
217
- # Advanced search with metadata filters
218
- results = Ragdoll.search(
219
- query: 'deep learning',
220
- classification: 'research',
221
- keywords: ['AI', 'neural networks'],
222
- tags: ['technical']
223
- )
194
+ ## Cross-Modal Search
224
195
 
225
- # Get context for RAG applications
226
- context = Ragdoll.get_context(query: 'machine learning', limit: 5)
196
+ The unified architecture enables powerful cross-modal search capabilities:
227
197
 
228
- # Enhanced prompt with context
229
- enhanced = Ragdoll.enhance_prompt(
230
- prompt: 'What is machine learning?',
231
- context_limit: 5
198
+ ```ruby
199
+ # Find images by describing their visual content
200
+ image_results = Ragdoll.search(query: 'red sports car in parking lot')
201
+ # Returns image documents whose AI descriptions match the query
202
+
203
+ # Search for audio by spoken content
204
+ audio_results = Ragdoll.search(query: 'quarterly sales meeting discussion')
205
+ # Returns audio documents whose transcripts contain these topics
206
+
207
+ # Mixed results across all media types
208
+ all_results = Ragdoll.search(query: 'artificial intelligence')
209
+ # Returns text documents, images with AI descriptions, and audio transcripts
210
+ # all ranked by relevance to the query
211
+
212
+ # Filter by original media type while searching text
213
+ image_only = Ragdoll.search(
214
+ query: 'machine learning workflow',
215
+ original_media_type: 'image'
232
216
  )
233
217
 
234
- # Hybrid search combining semantic and full-text
235
- results = Ragdoll.hybrid_search(
236
- query: 'neural networks',
237
- semantic_weight: 0.7,
238
- text_weight: 0.3
218
+ # Search with quality filtering
219
+ high_quality = Ragdoll.search(
220
+ query: 'deep learning',
221
+ min_quality_score: 0.7
239
222
  )
240
223
  ```
241
224
 
242
- ### Keywords Search
225
+ ## Migration from Multi-Modal
243
226
 
244
- Ragdoll supports powerful keywords-based search that can be used standalone or combined with semantic search. The keywords system uses PostgreSQL array operations for high performance and supports both partial matching (overlap) and exact matching (contains all).
227
+ Migrate smoothly from the previous multi-modal architecture:
245
228
 
246
229
  ```ruby
247
- # Keywords-only search (overlap - documents containing any of the keywords)
248
- results = Ragdoll::Document.search_by_keywords(['machine', 'learning', 'ai'])
230
+ # Check migration readiness
231
+ migration_service = Ragdoll::MigrationService.new
232
+ report = migration_service.create_comparison_report
249
233
 
250
- # Results are sorted by match count (documents with more keyword matches rank higher)
251
- results.each do |doc|
252
- puts "#{doc.title}: #{doc.keywords_match_count} matches"
253
- end
234
+ puts "Migration Benefits:"
235
+ report[:benefits].each { |benefit, description| puts "- #{description}" }
254
236
 
255
- # Exact keywords search (contains all - documents must have ALL keywords)
256
- results = Ragdoll::Document.search_by_keywords_all(['ruby', 'programming'])
257
-
258
- # Results are sorted by focus (fewer total keywords = more focused document)
259
- results.each do |doc|
260
- puts "#{doc.title}: #{doc.total_keywords_count} total keywords"
261
- end
262
-
263
- # Combined semantic + keywords search for best results
264
- results = Ragdoll.search(
265
- query: 'artificial intelligence applications',
266
- keywords: ['ai', 'machine learning', 'neural networks'],
267
- limit: 10
237
+ # Migrate all documents
238
+ results = Ragdoll::MigrationService.migrate_all_documents(
239
+ batch_size: 50,
240
+ process_embeddings: true
268
241
  )
269
242
 
270
- # Keywords search with options
271
- results = Ragdoll::Document.search_by_keywords(
272
- ['web', 'javascript', 'frontend'],
273
- limit: 20
274
- )
243
+ puts "Migrated: #{results[:migrated]} documents"
244
+ puts "Errors: #{results[:errors].length}"
245
+
246
+ # Validate migration integrity
247
+ validation = migration_service.validate_migration
248
+ puts "Validation passed: #{validation[:passed]}/#{validation[:total_checks]} checks"
275
249
 
276
- # Case-insensitive keyword matching (automatically normalized)
277
- results = Ragdoll::Document.search_by_keywords(['Python', 'DATA-SCIENCE', 'ai'])
278
- # Will match documents with keywords: ['python', 'data-science', 'ai']
250
+ # Migrate individual document
251
+ migrated_doc = Ragdoll::MigrationService.migrate_document(old_document_id)
279
252
  ```
280
253
 
281
- **Keywords Search Features:**
282
- - **High Performance**: Uses PostgreSQL GIN indexes for fast array operations
283
- - **Flexible Matching**: Supports both overlap (`&&`) and contains (`@>`) operators
284
- - **Smart Scoring**: Results ordered by match count or document focus
285
- - **Case Insensitive**: Automatic keyword normalization
286
- - **Integration Ready**: Works seamlessly with semantic search
287
- - **Inspired by `find_matching_entries.rb`**: Optimized for PostgreSQL arrays
254
+ ## API Overview
255
+
256
+ ### Unified Document Management
288
257
 
289
- ### Search Analytics and Tracking
258
+ ```ruby
259
+ # Add documents with automatic text conversion
260
+ result = Ragdoll.add_document(path: 'any_file.ext')
261
+ puts result[:document_id]
262
+ puts result[:content_preview] # First 100 characters of converted text
263
+
264
+ # Batch processing with unified pipeline
265
+ files = ['doc.pdf', 'image.jpg', 'audio.mp3']
266
+ results = Ragdoll::UnifiedDocumentManagement.new.batch_process_documents(files)
267
+
268
+ # Reprocess with different conversion settings
269
+ Ragdoll::UnifiedDocumentManagement.new.reprocess_document(
270
+ document_id,
271
+ image_detail_level: :analytical
272
+ )
273
+ ```
290
274
 
291
- Ragdoll automatically tracks all searches to provide comprehensive analytics and improve search relevance over time:
275
+ ### Search API
292
276
 
293
277
  ```ruby
294
- # Get search analytics for the last 30 days
295
- analytics = Ragdoll::Search.search_analytics(days: 30)
296
- puts "Total searches: #{analytics[:total_searches]}"
297
- puts "Unique queries: #{analytics[:unique_queries]}"
298
- puts "Average execution time: #{analytics[:avg_execution_time]}ms"
299
- puts "Click-through rate: #{analytics[:click_through_rate]}%"
300
-
301
- # Find similar searches using vector similarity
302
- search = Ragdoll::Search.first
303
- similar_searches = search.nearest_neighbors(:query_embedding, distance: :cosine).limit(5)
304
-
305
- similar_searches.each do |similar|
306
- puts "Query: #{similar.query}"
307
- puts "Similarity: #{similar.neighbor_distance}"
308
- puts "Results: #{similar.results_count}"
309
- end
278
+ # Unified search across all content types
279
+ results = Ragdoll.search(query: 'machine learning algorithms')
310
280
 
311
- # Track user interactions (clicks on search results)
312
- search_result = Ragdoll::SearchResult.first
313
- search_result.mark_as_clicked!
281
+ # Search with original media type context
282
+ results.each do |doc|
283
+ puts "#{doc.title} (originally #{doc.original_media_type})"
284
+ puts "Quality: #{doc.content_quality_score.round(2)}"
285
+ puts "Content: #{doc.content[0..100]}..."
286
+ end
314
287
 
315
- # Disable tracking for specific searches if needed
316
- results = Ragdoll.search(
317
- query: 'private query',
318
- track_search: false
288
+ # Advanced search with content quality
289
+ high_quality_results = Ragdoll.search(
290
+ query: 'neural networks',
291
+ min_quality_score: 0.8,
292
+ limit: 10
319
293
  )
320
294
  ```
321
295
 
322
- ### System Operations
296
+ ### Content Analysis
323
297
 
324
298
  ```ruby
325
- # Get system statistics
326
- stats = Ragdoll.stats
327
- # Returns information about documents, content types, embeddings, etc.
299
+ # Analyze converted content
300
+ document = Ragdoll::UnifiedDocument.find(id)
328
301
 
329
- # Health check
330
- healthy = Ragdoll.healthy?
302
+ # Check original media type
303
+ puts document.unified_contents.first.original_media_type # 'image', 'audio', 'text', etc.
331
304
 
332
- # Get configuration
333
- config = Ragdoll.configuration
305
+ # View conversion metadata
306
+ content = document.unified_contents.first
307
+ puts content.conversion_method # 'image_to_text', 'audio_transcription', etc.
308
+ puts content.metadata # Conversion settings and results
334
309
 
335
- # Reset configuration (useful for testing)
336
- Ragdoll.reset_configuration!
310
+ # Quality metrics
311
+ puts content.word_count
312
+ puts content.character_count
313
+ puts content.content_quality_score
337
314
  ```
338
315
 
339
- ### Configuration
316
+ ## Configuration
340
317
 
341
318
  ```ruby
342
- # Configure the system
343
319
  Ragdoll.configure do |config|
344
- # Database configuration (PostgreSQL only - REQUIRED)
320
+ # Enable unified text-based architecture
321
+ config.use_unified_models = true
322
+
323
+ # Database configuration (PostgreSQL required)
345
324
  config.database_config = {
346
325
  adapter: 'postgresql',
347
326
  database: 'ragdoll_production',
@@ -352,142 +331,74 @@ Ragdoll.configure do |config|
352
331
  auto_migrate: true
353
332
  }
354
333
 
355
- # Ruby LLM configuration for multiple providers
356
- config.ruby_llm_config[:openai][:api_key] = ENV['OPENAI_API_KEY']
357
- config.ruby_llm_config[:openai][:organization] = ENV['OPENAI_ORGANIZATION']
358
- config.ruby_llm_config[:openai][:project] = ENV['OPENAI_PROJECT']
334
+ # Text conversion settings
335
+ config.text_conversion = {
336
+ # Image conversion detail levels:
337
+ # :minimal - Brief one-sentence description
338
+ # :standard - Main elements and composition
339
+ # :comprehensive - Detailed description including objects, colors, mood
340
+ # :analytical - Thorough analysis including artistic elements
341
+ image_detail_level: :comprehensive,
342
+
343
+ # Audio transcription providers
344
+ audio_transcription_provider: :openai, # :azure, :google, :whisper_local
345
+
346
+ # Fallback behavior
347
+ enable_fallback_descriptions: true,
348
+ fallback_timeout: 30 # seconds
349
+ }
350
+
351
+ # Single embedding model for all content types
352
+ config.embedding_model = "text-embedding-3-large"
353
+ config.embedding_provider = :openai
359
354
 
355
+ # Ruby LLM configuration for text conversion
356
+ config.ruby_llm_config[:openai][:api_key] = ENV['OPENAI_API_KEY']
360
357
  config.ruby_llm_config[:anthropic][:api_key] = ENV['ANTHROPIC_API_KEY']
361
- config.ruby_llm_config[:google][:api_key] = ENV['GOOGLE_API_KEY']
362
358
 
363
- # Model configuration
364
- config.models[:default] = 'openai/gpt-4o'
365
- config.models[:summary] = 'openai/gpt-4o'
366
- config.models[:keywords] = 'openai/gpt-4o'
367
- config.models[:embedding][:text] = 'text-embedding-3-small'
368
- config.models[:embedding][:image] = 'image-embedding-3-small'
369
- config.models[:embedding][:audio] = 'audio-embedding-3-small'
359
+ # Vision model configuration for image descriptions
360
+ config.vision_config = {
361
+ primary_model: 'gpt-4-vision-preview',
362
+ fallback_model: 'gemini-pro-vision',
363
+ temperature: 0.2
364
+ }
370
365
 
371
- # Logging configuration
372
- config.logging_config[:log_level] = :warn # :debug, :info, :warn, :error, :fatal
373
- config.logging_config[:log_filepath] = File.join(Dir.home, '.ragdoll', 'ragdoll.log')
366
+ # Audio transcription configuration
367
+ config.audio_config = {
368
+ openai: {
369
+ model: 'whisper-1',
370
+ temperature: 0.0
371
+ },
372
+ azure: {
373
+ endpoint: ENV['AZURE_SPEECH_ENDPOINT'],
374
+ api_key: ENV['AZURE_SPEECH_KEY']
375
+ }
376
+ }
374
377
 
375
378
  # Processing settings
376
379
  config.chunking[:text][:max_tokens] = 1000
377
380
  config.chunking[:text][:overlap] = 200
378
381
  config.search[:similarity_threshold] = 0.7
379
382
  config.search[:max_results] = 10
380
- end
381
- ```
382
-
383
- ## Current Implementation Status
384
-
385
- ### ✅ **Fully Implemented**
386
- - **Text document processing**: PDF, DOCX, HTML, Markdown, plain text files with encoding fallback
387
- - **Embedding generation**: Text chunking and vector embedding creation
388
- - **Database schema**: Multi-modal polymorphic architecture with PostgreSQL + pgvector
389
- - **Dual metadata architecture**: Separate LLM-generated content analysis and file properties
390
- - **Search functionality**: Semantic search with cosine similarity and usage analytics
391
- - **Search tracking system**: Comprehensive analytics with query embeddings, click-through tracking, and performance monitoring
392
- - **Document management**: Add, update, delete, list operations
393
- - **Duplicate detection**: Multi-level duplicate prevention with file hash, content hash, and metadata comparison
394
- - **Background processing**: ActiveJob integration for async embedding generation
395
- - **LLM metadata generation**: AI-powered structured content analysis with schema validation
396
- - **Logging**: Configurable file-based logging with multiple levels
397
-
398
- ### 🚧 **In Development**
399
- - **Image processing**: Framework exists but vision AI integration needs completion
400
- - **Audio processing**: Framework exists but speech-to-text integration needs completion
401
- - **Hybrid search**: Combining semantic and full-text search capabilities
402
-
403
- ### 📋 **Planned Features**
404
- - **Multi-modal search**: Search across text, image, and audio content types
405
- - **Content-type specific embedding models**: Different models for text, image, audio
406
- - **Enhanced metadata schemas**: Domain-specific metadata templates
407
-
408
- ## Architecture Highlights
409
-
410
- ### Dual Metadata Design
411
-
412
- Ragdoll uses a sophisticated dual metadata architecture to separate concerns:
413
-
414
- - **`metadata` (JSON)**: LLM-generated content analysis including summary, keywords, classification, topics, sentiment, and domain-specific insights
415
- - **`file_metadata` (JSON)**: System-generated file properties including size, MIME type, dimensions, processing parameters, and technical characteristics
416
-
417
- This separation enables both semantic search operations on content meaning and efficient file management operations.
418
-
419
- ### Polymorphic Multi-Modal Architecture
420
-
421
- The database schema uses polymorphic associations to elegantly support multiple content types:
422
-
423
- - **Documents**: Central entity with dual metadata columns
424
- - **Content Types**: Specialized tables for `text_contents`, `image_contents`, `audio_contents`
425
- - **Embeddings**: Unified vector storage via polymorphic `embeddable` associations
426
-
427
- ## Text Document Processing (Current)
428
-
429
- Currently, Ragdoll processes text documents through:
430
-
431
- 1. **Content Extraction**: Extracts text from PDF, DOCX, HTML, Markdown, and plain text
432
- 2. **Metadata Generation**: AI-powered analysis creates structured content metadata
433
- 3. **Text Chunking**: Splits content into manageable chunks with configurable size/overlap
434
- 4. **Embedding Generation**: Creates vector embeddings using OpenAI or other providers
435
- 5. **Database Storage**: Stores in polymorphic multi-modal architecture with dual metadata
436
- 6. **Search**: Semantic search using cosine similarity with usage analytics
437
-
438
- ### Example Usage
439
-
440
- ```ruby
441
- # Add a text document
442
- result = Ragdoll.add_document(path: 'document.pdf')
443
-
444
- # Check processing status
445
- status = Ragdoll.document_status(id: result[:document_id])
446
-
447
- # Search the content
448
- results = Ragdoll.search(query: 'machine learning')
449
- ```
450
-
451
- ## PostgreSQL + pgvector Configuration
452
-
453
- ### Database Setup
454
-
455
- ```bash
456
- # Install PostgreSQL and pgvector
457
- brew install postgresql pgvector # macOS
458
- # or
459
- apt-get install postgresql postgresql-contrib # Ubuntu
460
-
461
- # Create database and enable pgvector extension
462
- createdb ragdoll_production
463
- psql -d ragdoll_production -c "CREATE EXTENSION IF NOT EXISTS vector;"
464
- ```
465
383
 
466
- ### Configuration Example
467
-
468
- ```ruby
469
- Ragdoll.configure do |config|
470
- config.database_config = {
471
- adapter: 'postgresql',
472
- database: 'ragdoll_production',
473
- username: 'ragdoll',
474
- password: ENV['DATABASE_PASSWORD'],
475
- host: 'localhost',
476
- port: 5432,
477
- pool: 20,
478
- auto_migrate: true
384
+ # Quality thresholds
385
+ config.quality_thresholds = {
386
+ high_quality: 0.8,
387
+ medium_quality: 0.5,
388
+ min_content_length: 50
479
389
  }
480
390
  end
481
391
  ```
482
392
 
483
393
  ## Performance Features
484
394
 
485
- - **Native pgvector**: Hardware-accelerated similarity search
486
- - **IVFFlat indexing**: Fast approximate nearest neighbor search
487
- - **Polymorphic embeddings**: Unified search across content types
488
- - **Batch processing**: Efficient bulk operations
489
- - **Background jobs**: Asynchronous document processing
490
- - **Connection pooling**: High-concurrency support
395
+ - **Unified Index**: Single text-based search index for all content types
396
+ - **Optimized Conversion**: Efficient text extraction and AI-powered description generation
397
+ - **Quality Scoring**: Automatic assessment of converted content quality
398
+ - **Batch Processing**: Efficient bulk document processing with progress tracking
399
+ - **Smart Caching**: Caches conversion results to avoid reprocessing
400
+ - **Background Jobs**: Asynchronous processing for large files
401
+ - **Cross-Modal Optimization**: Specialized optimizations for different media type conversions
491
402
 
492
403
  ## Installation
493
404
 
@@ -497,6 +408,12 @@ brew install postgresql pgvector # macOS
497
408
  # or
498
409
  apt-get install postgresql postgresql-contrib # Ubuntu
499
410
 
411
+ # For image processing
412
+ brew install imagemagick
413
+
414
+ # For audio processing (optional, depending on provider)
415
+ brew install ffmpeg
416
+
500
417
  # Install gem
501
418
  gem install ragdoll
502
419
 
@@ -507,61 +424,83 @@ gem 'ragdoll'
507
424
  ## Requirements
508
425
 
509
426
  - **Ruby**: 3.2+
510
- - **PostgreSQL**: 12+ with pgvector extension (REQUIRED - no other databases supported)
511
- - **Dependencies**: activerecord, pg, pgvector, neighbor, ruby_llm, pdf-reader, docx, rubyzip, shrine, rmagick, opensearch-ruby, searchkick, ruby-progressbar
427
+ - **PostgreSQL**: 12+ with pgvector extension
428
+ - **ImageMagick**: For image processing and metadata extraction
429
+ - **FFmpeg**: Optional, for advanced audio/video processing
430
+ - **Dependencies**: activerecord, pg, pgvector, neighbor, ruby_llm, pdf-reader, docx, rmagick, tempfile
512
431
 
513
- ## Use Cases
432
+ ### Vision Model Requirements
514
433
 
515
- - Internal knowledge bases and chat assistants grounded in your documents
516
- - Product documentation and support search with analytics and relevance feedback
517
- - Research corpora exploration (summaries, topics, similarity) across large text sets
518
- - Incident retrospectives and operational analytics with searchable write-ups
519
- - Media libraries preparing for text + image + audio pipelines (image/audio in progress)
434
+ For comprehensive image descriptions:
435
+ - **OpenAI**: GPT-4 Vision (recommended)
436
+ - **Google**: Gemini Pro Vision
437
+ - **Anthropic**: Claude 3 with vision capabilities
438
+ - **Local**: Ollama with vision-capable models
520
439
 
521
- ## Environment Variables
440
+ ### Audio Transcription Requirements
522
441
 
523
- Set the following as environment variables (do not commit secrets to source control):
524
-
525
- - `OPENAI_API_KEY` required for OpenAI models
526
- - `OPENAI_ORGANIZATION` optional, for OpenAI org scoping
527
- - `OPENAI_PROJECT` — optional, for OpenAI project scoping
528
- - `ANTHROPIC_API_KEY` — optional, for Anthropic models
529
- - `GOOGLE_API_KEY` — optional, for Google models
530
- - `DATABASE_PASSWORD` — your PostgreSQL password if not using peer auth
442
+ - **OpenAI**: Whisper API (recommended)
443
+ - **Azure**: Speech Services
444
+ - **Google**: Cloud Speech-to-Text
445
+ - **Local**: Whisper installation
531
446
 
532
447
  ## Troubleshooting
533
448
 
534
- ### pgvector extension missing
535
-
536
- - Ensure the extension is enabled in your database:
449
+ ### Image Processing Issues
537
450
 
538
451
  ```bash
539
- psql -d ragdoll_production -c "CREATE EXTENSION IF NOT EXISTS vector;"
452
+ # Verify ImageMagick installation
453
+ convert -version
454
+
455
+ # Check vision model access
456
+ irb -r ragdoll
457
+ > Ragdoll::ImageToTextService.new.convert('test_image.jpg')
540
458
  ```
541
459
 
542
- - If the command fails, verify PostgreSQL and pgvector are installed and that you’re connecting to the correct database.
460
+ ### Audio Processing Issues
543
461
 
544
- ### Document stuck in "processing"
462
+ ```bash
463
+ # For Whisper local installation
464
+ pip install openai-whisper
545
465
 
546
- - Confirm your API keys are set and valid.
547
- - Ensure `auto_migrate: true` in configuration (or run migrations if you manage schema yourself).
548
- - Check logs at the path configured by `logging_config[:log_filepath]` for errors.
466
+ # Test audio file support
467
+ irb -r ragdoll
468
+ > Ragdoll::AudioToTextService.new.transcribe('test_audio.wav')
469
+ ```
549
470
 
550
- ## Related Projects
471
+ ### Content Quality Issues
551
472
 
552
- - **ragdoll-cli**: Standalone CLI application using ragdoll
553
- - **ragdoll-rails**: Rails engine with web interface for ragdoll
473
+ ```ruby
474
+ # Check content quality distribution
475
+ stats = Ragdoll::UnifiedContent.stats
476
+ puts stats[:content_quality_distribution]
477
+
478
+ # Reprocess low-quality content
479
+ low_quality = Ragdoll::UnifiedDocument.joins(:unified_contents)
480
+ .where('unified_contents.content_quality_score < 0.5')
481
+
482
+ low_quality.each do |doc|
483
+ Ragdoll::UnifiedDocumentManagement.new.reprocess_document(
484
+ doc.id,
485
+ image_detail_level: :analytical
486
+ )
487
+ end
488
+ ```
554
489
 
555
- ## Contributing & Support
490
+ ## Use Cases
556
491
 
557
- Contributions are welcome! If you find a bug or have a feature request, please open an issue or submit a pull request. For questions and feedback, open an issue in this repository.
492
+ - **Knowledge Bases**: Search across text documents, presentation images, and recorded meetings
493
+ - **Media Libraries**: Find images by visual content, audio by spoken topics
494
+ - **Research Collections**: Unified search across papers (text), charts (images), and interviews (audio)
495
+ - **Documentation Systems**: Search technical docs, architecture diagrams, and explanation videos
496
+ - **Educational Content**: Find learning materials across all media types through unified text search
558
497
 
559
498
  ## Key Design Principles
560
499
 
561
- 1. **Database-Oriented**: Built on ActiveRecord with PostgreSQL + pgvector for production performance
562
- 2. **Multi-Modal First**: Text, image, and audio content as first-class citizens via polymorphic architecture
563
- 3. **Dual Metadata Design**: Separates LLM-generated content analysis from file properties
564
- 4. **LLM-Enhanced**: Structured metadata generation with schema validation using latest AI capabilities
565
- 5. **High-Level API**: Simple, intuitive interface for complex operations
566
- 6. **Scalable**: Designed for production workloads with background processing and proper indexing
567
- 7. **Extensible**: Easy to add new content types and embedding models through polymorphic design
500
+ 1. **Unified Text Representation**: All media types converted to searchable text
501
+ 2. **Cross-Modal Search**: Images findable through descriptions, audio through transcripts
502
+ 3. **Quality-Driven**: Automatic assessment and optimization of converted content
503
+ 4. **Simplified Architecture**: Single content model instead of complex polymorphic relationships
504
+ 5. **AI-Enhanced Conversion**: Leverages latest vision and speech models for rich text conversion
505
+ 6. **Migration-Friendly**: Smooth transition path from previous multi-modal architecture
506
+ 7. **Performance-Optimized**: Single embedding model and unified search index for speed