ragdoll 0.1.11 → 0.1.12
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +323 -384
- data/app/models/ragdoll/document.rb +1 -1
- data/app/models/ragdoll/unified_content.rb +216 -0
- data/app/models/ragdoll/unified_document.rb +338 -0
- data/app/services/ragdoll/audio_to_text_service.rb +200 -0
- data/app/services/ragdoll/document_converter.rb +216 -0
- data/app/services/ragdoll/document_processor.rb +197 -331
- data/app/services/ragdoll/image_to_text_service.rb +322 -0
- data/app/services/ragdoll/migration_service.rb +340 -0
- data/app/services/ragdoll/text_extraction_service.rb +422 -0
- data/app/services/ragdoll/unified_document_management.rb +300 -0
- data/db/migrate/20250923000001_create_ragdoll_unified_contents.rb +87 -0
- data/lib/ragdoll/core/version.rb +1 -1
- data/lib/ragdoll/core.rb +7 -0
- metadata +11 -2
data/README.md
CHANGED
@@ -1,7 +1,7 @@
|
|
1
|
-
<
|
2
|
-
|
3
|
-
|
4
|
-
|
1
|
+
> [!CAUTION]<br />
|
2
|
+
> **Software Under Development by a Crazy Man**<br />
|
3
|
+
> Gave up on the multi-modal vectorization approach,<br />
|
4
|
+
> now using a unified text-based RAG architecture.
|
5
5
|
<br />
|
6
6
|
<div align="center">
|
7
7
|
<table>
|
@@ -12,7 +12,8 @@
|
|
12
12
|
</a>
|
13
13
|
</td>
|
14
14
|
<td width="50%" valign="top">
|
15
|
-
<p
|
15
|
+
<p><strong>🔄 NEW: Unified Text-Based RAG Architecture</strong></p>
|
16
|
+
<p>Ragdoll has evolved to a unified text-based RAG (Retrieval-Augmented Generation) architecture that converts all media types—text, images, audio, and video—to comprehensive text representations before vectorization. This approach enables true cross-modal search where you can find images through their AI-generated descriptions, audio through transcripts, and all content through a single, powerful text-based search index.</p>
|
16
17
|
</td>
|
17
18
|
</tr>
|
18
19
|
</table>
|
@@ -20,62 +21,66 @@
|
|
20
21
|
|
21
22
|
# Ragdoll
|
22
23
|
|
23
|
-
|
24
|
+
**Unified Text-Based RAG (Retrieval-Augmented Generation) library built on ActiveRecord.** Features PostgreSQL + pgvector for high-performance semantic search with a simplified architecture that converts all media types to searchable text.
|
25
|
+
|
26
|
+
RAG does not have to be hard. The new unified approach eliminates the complexity of multi-modal vectorization while enabling powerful cross-modal search capabilities. See: [https://0x1eef.github.io/posts/an-introduction-to-rag-with-llm.rb/](https://0x1eef.github.io/posts/an-introduction-to-rag-with-llm.rb/)
|
27
|
+
|
28
|
+
## 🆕 **What's New: Unified Text-Based Architecture**
|
29
|
+
|
30
|
+
Ragdoll 2.0 introduces a revolutionary unified approach:
|
24
31
|
|
25
|
-
|
32
|
+
- **All Media → Text**: Images become comprehensive descriptions, audio becomes transcripts
|
33
|
+
- **Single Embedding Model**: One text embedding model for all content types
|
34
|
+
- **Cross-Modal Search**: Find images through descriptions, audio through transcripts
|
35
|
+
- **Simplified Architecture**: No more complex STI (Single Table Inheritance) models
|
36
|
+
- **Better Search**: Unified text index enables more sophisticated queries
|
37
|
+
- **Migration Path**: Smooth transition from the previous multi-modal system
|
26
38
|
|
27
39
|
## Overview
|
28
40
|
|
29
|
-
Ragdoll is a database-first,
|
41
|
+
Ragdoll is a database-first, unified text-based Retrieval-Augmented Generation (RAG) library for Ruby. It pairs PostgreSQL + pgvector with an ActiveRecord-driven schema to deliver fast, production-grade semantic search through a simplified unified architecture.
|
30
42
|
|
31
|
-
The library
|
43
|
+
The library converts all document types to rich text representations: PDFs and documents are extracted as text, images are converted to comprehensive AI-generated descriptions, and audio files are transcribed. This unified approach enables powerful cross-modal search while maintaining simplicity.
|
32
44
|
|
33
|
-
### Why
|
45
|
+
### Why the New Unified Architecture?
|
34
46
|
|
35
|
-
-
|
36
|
-
-
|
37
|
-
-
|
38
|
-
-
|
39
|
-
-
|
40
|
-
-
|
47
|
+
- **Simplified Complexity**: Single content model instead of multiple polymorphic types
|
48
|
+
- **Cross-Modal Search**: Find images by searching for objects or concepts in their descriptions
|
49
|
+
- **Unified Index**: One text-based search index for all content types
|
50
|
+
- **Better Retrieval**: Text descriptions often contain more searchable information than raw media
|
51
|
+
- **Cost Effective**: Single embedding model instead of specialized models per media type
|
52
|
+
- **Easier Maintenance**: One embedding pipeline to maintain and optimize
|
41
53
|
|
42
54
|
### Key Capabilities
|
43
55
|
|
44
|
-
-
|
45
|
-
-
|
46
|
-
-
|
47
|
-
- Search
|
48
|
-
-
|
49
|
-
-
|
56
|
+
- **Universal Text Conversion**: Converts any media type to searchable text
|
57
|
+
- **AI-Powered Descriptions**: Comprehensive image descriptions using vision models
|
58
|
+
- **Audio Transcription**: Speech-to-text conversion for audio content
|
59
|
+
- **Semantic Search**: Vector similarity search across all converted content
|
60
|
+
- **Cross-Modal Retrieval**: Search for images using text descriptions of their content
|
61
|
+
- **Content Quality Assessment**: Automatic scoring of converted content quality
|
62
|
+
- **Migration Support**: Tools to migrate from previous multi-modal architecture
|
50
63
|
|
51
64
|
## Table of Contents
|
52
65
|
|
53
66
|
- [Quick Start](#quick-start)
|
67
|
+
- [Unified Architecture Guide](#unified-architecture-guide)
|
68
|
+
- [Document Processing Pipeline](#document-processing-pipeline)
|
69
|
+
- [Cross-Modal Search](#cross-modal-search)
|
70
|
+
- [Migration from Multi-Modal](#migration-from-multi-modal)
|
54
71
|
- [API Overview](#api-overview)
|
55
|
-
- [Search and Retrieval](#search-and-retrieval)
|
56
|
-
- [Search Analytics and Tracking](#search-analytics-and-tracking)
|
57
|
-
- [System Operations](#system-operations)
|
58
72
|
- [Configuration](#configuration)
|
59
|
-
- [Current Implementation Status](#current-implementation-status)
|
60
|
-
- [Architecture Highlights](#architecture-highlights)
|
61
|
-
- [Text Document Processing](#text-document-processing-current)
|
62
|
-
- [PostgreSQL + pgvector Configuration](#postgresql--pgvector-configuration)
|
63
|
-
- [Performance Features](#performance-features)
|
64
73
|
- [Installation](#installation)
|
65
74
|
- [Requirements](#requirements)
|
66
|
-
- [
|
67
|
-
- [Environment Variables](#environment-variables)
|
75
|
+
- [Performance Features](#performance-features)
|
68
76
|
- [Troubleshooting](#troubleshooting)
|
69
|
-
- [Related Projects](#related-projects)
|
70
|
-
- [Key Design Principles](#key-design-principles)
|
71
|
-
- [Contributing & Support](#contributing--support)
|
72
77
|
|
73
78
|
## Quick Start
|
74
79
|
|
75
80
|
```ruby
|
76
81
|
require 'ragdoll'
|
77
82
|
|
78
|
-
# Configure with
|
83
|
+
# Configure with unified text-based architecture
|
79
84
|
Ragdoll.configure do |config|
|
80
85
|
# Database configuration (PostgreSQL only)
|
81
86
|
config.database_config = {
|
@@ -88,260 +93,234 @@ Ragdoll.configure do |config|
|
|
88
93
|
auto_migrate: true
|
89
94
|
}
|
90
95
|
|
91
|
-
#
|
92
|
-
config.
|
93
|
-
|
94
|
-
|
96
|
+
# Enable unified text-based models
|
97
|
+
config.use_unified_models = true
|
98
|
+
|
99
|
+
# Text conversion settings
|
100
|
+
config.text_conversion = {
|
101
|
+
image_detail_level: :comprehensive, # :minimal, :standard, :comprehensive, :analytical
|
102
|
+
audio_transcription_provider: :openai, # :azure, :google, :whisper_local
|
103
|
+
enable_fallback_descriptions: true
|
104
|
+
}
|
95
105
|
|
96
|
-
#
|
97
|
-
config.
|
98
|
-
config.
|
106
|
+
# Single embedding model for all content
|
107
|
+
config.embedding_model = "text-embedding-3-large"
|
108
|
+
config.embedding_provider = :openai
|
99
109
|
|
100
|
-
#
|
101
|
-
config.
|
102
|
-
config.logging_config[:log_filepath] = File.join(Dir.home, '.ragdoll', 'ragdoll.log')
|
110
|
+
# Ruby LLM configuration
|
111
|
+
config.ruby_llm_config[:openai][:api_key] = ENV['OPENAI_API_KEY']
|
103
112
|
end
|
104
113
|
|
105
|
-
# Add documents -
|
114
|
+
# Add documents - all types converted to text
|
106
115
|
result = Ragdoll.add_document(path: 'research_paper.pdf')
|
107
|
-
|
108
|
-
|
116
|
+
image_result = Ragdoll.add_document(path: 'diagram.png') # Converted to description
|
117
|
+
audio_result = Ragdoll.add_document(path: 'lecture.mp3') # Converted to transcript
|
109
118
|
|
110
|
-
#
|
111
|
-
|
112
|
-
|
119
|
+
# Cross-modal search - find images by describing their content
|
120
|
+
results = Ragdoll.search(query: 'neural network architecture diagram')
|
121
|
+
# This can return the image document if its AI description mentions neural networks
|
113
122
|
|
114
|
-
# Search
|
115
|
-
results = Ragdoll.search(query: '
|
123
|
+
# Search for audio content by transcript content
|
124
|
+
results = Ragdoll.search(query: 'machine learning discussion')
|
125
|
+
# Returns audio documents whose transcripts mention machine learning
|
116
126
|
|
117
|
-
#
|
118
|
-
document = Ragdoll.get_document(id:
|
127
|
+
# Check content quality
|
128
|
+
document = Ragdoll.get_document(id: result[:document_id])
|
129
|
+
puts document[:content_quality_score] # 0.0 to 1.0 rating
|
119
130
|
```
|
120
131
|
|
121
|
-
##
|
132
|
+
## Unified Architecture Guide
|
122
133
|
|
123
|
-
|
134
|
+
### Document Processing Pipeline
|
124
135
|
|
125
|
-
|
136
|
+
The new unified pipeline converts all media types to searchable text:
|
126
137
|
|
127
138
|
```ruby
|
128
|
-
#
|
129
|
-
|
130
|
-
|
131
|
-
puts result[:document_id] # "123"
|
132
|
-
puts result[:message] # "Document 'document' added successfully with ID 123"
|
133
|
-
puts result[:embeddings_queued] # true
|
134
|
-
|
135
|
-
# Add document with force option to override duplicate detection
|
136
|
-
result = Ragdoll.add_document(path: 'document.pdf', force: true)
|
137
|
-
# Creates new document even if duplicate exists
|
138
|
-
|
139
|
-
# Check document processing status
|
140
|
-
status = Ragdoll.document_status(id: result[:document_id])
|
141
|
-
puts status[:status] # "processed"
|
142
|
-
puts status[:embeddings_count] # 15
|
143
|
-
puts status[:embeddings_ready] # true
|
144
|
-
puts status[:message] # "Document processed successfully with 15 embeddings"
|
145
|
-
|
146
|
-
# Get detailed document information
|
147
|
-
document = Ragdoll.get_document(id: result[:document_id])
|
148
|
-
puts document[:title] # "document"
|
149
|
-
puts document[:status] # "processed"
|
150
|
-
puts document[:embeddings_count] # 15
|
151
|
-
puts document[:content_length] # 5000
|
139
|
+
# Text files: Direct extraction
|
140
|
+
text_doc = Ragdoll.add_document(path: 'article.md')
|
141
|
+
# Content: Original markdown text
|
152
142
|
|
153
|
-
#
|
154
|
-
Ragdoll.
|
143
|
+
# PDF/DOCX: Text extraction
|
144
|
+
pdf_doc = Ragdoll.add_document(path: 'research.pdf')
|
145
|
+
# Content: Extracted text from all pages
|
155
146
|
|
156
|
-
#
|
157
|
-
Ragdoll.
|
147
|
+
# Images: AI-generated descriptions
|
148
|
+
image_doc = Ragdoll.add_document(path: 'chart.png')
|
149
|
+
# Content: "Bar chart showing quarterly sales data with increasing trend..."
|
158
150
|
|
159
|
-
#
|
160
|
-
|
151
|
+
# Audio: Speech-to-text transcription
|
152
|
+
audio_doc = Ragdoll.add_document(path: 'meeting.mp3')
|
153
|
+
# Content: "In this meeting we discussed the quarterly results..."
|
161
154
|
|
162
|
-
#
|
163
|
-
|
164
|
-
|
165
|
-
puts stats[:total_embeddings] # 1250
|
155
|
+
# Video: Audio transcription + metadata
|
156
|
+
video_doc = Ragdoll.add_document(path: 'presentation.mp4')
|
157
|
+
# Content: Combination of audio transcript and video metadata
|
166
158
|
```
|
167
159
|
|
168
|
-
###
|
169
|
-
|
170
|
-
Ragdoll includes sophisticated duplicate detection to prevent redundant document processing:
|
160
|
+
### Text Conversion Services
|
171
161
|
|
172
162
|
```ruby
|
173
|
-
#
|
174
|
-
|
175
|
-
|
176
|
-
|
177
|
-
|
178
|
-
# Force adding a duplicate document
|
179
|
-
result3 = Ragdoll.add_document(path: 'research.pdf', force: true)
|
180
|
-
# Creates a new document with modified location identifier
|
181
|
-
|
182
|
-
# Duplicate detection criteria:
|
183
|
-
# 1. Exact location/path match
|
184
|
-
# 2. File modification time (for files)
|
185
|
-
# 3. File content hash (SHA256)
|
186
|
-
# 4. Content hash for text
|
187
|
-
# 5. File size and metadata similarity
|
188
|
-
# 6. Document title and type matching
|
189
|
-
```
|
163
|
+
# Use individual conversion services
|
164
|
+
text_content = Ragdoll::TextExtractionService.extract('document.pdf')
|
165
|
+
image_description = Ragdoll::ImageToTextService.convert('photo.jpg', detail_level: :comprehensive)
|
166
|
+
audio_transcript = Ragdoll::AudioToTextService.transcribe('speech.wav')
|
190
167
|
|
191
|
-
|
192
|
-
|
193
|
-
- **Smart similarity**: Detects duplicates even with minor differences (5% content tolerance)
|
194
|
-
- **File integrity**: SHA256 hashing for reliable file comparison
|
195
|
-
- **URL support**: Content-based detection for web documents
|
196
|
-
- **Force option**: Override detection when needed
|
197
|
-
- **Performance optimized**: Database indexes for fast lookups
|
168
|
+
# Use unified converter (orchestrates all services)
|
169
|
+
unified_text = Ragdoll::DocumentConverter.convert_to_text('any_file.ext')
|
198
170
|
|
199
|
-
|
171
|
+
# Manage documents with unified approach
|
172
|
+
management = Ragdoll::UnifiedDocumentManagement.new
|
173
|
+
document = management.add_document('mixed_media_file.mov')
|
174
|
+
```
|
200
175
|
|
201
|
-
|
202
|
-
# Semantic search across all content types
|
203
|
-
results = Ragdoll.search(query: 'artificial intelligence')
|
204
|
-
|
205
|
-
# Search with automatic tracking (default)
|
206
|
-
results = Ragdoll.search(
|
207
|
-
query: 'machine learning',
|
208
|
-
session_id: 123, # Optional: track user sessions
|
209
|
-
user_id: 456 # Optional: track by user
|
210
|
-
)
|
176
|
+
### Content Quality Assessment
|
211
177
|
|
212
|
-
|
213
|
-
|
214
|
-
|
215
|
-
|
178
|
+
```ruby
|
179
|
+
# Get content quality scores
|
180
|
+
document = Ragdoll::UnifiedDocument.find(id)
|
181
|
+
quality = document.content_quality_score # 0.0 to 1.0
|
182
|
+
|
183
|
+
# Quality factors:
|
184
|
+
# - Content length (50-2000 words optimal)
|
185
|
+
# - Original media type (text > documents > descriptions > placeholders)
|
186
|
+
# - Conversion success (full content > partial > fallback)
|
187
|
+
|
188
|
+
# Batch quality assessment
|
189
|
+
stats = Ragdoll::UnifiedContent.stats
|
190
|
+
puts stats[:content_quality_distribution]
|
191
|
+
# => { high: 150, medium: 75, low: 25 }
|
192
|
+
```
|
216
193
|
|
217
|
-
|
218
|
-
results = Ragdoll.search(
|
219
|
-
query: 'deep learning',
|
220
|
-
classification: 'research',
|
221
|
-
keywords: ['AI', 'neural networks'],
|
222
|
-
tags: ['technical']
|
223
|
-
)
|
194
|
+
## Cross-Modal Search
|
224
195
|
|
225
|
-
|
226
|
-
context = Ragdoll.get_context(query: 'machine learning', limit: 5)
|
196
|
+
The unified architecture enables powerful cross-modal search capabilities:
|
227
197
|
|
228
|
-
|
229
|
-
|
230
|
-
|
231
|
-
|
198
|
+
```ruby
|
199
|
+
# Find images by describing their visual content
|
200
|
+
image_results = Ragdoll.search(query: 'red sports car in parking lot')
|
201
|
+
# Returns image documents whose AI descriptions match the query
|
202
|
+
|
203
|
+
# Search for audio by spoken content
|
204
|
+
audio_results = Ragdoll.search(query: 'quarterly sales meeting discussion')
|
205
|
+
# Returns audio documents whose transcripts contain these topics
|
206
|
+
|
207
|
+
# Mixed results across all media types
|
208
|
+
all_results = Ragdoll.search(query: 'artificial intelligence')
|
209
|
+
# Returns text documents, images with AI descriptions, and audio transcripts
|
210
|
+
# all ranked by relevance to the query
|
211
|
+
|
212
|
+
# Filter by original media type while searching text
|
213
|
+
image_only = Ragdoll.search(
|
214
|
+
query: 'machine learning workflow',
|
215
|
+
original_media_type: 'image'
|
232
216
|
)
|
233
217
|
|
234
|
-
#
|
235
|
-
|
236
|
-
query: '
|
237
|
-
|
238
|
-
text_weight: 0.3
|
218
|
+
# Search with quality filtering
|
219
|
+
high_quality = Ragdoll.search(
|
220
|
+
query: 'deep learning',
|
221
|
+
min_quality_score: 0.7
|
239
222
|
)
|
240
223
|
```
|
241
224
|
|
242
|
-
|
225
|
+
## Migration from Multi-Modal
|
243
226
|
|
244
|
-
|
227
|
+
Migrate smoothly from the previous multi-modal architecture:
|
245
228
|
|
246
229
|
```ruby
|
247
|
-
#
|
248
|
-
|
230
|
+
# Check migration readiness
|
231
|
+
migration_service = Ragdoll::MigrationService.new
|
232
|
+
report = migration_service.create_comparison_report
|
249
233
|
|
250
|
-
|
251
|
-
|
252
|
-
puts "#{doc.title}: #{doc.keywords_match_count} matches"
|
253
|
-
end
|
234
|
+
puts "Migration Benefits:"
|
235
|
+
report[:benefits].each { |benefit, description| puts "- #{description}" }
|
254
236
|
|
255
|
-
#
|
256
|
-
results = Ragdoll::
|
257
|
-
|
258
|
-
|
259
|
-
results.each do |doc|
|
260
|
-
puts "#{doc.title}: #{doc.total_keywords_count} total keywords"
|
261
|
-
end
|
262
|
-
|
263
|
-
# Combined semantic + keywords search for best results
|
264
|
-
results = Ragdoll.search(
|
265
|
-
query: 'artificial intelligence applications',
|
266
|
-
keywords: ['ai', 'machine learning', 'neural networks'],
|
267
|
-
limit: 10
|
237
|
+
# Migrate all documents
|
238
|
+
results = Ragdoll::MigrationService.migrate_all_documents(
|
239
|
+
batch_size: 50,
|
240
|
+
process_embeddings: true
|
268
241
|
)
|
269
242
|
|
270
|
-
|
271
|
-
|
272
|
-
|
273
|
-
|
274
|
-
|
243
|
+
puts "Migrated: #{results[:migrated]} documents"
|
244
|
+
puts "Errors: #{results[:errors].length}"
|
245
|
+
|
246
|
+
# Validate migration integrity
|
247
|
+
validation = migration_service.validate_migration
|
248
|
+
puts "Validation passed: #{validation[:passed]}/#{validation[:total_checks]} checks"
|
275
249
|
|
276
|
-
#
|
277
|
-
|
278
|
-
# Will match documents with keywords: ['python', 'data-science', 'ai']
|
250
|
+
# Migrate individual document
|
251
|
+
migrated_doc = Ragdoll::MigrationService.migrate_document(old_document_id)
|
279
252
|
```
|
280
253
|
|
281
|
-
|
282
|
-
|
283
|
-
|
284
|
-
- **Smart Scoring**: Results ordered by match count or document focus
|
285
|
-
- **Case Insensitive**: Automatic keyword normalization
|
286
|
-
- **Integration Ready**: Works seamlessly with semantic search
|
287
|
-
- **Inspired by `find_matching_entries.rb`**: Optimized for PostgreSQL arrays
|
254
|
+
## API Overview
|
255
|
+
|
256
|
+
### Unified Document Management
|
288
257
|
|
289
|
-
|
258
|
+
```ruby
|
259
|
+
# Add documents with automatic text conversion
|
260
|
+
result = Ragdoll.add_document(path: 'any_file.ext')
|
261
|
+
puts result[:document_id]
|
262
|
+
puts result[:content_preview] # First 100 characters of converted text
|
263
|
+
|
264
|
+
# Batch processing with unified pipeline
|
265
|
+
files = ['doc.pdf', 'image.jpg', 'audio.mp3']
|
266
|
+
results = Ragdoll::UnifiedDocumentManagement.new.batch_process_documents(files)
|
267
|
+
|
268
|
+
# Reprocess with different conversion settings
|
269
|
+
Ragdoll::UnifiedDocumentManagement.new.reprocess_document(
|
270
|
+
document_id,
|
271
|
+
image_detail_level: :analytical
|
272
|
+
)
|
273
|
+
```
|
290
274
|
|
291
|
-
|
275
|
+
### Search API
|
292
276
|
|
293
277
|
```ruby
|
294
|
-
#
|
295
|
-
|
296
|
-
puts "Total searches: #{analytics[:total_searches]}"
|
297
|
-
puts "Unique queries: #{analytics[:unique_queries]}"
|
298
|
-
puts "Average execution time: #{analytics[:avg_execution_time]}ms"
|
299
|
-
puts "Click-through rate: #{analytics[:click_through_rate]}%"
|
300
|
-
|
301
|
-
# Find similar searches using vector similarity
|
302
|
-
search = Ragdoll::Search.first
|
303
|
-
similar_searches = search.nearest_neighbors(:query_embedding, distance: :cosine).limit(5)
|
304
|
-
|
305
|
-
similar_searches.each do |similar|
|
306
|
-
puts "Query: #{similar.query}"
|
307
|
-
puts "Similarity: #{similar.neighbor_distance}"
|
308
|
-
puts "Results: #{similar.results_count}"
|
309
|
-
end
|
278
|
+
# Unified search across all content types
|
279
|
+
results = Ragdoll.search(query: 'machine learning algorithms')
|
310
280
|
|
311
|
-
#
|
312
|
-
|
313
|
-
|
281
|
+
# Search with original media type context
|
282
|
+
results.each do |doc|
|
283
|
+
puts "#{doc.title} (originally #{doc.original_media_type})"
|
284
|
+
puts "Quality: #{doc.content_quality_score.round(2)}"
|
285
|
+
puts "Content: #{doc.content[0..100]}..."
|
286
|
+
end
|
314
287
|
|
315
|
-
#
|
316
|
-
|
317
|
-
query: '
|
318
|
-
|
288
|
+
# Advanced search with content quality
|
289
|
+
high_quality_results = Ragdoll.search(
|
290
|
+
query: 'neural networks',
|
291
|
+
min_quality_score: 0.8,
|
292
|
+
limit: 10
|
319
293
|
)
|
320
294
|
```
|
321
295
|
|
322
|
-
###
|
296
|
+
### Content Analysis
|
323
297
|
|
324
298
|
```ruby
|
325
|
-
#
|
326
|
-
|
327
|
-
# Returns information about documents, content types, embeddings, etc.
|
299
|
+
# Analyze converted content
|
300
|
+
document = Ragdoll::UnifiedDocument.find(id)
|
328
301
|
|
329
|
-
#
|
330
|
-
|
302
|
+
# Check original media type
|
303
|
+
puts document.unified_contents.first.original_media_type # 'image', 'audio', 'text', etc.
|
331
304
|
|
332
|
-
#
|
333
|
-
|
305
|
+
# View conversion metadata
|
306
|
+
content = document.unified_contents.first
|
307
|
+
puts content.conversion_method # 'image_to_text', 'audio_transcription', etc.
|
308
|
+
puts content.metadata # Conversion settings and results
|
334
309
|
|
335
|
-
#
|
336
|
-
|
310
|
+
# Quality metrics
|
311
|
+
puts content.word_count
|
312
|
+
puts content.character_count
|
313
|
+
puts content.content_quality_score
|
337
314
|
```
|
338
315
|
|
339
|
-
|
316
|
+
## Configuration
|
340
317
|
|
341
318
|
```ruby
|
342
|
-
# Configure the system
|
343
319
|
Ragdoll.configure do |config|
|
344
|
-
#
|
320
|
+
# Enable unified text-based architecture
|
321
|
+
config.use_unified_models = true
|
322
|
+
|
323
|
+
# Database configuration (PostgreSQL required)
|
345
324
|
config.database_config = {
|
346
325
|
adapter: 'postgresql',
|
347
326
|
database: 'ragdoll_production',
|
@@ -352,142 +331,74 @@ Ragdoll.configure do |config|
|
|
352
331
|
auto_migrate: true
|
353
332
|
}
|
354
333
|
|
355
|
-
#
|
356
|
-
config.
|
357
|
-
|
358
|
-
|
334
|
+
# Text conversion settings
|
335
|
+
config.text_conversion = {
|
336
|
+
# Image conversion detail levels:
|
337
|
+
# :minimal - Brief one-sentence description
|
338
|
+
# :standard - Main elements and composition
|
339
|
+
# :comprehensive - Detailed description including objects, colors, mood
|
340
|
+
# :analytical - Thorough analysis including artistic elements
|
341
|
+
image_detail_level: :comprehensive,
|
342
|
+
|
343
|
+
# Audio transcription providers
|
344
|
+
audio_transcription_provider: :openai, # :azure, :google, :whisper_local
|
345
|
+
|
346
|
+
# Fallback behavior
|
347
|
+
enable_fallback_descriptions: true,
|
348
|
+
fallback_timeout: 30 # seconds
|
349
|
+
}
|
350
|
+
|
351
|
+
# Single embedding model for all content types
|
352
|
+
config.embedding_model = "text-embedding-3-large"
|
353
|
+
config.embedding_provider = :openai
|
359
354
|
|
355
|
+
# Ruby LLM configuration for text conversion
|
356
|
+
config.ruby_llm_config[:openai][:api_key] = ENV['OPENAI_API_KEY']
|
360
357
|
config.ruby_llm_config[:anthropic][:api_key] = ENV['ANTHROPIC_API_KEY']
|
361
|
-
config.ruby_llm_config[:google][:api_key] = ENV['GOOGLE_API_KEY']
|
362
358
|
|
363
|
-
#
|
364
|
-
config.
|
365
|
-
|
366
|
-
|
367
|
-
|
368
|
-
|
369
|
-
config.models[:embedding][:audio] = 'audio-embedding-3-small'
|
359
|
+
# Vision model configuration for image descriptions
|
360
|
+
config.vision_config = {
|
361
|
+
primary_model: 'gpt-4-vision-preview',
|
362
|
+
fallback_model: 'gemini-pro-vision',
|
363
|
+
temperature: 0.2
|
364
|
+
}
|
370
365
|
|
371
|
-
#
|
372
|
-
config.
|
373
|
-
|
366
|
+
# Audio transcription configuration
|
367
|
+
config.audio_config = {
|
368
|
+
openai: {
|
369
|
+
model: 'whisper-1',
|
370
|
+
temperature: 0.0
|
371
|
+
},
|
372
|
+
azure: {
|
373
|
+
endpoint: ENV['AZURE_SPEECH_ENDPOINT'],
|
374
|
+
api_key: ENV['AZURE_SPEECH_KEY']
|
375
|
+
}
|
376
|
+
}
|
374
377
|
|
375
378
|
# Processing settings
|
376
379
|
config.chunking[:text][:max_tokens] = 1000
|
377
380
|
config.chunking[:text][:overlap] = 200
|
378
381
|
config.search[:similarity_threshold] = 0.7
|
379
382
|
config.search[:max_results] = 10
|
380
|
-
end
|
381
|
-
```
|
382
|
-
|
383
|
-
## Current Implementation Status
|
384
|
-
|
385
|
-
### ✅ **Fully Implemented**
|
386
|
-
- **Text document processing**: PDF, DOCX, HTML, Markdown, plain text files with encoding fallback
|
387
|
-
- **Embedding generation**: Text chunking and vector embedding creation
|
388
|
-
- **Database schema**: Multi-modal polymorphic architecture with PostgreSQL + pgvector
|
389
|
-
- **Dual metadata architecture**: Separate LLM-generated content analysis and file properties
|
390
|
-
- **Search functionality**: Semantic search with cosine similarity and usage analytics
|
391
|
-
- **Search tracking system**: Comprehensive analytics with query embeddings, click-through tracking, and performance monitoring
|
392
|
-
- **Document management**: Add, update, delete, list operations
|
393
|
-
- **Duplicate detection**: Multi-level duplicate prevention with file hash, content hash, and metadata comparison
|
394
|
-
- **Background processing**: ActiveJob integration for async embedding generation
|
395
|
-
- **LLM metadata generation**: AI-powered structured content analysis with schema validation
|
396
|
-
- **Logging**: Configurable file-based logging with multiple levels
|
397
|
-
|
398
|
-
### 🚧 **In Development**
|
399
|
-
- **Image processing**: Framework exists but vision AI integration needs completion
|
400
|
-
- **Audio processing**: Framework exists but speech-to-text integration needs completion
|
401
|
-
- **Hybrid search**: Combining semantic and full-text search capabilities
|
402
|
-
|
403
|
-
### 📋 **Planned Features**
|
404
|
-
- **Multi-modal search**: Search across text, image, and audio content types
|
405
|
-
- **Content-type specific embedding models**: Different models for text, image, audio
|
406
|
-
- **Enhanced metadata schemas**: Domain-specific metadata templates
|
407
|
-
|
408
|
-
## Architecture Highlights
|
409
|
-
|
410
|
-
### Dual Metadata Design
|
411
|
-
|
412
|
-
Ragdoll uses a sophisticated dual metadata architecture to separate concerns:
|
413
|
-
|
414
|
-
- **`metadata` (JSON)**: LLM-generated content analysis including summary, keywords, classification, topics, sentiment, and domain-specific insights
|
415
|
-
- **`file_metadata` (JSON)**: System-generated file properties including size, MIME type, dimensions, processing parameters, and technical characteristics
|
416
|
-
|
417
|
-
This separation enables both semantic search operations on content meaning and efficient file management operations.
|
418
|
-
|
419
|
-
### Polymorphic Multi-Modal Architecture
|
420
|
-
|
421
|
-
The database schema uses polymorphic associations to elegantly support multiple content types:
|
422
|
-
|
423
|
-
- **Documents**: Central entity with dual metadata columns
|
424
|
-
- **Content Types**: Specialized tables for `text_contents`, `image_contents`, `audio_contents`
|
425
|
-
- **Embeddings**: Unified vector storage via polymorphic `embeddable` associations
|
426
|
-
|
427
|
-
## Text Document Processing (Current)
|
428
|
-
|
429
|
-
Currently, Ragdoll processes text documents through:
|
430
|
-
|
431
|
-
1. **Content Extraction**: Extracts text from PDF, DOCX, HTML, Markdown, and plain text
|
432
|
-
2. **Metadata Generation**: AI-powered analysis creates structured content metadata
|
433
|
-
3. **Text Chunking**: Splits content into manageable chunks with configurable size/overlap
|
434
|
-
4. **Embedding Generation**: Creates vector embeddings using OpenAI or other providers
|
435
|
-
5. **Database Storage**: Stores in polymorphic multi-modal architecture with dual metadata
|
436
|
-
6. **Search**: Semantic search using cosine similarity with usage analytics
|
437
|
-
|
438
|
-
### Example Usage
|
439
|
-
|
440
|
-
```ruby
|
441
|
-
# Add a text document
|
442
|
-
result = Ragdoll.add_document(path: 'document.pdf')
|
443
|
-
|
444
|
-
# Check processing status
|
445
|
-
status = Ragdoll.document_status(id: result[:document_id])
|
446
|
-
|
447
|
-
# Search the content
|
448
|
-
results = Ragdoll.search(query: 'machine learning')
|
449
|
-
```
|
450
|
-
|
451
|
-
## PostgreSQL + pgvector Configuration
|
452
|
-
|
453
|
-
### Database Setup
|
454
|
-
|
455
|
-
```bash
|
456
|
-
# Install PostgreSQL and pgvector
|
457
|
-
brew install postgresql pgvector # macOS
|
458
|
-
# or
|
459
|
-
apt-get install postgresql postgresql-contrib # Ubuntu
|
460
|
-
|
461
|
-
# Create database and enable pgvector extension
|
462
|
-
createdb ragdoll_production
|
463
|
-
psql -d ragdoll_production -c "CREATE EXTENSION IF NOT EXISTS vector;"
|
464
|
-
```
|
465
383
|
|
466
|
-
|
467
|
-
|
468
|
-
|
469
|
-
|
470
|
-
|
471
|
-
adapter: 'postgresql',
|
472
|
-
database: 'ragdoll_production',
|
473
|
-
username: 'ragdoll',
|
474
|
-
password: ENV['DATABASE_PASSWORD'],
|
475
|
-
host: 'localhost',
|
476
|
-
port: 5432,
|
477
|
-
pool: 20,
|
478
|
-
auto_migrate: true
|
384
|
+
# Quality thresholds
|
385
|
+
config.quality_thresholds = {
|
386
|
+
high_quality: 0.8,
|
387
|
+
medium_quality: 0.5,
|
388
|
+
min_content_length: 50
|
479
389
|
}
|
480
390
|
end
|
481
391
|
```
|
482
392
|
|
483
393
|
## Performance Features
|
484
394
|
|
485
|
-
- **
|
486
|
-
- **
|
487
|
-
- **
|
488
|
-
- **Batch
|
489
|
-
- **
|
490
|
-
- **
|
395
|
+
- **Unified Index**: Single text-based search index for all content types
|
396
|
+
- **Optimized Conversion**: Efficient text extraction and AI-powered description generation
|
397
|
+
- **Quality Scoring**: Automatic assessment of converted content quality
|
398
|
+
- **Batch Processing**: Efficient bulk document processing with progress tracking
|
399
|
+
- **Smart Caching**: Caches conversion results to avoid reprocessing
|
400
|
+
- **Background Jobs**: Asynchronous processing for large files
|
401
|
+
- **Cross-Modal Optimization**: Specialized optimizations for different media type conversions
|
491
402
|
|
492
403
|
## Installation
|
493
404
|
|
@@ -497,6 +408,12 @@ brew install postgresql pgvector # macOS
|
|
497
408
|
# or
|
498
409
|
apt-get install postgresql postgresql-contrib # Ubuntu
|
499
410
|
|
411
|
+
# For image processing
|
412
|
+
brew install imagemagick
|
413
|
+
|
414
|
+
# For audio processing (optional, depending on provider)
|
415
|
+
brew install ffmpeg
|
416
|
+
|
500
417
|
# Install gem
|
501
418
|
gem install ragdoll
|
502
419
|
|
@@ -507,61 +424,83 @@ gem 'ragdoll'
|
|
507
424
|
## Requirements
|
508
425
|
|
509
426
|
- **Ruby**: 3.2+
|
510
|
-
- **PostgreSQL**: 12+ with pgvector extension
|
511
|
-
- **
|
427
|
+
- **PostgreSQL**: 12+ with pgvector extension
|
428
|
+
- **ImageMagick**: For image processing and metadata extraction
|
429
|
+
- **FFmpeg**: Optional, for advanced audio/video processing
|
430
|
+
- **Dependencies**: activerecord, pg, pgvector, neighbor, ruby_llm, pdf-reader, docx, rmagick, tempfile
|
512
431
|
|
513
|
-
|
432
|
+
### Vision Model Requirements
|
514
433
|
|
515
|
-
|
516
|
-
-
|
517
|
-
-
|
518
|
-
-
|
519
|
-
-
|
434
|
+
For comprehensive image descriptions:
|
435
|
+
- **OpenAI**: GPT-4 Vision (recommended)
|
436
|
+
- **Google**: Gemini Pro Vision
|
437
|
+
- **Anthropic**: Claude 3 with vision capabilities
|
438
|
+
- **Local**: Ollama with vision-capable models
|
520
439
|
|
521
|
-
|
440
|
+
### Audio Transcription Requirements
|
522
441
|
|
523
|
-
|
524
|
-
|
525
|
-
-
|
526
|
-
-
|
527
|
-
- `OPENAI_PROJECT` — optional, for OpenAI project scoping
|
528
|
-
- `ANTHROPIC_API_KEY` — optional, for Anthropic models
|
529
|
-
- `GOOGLE_API_KEY` — optional, for Google models
|
530
|
-
- `DATABASE_PASSWORD` — your PostgreSQL password if not using peer auth
|
442
|
+
- **OpenAI**: Whisper API (recommended)
|
443
|
+
- **Azure**: Speech Services
|
444
|
+
- **Google**: Cloud Speech-to-Text
|
445
|
+
- **Local**: Whisper installation
|
531
446
|
|
532
447
|
## Troubleshooting
|
533
448
|
|
534
|
-
###
|
535
|
-
|
536
|
-
- Ensure the extension is enabled in your database:
|
449
|
+
### Image Processing Issues
|
537
450
|
|
538
451
|
```bash
|
539
|
-
|
452
|
+
# Verify ImageMagick installation
|
453
|
+
convert -version
|
454
|
+
|
455
|
+
# Check vision model access
|
456
|
+
irb -r ragdoll
|
457
|
+
> Ragdoll::ImageToTextService.new.convert('test_image.jpg')
|
540
458
|
```
|
541
459
|
|
542
|
-
|
460
|
+
### Audio Processing Issues
|
543
461
|
|
544
|
-
|
462
|
+
```bash
|
463
|
+
# For Whisper local installation
|
464
|
+
pip install openai-whisper
|
545
465
|
|
546
|
-
|
547
|
-
-
|
548
|
-
|
466
|
+
# Test audio file support
|
467
|
+
irb -r ragdoll
|
468
|
+
> Ragdoll::AudioToTextService.new.transcribe('test_audio.wav')
|
469
|
+
```
|
549
470
|
|
550
|
-
|
471
|
+
### Content Quality Issues
|
551
472
|
|
552
|
-
|
553
|
-
|
473
|
+
```ruby
|
474
|
+
# Check content quality distribution
|
475
|
+
stats = Ragdoll::UnifiedContent.stats
|
476
|
+
puts stats[:content_quality_distribution]
|
477
|
+
|
478
|
+
# Reprocess low-quality content
|
479
|
+
low_quality = Ragdoll::UnifiedDocument.joins(:unified_contents)
|
480
|
+
.where('unified_contents.content_quality_score < 0.5')
|
481
|
+
|
482
|
+
low_quality.each do |doc|
|
483
|
+
Ragdoll::UnifiedDocumentManagement.new.reprocess_document(
|
484
|
+
doc.id,
|
485
|
+
image_detail_level: :analytical
|
486
|
+
)
|
487
|
+
end
|
488
|
+
```
|
554
489
|
|
555
|
-
##
|
490
|
+
## Use Cases
|
556
491
|
|
557
|
-
|
492
|
+
- **Knowledge Bases**: Search across text documents, presentation images, and recorded meetings
|
493
|
+
- **Media Libraries**: Find images by visual content, audio by spoken topics
|
494
|
+
- **Research Collections**: Unified search across papers (text), charts (images), and interviews (audio)
|
495
|
+
- **Documentation Systems**: Search technical docs, architecture diagrams, and explanation videos
|
496
|
+
- **Educational Content**: Find learning materials across all media types through unified text search
|
558
497
|
|
559
498
|
## Key Design Principles
|
560
499
|
|
561
|
-
1. **
|
562
|
-
2. **
|
563
|
-
3. **
|
564
|
-
4. **
|
565
|
-
5. **
|
566
|
-
6. **
|
567
|
-
7. **
|
500
|
+
1. **Unified Text Representation**: All media types converted to searchable text
|
501
|
+
2. **Cross-Modal Search**: Images findable through descriptions, audio through transcripts
|
502
|
+
3. **Quality-Driven**: Automatic assessment and optimization of converted content
|
503
|
+
4. **Simplified Architecture**: Single content model instead of complex polymorphic relationships
|
504
|
+
5. **AI-Enhanced Conversion**: Leverages latest vision and speech models for rich text conversion
|
505
|
+
6. **Migration-Friendly**: Smooth transition path from previous multi-modal architecture
|
506
|
+
7. **Performance-Optimized**: Single embedding model and unified search index for speed
|