ragdoll 0.1.9 → 0.1.11
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +68 -4
- data/README.md +86 -1
- data/Rakefile +4 -2
- data/app/models/ragdoll/document.rb +115 -12
- data/app/models/ragdoll/embedding.rb +36 -4
- data/app/models/ragdoll/search.rb +1 -1
- data/app/services/ragdoll/document_management.rb +117 -9
- data/app/services/ragdoll/document_processor.rb +67 -31
- data/app/services/ragdoll/search_engine.rb +13 -2
- data/db/migrate/{001_enable_postgresql_extensions.rb → 20250815234901_enable_postgresql_extensions.rb} +7 -8
- data/db/migrate/20250815234902_create_ragdoll_documents.rb +117 -0
- data/db/migrate/{005_create_ragdoll_embeddings.rb → 20250815234903_create_ragdoll_embeddings.rb} +13 -10
- data/db/migrate/{006_create_ragdoll_contents.rb → 20250815234904_create_ragdoll_contents.rb} +14 -11
- data/db/migrate/{007_create_ragdoll_searches.rb → 20250815234905_create_ragdoll_searches.rb} +24 -20
- data/db/migrate/{008_create_ragdoll_search_results.rb → 20250815234906_create_ragdoll_search_results.rb} +16 -16
- data/lib/ragdoll/core/client.rb +2 -2
- data/lib/ragdoll/core/database.rb +8 -3
- data/lib/ragdoll/core/version.rb +1 -1
- data/lib/tasks/db.rake +63 -15
- metadata +7 -7
- data/db/migrate/004_create_ragdoll_documents.rb +0 -70
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 255dd5c7e6ccbdeeafe2a0ed74382c5fefb4df2e015942c191fb3747c24a6cb8
|
4
|
+
data.tar.gz: 284ceedd72c305d3dcf5385482b983463ec71ec850514655614e47ad71be17a8
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 64d603061ba7742699e84a5bc8933de4cbc1ab9a6b748d74b17371c05d855f95a1a2750da713be57f0e5a4a95f304894a73821b3cb6dbb3118623c9c5a1cbce2
|
7
|
+
data.tar.gz: c59f77cffb7026cf07eedf5f6724d17a430e4d241691b5cf380292f7bb4ee045bfacbc15787603f531187b5b03cb68e69bea7576e71c7ac8600dbe334af248ca
|
data/CHANGELOG.md
CHANGED
@@ -6,7 +6,58 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
6
6
|
|
7
7
|
## [Unreleased]
|
8
8
|
|
9
|
-
|
9
|
+
## [0.1.11] - 2025-01-17
|
10
|
+
|
11
|
+
### Added
|
12
|
+
- **Force Option for Document Addition**: New `force` parameter in document management to override duplicate detection
|
13
|
+
- Allows forced document addition even when duplicate titles exist
|
14
|
+
- Enables overwriting existing documents when needed
|
15
|
+
|
16
|
+
### Fixed
|
17
|
+
- **Search Query Embedding**: Made `query_embedding` parameter optional in search methods
|
18
|
+
- Improved flexibility for search operations that don't require embeddings
|
19
|
+
- Better error handling for search queries without embeddings
|
20
|
+
|
21
|
+
### Changed
|
22
|
+
- **Database Setup**: Enhanced database role handling and setup procedures
|
23
|
+
- Improved database connection configuration
|
24
|
+
- Better handling of database roles and permissions
|
25
|
+
|
26
|
+
### Removed
|
27
|
+
- **Obsolete Migrations**: Removed outdated RagdollDocuments migration files
|
28
|
+
- Cleaned up legacy migration structure
|
29
|
+
- Streamlined database migration path
|
30
|
+
|
31
|
+
## [0.1.10] - 2025-01-15
|
32
|
+
|
33
|
+
### Changed
|
34
|
+
- Continued improvements to search performance and accuracy
|
35
|
+
|
36
|
+
### Added
|
37
|
+
- **Hybrid Search**: Complete implementation combining semantic and full-text search capabilities
|
38
|
+
- Configurable weights for semantic vs text search (default: 70% semantic, 30% text)
|
39
|
+
- Deduplication of results by document ID
|
40
|
+
- Combined scoring system for unified result ranking
|
41
|
+
- **Full-text Search**: PostgreSQL full-text search with tsvector indexing
|
42
|
+
- Per-word match ratio scoring (0.0 to 1.0)
|
43
|
+
- GIN index for high-performance text search
|
44
|
+
- Search across title, summary, keywords, and description fields
|
45
|
+
- **Enhanced Search API**: Complete search type delegation at top-level Ragdoll namespace
|
46
|
+
- `Ragdoll.hybrid_search` method for combined semantic and text search
|
47
|
+
- `Ragdoll::Document.search_content` for full-text search capabilities
|
48
|
+
- Consistent parameter handling across all search methods
|
49
|
+
|
50
|
+
### Changed
|
51
|
+
- **Search Architecture**: Unified search interface supporting semantic, fulltext, and hybrid modes
|
52
|
+
- **Database Schema**: Added search_vector column with GIN indexing for full-text search performance
|
53
|
+
|
54
|
+
### Technical Details
|
55
|
+
- Full-text search uses PostgreSQL's built-in tsvector capabilities
|
56
|
+
- Hybrid search combines cosine similarity (semantic) with text match ratios
|
57
|
+
- Results are ranked by weighted combined scores
|
58
|
+
- All search methods maintain backward compatibility
|
59
|
+
|
60
|
+
## [0.1.9] - 2025-01-10
|
10
61
|
|
11
62
|
### Added
|
12
63
|
- **Initial CHANGELOG**: Added comprehensive CHANGELOG.md following Keep a Changelog format
|
@@ -40,7 +91,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
40
91
|
- **Test Coverage**: Added coverage directory to .gitignore for cleaner repository state
|
41
92
|
|
42
93
|
### Technical Details
|
43
|
-
- Commits: `9186067`, `cb952d3`, `e902a5f`, `632527b`
|
94
|
+
- Commits: `9186067`, `cb952d3`, `e902a5f`, `632527b`
|
44
95
|
- All changes maintain backward compatibility
|
45
96
|
- No breaking API changes
|
46
97
|
|
@@ -141,6 +192,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
141
192
|
- **Database Schema**: Multi-modal polymorphic architecture with PostgreSQL + pgvector
|
142
193
|
- **Dual Metadata Architecture**: Separate LLM-generated content analysis and file properties
|
143
194
|
- **Search Functionality**: Semantic search with cosine similarity and usage analytics
|
195
|
+
- **Hybrid Search**: Complete implementation combining semantic and full-text search with configurable weights
|
196
|
+
- **Full-text Search**: PostgreSQL tsvector-based text search with GIN indexing
|
144
197
|
- **Search Tracking System**: Comprehensive analytics with query embeddings, click-through tracking, and performance monitoring
|
145
198
|
- **Document Management**: Add, update, delete, list operations
|
146
199
|
- **Background Processing**: ActiveJob integration for async embedding generation
|
@@ -150,7 +203,6 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
150
203
|
### 🚧 In Development
|
151
204
|
- **Image Processing**: Framework exists but vision AI integration needs completion
|
152
205
|
- **Audio Processing**: Framework exists but speech-to-text integration needs completion
|
153
|
-
- **Hybrid Search**: Combining semantic and full-text search capabilities
|
154
206
|
|
155
207
|
### 📋 Planned Features
|
156
208
|
- **Multi-modal Search**: Search across text, image, and audio content types
|
@@ -161,6 +213,18 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
161
213
|
|
162
214
|
## Migration Guide
|
163
215
|
|
216
|
+
### From 0.1.9 to 0.1.10
|
217
|
+
- **New Search Methods**: `Ragdoll.hybrid_search` and `Ragdoll::Document.search_content` methods now available
|
218
|
+
- **Database Migration**: New search_vector column added to documents table with GIN index for full-text search
|
219
|
+
- **API Enhancement**: All search methods now support unified parameter interface
|
220
|
+
- **Backward Compatibility**: Existing `Ragdoll.search` method unchanged, continues to work as before
|
221
|
+
- **CLI Integration**: ragdoll-cli now requires ragdoll >= 0.1.10 for hybrid and full-text search support
|
222
|
+
|
223
|
+
### From 0.1.8 to 0.1.9
|
224
|
+
- **CHANGELOG Addition**: Comprehensive changelog and feature tracking added
|
225
|
+
- **API Method Consistency**: `hybrid_search` method properly delegated to top-level namespace
|
226
|
+
- **No Breaking Changes**: All existing functionality remains compatible
|
227
|
+
|
164
228
|
### From 0.1.7 to 0.1.8
|
165
229
|
- New search tracking tables will be automatically created via migrations
|
166
230
|
- No breaking changes to existing API
|
@@ -198,4 +262,4 @@ This project is licensed under the MIT License - see the LICENSE file for detail
|
|
198
262
|
|
199
263
|
---
|
200
264
|
|
201
|
-
*This changelog is automatically maintained and reflects the actual implementation status of features.*
|
265
|
+
*This changelog is automatically maintained and reflects the actual implementation status of features.*
|
data/README.md
CHANGED
@@ -22,6 +22,8 @@
|
|
22
22
|
|
23
23
|
Database-oriented multi-modal RAG (Retrieval-Augmented Generation) library built on ActiveRecord. Features PostgreSQL + pgvector for high-performance semantic search, polymorphic content architecture, and dual metadata design for sophisticated document analysis.
|
24
24
|
|
25
|
+
RAG does not have to be hard. Every week its getting simpler. The frontier LLM providers are starting to encorporate RAG services. For example OpenAI offers a vector search service. See: [https://0x1eef.github.io/posts/an-introduction-to-rag-with-llm.rb/](https://0x1eef.github.io/posts/an-introduction-to-rag-with-llm.rb/)
|
26
|
+
|
25
27
|
## Overview
|
26
28
|
|
27
29
|
Ragdoll is a database-first, multi-modal Retrieval-Augmented Generation (RAG) library for Ruby. It pairs PostgreSQL + pgvector with an ActiveRecord-driven schema to deliver fast, production-grade semantic search and clean data modeling. Today it ships with robust text processing; image and audio pipelines are scaffolded and actively being completed.
|
@@ -130,6 +132,10 @@ puts result[:document_id] # "123"
|
|
130
132
|
puts result[:message] # "Document 'document' added successfully with ID 123"
|
131
133
|
puts result[:embeddings_queued] # true
|
132
134
|
|
135
|
+
# Add document with force option to override duplicate detection
|
136
|
+
result = Ragdoll.add_document(path: 'document.pdf', force: true)
|
137
|
+
# Creates new document even if duplicate exists
|
138
|
+
|
133
139
|
# Check document processing status
|
134
140
|
status = Ragdoll.document_status(id: result[:document_id])
|
135
141
|
puts status[:status] # "processed"
|
@@ -159,6 +165,37 @@ puts stats[:total_documents] # 50
|
|
159
165
|
puts stats[:total_embeddings] # 1250
|
160
166
|
```
|
161
167
|
|
168
|
+
### Duplicate Detection
|
169
|
+
|
170
|
+
Ragdoll includes sophisticated duplicate detection to prevent redundant document processing:
|
171
|
+
|
172
|
+
```ruby
|
173
|
+
# Automatic duplicate detection (default behavior)
|
174
|
+
result1 = Ragdoll.add_document(path: 'research.pdf')
|
175
|
+
result2 = Ragdoll.add_document(path: 'research.pdf')
|
176
|
+
# result2 returns the same document_id as result1 (duplicate detected)
|
177
|
+
|
178
|
+
# Force adding a duplicate document
|
179
|
+
result3 = Ragdoll.add_document(path: 'research.pdf', force: true)
|
180
|
+
# Creates a new document with modified location identifier
|
181
|
+
|
182
|
+
# Duplicate detection criteria:
|
183
|
+
# 1. Exact location/path match
|
184
|
+
# 2. File modification time (for files)
|
185
|
+
# 3. File content hash (SHA256)
|
186
|
+
# 4. Content hash for text
|
187
|
+
# 5. File size and metadata similarity
|
188
|
+
# 6. Document title and type matching
|
189
|
+
```
|
190
|
+
|
191
|
+
**Duplicate Detection Features:**
|
192
|
+
- **Multi-level detection**: Checks location, file hash, content hash, and metadata
|
193
|
+
- **Smart similarity**: Detects duplicates even with minor differences (5% content tolerance)
|
194
|
+
- **File integrity**: SHA256 hashing for reliable file comparison
|
195
|
+
- **URL support**: Content-based detection for web documents
|
196
|
+
- **Force option**: Override detection when needed
|
197
|
+
- **Performance optimized**: Database indexes for fast lookups
|
198
|
+
|
162
199
|
### Search and Retrieval
|
163
200
|
|
164
201
|
```ruby
|
@@ -202,6 +239,53 @@ results = Ragdoll.hybrid_search(
|
|
202
239
|
)
|
203
240
|
```
|
204
241
|
|
242
|
+
### Keywords Search
|
243
|
+
|
244
|
+
Ragdoll supports powerful keywords-based search that can be used standalone or combined with semantic search. The keywords system uses PostgreSQL array operations for high performance and supports both partial matching (overlap) and exact matching (contains all).
|
245
|
+
|
246
|
+
```ruby
|
247
|
+
# Keywords-only search (overlap - documents containing any of the keywords)
|
248
|
+
results = Ragdoll::Document.search_by_keywords(['machine', 'learning', 'ai'])
|
249
|
+
|
250
|
+
# Results are sorted by match count (documents with more keyword matches rank higher)
|
251
|
+
results.each do |doc|
|
252
|
+
puts "#{doc.title}: #{doc.keywords_match_count} matches"
|
253
|
+
end
|
254
|
+
|
255
|
+
# Exact keywords search (contains all - documents must have ALL keywords)
|
256
|
+
results = Ragdoll::Document.search_by_keywords_all(['ruby', 'programming'])
|
257
|
+
|
258
|
+
# Results are sorted by focus (fewer total keywords = more focused document)
|
259
|
+
results.each do |doc|
|
260
|
+
puts "#{doc.title}: #{doc.total_keywords_count} total keywords"
|
261
|
+
end
|
262
|
+
|
263
|
+
# Combined semantic + keywords search for best results
|
264
|
+
results = Ragdoll.search(
|
265
|
+
query: 'artificial intelligence applications',
|
266
|
+
keywords: ['ai', 'machine learning', 'neural networks'],
|
267
|
+
limit: 10
|
268
|
+
)
|
269
|
+
|
270
|
+
# Keywords search with options
|
271
|
+
results = Ragdoll::Document.search_by_keywords(
|
272
|
+
['web', 'javascript', 'frontend'],
|
273
|
+
limit: 20
|
274
|
+
)
|
275
|
+
|
276
|
+
# Case-insensitive keyword matching (automatically normalized)
|
277
|
+
results = Ragdoll::Document.search_by_keywords(['Python', 'DATA-SCIENCE', 'ai'])
|
278
|
+
# Will match documents with keywords: ['python', 'data-science', 'ai']
|
279
|
+
```
|
280
|
+
|
281
|
+
**Keywords Search Features:**
|
282
|
+
- **High Performance**: Uses PostgreSQL GIN indexes for fast array operations
|
283
|
+
- **Flexible Matching**: Supports both overlap (`&&`) and contains (`@>`) operators
|
284
|
+
- **Smart Scoring**: Results ordered by match count or document focus
|
285
|
+
- **Case Insensitive**: Automatic keyword normalization
|
286
|
+
- **Integration Ready**: Works seamlessly with semantic search
|
287
|
+
- **Inspired by `find_matching_entries.rb`**: Optimized for PostgreSQL arrays
|
288
|
+
|
205
289
|
### Search Analytics and Tracking
|
206
290
|
|
207
291
|
Ragdoll automatically tracks all searches to provide comprehensive analytics and improve search relevance over time:
|
@@ -299,13 +383,14 @@ end
|
|
299
383
|
## Current Implementation Status
|
300
384
|
|
301
385
|
### ✅ **Fully Implemented**
|
302
|
-
- **Text document processing**: PDF, DOCX, HTML, Markdown, plain text files
|
386
|
+
- **Text document processing**: PDF, DOCX, HTML, Markdown, plain text files with encoding fallback
|
303
387
|
- **Embedding generation**: Text chunking and vector embedding creation
|
304
388
|
- **Database schema**: Multi-modal polymorphic architecture with PostgreSQL + pgvector
|
305
389
|
- **Dual metadata architecture**: Separate LLM-generated content analysis and file properties
|
306
390
|
- **Search functionality**: Semantic search with cosine similarity and usage analytics
|
307
391
|
- **Search tracking system**: Comprehensive analytics with query embeddings, click-through tracking, and performance monitoring
|
308
392
|
- **Document management**: Add, update, delete, list operations
|
393
|
+
- **Duplicate detection**: Multi-level duplicate prevention with file hash, content hash, and metadata comparison
|
309
394
|
- **Background processing**: ActiveJob integration for async embedding generation
|
310
395
|
- **LLM metadata generation**: AI-powered structured content analysis with schema validation
|
311
396
|
- **Logging**: Configurable file-based logging with multiple levels
|
data/Rakefile
CHANGED
@@ -49,8 +49,10 @@ task :setup_test_db do
|
|
49
49
|
puts "Warning: Could not install pgvector extension: #{e.message}"
|
50
50
|
end
|
51
51
|
|
52
|
-
#
|
53
|
-
|
52
|
+
# Reset and run migrations (drops all tables and re-runs migrations)
|
53
|
+
# This ensures clean state for tests regardless of previous migration versions
|
54
|
+
Ragdoll::Core::Database.setup(test_db_config.merge(auto_migrate: false, logger: nil))
|
55
|
+
Ragdoll::Core::Database.reset!
|
54
56
|
puts "Test database setup complete"
|
55
57
|
end
|
56
58
|
|
@@ -142,10 +142,12 @@ module Ragdoll
|
|
142
142
|
def keywords_array
|
143
143
|
return [] unless keywords.present?
|
144
144
|
|
145
|
+
# After migration, keywords is now a PostgreSQL array
|
145
146
|
case keywords
|
146
147
|
when Array
|
147
|
-
keywords
|
148
|
+
keywords.map(&:to_s).map(&:strip).reject(&:empty?)
|
148
149
|
when String
|
150
|
+
# Fallback for any remaining string data (shouldn't happen after migration)
|
149
151
|
keywords.split(",").map(&:strip).reject(&:empty?)
|
150
152
|
else
|
151
153
|
[]
|
@@ -153,17 +155,23 @@ module Ragdoll
|
|
153
155
|
end
|
154
156
|
|
155
157
|
def add_keyword(keyword)
|
158
|
+
return if keyword.blank?
|
159
|
+
|
156
160
|
current_keywords = keywords_array
|
157
|
-
|
161
|
+
normalized_keyword = keyword.to_s.strip.downcase
|
162
|
+
return if current_keywords.map(&:downcase).include?(normalized_keyword)
|
158
163
|
|
159
|
-
current_keywords <<
|
160
|
-
self.keywords = current_keywords
|
164
|
+
current_keywords << normalized_keyword
|
165
|
+
self.keywords = current_keywords
|
161
166
|
end
|
162
167
|
|
163
168
|
def remove_keyword(keyword)
|
169
|
+
return if keyword.blank?
|
170
|
+
|
164
171
|
current_keywords = keywords_array
|
165
|
-
|
166
|
-
|
172
|
+
normalized_keyword = keyword.to_s.strip.downcase
|
173
|
+
current_keywords.reject! { |k| k.downcase == normalized_keyword }
|
174
|
+
self.keywords = current_keywords
|
167
175
|
end
|
168
176
|
|
169
177
|
# Metadata accessors for common fields
|
@@ -249,15 +257,110 @@ module Ragdoll
|
|
249
257
|
puts "Metadata generation failed: #{e.message}"
|
250
258
|
end
|
251
259
|
|
252
|
-
# PostgreSQL full-text search on metadata fields
|
260
|
+
# PostgreSQL full-text search on metadata fields with per-word match-ratio [0.0..1.0]
|
253
261
|
def self.search_content(query, **options)
|
254
262
|
return none if query.blank?
|
255
263
|
|
256
|
-
#
|
257
|
-
|
258
|
-
|
259
|
-
|
260
|
-
|
264
|
+
# Split into unique alphanumeric words
|
265
|
+
words = query.downcase.scan(/[[:alnum:]]+/).uniq
|
266
|
+
return none if words.empty?
|
267
|
+
|
268
|
+
limit = options[:limit] || 20
|
269
|
+
threshold = options[:threshold] || 0.0
|
270
|
+
|
271
|
+
# Use precomputed tsvector column if it exists, otherwise build on the fly
|
272
|
+
if column_names.include?("search_vector")
|
273
|
+
tsvector = "#{table_name}.search_vector"
|
274
|
+
else
|
275
|
+
# Build tsvector from title and metadata fields
|
276
|
+
text_expr =
|
277
|
+
"COALESCE(title, '') || ' ' || " \
|
278
|
+
"COALESCE(metadata->>'summary', '') || ' ' || " \
|
279
|
+
"COALESCE(metadata->>'keywords', '') || ' ' || " \
|
280
|
+
"COALESCE(metadata->>'description', '')"
|
281
|
+
tsvector = "to_tsvector('english', #{text_expr})"
|
282
|
+
end
|
283
|
+
|
284
|
+
# Prepare sanitized tsquery terms
|
285
|
+
tsqueries = words.map do |word|
|
286
|
+
sanitize_sql_array(["plainto_tsquery('english', ?)", word])
|
287
|
+
end
|
288
|
+
|
289
|
+
# Combine per-word tsqueries with OR so PostgreSQL can use the GIN index
|
290
|
+
combined_tsquery = tsqueries.join(' || ')
|
291
|
+
|
292
|
+
# Score each match (1 if present, 0 if not), sum them
|
293
|
+
score_terms = tsqueries.map { |tsq| "(#{tsvector} @@ #{tsq})::int" }
|
294
|
+
score_sum = score_terms.join(' + ')
|
295
|
+
|
296
|
+
# Similarity ratio: fraction of query words present
|
297
|
+
similarity_sql = "(#{score_sum})::float / #{words.size}"
|
298
|
+
|
299
|
+
# Start with basic search query
|
300
|
+
query = select("#{table_name}.*, #{similarity_sql} AS fulltext_similarity")
|
301
|
+
|
302
|
+
# Build where conditions
|
303
|
+
conditions = ["#{tsvector} @@ (#{combined_tsquery})"]
|
304
|
+
|
305
|
+
# Add status filter (default to processed unless overridden)
|
306
|
+
status = options[:status] || 'processed'
|
307
|
+
conditions << "#{table_name}.status = '#{status}'"
|
308
|
+
|
309
|
+
# Add document type filter if specified
|
310
|
+
if options[:document_type].present?
|
311
|
+
conditions << sanitize_sql_array(["#{table_name}.document_type = ?", options[:document_type]])
|
312
|
+
end
|
313
|
+
|
314
|
+
# Add threshold filtering if specified
|
315
|
+
if threshold > 0.0
|
316
|
+
conditions << "#{similarity_sql} >= #{threshold}"
|
317
|
+
end
|
318
|
+
|
319
|
+
# Combine all conditions
|
320
|
+
where_clause = conditions.join(' AND ')
|
321
|
+
|
322
|
+
# Materialize to array to avoid COUNT/SELECT alias conflicts in some AR versions
|
323
|
+
query.where(where_clause)
|
324
|
+
.order(Arel.sql("fulltext_similarity DESC, updated_at DESC"))
|
325
|
+
.limit(limit)
|
326
|
+
.to_a
|
327
|
+
end
|
328
|
+
|
329
|
+
# Search documents by keywords using PostgreSQL array operations
|
330
|
+
# Returns documents that match keywords with scoring based on match count
|
331
|
+
# Inspired by find_matching_entries.rb algorithm but optimized for PostgreSQL arrays
|
332
|
+
def self.search_by_keywords(keywords_array, **options)
|
333
|
+
return where("1 = 0") if keywords_array.blank?
|
334
|
+
|
335
|
+
# Normalize keywords to lowercase strings array
|
336
|
+
normalized_keywords = Array(keywords_array).map(&:to_s).map(&:downcase).reject(&:empty?)
|
337
|
+
return where("1 = 0") if normalized_keywords.empty?
|
338
|
+
|
339
|
+
limit = options[:limit] || 20
|
340
|
+
|
341
|
+
# Use PostgreSQL array overlap operator with proper array literal
|
342
|
+
quoted_keywords = normalized_keywords.map { |k| "\"#{k}\"" }.join(',')
|
343
|
+
array_literal = "'{#{quoted_keywords}}'::text[]"
|
344
|
+
where("keywords && #{array_literal}")
|
345
|
+
.order("created_at DESC")
|
346
|
+
.limit(limit)
|
347
|
+
end
|
348
|
+
|
349
|
+
# Find documents that contain ALL specified keywords (exact array matching)
|
350
|
+
def self.search_by_keywords_all(keywords_array, **options)
|
351
|
+
return where("1 = 0") if keywords_array.blank?
|
352
|
+
|
353
|
+
normalized_keywords = Array(keywords_array).map(&:to_s).map(&:downcase).reject(&:empty?)
|
354
|
+
return where("1 = 0") if normalized_keywords.empty?
|
355
|
+
|
356
|
+
limit = options[:limit] || 20
|
357
|
+
|
358
|
+
# Use PostgreSQL array contains operator with proper array literal
|
359
|
+
quoted_keywords = normalized_keywords.map { |k| "\"#{k}\"" }.join(',')
|
360
|
+
array_literal = "'{#{quoted_keywords}}'::text[]"
|
361
|
+
where("keywords @> #{array_literal}")
|
362
|
+
.order("created_at DESC")
|
363
|
+
.limit(limit)
|
261
364
|
end
|
262
365
|
|
263
366
|
# Faceted search by metadata fields
|
@@ -64,10 +64,26 @@ module Ragdoll
|
|
64
64
|
scope = scope.by_model(filters[:embedding_model]) if filters[:embedding_model]
|
65
65
|
|
66
66
|
# Document-level filters require joining through embeddable (STI Content) to documents
|
67
|
-
|
67
|
+
needs_document_join = filters[:document_type] || filters[:keywords]
|
68
|
+
|
69
|
+
if needs_document_join
|
68
70
|
scope = scope.joins("JOIN ragdoll_contents ON ragdoll_contents.id = ragdoll_embeddings.embeddable_id")
|
69
71
|
.joins("JOIN ragdoll_documents ON ragdoll_documents.id = ragdoll_contents.document_id")
|
70
|
-
|
72
|
+
end
|
73
|
+
|
74
|
+
if filters[:document_type]
|
75
|
+
scope = scope.where("ragdoll_documents.document_type = ?", filters[:document_type])
|
76
|
+
end
|
77
|
+
|
78
|
+
# Keywords filtering using PostgreSQL array operations
|
79
|
+
if filters[:keywords] && filters[:keywords].any?
|
80
|
+
normalized_keywords = Array(filters[:keywords]).map(&:to_s).map(&:downcase).reject(&:empty?)
|
81
|
+
if normalized_keywords.any?
|
82
|
+
# Use PostgreSQL array overlap operator with proper array literal
|
83
|
+
quoted_keywords = normalized_keywords.map { |k| "\"#{k}\"" }.join(',')
|
84
|
+
array_literal = "'{#{quoted_keywords}}'::text[]"
|
85
|
+
scope = scope.where("ragdoll_documents.keywords && #{array_literal}")
|
86
|
+
end
|
71
87
|
end
|
72
88
|
|
73
89
|
# Use pgvector for similarity search
|
@@ -83,10 +99,26 @@ module Ragdoll
|
|
83
99
|
scope = scope.by_model(filters[:embedding_model]) if filters[:embedding_model]
|
84
100
|
|
85
101
|
# Document-level filters require joining through embeddable (STI Content) to documents
|
86
|
-
|
102
|
+
needs_document_join = filters[:document_type] || filters[:keywords]
|
103
|
+
|
104
|
+
if needs_document_join
|
87
105
|
scope = scope.joins("JOIN ragdoll_contents ON ragdoll_contents.id = ragdoll_embeddings.embeddable_id")
|
88
106
|
.joins("JOIN ragdoll_documents ON ragdoll_documents.id = ragdoll_contents.document_id")
|
89
|
-
|
107
|
+
end
|
108
|
+
|
109
|
+
if filters[:document_type]
|
110
|
+
scope = scope.where("ragdoll_documents.document_type = ?", filters[:document_type])
|
111
|
+
end
|
112
|
+
|
113
|
+
# Keywords filtering using PostgreSQL array operations
|
114
|
+
if filters[:keywords] && filters[:keywords].any?
|
115
|
+
normalized_keywords = Array(filters[:keywords]).map(&:to_s).map(&:downcase).reject(&:empty?)
|
116
|
+
if normalized_keywords.any?
|
117
|
+
# Use PostgreSQL array overlap operator with proper array literal
|
118
|
+
quoted_keywords = normalized_keywords.map { |k| "\"#{k}\"" }.join(',')
|
119
|
+
array_literal = "'{#{quoted_keywords}}'::text[]"
|
120
|
+
scope = scope.where("ragdoll_documents.keywords && #{array_literal}")
|
121
|
+
end
|
90
122
|
end
|
91
123
|
|
92
124
|
search_with_pgvector_stats(query_embedding, scope, limit, threshold)
|
@@ -14,7 +14,7 @@ module Ragdoll
|
|
14
14
|
has_many :embeddings, through: :search_results
|
15
15
|
|
16
16
|
validates :query, presence: true
|
17
|
-
validates :query_embedding, presence: true
|
17
|
+
validates :query_embedding, presence: false, allow_nil: true
|
18
18
|
validates :search_type, presence: true, inclusion: { in: %w[semantic hybrid fulltext] }
|
19
19
|
validates :results_count, presence: true, numericality: { greater_than_or_equal_to: 0 }
|
20
20
|
|
@@ -1,9 +1,11 @@
|
|
1
1
|
# frozen_string_literal: true
|
2
2
|
|
3
|
+
require "securerandom"
|
4
|
+
|
3
5
|
module Ragdoll
|
4
6
|
class DocumentManagement
|
5
7
|
class << self
|
6
|
-
def add_document(location, content, metadata = {})
|
8
|
+
def add_document(location, content, metadata = {}, force: false)
|
7
9
|
# Ensure location is an absolute path if it's a file path
|
8
10
|
absolute_location = location.start_with?("http") || location.start_with?("ftp") ? location : File.expand_path(location)
|
9
11
|
|
@@ -14,17 +16,21 @@ module Ragdoll
|
|
14
16
|
Time.current
|
15
17
|
end
|
16
18
|
|
17
|
-
#
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
|
19
|
+
# Skip duplicate detection if force is true
|
20
|
+
unless force
|
21
|
+
existing_document = find_duplicate_document(absolute_location, content, metadata, file_modified_at)
|
22
|
+
return existing_document.id.to_s if existing_document
|
23
|
+
end
|
22
24
|
|
23
|
-
#
|
24
|
-
|
25
|
+
# Modify location if force is used to avoid unique constraint violation
|
26
|
+
final_location = if force
|
27
|
+
"#{absolute_location}#forced_#{Time.current.to_i}_#{SecureRandom.hex(4)}"
|
28
|
+
else
|
29
|
+
absolute_location
|
30
|
+
end
|
25
31
|
|
26
32
|
document = Ragdoll::Document.create!(
|
27
|
-
location:
|
33
|
+
location: final_location,
|
28
34
|
title: metadata[:title] || metadata["title"] || extract_title_from_location(location),
|
29
35
|
document_type: metadata[:document_type] || metadata["document_type"] || "text",
|
30
36
|
metadata: metadata.is_a?(Hash) ? metadata : {},
|
@@ -100,6 +106,108 @@ module Ragdoll
|
|
100
106
|
|
101
107
|
private
|
102
108
|
|
109
|
+
def find_duplicate_document(location, content, metadata, file_modified_at)
|
110
|
+
# Primary check: exact location match (simple duplicate detection)
|
111
|
+
existing = Ragdoll::Document.find_by(location: location)
|
112
|
+
return existing if existing
|
113
|
+
|
114
|
+
# Secondary check: exact location and file modification time (for files)
|
115
|
+
existing_with_time = Ragdoll::Document.find_by(
|
116
|
+
location: location,
|
117
|
+
file_modified_at: file_modified_at
|
118
|
+
)
|
119
|
+
return existing_with_time if existing_with_time
|
120
|
+
|
121
|
+
# Enhanced duplicate detection for file-based documents
|
122
|
+
if File.exist?(location) && !location.start_with?("http")
|
123
|
+
file_size = File.size(location)
|
124
|
+
content_hash = calculate_file_hash(location)
|
125
|
+
|
126
|
+
# Check for documents with same file hash (most reliable)
|
127
|
+
potential_duplicates = Ragdoll::Document.where("metadata->>'file_hash' = ?", content_hash)
|
128
|
+
return potential_duplicates.first if potential_duplicates.any?
|
129
|
+
|
130
|
+
# Check for documents with same file size and similar metadata
|
131
|
+
same_size_docs = Ragdoll::Document.where("metadata->>'file_size' = ?", file_size.to_s)
|
132
|
+
same_size_docs.each do |doc|
|
133
|
+
return doc if documents_are_duplicates?(doc, location, content, metadata, file_size, content_hash)
|
134
|
+
end
|
135
|
+
end
|
136
|
+
|
137
|
+
# For non-file documents (URLs, etc), check content-based duplicates
|
138
|
+
unless File.exist?(location)
|
139
|
+
return find_content_based_duplicate(content, metadata)
|
140
|
+
end
|
141
|
+
|
142
|
+
nil
|
143
|
+
end
|
144
|
+
|
145
|
+
def documents_are_duplicates?(existing_doc, location, content, metadata, file_size, content_hash)
|
146
|
+
# Compare multiple factors to determine if documents are duplicates
|
147
|
+
|
148
|
+
# Check filename similarity (basename without extension)
|
149
|
+
existing_basename = File.basename(existing_doc.location, File.extname(existing_doc.location))
|
150
|
+
new_basename = File.basename(location, File.extname(location))
|
151
|
+
return false unless existing_basename == new_basename
|
152
|
+
|
153
|
+
# Check content length similarity (within 5% tolerance)
|
154
|
+
if content.present? && existing_doc.content.present?
|
155
|
+
content_length_diff = (content.length - existing_doc.content.length).abs
|
156
|
+
max_length = [content.length, existing_doc.content.length].max
|
157
|
+
return false if max_length > 0 && (content_length_diff.to_f / max_length) > 0.05
|
158
|
+
end
|
159
|
+
|
160
|
+
# Check key metadata fields
|
161
|
+
existing_metadata = existing_doc.metadata || {}
|
162
|
+
new_metadata = metadata || {}
|
163
|
+
|
164
|
+
# Compare file type/document type
|
165
|
+
return false if existing_doc.document_type != (new_metadata[:document_type] || new_metadata["document_type"] || "text")
|
166
|
+
|
167
|
+
# Compare title if available
|
168
|
+
existing_title = existing_metadata["title"] || existing_doc.title
|
169
|
+
new_title = new_metadata[:title] || new_metadata["title"] || extract_title_from_location(location)
|
170
|
+
return false if existing_title && new_title && existing_title != new_title
|
171
|
+
|
172
|
+
# If we reach here, documents are likely duplicates
|
173
|
+
true
|
174
|
+
end
|
175
|
+
|
176
|
+
def find_content_based_duplicate(content, metadata)
|
177
|
+
return nil unless content.present?
|
178
|
+
|
179
|
+
content_hash = calculate_content_hash(content)
|
180
|
+
title = metadata[:title] || metadata["title"]
|
181
|
+
|
182
|
+
# Look for documents with same content hash
|
183
|
+
Ragdoll::Document.where("metadata->>'content_hash' = ?", content_hash).first ||
|
184
|
+
# Look for documents with same title and similar content length (within 5% tolerance)
|
185
|
+
(title ? find_by_title_and_content_similarity(title, content) : nil)
|
186
|
+
end
|
187
|
+
|
188
|
+
def find_by_title_and_content_similarity(title, content)
|
189
|
+
content_length = content.length
|
190
|
+
tolerance = content_length * 0.05
|
191
|
+
|
192
|
+
Ragdoll::Document.where(title: title).find do |doc|
|
193
|
+
doc.content.present? &&
|
194
|
+
(doc.content.length - content_length).abs <= tolerance
|
195
|
+
end
|
196
|
+
end
|
197
|
+
|
198
|
+
def calculate_file_hash(file_path)
|
199
|
+
require 'digest'
|
200
|
+
Digest::SHA256.file(file_path).hexdigest
|
201
|
+
rescue StandardError => e
|
202
|
+
Rails.logger.warn "Failed to calculate file hash for #{file_path}: #{e.message}" if defined?(Rails)
|
203
|
+
nil
|
204
|
+
end
|
205
|
+
|
206
|
+
def calculate_content_hash(content)
|
207
|
+
require 'digest'
|
208
|
+
Digest::SHA256.hexdigest(content)
|
209
|
+
end
|
210
|
+
|
103
211
|
def extract_title_from_location(location)
|
104
212
|
File.basename(location, File.extname(location))
|
105
213
|
end
|