ragdoll 0.1.9 → 0.1.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: cde84c4b5bbf1e8296bdd762ee78acb2f69663e493ce23b0941ada9d1201bdcd
4
- data.tar.gz: f8bc456d3c536a295920bc1c806974b2b39f08977a8761604c7a192b83e756d2
3
+ metadata.gz: 255dd5c7e6ccbdeeafe2a0ed74382c5fefb4df2e015942c191fb3747c24a6cb8
4
+ data.tar.gz: 284ceedd72c305d3dcf5385482b983463ec71ec850514655614e47ad71be17a8
5
5
  SHA512:
6
- metadata.gz: c1ce0e46be45fe8004930ec231a83a59f31039f4908be2a0e0ba67043237f1ea03bc00991820f6928a6ef5baa6ca910547876f21ddad5a7ead2d6384192e7708
7
- data.tar.gz: e3f50e1205b4ba755c6a978acb06240b7b1fa729f4fa9bef33f956a9b245ad3d3323612f300902051237ffa71a763fc6db8d8e0fedc4f2761c46a977b42d6958
6
+ metadata.gz: 64d603061ba7742699e84a5bc8933de4cbc1ab9a6b748d74b17371c05d855f95a1a2750da713be57f0e5a4a95f304894a73821b3cb6dbb3118623c9c5a1cbce2
7
+ data.tar.gz: c59f77cffb7026cf07eedf5f6724d17a430e4d241691b5cf380292f7bb4ee045bfacbc15787603f531187b5b03cb68e69bea7576e71c7ac8600dbe334af248ca
data/CHANGELOG.md CHANGED
@@ -6,7 +6,58 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
6
 
7
7
  ## [Unreleased]
8
8
 
9
- *Note: These features will be included in the next release (likely v0.1.9) featuring comprehensive search tracking and analytics capabilities.*
9
+ ## [0.1.11] - 2025-01-17
10
+
11
+ ### Added
12
+ - **Force Option for Document Addition**: New `force` parameter in document management to override duplicate detection
13
+ - Allows forced document addition even when duplicate titles exist
14
+ - Enables overwriting existing documents when needed
15
+
16
+ ### Fixed
17
+ - **Search Query Embedding**: Made `query_embedding` parameter optional in search methods
18
+ - Improved flexibility for search operations that don't require embeddings
19
+ - Better error handling for search queries without embeddings
20
+
21
+ ### Changed
22
+ - **Database Setup**: Enhanced database role handling and setup procedures
23
+ - Improved database connection configuration
24
+ - Better handling of database roles and permissions
25
+
26
+ ### Removed
27
+ - **Obsolete Migrations**: Removed outdated RagdollDocuments migration files
28
+ - Cleaned up legacy migration structure
29
+ - Streamlined database migration path
30
+
31
+ ## [0.1.10] - 2025-01-15
32
+
33
+ ### Changed
34
+ - Continued improvements to search performance and accuracy
35
+
36
+ ### Added
37
+ - **Hybrid Search**: Complete implementation combining semantic and full-text search capabilities
38
+ - Configurable weights for semantic vs text search (default: 70% semantic, 30% text)
39
+ - Deduplication of results by document ID
40
+ - Combined scoring system for unified result ranking
41
+ - **Full-text Search**: PostgreSQL full-text search with tsvector indexing
42
+ - Per-word match ratio scoring (0.0 to 1.0)
43
+ - GIN index for high-performance text search
44
+ - Search across title, summary, keywords, and description fields
45
+ - **Enhanced Search API**: Complete search type delegation at top-level Ragdoll namespace
46
+ - `Ragdoll.hybrid_search` method for combined semantic and text search
47
+ - `Ragdoll::Document.search_content` for full-text search capabilities
48
+ - Consistent parameter handling across all search methods
49
+
50
+ ### Changed
51
+ - **Search Architecture**: Unified search interface supporting semantic, fulltext, and hybrid modes
52
+ - **Database Schema**: Added search_vector column with GIN indexing for full-text search performance
53
+
54
+ ### Technical Details
55
+ - Full-text search uses PostgreSQL's built-in tsvector capabilities
56
+ - Hybrid search combines cosine similarity (semantic) with text match ratios
57
+ - Results are ranked by weighted combined scores
58
+ - All search methods maintain backward compatibility
59
+
60
+ ## [0.1.9] - 2025-01-10
10
61
 
11
62
  ### Added
12
63
  - **Initial CHANGELOG**: Added comprehensive CHANGELOG.md following Keep a Changelog format
@@ -40,7 +91,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
40
91
  - **Test Coverage**: Added coverage directory to .gitignore for cleaner repository state
41
92
 
42
93
  ### Technical Details
43
- - Commits: `9186067`, `cb952d3`, `e902a5f`, `632527b`
94
+ - Commits: `9186067`, `cb952d3`, `e902a5f`, `632527b`
44
95
  - All changes maintain backward compatibility
45
96
  - No breaking API changes
46
97
 
@@ -141,6 +192,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
141
192
  - **Database Schema**: Multi-modal polymorphic architecture with PostgreSQL + pgvector
142
193
  - **Dual Metadata Architecture**: Separate LLM-generated content analysis and file properties
143
194
  - **Search Functionality**: Semantic search with cosine similarity and usage analytics
195
+ - **Hybrid Search**: Complete implementation combining semantic and full-text search with configurable weights
196
+ - **Full-text Search**: PostgreSQL tsvector-based text search with GIN indexing
144
197
  - **Search Tracking System**: Comprehensive analytics with query embeddings, click-through tracking, and performance monitoring
145
198
  - **Document Management**: Add, update, delete, list operations
146
199
  - **Background Processing**: ActiveJob integration for async embedding generation
@@ -150,7 +203,6 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
150
203
  ### 🚧 In Development
151
204
  - **Image Processing**: Framework exists but vision AI integration needs completion
152
205
  - **Audio Processing**: Framework exists but speech-to-text integration needs completion
153
- - **Hybrid Search**: Combining semantic and full-text search capabilities
154
206
 
155
207
  ### 📋 Planned Features
156
208
  - **Multi-modal Search**: Search across text, image, and audio content types
@@ -161,6 +213,18 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
161
213
 
162
214
  ## Migration Guide
163
215
 
216
+ ### From 0.1.9 to 0.1.10
217
+ - **New Search Methods**: `Ragdoll.hybrid_search` and `Ragdoll::Document.search_content` methods now available
218
+ - **Database Migration**: New search_vector column added to documents table with GIN index for full-text search
219
+ - **API Enhancement**: All search methods now support unified parameter interface
220
+ - **Backward Compatibility**: Existing `Ragdoll.search` method unchanged, continues to work as before
221
+ - **CLI Integration**: ragdoll-cli now requires ragdoll >= 0.1.10 for hybrid and full-text search support
222
+
223
+ ### From 0.1.8 to 0.1.9
224
+ - **CHANGELOG Addition**: Comprehensive changelog and feature tracking added
225
+ - **API Method Consistency**: `hybrid_search` method properly delegated to top-level namespace
226
+ - **No Breaking Changes**: All existing functionality remains compatible
227
+
164
228
  ### From 0.1.7 to 0.1.8
165
229
  - New search tracking tables will be automatically created via migrations
166
230
  - No breaking changes to existing API
@@ -198,4 +262,4 @@ This project is licensed under the MIT License - see the LICENSE file for detail
198
262
 
199
263
  ---
200
264
 
201
- *This changelog is automatically maintained and reflects the actual implementation status of features.*
265
+ *This changelog is automatically maintained and reflects the actual implementation status of features.*
data/README.md CHANGED
@@ -22,6 +22,8 @@
22
22
 
23
23
  Database-oriented multi-modal RAG (Retrieval-Augmented Generation) library built on ActiveRecord. Features PostgreSQL + pgvector for high-performance semantic search, polymorphic content architecture, and dual metadata design for sophisticated document analysis.
24
24
 
25
+ RAG does not have to be hard. Every week its getting simpler. The frontier LLM providers are starting to encorporate RAG services. For example OpenAI offers a vector search service. See: [https://0x1eef.github.io/posts/an-introduction-to-rag-with-llm.rb/](https://0x1eef.github.io/posts/an-introduction-to-rag-with-llm.rb/)
26
+
25
27
  ## Overview
26
28
 
27
29
  Ragdoll is a database-first, multi-modal Retrieval-Augmented Generation (RAG) library for Ruby. It pairs PostgreSQL + pgvector with an ActiveRecord-driven schema to deliver fast, production-grade semantic search and clean data modeling. Today it ships with robust text processing; image and audio pipelines are scaffolded and actively being completed.
@@ -130,6 +132,10 @@ puts result[:document_id] # "123"
130
132
  puts result[:message] # "Document 'document' added successfully with ID 123"
131
133
  puts result[:embeddings_queued] # true
132
134
 
135
+ # Add document with force option to override duplicate detection
136
+ result = Ragdoll.add_document(path: 'document.pdf', force: true)
137
+ # Creates new document even if duplicate exists
138
+
133
139
  # Check document processing status
134
140
  status = Ragdoll.document_status(id: result[:document_id])
135
141
  puts status[:status] # "processed"
@@ -159,6 +165,37 @@ puts stats[:total_documents] # 50
159
165
  puts stats[:total_embeddings] # 1250
160
166
  ```
161
167
 
168
+ ### Duplicate Detection
169
+
170
+ Ragdoll includes sophisticated duplicate detection to prevent redundant document processing:
171
+
172
+ ```ruby
173
+ # Automatic duplicate detection (default behavior)
174
+ result1 = Ragdoll.add_document(path: 'research.pdf')
175
+ result2 = Ragdoll.add_document(path: 'research.pdf')
176
+ # result2 returns the same document_id as result1 (duplicate detected)
177
+
178
+ # Force adding a duplicate document
179
+ result3 = Ragdoll.add_document(path: 'research.pdf', force: true)
180
+ # Creates a new document with modified location identifier
181
+
182
+ # Duplicate detection criteria:
183
+ # 1. Exact location/path match
184
+ # 2. File modification time (for files)
185
+ # 3. File content hash (SHA256)
186
+ # 4. Content hash for text
187
+ # 5. File size and metadata similarity
188
+ # 6. Document title and type matching
189
+ ```
190
+
191
+ **Duplicate Detection Features:**
192
+ - **Multi-level detection**: Checks location, file hash, content hash, and metadata
193
+ - **Smart similarity**: Detects duplicates even with minor differences (5% content tolerance)
194
+ - **File integrity**: SHA256 hashing for reliable file comparison
195
+ - **URL support**: Content-based detection for web documents
196
+ - **Force option**: Override detection when needed
197
+ - **Performance optimized**: Database indexes for fast lookups
198
+
162
199
  ### Search and Retrieval
163
200
 
164
201
  ```ruby
@@ -202,6 +239,53 @@ results = Ragdoll.hybrid_search(
202
239
  )
203
240
  ```
204
241
 
242
+ ### Keywords Search
243
+
244
+ Ragdoll supports powerful keywords-based search that can be used standalone or combined with semantic search. The keywords system uses PostgreSQL array operations for high performance and supports both partial matching (overlap) and exact matching (contains all).
245
+
246
+ ```ruby
247
+ # Keywords-only search (overlap - documents containing any of the keywords)
248
+ results = Ragdoll::Document.search_by_keywords(['machine', 'learning', 'ai'])
249
+
250
+ # Results are sorted by match count (documents with more keyword matches rank higher)
251
+ results.each do |doc|
252
+ puts "#{doc.title}: #{doc.keywords_match_count} matches"
253
+ end
254
+
255
+ # Exact keywords search (contains all - documents must have ALL keywords)
256
+ results = Ragdoll::Document.search_by_keywords_all(['ruby', 'programming'])
257
+
258
+ # Results are sorted by focus (fewer total keywords = more focused document)
259
+ results.each do |doc|
260
+ puts "#{doc.title}: #{doc.total_keywords_count} total keywords"
261
+ end
262
+
263
+ # Combined semantic + keywords search for best results
264
+ results = Ragdoll.search(
265
+ query: 'artificial intelligence applications',
266
+ keywords: ['ai', 'machine learning', 'neural networks'],
267
+ limit: 10
268
+ )
269
+
270
+ # Keywords search with options
271
+ results = Ragdoll::Document.search_by_keywords(
272
+ ['web', 'javascript', 'frontend'],
273
+ limit: 20
274
+ )
275
+
276
+ # Case-insensitive keyword matching (automatically normalized)
277
+ results = Ragdoll::Document.search_by_keywords(['Python', 'DATA-SCIENCE', 'ai'])
278
+ # Will match documents with keywords: ['python', 'data-science', 'ai']
279
+ ```
280
+
281
+ **Keywords Search Features:**
282
+ - **High Performance**: Uses PostgreSQL GIN indexes for fast array operations
283
+ - **Flexible Matching**: Supports both overlap (`&&`) and contains (`@>`) operators
284
+ - **Smart Scoring**: Results ordered by match count or document focus
285
+ - **Case Insensitive**: Automatic keyword normalization
286
+ - **Integration Ready**: Works seamlessly with semantic search
287
+ - **Inspired by `find_matching_entries.rb`**: Optimized for PostgreSQL arrays
288
+
205
289
  ### Search Analytics and Tracking
206
290
 
207
291
  Ragdoll automatically tracks all searches to provide comprehensive analytics and improve search relevance over time:
@@ -299,13 +383,14 @@ end
299
383
  ## Current Implementation Status
300
384
 
301
385
  ### ✅ **Fully Implemented**
302
- - **Text document processing**: PDF, DOCX, HTML, Markdown, plain text files
386
+ - **Text document processing**: PDF, DOCX, HTML, Markdown, plain text files with encoding fallback
303
387
  - **Embedding generation**: Text chunking and vector embedding creation
304
388
  - **Database schema**: Multi-modal polymorphic architecture with PostgreSQL + pgvector
305
389
  - **Dual metadata architecture**: Separate LLM-generated content analysis and file properties
306
390
  - **Search functionality**: Semantic search with cosine similarity and usage analytics
307
391
  - **Search tracking system**: Comprehensive analytics with query embeddings, click-through tracking, and performance monitoring
308
392
  - **Document management**: Add, update, delete, list operations
393
+ - **Duplicate detection**: Multi-level duplicate prevention with file hash, content hash, and metadata comparison
309
394
  - **Background processing**: ActiveJob integration for async embedding generation
310
395
  - **LLM metadata generation**: AI-powered structured content analysis with schema validation
311
396
  - **Logging**: Configurable file-based logging with multiple levels
data/Rakefile CHANGED
@@ -49,8 +49,10 @@ task :setup_test_db do
49
49
  puts "Warning: Could not install pgvector extension: #{e.message}"
50
50
  end
51
51
 
52
- # Run migrations
53
- Ragdoll::Core::Database.setup(test_db_config.merge(auto_migrate: true, logger: nil))
52
+ # Reset and run migrations (drops all tables and re-runs migrations)
53
+ # This ensures clean state for tests regardless of previous migration versions
54
+ Ragdoll::Core::Database.setup(test_db_config.merge(auto_migrate: false, logger: nil))
55
+ Ragdoll::Core::Database.reset!
54
56
  puts "Test database setup complete"
55
57
  end
56
58
 
@@ -142,10 +142,12 @@ module Ragdoll
142
142
  def keywords_array
143
143
  return [] unless keywords.present?
144
144
 
145
+ # After migration, keywords is now a PostgreSQL array
145
146
  case keywords
146
147
  when Array
147
- keywords
148
+ keywords.map(&:to_s).map(&:strip).reject(&:empty?)
148
149
  when String
150
+ # Fallback for any remaining string data (shouldn't happen after migration)
149
151
  keywords.split(",").map(&:strip).reject(&:empty?)
150
152
  else
151
153
  []
@@ -153,17 +155,23 @@ module Ragdoll
153
155
  end
154
156
 
155
157
  def add_keyword(keyword)
158
+ return if keyword.blank?
159
+
156
160
  current_keywords = keywords_array
157
- return if current_keywords.include?(keyword.strip)
161
+ normalized_keyword = keyword.to_s.strip.downcase
162
+ return if current_keywords.map(&:downcase).include?(normalized_keyword)
158
163
 
159
- current_keywords << keyword.strip
160
- self.keywords = current_keywords.join(", ")
164
+ current_keywords << normalized_keyword
165
+ self.keywords = current_keywords
161
166
  end
162
167
 
163
168
  def remove_keyword(keyword)
169
+ return if keyword.blank?
170
+
164
171
  current_keywords = keywords_array
165
- current_keywords.delete(keyword.strip)
166
- self.keywords = current_keywords.join(", ")
172
+ normalized_keyword = keyword.to_s.strip.downcase
173
+ current_keywords.reject! { |k| k.downcase == normalized_keyword }
174
+ self.keywords = current_keywords
167
175
  end
168
176
 
169
177
  # Metadata accessors for common fields
@@ -249,15 +257,110 @@ module Ragdoll
249
257
  puts "Metadata generation failed: #{e.message}"
250
258
  end
251
259
 
252
- # PostgreSQL full-text search on metadata fields
260
+ # PostgreSQL full-text search on metadata fields with per-word match-ratio [0.0..1.0]
253
261
  def self.search_content(query, **options)
254
262
  return none if query.blank?
255
263
 
256
- # Use PostgreSQL's built-in full-text search across metadata fields
257
- where(
258
- "to_tsvector('english', COALESCE(title, '') || ' ' || COALESCE(metadata->>'summary', '') || ' ' || COALESCE(metadata->>'keywords', '') || ' ' || COALESCE(metadata->>'description', '')) @@ plainto_tsquery('english', ?)",
259
- query
260
- ).limit(options[:limit] || 20)
264
+ # Split into unique alphanumeric words
265
+ words = query.downcase.scan(/[[:alnum:]]+/).uniq
266
+ return none if words.empty?
267
+
268
+ limit = options[:limit] || 20
269
+ threshold = options[:threshold] || 0.0
270
+
271
+ # Use precomputed tsvector column if it exists, otherwise build on the fly
272
+ if column_names.include?("search_vector")
273
+ tsvector = "#{table_name}.search_vector"
274
+ else
275
+ # Build tsvector from title and metadata fields
276
+ text_expr =
277
+ "COALESCE(title, '') || ' ' || " \
278
+ "COALESCE(metadata->>'summary', '') || ' ' || " \
279
+ "COALESCE(metadata->>'keywords', '') || ' ' || " \
280
+ "COALESCE(metadata->>'description', '')"
281
+ tsvector = "to_tsvector('english', #{text_expr})"
282
+ end
283
+
284
+ # Prepare sanitized tsquery terms
285
+ tsqueries = words.map do |word|
286
+ sanitize_sql_array(["plainto_tsquery('english', ?)", word])
287
+ end
288
+
289
+ # Combine per-word tsqueries with OR so PostgreSQL can use the GIN index
290
+ combined_tsquery = tsqueries.join(' || ')
291
+
292
+ # Score each match (1 if present, 0 if not), sum them
293
+ score_terms = tsqueries.map { |tsq| "(#{tsvector} @@ #{tsq})::int" }
294
+ score_sum = score_terms.join(' + ')
295
+
296
+ # Similarity ratio: fraction of query words present
297
+ similarity_sql = "(#{score_sum})::float / #{words.size}"
298
+
299
+ # Start with basic search query
300
+ query = select("#{table_name}.*, #{similarity_sql} AS fulltext_similarity")
301
+
302
+ # Build where conditions
303
+ conditions = ["#{tsvector} @@ (#{combined_tsquery})"]
304
+
305
+ # Add status filter (default to processed unless overridden)
306
+ status = options[:status] || 'processed'
307
+ conditions << "#{table_name}.status = '#{status}'"
308
+
309
+ # Add document type filter if specified
310
+ if options[:document_type].present?
311
+ conditions << sanitize_sql_array(["#{table_name}.document_type = ?", options[:document_type]])
312
+ end
313
+
314
+ # Add threshold filtering if specified
315
+ if threshold > 0.0
316
+ conditions << "#{similarity_sql} >= #{threshold}"
317
+ end
318
+
319
+ # Combine all conditions
320
+ where_clause = conditions.join(' AND ')
321
+
322
+ # Materialize to array to avoid COUNT/SELECT alias conflicts in some AR versions
323
+ query.where(where_clause)
324
+ .order(Arel.sql("fulltext_similarity DESC, updated_at DESC"))
325
+ .limit(limit)
326
+ .to_a
327
+ end
328
+
329
+ # Search documents by keywords using PostgreSQL array operations
330
+ # Returns documents that match keywords with scoring based on match count
331
+ # Inspired by find_matching_entries.rb algorithm but optimized for PostgreSQL arrays
332
+ def self.search_by_keywords(keywords_array, **options)
333
+ return where("1 = 0") if keywords_array.blank?
334
+
335
+ # Normalize keywords to lowercase strings array
336
+ normalized_keywords = Array(keywords_array).map(&:to_s).map(&:downcase).reject(&:empty?)
337
+ return where("1 = 0") if normalized_keywords.empty?
338
+
339
+ limit = options[:limit] || 20
340
+
341
+ # Use PostgreSQL array overlap operator with proper array literal
342
+ quoted_keywords = normalized_keywords.map { |k| "\"#{k}\"" }.join(',')
343
+ array_literal = "'{#{quoted_keywords}}'::text[]"
344
+ where("keywords && #{array_literal}")
345
+ .order("created_at DESC")
346
+ .limit(limit)
347
+ end
348
+
349
+ # Find documents that contain ALL specified keywords (exact array matching)
350
+ def self.search_by_keywords_all(keywords_array, **options)
351
+ return where("1 = 0") if keywords_array.blank?
352
+
353
+ normalized_keywords = Array(keywords_array).map(&:to_s).map(&:downcase).reject(&:empty?)
354
+ return where("1 = 0") if normalized_keywords.empty?
355
+
356
+ limit = options[:limit] || 20
357
+
358
+ # Use PostgreSQL array contains operator with proper array literal
359
+ quoted_keywords = normalized_keywords.map { |k| "\"#{k}\"" }.join(',')
360
+ array_literal = "'{#{quoted_keywords}}'::text[]"
361
+ where("keywords @> #{array_literal}")
362
+ .order("created_at DESC")
363
+ .limit(limit)
261
364
  end
262
365
 
263
366
  # Faceted search by metadata fields
@@ -64,10 +64,26 @@ module Ragdoll
64
64
  scope = scope.by_model(filters[:embedding_model]) if filters[:embedding_model]
65
65
 
66
66
  # Document-level filters require joining through embeddable (STI Content) to documents
67
- if filters[:document_type]
67
+ needs_document_join = filters[:document_type] || filters[:keywords]
68
+
69
+ if needs_document_join
68
70
  scope = scope.joins("JOIN ragdoll_contents ON ragdoll_contents.id = ragdoll_embeddings.embeddable_id")
69
71
  .joins("JOIN ragdoll_documents ON ragdoll_documents.id = ragdoll_contents.document_id")
70
- .where("ragdoll_documents.document_type = ?", filters[:document_type])
72
+ end
73
+
74
+ if filters[:document_type]
75
+ scope = scope.where("ragdoll_documents.document_type = ?", filters[:document_type])
76
+ end
77
+
78
+ # Keywords filtering using PostgreSQL array operations
79
+ if filters[:keywords] && filters[:keywords].any?
80
+ normalized_keywords = Array(filters[:keywords]).map(&:to_s).map(&:downcase).reject(&:empty?)
81
+ if normalized_keywords.any?
82
+ # Use PostgreSQL array overlap operator with proper array literal
83
+ quoted_keywords = normalized_keywords.map { |k| "\"#{k}\"" }.join(',')
84
+ array_literal = "'{#{quoted_keywords}}'::text[]"
85
+ scope = scope.where("ragdoll_documents.keywords && #{array_literal}")
86
+ end
71
87
  end
72
88
 
73
89
  # Use pgvector for similarity search
@@ -83,10 +99,26 @@ module Ragdoll
83
99
  scope = scope.by_model(filters[:embedding_model]) if filters[:embedding_model]
84
100
 
85
101
  # Document-level filters require joining through embeddable (STI Content) to documents
86
- if filters[:document_type]
102
+ needs_document_join = filters[:document_type] || filters[:keywords]
103
+
104
+ if needs_document_join
87
105
  scope = scope.joins("JOIN ragdoll_contents ON ragdoll_contents.id = ragdoll_embeddings.embeddable_id")
88
106
  .joins("JOIN ragdoll_documents ON ragdoll_documents.id = ragdoll_contents.document_id")
89
- .where("ragdoll_documents.document_type = ?", filters[:document_type])
107
+ end
108
+
109
+ if filters[:document_type]
110
+ scope = scope.where("ragdoll_documents.document_type = ?", filters[:document_type])
111
+ end
112
+
113
+ # Keywords filtering using PostgreSQL array operations
114
+ if filters[:keywords] && filters[:keywords].any?
115
+ normalized_keywords = Array(filters[:keywords]).map(&:to_s).map(&:downcase).reject(&:empty?)
116
+ if normalized_keywords.any?
117
+ # Use PostgreSQL array overlap operator with proper array literal
118
+ quoted_keywords = normalized_keywords.map { |k| "\"#{k}\"" }.join(',')
119
+ array_literal = "'{#{quoted_keywords}}'::text[]"
120
+ scope = scope.where("ragdoll_documents.keywords && #{array_literal}")
121
+ end
90
122
  end
91
123
 
92
124
  search_with_pgvector_stats(query_embedding, scope, limit, threshold)
@@ -14,7 +14,7 @@ module Ragdoll
14
14
  has_many :embeddings, through: :search_results
15
15
 
16
16
  validates :query, presence: true
17
- validates :query_embedding, presence: true
17
+ validates :query_embedding, presence: false, allow_nil: true
18
18
  validates :search_type, presence: true, inclusion: { in: %w[semantic hybrid fulltext] }
19
19
  validates :results_count, presence: true, numericality: { greater_than_or_equal_to: 0 }
20
20
 
@@ -1,9 +1,11 @@
1
1
  # frozen_string_literal: true
2
2
 
3
+ require "securerandom"
4
+
3
5
  module Ragdoll
4
6
  class DocumentManagement
5
7
  class << self
6
- def add_document(location, content, metadata = {})
8
+ def add_document(location, content, metadata = {}, force: false)
7
9
  # Ensure location is an absolute path if it's a file path
8
10
  absolute_location = location.start_with?("http") || location.start_with?("ftp") ? location : File.expand_path(location)
9
11
 
@@ -14,17 +16,21 @@ module Ragdoll
14
16
  Time.current
15
17
  end
16
18
 
17
- # Check if document already exists with same location and file_modified_at
18
- existing_document = Ragdoll::Document.find_by(
19
- location: absolute_location,
20
- file_modified_at: file_modified_at
21
- )
19
+ # Skip duplicate detection if force is true
20
+ unless force
21
+ existing_document = find_duplicate_document(absolute_location, content, metadata, file_modified_at)
22
+ return existing_document.id.to_s if existing_document
23
+ end
22
24
 
23
- # Return existing document ID if found (skip duplicate)
24
- return existing_document.id.to_s if existing_document
25
+ # Modify location if force is used to avoid unique constraint violation
26
+ final_location = if force
27
+ "#{absolute_location}#forced_#{Time.current.to_i}_#{SecureRandom.hex(4)}"
28
+ else
29
+ absolute_location
30
+ end
25
31
 
26
32
  document = Ragdoll::Document.create!(
27
- location: absolute_location,
33
+ location: final_location,
28
34
  title: metadata[:title] || metadata["title"] || extract_title_from_location(location),
29
35
  document_type: metadata[:document_type] || metadata["document_type"] || "text",
30
36
  metadata: metadata.is_a?(Hash) ? metadata : {},
@@ -100,6 +106,108 @@ module Ragdoll
100
106
 
101
107
  private
102
108
 
109
+ def find_duplicate_document(location, content, metadata, file_modified_at)
110
+ # Primary check: exact location match (simple duplicate detection)
111
+ existing = Ragdoll::Document.find_by(location: location)
112
+ return existing if existing
113
+
114
+ # Secondary check: exact location and file modification time (for files)
115
+ existing_with_time = Ragdoll::Document.find_by(
116
+ location: location,
117
+ file_modified_at: file_modified_at
118
+ )
119
+ return existing_with_time if existing_with_time
120
+
121
+ # Enhanced duplicate detection for file-based documents
122
+ if File.exist?(location) && !location.start_with?("http")
123
+ file_size = File.size(location)
124
+ content_hash = calculate_file_hash(location)
125
+
126
+ # Check for documents with same file hash (most reliable)
127
+ potential_duplicates = Ragdoll::Document.where("metadata->>'file_hash' = ?", content_hash)
128
+ return potential_duplicates.first if potential_duplicates.any?
129
+
130
+ # Check for documents with same file size and similar metadata
131
+ same_size_docs = Ragdoll::Document.where("metadata->>'file_size' = ?", file_size.to_s)
132
+ same_size_docs.each do |doc|
133
+ return doc if documents_are_duplicates?(doc, location, content, metadata, file_size, content_hash)
134
+ end
135
+ end
136
+
137
+ # For non-file documents (URLs, etc), check content-based duplicates
138
+ unless File.exist?(location)
139
+ return find_content_based_duplicate(content, metadata)
140
+ end
141
+
142
+ nil
143
+ end
144
+
145
+ def documents_are_duplicates?(existing_doc, location, content, metadata, file_size, content_hash)
146
+ # Compare multiple factors to determine if documents are duplicates
147
+
148
+ # Check filename similarity (basename without extension)
149
+ existing_basename = File.basename(existing_doc.location, File.extname(existing_doc.location))
150
+ new_basename = File.basename(location, File.extname(location))
151
+ return false unless existing_basename == new_basename
152
+
153
+ # Check content length similarity (within 5% tolerance)
154
+ if content.present? && existing_doc.content.present?
155
+ content_length_diff = (content.length - existing_doc.content.length).abs
156
+ max_length = [content.length, existing_doc.content.length].max
157
+ return false if max_length > 0 && (content_length_diff.to_f / max_length) > 0.05
158
+ end
159
+
160
+ # Check key metadata fields
161
+ existing_metadata = existing_doc.metadata || {}
162
+ new_metadata = metadata || {}
163
+
164
+ # Compare file type/document type
165
+ return false if existing_doc.document_type != (new_metadata[:document_type] || new_metadata["document_type"] || "text")
166
+
167
+ # Compare title if available
168
+ existing_title = existing_metadata["title"] || existing_doc.title
169
+ new_title = new_metadata[:title] || new_metadata["title"] || extract_title_from_location(location)
170
+ return false if existing_title && new_title && existing_title != new_title
171
+
172
+ # If we reach here, documents are likely duplicates
173
+ true
174
+ end
175
+
176
+ def find_content_based_duplicate(content, metadata)
177
+ return nil unless content.present?
178
+
179
+ content_hash = calculate_content_hash(content)
180
+ title = metadata[:title] || metadata["title"]
181
+
182
+ # Look for documents with same content hash
183
+ Ragdoll::Document.where("metadata->>'content_hash' = ?", content_hash).first ||
184
+ # Look for documents with same title and similar content length (within 5% tolerance)
185
+ (title ? find_by_title_and_content_similarity(title, content) : nil)
186
+ end
187
+
188
+ def find_by_title_and_content_similarity(title, content)
189
+ content_length = content.length
190
+ tolerance = content_length * 0.05
191
+
192
+ Ragdoll::Document.where(title: title).find do |doc|
193
+ doc.content.present? &&
194
+ (doc.content.length - content_length).abs <= tolerance
195
+ end
196
+ end
197
+
198
+ def calculate_file_hash(file_path)
199
+ require 'digest'
200
+ Digest::SHA256.file(file_path).hexdigest
201
+ rescue StandardError => e
202
+ Rails.logger.warn "Failed to calculate file hash for #{file_path}: #{e.message}" if defined?(Rails)
203
+ nil
204
+ end
205
+
206
+ def calculate_content_hash(content)
207
+ require 'digest'
208
+ Digest::SHA256.hexdigest(content)
209
+ end
210
+
103
211
  def extract_title_from_location(location)
104
212
  File.basename(location, File.extname(location))
105
213
  end