ragdoll 0.1.10 → 0.1.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 4f7b2c95ede1523e9e01af70394217387d876da6317fed651df3e27cf337cfe9
4
- data.tar.gz: a82ae7d541fd06876acb3acaf8f02639234f8b118274621851678a2799c5f559
3
+ metadata.gz: 255dd5c7e6ccbdeeafe2a0ed74382c5fefb4df2e015942c191fb3747c24a6cb8
4
+ data.tar.gz: 284ceedd72c305d3dcf5385482b983463ec71ec850514655614e47ad71be17a8
5
5
  SHA512:
6
- metadata.gz: ba14828a6e743677c84072b9f1bb27743e429531ebdd9fbd3d8553add7bbdad070d709cd617dc620fef4ddc6846085ca79d3bb6d32bae8465c6b3b10acc0692f
7
- data.tar.gz: de630ebf15168b562ef686ec6cd9f1cfe532b5bbf495e33a74085b567cf53ce7bb87e7c5c543756c47bd68c98290221b879a1b4d8e5888aac4916d1c1554fe99
6
+ metadata.gz: 64d603061ba7742699e84a5bc8933de4cbc1ab9a6b748d74b17371c05d855f95a1a2750da713be57f0e5a4a95f304894a73821b3cb6dbb3118623c9c5a1cbce2
7
+ data.tar.gz: c59f77cffb7026cf07eedf5f6724d17a430e4d241691b5cf380292f7bb4ee045bfacbc15787603f531187b5b03cb68e69bea7576e71c7ac8600dbe334af248ca
data/CHANGELOG.md CHANGED
@@ -6,6 +6,28 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
6
 
7
7
  ## [Unreleased]
8
8
 
9
+ ## [0.1.11] - 2025-01-17
10
+
11
+ ### Added
12
+ - **Force Option for Document Addition**: New `force` parameter in document management to override duplicate detection
13
+ - Allows forced document addition even when duplicate titles exist
14
+ - Enables overwriting existing documents when needed
15
+
16
+ ### Fixed
17
+ - **Search Query Embedding**: Made `query_embedding` parameter optional in search methods
18
+ - Improved flexibility for search operations that don't require embeddings
19
+ - Better error handling for search queries without embeddings
20
+
21
+ ### Changed
22
+ - **Database Setup**: Enhanced database role handling and setup procedures
23
+ - Improved database connection configuration
24
+ - Better handling of database roles and permissions
25
+
26
+ ### Removed
27
+ - **Obsolete Migrations**: Removed outdated RagdollDocuments migration files
28
+ - Cleaned up legacy migration structure
29
+ - Streamlined database migration path
30
+
9
31
  ## [0.1.10] - 2025-01-15
10
32
 
11
33
  ### Changed
data/README.md CHANGED
@@ -132,6 +132,10 @@ puts result[:document_id] # "123"
132
132
  puts result[:message] # "Document 'document' added successfully with ID 123"
133
133
  puts result[:embeddings_queued] # true
134
134
 
135
+ # Add document with force option to override duplicate detection
136
+ result = Ragdoll.add_document(path: 'document.pdf', force: true)
137
+ # Creates new document even if duplicate exists
138
+
135
139
  # Check document processing status
136
140
  status = Ragdoll.document_status(id: result[:document_id])
137
141
  puts status[:status] # "processed"
@@ -161,6 +165,37 @@ puts stats[:total_documents] # 50
161
165
  puts stats[:total_embeddings] # 1250
162
166
  ```
163
167
 
168
+ ### Duplicate Detection
169
+
170
+ Ragdoll includes sophisticated duplicate detection to prevent redundant document processing:
171
+
172
+ ```ruby
173
+ # Automatic duplicate detection (default behavior)
174
+ result1 = Ragdoll.add_document(path: 'research.pdf')
175
+ result2 = Ragdoll.add_document(path: 'research.pdf')
176
+ # result2 returns the same document_id as result1 (duplicate detected)
177
+
178
+ # Force adding a duplicate document
179
+ result3 = Ragdoll.add_document(path: 'research.pdf', force: true)
180
+ # Creates a new document with modified location identifier
181
+
182
+ # Duplicate detection criteria:
183
+ # 1. Exact location/path match
184
+ # 2. File modification time (for files)
185
+ # 3. File content hash (SHA256)
186
+ # 4. Content hash for text
187
+ # 5. File size and metadata similarity
188
+ # 6. Document title and type matching
189
+ ```
190
+
191
+ **Duplicate Detection Features:**
192
+ - **Multi-level detection**: Checks location, file hash, content hash, and metadata
193
+ - **Smart similarity**: Detects duplicates even with minor differences (5% content tolerance)
194
+ - **File integrity**: SHA256 hashing for reliable file comparison
195
+ - **URL support**: Content-based detection for web documents
196
+ - **Force option**: Override detection when needed
197
+ - **Performance optimized**: Database indexes for fast lookups
198
+
164
199
  ### Search and Retrieval
165
200
 
166
201
  ```ruby
@@ -348,13 +383,14 @@ end
348
383
  ## Current Implementation Status
349
384
 
350
385
  ### ✅ **Fully Implemented**
351
- - **Text document processing**: PDF, DOCX, HTML, Markdown, plain text files
386
+ - **Text document processing**: PDF, DOCX, HTML, Markdown, plain text files with encoding fallback
352
387
  - **Embedding generation**: Text chunking and vector embedding creation
353
388
  - **Database schema**: Multi-modal polymorphic architecture with PostgreSQL + pgvector
354
389
  - **Dual metadata architecture**: Separate LLM-generated content analysis and file properties
355
390
  - **Search functionality**: Semantic search with cosine similarity and usage analytics
356
391
  - **Search tracking system**: Comprehensive analytics with query embeddings, click-through tracking, and performance monitoring
357
392
  - **Document management**: Add, update, delete, list operations
393
+ - **Duplicate detection**: Multi-level duplicate prevention with file hash, content hash, and metadata comparison
358
394
  - **Background processing**: ActiveJob integration for async embedding generation
359
395
  - **LLM metadata generation**: AI-powered structured content analysis with schema validation
360
396
  - **Logging**: Configurable file-based logging with multiple levels
@@ -14,7 +14,7 @@ module Ragdoll
14
14
  has_many :embeddings, through: :search_results
15
15
 
16
16
  validates :query, presence: true
17
- validates :query_embedding, presence: true
17
+ validates :query_embedding, presence: false, allow_nil: true
18
18
  validates :search_type, presence: true, inclusion: { in: %w[semantic hybrid fulltext] }
19
19
  validates :results_count, presence: true, numericality: { greater_than_or_equal_to: 0 }
20
20
 
@@ -1,9 +1,11 @@
1
1
  # frozen_string_literal: true
2
2
 
3
+ require "securerandom"
4
+
3
5
  module Ragdoll
4
6
  class DocumentManagement
5
7
  class << self
6
- def add_document(location, content, metadata = {})
8
+ def add_document(location, content, metadata = {}, force: false)
7
9
  # Ensure location is an absolute path if it's a file path
8
10
  absolute_location = location.start_with?("http") || location.start_with?("ftp") ? location : File.expand_path(location)
9
11
 
@@ -14,17 +16,21 @@ module Ragdoll
14
16
  Time.current
15
17
  end
16
18
 
17
- # Check if document already exists with same location and file_modified_at
18
- existing_document = Ragdoll::Document.find_by(
19
- location: absolute_location,
20
- file_modified_at: file_modified_at
21
- )
19
+ # Skip duplicate detection if force is true
20
+ unless force
21
+ existing_document = find_duplicate_document(absolute_location, content, metadata, file_modified_at)
22
+ return existing_document.id.to_s if existing_document
23
+ end
22
24
 
23
- # Return existing document ID if found (skip duplicate)
24
- return existing_document.id.to_s if existing_document
25
+ # Modify location if force is used to avoid unique constraint violation
26
+ final_location = if force
27
+ "#{absolute_location}#forced_#{Time.current.to_i}_#{SecureRandom.hex(4)}"
28
+ else
29
+ absolute_location
30
+ end
25
31
 
26
32
  document = Ragdoll::Document.create!(
27
- location: absolute_location,
33
+ location: final_location,
28
34
  title: metadata[:title] || metadata["title"] || extract_title_from_location(location),
29
35
  document_type: metadata[:document_type] || metadata["document_type"] || "text",
30
36
  metadata: metadata.is_a?(Hash) ? metadata : {},
@@ -100,6 +106,108 @@ module Ragdoll
100
106
 
101
107
  private
102
108
 
109
+ def find_duplicate_document(location, content, metadata, file_modified_at)
110
+ # Primary check: exact location match (simple duplicate detection)
111
+ existing = Ragdoll::Document.find_by(location: location)
112
+ return existing if existing
113
+
114
+ # Secondary check: exact location and file modification time (for files)
115
+ existing_with_time = Ragdoll::Document.find_by(
116
+ location: location,
117
+ file_modified_at: file_modified_at
118
+ )
119
+ return existing_with_time if existing_with_time
120
+
121
+ # Enhanced duplicate detection for file-based documents
122
+ if File.exist?(location) && !location.start_with?("http")
123
+ file_size = File.size(location)
124
+ content_hash = calculate_file_hash(location)
125
+
126
+ # Check for documents with same file hash (most reliable)
127
+ potential_duplicates = Ragdoll::Document.where("metadata->>'file_hash' = ?", content_hash)
128
+ return potential_duplicates.first if potential_duplicates.any?
129
+
130
+ # Check for documents with same file size and similar metadata
131
+ same_size_docs = Ragdoll::Document.where("metadata->>'file_size' = ?", file_size.to_s)
132
+ same_size_docs.each do |doc|
133
+ return doc if documents_are_duplicates?(doc, location, content, metadata, file_size, content_hash)
134
+ end
135
+ end
136
+
137
+ # For non-file documents (URLs, etc), check content-based duplicates
138
+ unless File.exist?(location)
139
+ return find_content_based_duplicate(content, metadata)
140
+ end
141
+
142
+ nil
143
+ end
144
+
145
+ def documents_are_duplicates?(existing_doc, location, content, metadata, file_size, content_hash)
146
+ # Compare multiple factors to determine if documents are duplicates
147
+
148
+ # Check filename similarity (basename without extension)
149
+ existing_basename = File.basename(existing_doc.location, File.extname(existing_doc.location))
150
+ new_basename = File.basename(location, File.extname(location))
151
+ return false unless existing_basename == new_basename
152
+
153
+ # Check content length similarity (within 5% tolerance)
154
+ if content.present? && existing_doc.content.present?
155
+ content_length_diff = (content.length - existing_doc.content.length).abs
156
+ max_length = [content.length, existing_doc.content.length].max
157
+ return false if max_length > 0 && (content_length_diff.to_f / max_length) > 0.05
158
+ end
159
+
160
+ # Check key metadata fields
161
+ existing_metadata = existing_doc.metadata || {}
162
+ new_metadata = metadata || {}
163
+
164
+ # Compare file type/document type
165
+ return false if existing_doc.document_type != (new_metadata[:document_type] || new_metadata["document_type"] || "text")
166
+
167
+ # Compare title if available
168
+ existing_title = existing_metadata["title"] || existing_doc.title
169
+ new_title = new_metadata[:title] || new_metadata["title"] || extract_title_from_location(location)
170
+ return false if existing_title && new_title && existing_title != new_title
171
+
172
+ # If we reach here, documents are likely duplicates
173
+ true
174
+ end
175
+
176
+ def find_content_based_duplicate(content, metadata)
177
+ return nil unless content.present?
178
+
179
+ content_hash = calculate_content_hash(content)
180
+ title = metadata[:title] || metadata["title"]
181
+
182
+ # Look for documents with same content hash
183
+ Ragdoll::Document.where("metadata->>'content_hash' = ?", content_hash).first ||
184
+ # Look for documents with same title and similar content length (within 5% tolerance)
185
+ (title ? find_by_title_and_content_similarity(title, content) : nil)
186
+ end
187
+
188
+ def find_by_title_and_content_similarity(title, content)
189
+ content_length = content.length
190
+ tolerance = content_length * 0.05
191
+
192
+ Ragdoll::Document.where(title: title).find do |doc|
193
+ doc.content.present? &&
194
+ (doc.content.length - content_length).abs <= tolerance
195
+ end
196
+ end
197
+
198
+ def calculate_file_hash(file_path)
199
+ require 'digest'
200
+ Digest::SHA256.file(file_path).hexdigest
201
+ rescue StandardError => e
202
+ Rails.logger.warn "Failed to calculate file hash for #{file_path}: #{e.message}" if defined?(Rails)
203
+ nil
204
+ end
205
+
206
+ def calculate_content_hash(content)
207
+ require 'digest'
208
+ Digest::SHA256.hexdigest(content)
209
+ end
210
+
103
211
  def extract_title_from_location(location)
104
212
  File.basename(location, File.extname(location))
105
213
  end
@@ -99,8 +99,6 @@ module Ragdoll
99
99
  else
100
100
  parse_text # Default to text parsing for unknown formats
101
101
  end
102
- rescue StandardError => e # StandardError => e
103
- raise ParseError, "#{__LINE__} Failed to parse #{@file_path}: #{e.message}"
104
102
  end
105
103
 
106
104
  private
@@ -109,6 +107,12 @@ module Ragdoll
109
107
  content = ""
110
108
  metadata = {}
111
109
 
110
+ # Add file-based metadata for duplicate detection
111
+ if File.exist?(@file_path)
112
+ metadata[:file_size] = File.size(@file_path)
113
+ metadata[:file_hash] = calculate_file_hash(@file_path)
114
+ end
115
+
112
116
  begin
113
117
  PDF::Reader.open(@file_path) do |reader|
114
118
  # Extract metadata
@@ -144,6 +148,10 @@ module Ragdoll
144
148
  metadata[:title] = extract_title_from_filepath
145
149
  end
146
150
 
151
+ # Add content hash for duplicate detection
152
+ # Ensure content is UTF-8 encoded before checking presence
153
+ metadata[:content_hash] = calculate_content_hash(content) if content && content.length > 0
154
+
147
155
  {
148
156
  content: content.strip,
149
157
  metadata: metadata,
@@ -155,6 +163,12 @@ module Ragdoll
155
163
  content = ""
156
164
  metadata = {}
157
165
 
166
+ # Add file-based metadata for duplicate detection
167
+ if File.exist?(@file_path)
168
+ metadata[:file_size] = File.size(@file_path)
169
+ metadata[:file_hash] = calculate_file_hash(@file_path)
170
+ end
171
+
158
172
  begin
159
173
  doc = Docx::Document.open(@file_path)
160
174
 
@@ -204,6 +218,10 @@ module Ragdoll
204
218
  metadata[:title] = extract_title_from_filepath
205
219
  end
206
220
 
221
+ # Add content hash for duplicate detection
222
+ # Ensure content is UTF-8 encoded before checking presence
223
+ metadata[:content_hash] = calculate_content_hash(content) if content && content.length > 0
224
+
207
225
  {
208
226
  content: content.strip,
209
227
  metadata: metadata,
@@ -212,46 +230,31 @@ module Ragdoll
212
230
  end
213
231
 
214
232
  def parse_text
215
- content = File.read(@file_path, encoding: "UTF-8")
216
- metadata = {
217
- file_size: File.size(@file_path),
218
- encoding: "UTF-8"
219
- }
220
-
233
+ # Determine document type first (before any IO operations)
221
234
  document_type = case @file_extension
222
235
  when ".md", ".markdown" then "markdown"
223
236
  when ".txt" then "text"
224
237
  else "text"
225
238
  end
226
239
 
227
- # Parse YAML front matter for markdown files
228
- if document_type == "markdown" && content.start_with?("---\n")
229
- front_matter, body_content = parse_yaml_front_matter(content)
230
- if front_matter
231
- metadata.merge!(front_matter)
232
- content = body_content
233
- end
234
- end
235
-
236
- # Add filepath-based title as fallback if no title was found
237
- if metadata[:title].nil? || (metadata[:title].is_a?(String) && metadata[:title].strip.empty?)
238
- metadata[:title] = extract_title_from_filepath
240
+ begin
241
+ content = File.read(@file_path, encoding: "UTF-8")
242
+ encoding = "UTF-8"
243
+ rescue Encoding::InvalidByteSequenceError, Encoding::UndefinedConversionError
244
+ # Try with different encoding - read as ISO-8859-1 and force encoding to UTF-8
245
+ content = File.read(@file_path, encoding: "ISO-8859-1").encode("UTF-8", invalid: :replace, undef: :replace, replace: "?")
246
+ encoding = "ISO-8859-1"
247
+ rescue Errno::ENOENT, Errno::EACCES => e
248
+ raise ParseError, "Failed to read file #{@file_path}: #{e.message}"
239
249
  end
240
250
 
241
- {
242
- content: content,
243
- metadata: metadata,
244
- document_type: document_type
245
- }
246
- rescue Encoding::InvalidByteSequenceError
247
- # Try with different encoding
248
- content = File.read(@file_path, encoding: "ISO-8859-1")
249
251
  metadata = {
250
252
  file_size: File.size(@file_path),
251
- encoding: "ISO-8859-1"
253
+ file_hash: calculate_file_hash(@file_path),
254
+ encoding: encoding
252
255
  }
253
256
 
254
- # Try to parse front matter with different encoding too
257
+ # Parse YAML front matter for markdown files
255
258
  if document_type == "markdown" && content.start_with?("---\n")
256
259
  front_matter, body_content = parse_yaml_front_matter(content)
257
260
  if front_matter
@@ -265,10 +268,14 @@ module Ragdoll
265
268
  metadata[:title] = extract_title_from_filepath
266
269
  end
267
270
 
271
+ # Add content hash for duplicate detection
272
+ # Ensure content is UTF-8 encoded before checking presence
273
+ metadata[:content_hash] = calculate_content_hash(content) if content && content.length > 0
274
+
268
275
  {
269
276
  content: content,
270
277
  metadata: metadata,
271
- document_type: document_type.nil? ? "text" : document_type
278
+ document_type: document_type
272
279
  }
273
280
  end
274
281
 
@@ -296,6 +303,7 @@ module Ragdoll
296
303
 
297
304
  metadata = {
298
305
  file_size: File.size(@file_path),
306
+ file_hash: calculate_file_hash(@file_path),
299
307
  original_format: "html"
300
308
  }
301
309
 
@@ -306,6 +314,9 @@ module Ragdoll
306
314
  metadata[:title] = extract_title_from_filepath
307
315
  end
308
316
 
317
+ # Add content hash for duplicate detection
318
+ metadata[:content_hash] = calculate_content_hash(clean_content) if clean_content.present?
319
+
309
320
  {
310
321
  content: clean_content,
311
322
  metadata: metadata,
@@ -318,6 +329,7 @@ module Ragdoll
318
329
 
319
330
  metadata = {
320
331
  file_size: File.size(@file_path),
332
+ file_hash: calculate_file_hash(@file_path),
321
333
  file_type: @file_extension.sub(".", ""),
322
334
  original_filename: File.basename(@file_path)
323
335
  }
@@ -347,6 +359,10 @@ module Ragdoll
347
359
  # Add filepath-based title as fallback
348
360
  metadata[:title] = extract_title_from_filepath
349
361
 
362
+ # Add content hash for duplicate detection
363
+ # Ensure content is UTF-8 encoded before checking presence
364
+ metadata[:content_hash] = calculate_content_hash(content) if content && content.length > 0
365
+
350
366
  puts "✅ DocumentProcessor: Image parsing complete. Content: '#{content[0..100]}...'"
351
367
 
352
368
  {
@@ -461,5 +477,25 @@ module Ragdoll
461
477
  [nil, content]
462
478
  end
463
479
  end
480
+
481
+ # Calculate SHA256 hash of file content for duplicate detection
482
+ def calculate_file_hash(file_path)
483
+ require 'digest'
484
+ Digest::SHA256.file(file_path).hexdigest
485
+ rescue StandardError => e
486
+ Rails.logger.warn "Failed to calculate file hash for #{file_path}: #{e.message}" if defined?(Rails)
487
+ puts "Warning: Failed to calculate file hash for #{file_path}: #{e.message}"
488
+ nil
489
+ end
490
+
491
+ # Calculate SHA256 hash of text content for duplicate detection
492
+ def calculate_content_hash(content)
493
+ require 'digest'
494
+ Digest::SHA256.hexdigest(content)
495
+ rescue StandardError => e
496
+ Rails.logger.warn "Failed to calculate content hash: #{e.message}" if defined?(Rails)
497
+ puts "Warning: Failed to calculate content hash: #{e.message}"
498
+ nil
499
+ end
464
500
  end
465
501
  end
@@ -184,7 +184,7 @@ module Ragdoll
184
184
  end
185
185
 
186
186
  # Document management
187
- def add_document(path:)
187
+ def add_document(path:, force: false)
188
188
  # Parse the document
189
189
  parsed = Ragdoll::DocumentProcessor.parse(path)
190
190
 
@@ -197,7 +197,7 @@ module Ragdoll
197
197
  title: title,
198
198
  document_type: parsed[:document_type],
199
199
  **parsed[:metadata]
200
- })
200
+ }, force: force)
201
201
 
202
202
  # Queue background jobs for processing if content is available
203
203
  embeddings_queued = false
@@ -3,6 +3,6 @@
3
3
 
4
4
  module Ragdoll
5
5
  module Core
6
- VERSION = "0.1.10"
6
+ VERSION = "0.1.11"
7
7
  end
8
8
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: ragdoll
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.10
4
+ version: 0.1.11
5
5
  platform: ruby
6
6
  authors:
7
7
  - Dewayne VanHoozer