RubyGems - ragdoll - Versions diffs - 0.1.10 → 0.1.11 - Mend

ragdoll 0.1.10 → 0.1.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +22 -0
data/README.md +37 -1
data/app/models/ragdoll/search.rb +1 -1
data/app/services/ragdoll/document_management.rb +117 -9
data/app/services/ragdoll/document_processor.rb +67 -31
data/lib/ragdoll/core/client.rb +2 -2
data/lib/ragdoll/core/version.rb +1 -1
metadata +1 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 4f7b2c95ede1523e9e01af70394217387d876da6317fed651df3e27cf337cfe9
-  data.tar.gz: a82ae7d541fd06876acb3acaf8f02639234f8b118274621851678a2799c5f559
+  metadata.gz: 255dd5c7e6ccbdeeafe2a0ed74382c5fefb4df2e015942c191fb3747c24a6cb8
+  data.tar.gz: 284ceedd72c305d3dcf5385482b983463ec71ec850514655614e47ad71be17a8
 SHA512:
-  metadata.gz: ba14828a6e743677c84072b9f1bb27743e429531ebdd9fbd3d8553add7bbdad070d709cd617dc620fef4ddc6846085ca79d3bb6d32bae8465c6b3b10acc0692f
-  data.tar.gz: de630ebf15168b562ef686ec6cd9f1cfe532b5bbf495e33a74085b567cf53ce7bb87e7c5c543756c47bd68c98290221b879a1b4d8e5888aac4916d1c1554fe99
+  metadata.gz: 64d603061ba7742699e84a5bc8933de4cbc1ab9a6b748d74b17371c05d855f95a1a2750da713be57f0e5a4a95f304894a73821b3cb6dbb3118623c9c5a1cbce2
+  data.tar.gz: c59f77cffb7026cf07eedf5f6724d17a430e4d241691b5cf380292f7bb4ee045bfacbc15787603f531187b5b03cb68e69bea7576e71c7ac8600dbe334af248ca

data/CHANGELOG.md CHANGED Viewed

@@ -6,6 +6,28 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 ## [Unreleased]
+## [0.1.11] - 2025-01-17
+### Added
+- **Force Option for Document Addition**: New `force` parameter in document management to override duplicate detection
+  - Allows forced document addition even when duplicate titles exist
+  - Enables overwriting existing documents when needed
+### Fixed
+- **Search Query Embedding**: Made `query_embedding` parameter optional in search methods
+  - Improved flexibility for search operations that don't require embeddings
+  - Better error handling for search queries without embeddings
+### Changed
+- **Database Setup**: Enhanced database role handling and setup procedures
+  - Improved database connection configuration
+  - Better handling of database roles and permissions
+### Removed
+- **Obsolete Migrations**: Removed outdated RagdollDocuments migration files
+  - Cleaned up legacy migration structure
+  - Streamlined database migration path
 ## [0.1.10] - 2025-01-15
 ### Changed

data/README.md CHANGED Viewed

@@ -132,6 +132,10 @@ puts result[:document_id]     # "123"
 puts result[:message]         # "Document 'document' added successfully with ID 123"
 puts result[:embeddings_queued] # true
+# Add document with force option to override duplicate detection
+result = Ragdoll.add_document(path: 'document.pdf', force: true)
+# Creates new document even if duplicate exists
 # Check document processing status
 status = Ragdoll.document_status(id: result[:document_id])
 puts status[:status]          # "processed"
@@ -161,6 +165,37 @@ puts stats[:total_documents]  # 50
 puts stats[:total_embeddings] # 1250
 ```
+### Duplicate Detection
+Ragdoll includes sophisticated duplicate detection to prevent redundant document processing:
+```ruby
+# Automatic duplicate detection (default behavior)
+result1 = Ragdoll.add_document(path: 'research.pdf')
+result2 = Ragdoll.add_document(path: 'research.pdf')
+# result2 returns the same document_id as result1 (duplicate detected)
+# Force adding a duplicate document
+result3 = Ragdoll.add_document(path: 'research.pdf', force: true)
+# Creates a new document with modified location identifier
+# Duplicate detection criteria:
+# 1. Exact location/path match
+# 2. File modification time (for files)
+# 3. File content hash (SHA256)
+# 4. Content hash for text
+# 5. File size and metadata similarity
+# 6. Document title and type matching
+```
+**Duplicate Detection Features:**
+- **Multi-level detection**: Checks location, file hash, content hash, and metadata
+- **Smart similarity**: Detects duplicates even with minor differences (5% content tolerance)
+- **File integrity**: SHA256 hashing for reliable file comparison
+- **URL support**: Content-based detection for web documents
+- **Force option**: Override detection when needed
+- **Performance optimized**: Database indexes for fast lookups
 ### Search and Retrieval
 ```ruby
@@ -348,13 +383,14 @@ end
 ## Current Implementation Status
 ### ✅ **Fully Implemented**
-- **Text document processing**: PDF, DOCX, HTML, Markdown, plain text files
+- **Text document processing**: PDF, DOCX, HTML, Markdown, plain text files with encoding fallback
 - **Embedding generation**: Text chunking and vector embedding creation
 - **Database schema**: Multi-modal polymorphic architecture with PostgreSQL + pgvector
 - **Dual metadata architecture**: Separate LLM-generated content analysis and file properties
 - **Search functionality**: Semantic search with cosine similarity and usage analytics
 - **Search tracking system**: Comprehensive analytics with query embeddings, click-through tracking, and performance monitoring
 - **Document management**: Add, update, delete, list operations
+- **Duplicate detection**: Multi-level duplicate prevention with file hash, content hash, and metadata comparison
 - **Background processing**: ActiveJob integration for async embedding generation
 - **LLM metadata generation**: AI-powered structured content analysis with schema validation
 - **Logging**: Configurable file-based logging with multiple levels

data/app/models/ragdoll/search.rb CHANGED Viewed

@@ -14,7 +14,7 @@ module Ragdoll
     has_many :embeddings, through: :search_results
     validates :query, presence: true
-    validates :query_embedding, presence: true
+    validates :query_embedding, presence: false, allow_nil: true
     validates :search_type, presence: true, inclusion: { in: %w[semantic hybrid fulltext] }
     validates :results_count, presence: true, numericality: { greater_than_or_equal_to: 0 }

data/app/services/ragdoll/document_management.rb CHANGED Viewed

@@ -1,9 +1,11 @@
 # frozen_string_literal: true
+require "securerandom"
 module Ragdoll
   class DocumentManagement
     class << self
-      def add_document(location, content, metadata = {})
+      def add_document(location, content, metadata = {}, force: false)
         # Ensure location is an absolute path if it's a file path
         absolute_location = location.start_with?("http") || location.start_with?("ftp") ? location : File.expand_path(location)
@@ -14,17 +16,21 @@ module Ragdoll
                              Time.current
                            end
-        # Check if document already exists with same location and file_modified_at
-        existing_document = Ragdoll::Document.find_by(
-          location: absolute_location,
-          file_modified_at: file_modified_at
-        )
+        # Skip duplicate detection if force is true
+        unless force
+          existing_document = find_duplicate_document(absolute_location, content, metadata, file_modified_at)
+          return existing_document.id.to_s if existing_document
+        end
-        # Return existing document ID if found (skip duplicate)
-        return existing_document.id.to_s if existing_document
+        # Modify location if force is used to avoid unique constraint violation
+        final_location = if force
+                           "#{absolute_location}#forced_#{Time.current.to_i}_#{SecureRandom.hex(4)}"
+                         else
+                           absolute_location
+                         end
         document = Ragdoll::Document.create!(
-          location: absolute_location,
+          location: final_location,
           title: metadata[:title] || metadata["title"] || extract_title_from_location(location),
           document_type: metadata[:document_type] || metadata["document_type"] || "text",
           metadata: metadata.is_a?(Hash) ? metadata : {},
@@ -100,6 +106,108 @@ module Ragdoll
       private
+      def find_duplicate_document(location, content, metadata, file_modified_at)
+        # Primary check: exact location match (simple duplicate detection)
+        existing = Ragdoll::Document.find_by(location: location)
+        return existing if existing
+        # Secondary check: exact location and file modification time (for files)
+        existing_with_time = Ragdoll::Document.find_by(
+          location: location,
+          file_modified_at: file_modified_at
+        )
+        return existing_with_time if existing_with_time
+        # Enhanced duplicate detection for file-based documents
+        if File.exist?(location) && !location.start_with?("http")
+          file_size = File.size(location)
+          content_hash = calculate_file_hash(location)
+          # Check for documents with same file hash (most reliable)
+          potential_duplicates = Ragdoll::Document.where("metadata->>'file_hash' = ?", content_hash)
+          return potential_duplicates.first if potential_duplicates.any?
+          # Check for documents with same file size and similar metadata
+          same_size_docs = Ragdoll::Document.where("metadata->>'file_size' = ?", file_size.to_s)
+          same_size_docs.each do |doc|
+            return doc if documents_are_duplicates?(doc, location, content, metadata, file_size, content_hash)
+          end
+        end
+        # For non-file documents (URLs, etc), check content-based duplicates
+        unless File.exist?(location)
+          return find_content_based_duplicate(content, metadata)
+        end
+        nil
+      end
+      def documents_are_duplicates?(existing_doc, location, content, metadata, file_size, content_hash)
+        # Compare multiple factors to determine if documents are duplicates
+        # Check filename similarity (basename without extension)
+        existing_basename = File.basename(existing_doc.location, File.extname(existing_doc.location))
+        new_basename = File.basename(location, File.extname(location))
+        return false unless existing_basename == new_basename
+        # Check content length similarity (within 5% tolerance)
+        if content.present? && existing_doc.content.present?
+          content_length_diff = (content.length - existing_doc.content.length).abs
+          max_length = [content.length, existing_doc.content.length].max
+          return false if max_length > 0 && (content_length_diff.to_f / max_length) > 0.05
+        end
+        # Check key metadata fields
+        existing_metadata = existing_doc.metadata || {}
+        new_metadata = metadata || {}
+        # Compare file type/document type
+        return false if existing_doc.document_type != (new_metadata[:document_type] || new_metadata["document_type"] || "text")
+        # Compare title if available
+        existing_title = existing_metadata["title"] || existing_doc.title
+        new_title = new_metadata[:title] || new_metadata["title"] || extract_title_from_location(location)
+        return false if existing_title && new_title && existing_title != new_title
+        # If we reach here, documents are likely duplicates
+        true
+      end
+      def find_content_based_duplicate(content, metadata)
+        return nil unless content.present?
+        content_hash = calculate_content_hash(content)
+        title = metadata[:title] || metadata["title"]
+        # Look for documents with same content hash
+        Ragdoll::Document.where("metadata->>'content_hash' = ?", content_hash).first ||
+        # Look for documents with same title and similar content length (within 5% tolerance)
+        (title ? find_by_title_and_content_similarity(title, content) : nil)
+      end
+      def find_by_title_and_content_similarity(title, content)
+        content_length = content.length
+        tolerance = content_length * 0.05
+        Ragdoll::Document.where(title: title).find do |doc|
+          doc.content.present? &&
+          (doc.content.length - content_length).abs <= tolerance
+        end
+      end
+      def calculate_file_hash(file_path)
+        require 'digest'
+        Digest::SHA256.file(file_path).hexdigest
+      rescue StandardError => e
+        Rails.logger.warn "Failed to calculate file hash for #{file_path}: #{e.message}" if defined?(Rails)
+        nil
+      end
+      def calculate_content_hash(content)
+        require 'digest'
+        Digest::SHA256.hexdigest(content)
+      end
       def extract_title_from_location(location)
         File.basename(location, File.extname(location))
       end

data/app/services/ragdoll/document_processor.rb CHANGED Viewed

@@ -99,8 +99,6 @@ module Ragdoll
       else
         parse_text # Default to text parsing for unknown formats
       end
-    rescue StandardError => e # StandardError => e
-      raise ParseError, "#{__LINE__} Failed to parse #{@file_path}: #{e.message}"
     end
     private
@@ -109,6 +107,12 @@ module Ragdoll
       content = ""
       metadata = {}
+      # Add file-based metadata for duplicate detection
+      if File.exist?(@file_path)
+        metadata[:file_size] = File.size(@file_path)
+        metadata[:file_hash] = calculate_file_hash(@file_path)
+      end
       begin
         PDF::Reader.open(@file_path) do |reader|
           # Extract metadata
@@ -144,6 +148,10 @@ module Ragdoll
         metadata[:title] = extract_title_from_filepath
       end
+      # Add content hash for duplicate detection
+      # Ensure content is UTF-8 encoded before checking presence
+      metadata[:content_hash] = calculate_content_hash(content) if content && content.length > 0
       {
         content: content.strip,
         metadata: metadata,
@@ -155,6 +163,12 @@ module Ragdoll
       content = ""
       metadata = {}
+      # Add file-based metadata for duplicate detection
+      if File.exist?(@file_path)
+        metadata[:file_size] = File.size(@file_path)
+        metadata[:file_hash] = calculate_file_hash(@file_path)
+      end
       begin
         doc = Docx::Document.open(@file_path)
@@ -204,6 +218,10 @@ module Ragdoll
         metadata[:title] = extract_title_from_filepath
       end
+      # Add content hash for duplicate detection
+      # Ensure content is UTF-8 encoded before checking presence
+      metadata[:content_hash] = calculate_content_hash(content) if content && content.length > 0
       {
         content: content.strip,
         metadata: metadata,
@@ -212,46 +230,31 @@ module Ragdoll
     end
     def parse_text
-      content = File.read(@file_path, encoding: "UTF-8")
-      metadata = {
-        file_size: File.size(@file_path),
-        encoding: "UTF-8"
-      }
+      # Determine document type first (before any IO operations)
       document_type = case @file_extension
                       when ".md", ".markdown" then "markdown"
                       when ".txt" then "text"
                       else "text"
                       end
-      # Parse YAML front matter for markdown files
-      if document_type == "markdown" && content.start_with?("---\n")
-        front_matter, body_content = parse_yaml_front_matter(content)
-        if front_matter
-          metadata.merge!(front_matter)
-          content = body_content
-        end
-      end
-      # Add filepath-based title as fallback if no title was found
-      if metadata[:title].nil? || (metadata[:title].is_a?(String) && metadata[:title].strip.empty?)
-        metadata[:title] = extract_title_from_filepath
+      begin
+        content = File.read(@file_path, encoding: "UTF-8")
+        encoding = "UTF-8"
+      rescue Encoding::InvalidByteSequenceError, Encoding::UndefinedConversionError
+        # Try with different encoding - read as ISO-8859-1 and force encoding to UTF-8
+        content = File.read(@file_path, encoding: "ISO-8859-1").encode("UTF-8", invalid: :replace, undef: :replace, replace: "?")
+        encoding = "ISO-8859-1"
+      rescue Errno::ENOENT, Errno::EACCES => e
+        raise ParseError, "Failed to read file #{@file_path}: #{e.message}"
       end
-      {
-        content: content,
-        metadata: metadata,
-        document_type: document_type
-      }
-    rescue Encoding::InvalidByteSequenceError
-      # Try with different encoding
-      content = File.read(@file_path, encoding: "ISO-8859-1")
       metadata = {
         file_size: File.size(@file_path),
-        encoding: "ISO-8859-1"
+        file_hash: calculate_file_hash(@file_path),
+        encoding: encoding
       }
-      # Try to parse front matter with different encoding too
+      # Parse YAML front matter for markdown files
       if document_type == "markdown" && content.start_with?("---\n")
         front_matter, body_content = parse_yaml_front_matter(content)
         if front_matter
@@ -265,10 +268,14 @@ module Ragdoll
         metadata[:title] = extract_title_from_filepath
       end
+      # Add content hash for duplicate detection
+      # Ensure content is UTF-8 encoded before checking presence
+      metadata[:content_hash] = calculate_content_hash(content) if content && content.length > 0
       {
         content: content,
         metadata: metadata,
-        document_type: document_type.nil? ? "text" : document_type
+        document_type: document_type
       }
     end
@@ -296,6 +303,7 @@ module Ragdoll
       metadata = {
         file_size: File.size(@file_path),
+        file_hash: calculate_file_hash(@file_path),
         original_format: "html"
       }
@@ -306,6 +314,9 @@ module Ragdoll
         metadata[:title] = extract_title_from_filepath
       end
+      # Add content hash for duplicate detection
+      metadata[:content_hash] = calculate_content_hash(clean_content) if clean_content.present?
       {
         content: clean_content,
         metadata: metadata,
@@ -318,6 +329,7 @@ module Ragdoll
       metadata = {
         file_size: File.size(@file_path),
+        file_hash: calculate_file_hash(@file_path),
         file_type: @file_extension.sub(".", ""),
         original_filename: File.basename(@file_path)
       }
@@ -347,6 +359,10 @@ module Ragdoll
       # Add filepath-based title as fallback
       metadata[:title] = extract_title_from_filepath
+      # Add content hash for duplicate detection
+      # Ensure content is UTF-8 encoded before checking presence
+      metadata[:content_hash] = calculate_content_hash(content) if content && content.length > 0
       puts "✅ DocumentProcessor: Image parsing complete. Content: '#{content[0..100]}...'"
       {
@@ -461,5 +477,25 @@ module Ragdoll
         [nil, content]
       end
     end
+    # Calculate SHA256 hash of file content for duplicate detection
+    def calculate_file_hash(file_path)
+      require 'digest'
+      Digest::SHA256.file(file_path).hexdigest
+    rescue StandardError => e
+      Rails.logger.warn "Failed to calculate file hash for #{file_path}: #{e.message}" if defined?(Rails)
+      puts "Warning: Failed to calculate file hash for #{file_path}: #{e.message}"
+      nil
+    end
+    # Calculate SHA256 hash of text content for duplicate detection
+    def calculate_content_hash(content)
+      require 'digest'
+      Digest::SHA256.hexdigest(content)
+    rescue StandardError => e
+      Rails.logger.warn "Failed to calculate content hash: #{e.message}" if defined?(Rails)
+      puts "Warning: Failed to calculate content hash: #{e.message}"
+      nil
+    end
   end
 end

data/lib/ragdoll/core/client.rb CHANGED Viewed

@@ -184,7 +184,7 @@ module Ragdoll
       end
       # Document management
-      def add_document(path:)
+      def add_document(path:, force: false)
         # Parse the document
         parsed = Ragdoll::DocumentProcessor.parse(path)
@@ -197,7 +197,7 @@ module Ragdoll
                                                    title: title,
                                                    document_type: parsed[:document_type],
                                                    **parsed[:metadata]
-                                                 })
+                                                 }, force: force)
         # Queue background jobs for processing if content is available
         embeddings_queued = false

data/lib/ragdoll/core/version.rb CHANGED Viewed

@@ -3,6 +3,6 @@
 module Ragdoll
   module Core
-    VERSION = "0.1.10"
+    VERSION = "0.1.11"
   end
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: ragdoll
 version: !ruby/object:Gem::Version
-  version: 0.1.10
+  version: 0.1.11
 platform: ruby
 authors:
 - Dewayne VanHoozer