RubyGems - ragdoll - Versions diffs - 0.1.9 → 0.1.11 - Mend

ragdoll 0.1.9 → 0.1.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (22) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +68 -4
data/README.md +86 -1
data/Rakefile +4 -2
data/app/models/ragdoll/document.rb +115 -12
data/app/models/ragdoll/embedding.rb +36 -4
data/app/models/ragdoll/search.rb +1 -1
data/app/services/ragdoll/document_management.rb +117 -9
data/app/services/ragdoll/document_processor.rb +67 -31
data/app/services/ragdoll/search_engine.rb +13 -2
data/db/migrate/{001_enable_postgresql_extensions.rb → 20250815234901_enable_postgresql_extensions.rb} +7 -8
data/db/migrate/20250815234902_create_ragdoll_documents.rb +117 -0
data/db/migrate/{005_create_ragdoll_embeddings.rb → 20250815234903_create_ragdoll_embeddings.rb} +13 -10
data/db/migrate/{006_create_ragdoll_contents.rb → 20250815234904_create_ragdoll_contents.rb} +14 -11
data/db/migrate/{007_create_ragdoll_searches.rb → 20250815234905_create_ragdoll_searches.rb} +24 -20
data/db/migrate/{008_create_ragdoll_search_results.rb → 20250815234906_create_ragdoll_search_results.rb} +16 -16
data/lib/ragdoll/core/client.rb +2 -2
data/lib/ragdoll/core/database.rb +8 -3
data/lib/ragdoll/core/version.rb +1 -1
data/lib/tasks/db.rake +63 -15
metadata +7 -7
data/db/migrate/004_create_ragdoll_documents.rb +0 -70

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: cde84c4b5bbf1e8296bdd762ee78acb2f69663e493ce23b0941ada9d1201bdcd
-  data.tar.gz: f8bc456d3c536a295920bc1c806974b2b39f08977a8761604c7a192b83e756d2
+  metadata.gz: 255dd5c7e6ccbdeeafe2a0ed74382c5fefb4df2e015942c191fb3747c24a6cb8
+  data.tar.gz: 284ceedd72c305d3dcf5385482b983463ec71ec850514655614e47ad71be17a8
 SHA512:
-  metadata.gz: c1ce0e46be45fe8004930ec231a83a59f31039f4908be2a0e0ba67043237f1ea03bc00991820f6928a6ef5baa6ca910547876f21ddad5a7ead2d6384192e7708
-  data.tar.gz: e3f50e1205b4ba755c6a978acb06240b7b1fa729f4fa9bef33f956a9b245ad3d3323612f300902051237ffa71a763fc6db8d8e0fedc4f2761c46a977b42d6958
+  metadata.gz: 64d603061ba7742699e84a5bc8933de4cbc1ab9a6b748d74b17371c05d855f95a1a2750da713be57f0e5a4a95f304894a73821b3cb6dbb3118623c9c5a1cbce2
+  data.tar.gz: c59f77cffb7026cf07eedf5f6724d17a430e4d241691b5cf380292f7bb4ee045bfacbc15787603f531187b5b03cb68e69bea7576e71c7ac8600dbe334af248ca

data/CHANGELOG.md CHANGED Viewed

@@ -6,7 +6,58 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 ## [Unreleased]
-*Note: These features will be included in the next release (likely v0.1.9) featuring comprehensive search tracking and analytics capabilities.*
+## [0.1.11] - 2025-01-17
+### Added
+- **Force Option for Document Addition**: New `force` parameter in document management to override duplicate detection
+  - Allows forced document addition even when duplicate titles exist
+  - Enables overwriting existing documents when needed
+### Fixed
+- **Search Query Embedding**: Made `query_embedding` parameter optional in search methods
+  - Improved flexibility for search operations that don't require embeddings
+  - Better error handling for search queries without embeddings
+### Changed
+- **Database Setup**: Enhanced database role handling and setup procedures
+  - Improved database connection configuration
+  - Better handling of database roles and permissions
+### Removed
+- **Obsolete Migrations**: Removed outdated RagdollDocuments migration files
+  - Cleaned up legacy migration structure
+  - Streamlined database migration path
+## [0.1.10] - 2025-01-15
+### Changed
+- Continued improvements to search performance and accuracy
+### Added
+- **Hybrid Search**: Complete implementation combining semantic and full-text search capabilities
+  - Configurable weights for semantic vs text search (default: 70% semantic, 30% text)
+  - Deduplication of results by document ID
+  - Combined scoring system for unified result ranking
+- **Full-text Search**: PostgreSQL full-text search with tsvector indexing
+  - Per-word match ratio scoring (0.0 to 1.0)
+  - GIN index for high-performance text search
+  - Search across title, summary, keywords, and description fields
+- **Enhanced Search API**: Complete search type delegation at top-level Ragdoll namespace
+  - `Ragdoll.hybrid_search` method for combined semantic and text search
+  - `Ragdoll::Document.search_content` for full-text search capabilities
+  - Consistent parameter handling across all search methods
+### Changed
+- **Search Architecture**: Unified search interface supporting semantic, fulltext, and hybrid modes
+- **Database Schema**: Added search_vector column with GIN indexing for full-text search performance
+### Technical Details
+- Full-text search uses PostgreSQL's built-in tsvector capabilities
+- Hybrid search combines cosine similarity (semantic) with text match ratios
+- Results are ranked by weighted combined scores
+- All search methods maintain backward compatibility
+## [0.1.9] - 2025-01-10
 ### Added
 - **Initial CHANGELOG**: Added comprehensive CHANGELOG.md following Keep a Changelog format
@@ -40,7 +91,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 - **Test Coverage**: Added coverage directory to .gitignore for cleaner repository state
 ### Technical Details
-- Commits: `9186067`, `cb952d3`, `e902a5f`, `632527b`
+- Commits: `9186067`, `cb952d3`, `e902a5f`, `632527b`
 - All changes maintain backward compatibility
 - No breaking API changes
@@ -141,6 +192,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 - **Database Schema**: Multi-modal polymorphic architecture with PostgreSQL + pgvector
 - **Dual Metadata Architecture**: Separate LLM-generated content analysis and file properties
 - **Search Functionality**: Semantic search with cosine similarity and usage analytics
+- **Hybrid Search**: Complete implementation combining semantic and full-text search with configurable weights
+- **Full-text Search**: PostgreSQL tsvector-based text search with GIN indexing
 - **Search Tracking System**: Comprehensive analytics with query embeddings, click-through tracking, and performance monitoring
 - **Document Management**: Add, update, delete, list operations
 - **Background Processing**: ActiveJob integration for async embedding generation
@@ -150,7 +203,6 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 ### 🚧 In Development
 - **Image Processing**: Framework exists but vision AI integration needs completion
 - **Audio Processing**: Framework exists but speech-to-text integration needs completion
-- **Hybrid Search**: Combining semantic and full-text search capabilities
 ### 📋 Planned Features
 - **Multi-modal Search**: Search across text, image, and audio content types
@@ -161,6 +213,18 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 ## Migration Guide
+### From 0.1.9 to 0.1.10
+- **New Search Methods**: `Ragdoll.hybrid_search` and `Ragdoll::Document.search_content` methods now available
+- **Database Migration**: New search_vector column added to documents table with GIN index for full-text search
+- **API Enhancement**: All search methods now support unified parameter interface
+- **Backward Compatibility**: Existing `Ragdoll.search` method unchanged, continues to work as before
+- **CLI Integration**: ragdoll-cli now requires ragdoll >= 0.1.10 for hybrid and full-text search support
+### From 0.1.8 to 0.1.9
+- **CHANGELOG Addition**: Comprehensive changelog and feature tracking added
+- **API Method Consistency**: `hybrid_search` method properly delegated to top-level namespace
+- **No Breaking Changes**: All existing functionality remains compatible
 ### From 0.1.7 to 0.1.8
 - New search tracking tables will be automatically created via migrations
 - No breaking changes to existing API
@@ -198,4 +262,4 @@ This project is licensed under the MIT License - see the LICENSE file for detail
 ---
-*This changelog is automatically maintained and reflects the actual implementation status of features.*
+*This changelog is automatically maintained and reflects the actual implementation status of features.*

data/README.md CHANGED Viewed

@@ -22,6 +22,8 @@
 Database-oriented multi-modal RAG (Retrieval-Augmented Generation) library built on ActiveRecord. Features PostgreSQL + pgvector for high-performance semantic search, polymorphic content architecture, and dual metadata design for sophisticated document analysis.
+RAG does not have to be hard.  Every week its getting simpler.  The frontier LLM providers are starting to encorporate RAG services.  For example OpenAI offers a vector search service.  See: [https://0x1eef.github.io/posts/an-introduction-to-rag-with-llm.rb/](https://0x1eef.github.io/posts/an-introduction-to-rag-with-llm.rb/)
 ## Overview
 Ragdoll is a database-first, multi-modal Retrieval-Augmented Generation (RAG) library for Ruby. It pairs PostgreSQL + pgvector with an ActiveRecord-driven schema to deliver fast, production-grade semantic search and clean data modeling. Today it ships with robust text processing; image and audio pipelines are scaffolded and actively being completed.
@@ -130,6 +132,10 @@ puts result[:document_id]     # "123"
 puts result[:message]         # "Document 'document' added successfully with ID 123"
 puts result[:embeddings_queued] # true
+# Add document with force option to override duplicate detection
+result = Ragdoll.add_document(path: 'document.pdf', force: true)
+# Creates new document even if duplicate exists
 # Check document processing status
 status = Ragdoll.document_status(id: result[:document_id])
 puts status[:status]          # "processed"
@@ -159,6 +165,37 @@ puts stats[:total_documents]  # 50
 puts stats[:total_embeddings] # 1250
 ```
+### Duplicate Detection
+Ragdoll includes sophisticated duplicate detection to prevent redundant document processing:
+```ruby
+# Automatic duplicate detection (default behavior)
+result1 = Ragdoll.add_document(path: 'research.pdf')
+result2 = Ragdoll.add_document(path: 'research.pdf')
+# result2 returns the same document_id as result1 (duplicate detected)
+# Force adding a duplicate document
+result3 = Ragdoll.add_document(path: 'research.pdf', force: true)
+# Creates a new document with modified location identifier
+# Duplicate detection criteria:
+# 1. Exact location/path match
+# 2. File modification time (for files)
+# 3. File content hash (SHA256)
+# 4. Content hash for text
+# 5. File size and metadata similarity
+# 6. Document title and type matching
+```
+**Duplicate Detection Features:**
+- **Multi-level detection**: Checks location, file hash, content hash, and metadata
+- **Smart similarity**: Detects duplicates even with minor differences (5% content tolerance)
+- **File integrity**: SHA256 hashing for reliable file comparison
+- **URL support**: Content-based detection for web documents
+- **Force option**: Override detection when needed
+- **Performance optimized**: Database indexes for fast lookups
 ### Search and Retrieval
 ```ruby
@@ -202,6 +239,53 @@ results = Ragdoll.hybrid_search(
 )
 ```
+### Keywords Search
+Ragdoll supports powerful keywords-based search that can be used standalone or combined with semantic search. The keywords system uses PostgreSQL array operations for high performance and supports both partial matching (overlap) and exact matching (contains all).
+```ruby
+# Keywords-only search (overlap - documents containing any of the keywords)
+results = Ragdoll::Document.search_by_keywords(['machine', 'learning', 'ai'])
+# Results are sorted by match count (documents with more keyword matches rank higher)
+results.each do |doc|
+  puts "#{doc.title}: #{doc.keywords_match_count} matches"
+end
+# Exact keywords search (contains all - documents must have ALL keywords)
+results = Ragdoll::Document.search_by_keywords_all(['ruby', 'programming'])
+# Results are sorted by focus (fewer total keywords = more focused document)
+results.each do |doc|
+  puts "#{doc.title}: #{doc.total_keywords_count} total keywords"
+end
+# Combined semantic + keywords search for best results
+results = Ragdoll.search(
+  query: 'artificial intelligence applications',
+  keywords: ['ai', 'machine learning', 'neural networks'],
+  limit: 10
+)
+# Keywords search with options
+results = Ragdoll::Document.search_by_keywords(
+  ['web', 'javascript', 'frontend'],
+  limit: 20
+)
+# Case-insensitive keyword matching (automatically normalized)
+results = Ragdoll::Document.search_by_keywords(['Python', 'DATA-SCIENCE', 'ai'])
+# Will match documents with keywords: ['python', 'data-science', 'ai']
+```
+**Keywords Search Features:**
+- **High Performance**: Uses PostgreSQL GIN indexes for fast array operations
+- **Flexible Matching**: Supports both overlap (`&&`) and contains (`@>`) operators
+- **Smart Scoring**: Results ordered by match count or document focus
+- **Case Insensitive**: Automatic keyword normalization
+- **Integration Ready**: Works seamlessly with semantic search
+- **Inspired by `find_matching_entries.rb`**: Optimized for PostgreSQL arrays
 ### Search Analytics and Tracking
 Ragdoll automatically tracks all searches to provide comprehensive analytics and improve search relevance over time:
@@ -299,13 +383,14 @@ end
 ## Current Implementation Status
 ### ✅ **Fully Implemented**
-- **Text document processing**: PDF, DOCX, HTML, Markdown, plain text files
+- **Text document processing**: PDF, DOCX, HTML, Markdown, plain text files with encoding fallback
 - **Embedding generation**: Text chunking and vector embedding creation
 - **Database schema**: Multi-modal polymorphic architecture with PostgreSQL + pgvector
 - **Dual metadata architecture**: Separate LLM-generated content analysis and file properties
 - **Search functionality**: Semantic search with cosine similarity and usage analytics
 - **Search tracking system**: Comprehensive analytics with query embeddings, click-through tracking, and performance monitoring
 - **Document management**: Add, update, delete, list operations
+- **Duplicate detection**: Multi-level duplicate prevention with file hash, content hash, and metadata comparison
 - **Background processing**: ActiveJob integration for async embedding generation
 - **LLM metadata generation**: AI-powered structured content analysis with schema validation
 - **Logging**: Configurable file-based logging with multiple levels

data/Rakefile CHANGED Viewed

@@ -49,8 +49,10 @@ task :setup_test_db do
     puts "Warning: Could not install pgvector extension: #{e.message}"
   end
-  # Run migrations
-  Ragdoll::Core::Database.setup(test_db_config.merge(auto_migrate: true, logger: nil))
+  # Reset and run migrations (drops all tables and re-runs migrations)
+  # This ensures clean state for tests regardless of previous migration versions
+  Ragdoll::Core::Database.setup(test_db_config.merge(auto_migrate: false, logger: nil))
+  Ragdoll::Core::Database.reset!
   puts "Test database setup complete"
 end

data/app/models/ragdoll/document.rb CHANGED Viewed

@@ -142,10 +142,12 @@ module Ragdoll
     def keywords_array
       return [] unless keywords.present?
+      # After migration, keywords is now a PostgreSQL array
       case keywords
       when Array
-        keywords
+        keywords.map(&:to_s).map(&:strip).reject(&:empty?)
       when String
+        # Fallback for any remaining string data (shouldn't happen after migration)
         keywords.split(",").map(&:strip).reject(&:empty?)
       else
         []
@@ -153,17 +155,23 @@ module Ragdoll
     end
     def add_keyword(keyword)
+      return if keyword.blank?
       current_keywords = keywords_array
-      return if current_keywords.include?(keyword.strip)
+      normalized_keyword = keyword.to_s.strip.downcase
+      return if current_keywords.map(&:downcase).include?(normalized_keyword)
-      current_keywords << keyword.strip
-      self.keywords = current_keywords.join(", ")
+      current_keywords << normalized_keyword
+      self.keywords = current_keywords
     end
     def remove_keyword(keyword)
+      return if keyword.blank?
       current_keywords = keywords_array
-      current_keywords.delete(keyword.strip)
-      self.keywords = current_keywords.join(", ")
+      normalized_keyword = keyword.to_s.strip.downcase
+      current_keywords.reject! { |k| k.downcase == normalized_keyword }
+      self.keywords = current_keywords
     end
     # Metadata accessors for common fields
@@ -249,15 +257,110 @@ module Ragdoll
       puts "Metadata generation failed: #{e.message}"
     end
-    # PostgreSQL full-text search on metadata fields
+    # PostgreSQL full-text search on metadata fields with per-word match-ratio [0.0..1.0]
     def self.search_content(query, **options)
       return none if query.blank?
-      # Use PostgreSQL's built-in full-text search across metadata fields
-      where(
-        "to_tsvector('english', COALESCE(title, '') || ' ' || COALESCE(metadata->>'summary', '') || ' ' || COALESCE(metadata->>'keywords', '') || ' ' || COALESCE(metadata->>'description', '')) @@ plainto_tsquery('english', ?)",
-        query
-      ).limit(options[:limit] || 20)
+      # Split into unique alphanumeric words
+      words = query.downcase.scan(/[[:alnum:]]+/).uniq
+      return none if words.empty?
+      limit = options[:limit] || 20
+      threshold = options[:threshold] || 0.0
+      # Use precomputed tsvector column if it exists, otherwise build on the fly
+      if column_names.include?("search_vector")
+        tsvector = "#{table_name}.search_vector"
+      else
+        # Build tsvector from title and metadata fields
+        text_expr =
+          "COALESCE(title, '') || ' ' || " \
+          "COALESCE(metadata->>'summary', '') || ' ' || " \
+          "COALESCE(metadata->>'keywords', '') || ' ' || " \
+          "COALESCE(metadata->>'description', '')"
+        tsvector = "to_tsvector('english', #{text_expr})"
+      end
+      # Prepare sanitized tsquery terms
+      tsqueries = words.map do |word|
+        sanitize_sql_array(["plainto_tsquery('english', ?)", word])
+      end
+      # Combine per-word tsqueries with OR so PostgreSQL can use the GIN index
+      combined_tsquery = tsqueries.join(' || ')
+      # Score each match (1 if present, 0 if not), sum them
+      score_terms = tsqueries.map { |tsq| "(#{tsvector} @@ #{tsq})::int" }
+      score_sum   = score_terms.join(' + ')
+      # Similarity ratio: fraction of query words present
+      similarity_sql = "(#{score_sum})::float / #{words.size}"
+      # Start with basic search query
+      query = select("#{table_name}.*, #{similarity_sql} AS fulltext_similarity")
+      # Build where conditions
+      conditions = ["#{tsvector} @@ (#{combined_tsquery})"]
+      # Add status filter (default to processed unless overridden)
+      status = options[:status] || 'processed'
+      conditions << "#{table_name}.status = '#{status}'"
+      # Add document type filter if specified
+      if options[:document_type].present?
+        conditions << sanitize_sql_array(["#{table_name}.document_type = ?", options[:document_type]])
+      end
+      # Add threshold filtering if specified
+      if threshold > 0.0
+        conditions << "#{similarity_sql} >= #{threshold}"
+      end
+      # Combine all conditions
+      where_clause = conditions.join(' AND ')
+      # Materialize to array to avoid COUNT/SELECT alias conflicts in some AR versions
+      query.where(where_clause)
+        .order(Arel.sql("fulltext_similarity DESC, updated_at DESC"))
+        .limit(limit)
+        .to_a
+    end
+    # Search documents by keywords using PostgreSQL array operations
+    # Returns documents that match keywords with scoring based on match count
+    # Inspired by find_matching_entries.rb algorithm but optimized for PostgreSQL arrays
+    def self.search_by_keywords(keywords_array, **options)
+      return where("1 = 0") if keywords_array.blank?
+      # Normalize keywords to lowercase strings array
+      normalized_keywords = Array(keywords_array).map(&:to_s).map(&:downcase).reject(&:empty?)
+      return where("1 = 0") if normalized_keywords.empty?
+      limit = options[:limit] || 20
+      # Use PostgreSQL array overlap operator with proper array literal
+      quoted_keywords = normalized_keywords.map { |k| "\"#{k}\"" }.join(',')
+      array_literal = "'{#{quoted_keywords}}'::text[]"
+      where("keywords && #{array_literal}")
+        .order("created_at DESC")
+        .limit(limit)
+    end
+    # Find documents that contain ALL specified keywords (exact array matching)
+    def self.search_by_keywords_all(keywords_array, **options)
+      return where("1 = 0") if keywords_array.blank?
+      normalized_keywords = Array(keywords_array).map(&:to_s).map(&:downcase).reject(&:empty?)
+      return where("1 = 0") if normalized_keywords.empty?
+      limit = options[:limit] || 20
+      # Use PostgreSQL array contains operator with proper array literal
+      quoted_keywords = normalized_keywords.map { |k| "\"#{k}\"" }.join(',')
+      array_literal = "'{#{quoted_keywords}}'::text[]"
+      where("keywords @> #{array_literal}")
+        .order("created_at DESC")
+        .limit(limit)
     end
     # Faceted search by metadata fields

data/app/models/ragdoll/embedding.rb CHANGED Viewed

@@ -64,10 +64,26 @@ module Ragdoll
       scope = scope.by_model(filters[:embedding_model]) if filters[:embedding_model]
       # Document-level filters require joining through embeddable (STI Content) to documents
-      if filters[:document_type]
+      needs_document_join = filters[:document_type] || filters[:keywords]
+      if needs_document_join
         scope = scope.joins("JOIN ragdoll_contents ON ragdoll_contents.id = ragdoll_embeddings.embeddable_id")
                      .joins("JOIN ragdoll_documents ON ragdoll_documents.id = ragdoll_contents.document_id")
-                     .where("ragdoll_documents.document_type = ?", filters[:document_type])
+      end
+      if filters[:document_type]
+        scope = scope.where("ragdoll_documents.document_type = ?", filters[:document_type])
+      end
+      # Keywords filtering using PostgreSQL array operations
+      if filters[:keywords] && filters[:keywords].any?
+        normalized_keywords = Array(filters[:keywords]).map(&:to_s).map(&:downcase).reject(&:empty?)
+        if normalized_keywords.any?
+          # Use PostgreSQL array overlap operator with proper array literal
+          quoted_keywords = normalized_keywords.map { |k| "\"#{k}\"" }.join(',')
+          array_literal = "'{#{quoted_keywords}}'::text[]"
+          scope = scope.where("ragdoll_documents.keywords && #{array_literal}")
+        end
       end
       # Use pgvector for similarity search
@@ -83,10 +99,26 @@ module Ragdoll
       scope = scope.by_model(filters[:embedding_model]) if filters[:embedding_model]
       # Document-level filters require joining through embeddable (STI Content) to documents
-      if filters[:document_type]
+      needs_document_join = filters[:document_type] || filters[:keywords]
+      if needs_document_join
         scope = scope.joins("JOIN ragdoll_contents ON ragdoll_contents.id = ragdoll_embeddings.embeddable_id")
                      .joins("JOIN ragdoll_documents ON ragdoll_documents.id = ragdoll_contents.document_id")
-                     .where("ragdoll_documents.document_type = ?", filters[:document_type])
+      end
+      if filters[:document_type]
+        scope = scope.where("ragdoll_documents.document_type = ?", filters[:document_type])
+      end
+      # Keywords filtering using PostgreSQL array operations
+      if filters[:keywords] && filters[:keywords].any?
+        normalized_keywords = Array(filters[:keywords]).map(&:to_s).map(&:downcase).reject(&:empty?)
+        if normalized_keywords.any?
+          # Use PostgreSQL array overlap operator with proper array literal
+          quoted_keywords = normalized_keywords.map { |k| "\"#{k}\"" }.join(',')
+          array_literal = "'{#{quoted_keywords}}'::text[]"
+          scope = scope.where("ragdoll_documents.keywords && #{array_literal}")
+        end
       end
       search_with_pgvector_stats(query_embedding, scope, limit, threshold)

data/app/models/ragdoll/search.rb CHANGED Viewed

@@ -14,7 +14,7 @@ module Ragdoll
     has_many :embeddings, through: :search_results
     validates :query, presence: true
-    validates :query_embedding, presence: true
+    validates :query_embedding, presence: false, allow_nil: true
     validates :search_type, presence: true, inclusion: { in: %w[semantic hybrid fulltext] }
     validates :results_count, presence: true, numericality: { greater_than_or_equal_to: 0 }

data/app/services/ragdoll/document_management.rb CHANGED Viewed

@@ -1,9 +1,11 @@
 # frozen_string_literal: true
+require "securerandom"
 module Ragdoll
   class DocumentManagement
     class << self
-      def add_document(location, content, metadata = {})
+      def add_document(location, content, metadata = {}, force: false)
         # Ensure location is an absolute path if it's a file path
         absolute_location = location.start_with?("http") || location.start_with?("ftp") ? location : File.expand_path(location)
@@ -14,17 +16,21 @@ module Ragdoll
                              Time.current
                            end
-        # Check if document already exists with same location and file_modified_at
-        existing_document = Ragdoll::Document.find_by(
-          location: absolute_location,
-          file_modified_at: file_modified_at
-        )
+        # Skip duplicate detection if force is true
+        unless force
+          existing_document = find_duplicate_document(absolute_location, content, metadata, file_modified_at)
+          return existing_document.id.to_s if existing_document
+        end
-        # Return existing document ID if found (skip duplicate)
-        return existing_document.id.to_s if existing_document
+        # Modify location if force is used to avoid unique constraint violation
+        final_location = if force
+                           "#{absolute_location}#forced_#{Time.current.to_i}_#{SecureRandom.hex(4)}"
+                         else
+                           absolute_location
+                         end
         document = Ragdoll::Document.create!(
-          location: absolute_location,
+          location: final_location,
           title: metadata[:title] || metadata["title"] || extract_title_from_location(location),
           document_type: metadata[:document_type] || metadata["document_type"] || "text",
           metadata: metadata.is_a?(Hash) ? metadata : {},
@@ -100,6 +106,108 @@ module Ragdoll
       private
+      def find_duplicate_document(location, content, metadata, file_modified_at)
+        # Primary check: exact location match (simple duplicate detection)
+        existing = Ragdoll::Document.find_by(location: location)
+        return existing if existing
+        # Secondary check: exact location and file modification time (for files)
+        existing_with_time = Ragdoll::Document.find_by(
+          location: location,
+          file_modified_at: file_modified_at
+        )
+        return existing_with_time if existing_with_time
+        # Enhanced duplicate detection for file-based documents
+        if File.exist?(location) && !location.start_with?("http")
+          file_size = File.size(location)
+          content_hash = calculate_file_hash(location)
+          # Check for documents with same file hash (most reliable)
+          potential_duplicates = Ragdoll::Document.where("metadata->>'file_hash' = ?", content_hash)
+          return potential_duplicates.first if potential_duplicates.any?
+          # Check for documents with same file size and similar metadata
+          same_size_docs = Ragdoll::Document.where("metadata->>'file_size' = ?", file_size.to_s)
+          same_size_docs.each do |doc|
+            return doc if documents_are_duplicates?(doc, location, content, metadata, file_size, content_hash)
+          end
+        end
+        # For non-file documents (URLs, etc), check content-based duplicates
+        unless File.exist?(location)
+          return find_content_based_duplicate(content, metadata)
+        end
+        nil
+      end
+      def documents_are_duplicates?(existing_doc, location, content, metadata, file_size, content_hash)
+        # Compare multiple factors to determine if documents are duplicates
+        # Check filename similarity (basename without extension)
+        existing_basename = File.basename(existing_doc.location, File.extname(existing_doc.location))
+        new_basename = File.basename(location, File.extname(location))
+        return false unless existing_basename == new_basename
+        # Check content length similarity (within 5% tolerance)
+        if content.present? && existing_doc.content.present?
+          content_length_diff = (content.length - existing_doc.content.length).abs
+          max_length = [content.length, existing_doc.content.length].max
+          return false if max_length > 0 && (content_length_diff.to_f / max_length) > 0.05
+        end
+        # Check key metadata fields
+        existing_metadata = existing_doc.metadata || {}
+        new_metadata = metadata || {}
+        # Compare file type/document type
+        return false if existing_doc.document_type != (new_metadata[:document_type] || new_metadata["document_type"] || "text")
+        # Compare title if available
+        existing_title = existing_metadata["title"] || existing_doc.title
+        new_title = new_metadata[:title] || new_metadata["title"] || extract_title_from_location(location)
+        return false if existing_title && new_title && existing_title != new_title
+        # If we reach here, documents are likely duplicates
+        true
+      end
+      def find_content_based_duplicate(content, metadata)
+        return nil unless content.present?
+        content_hash = calculate_content_hash(content)
+        title = metadata[:title] || metadata["title"]
+        # Look for documents with same content hash
+        Ragdoll::Document.where("metadata->>'content_hash' = ?", content_hash).first ||
+        # Look for documents with same title and similar content length (within 5% tolerance)
+        (title ? find_by_title_and_content_similarity(title, content) : nil)
+      end
+      def find_by_title_and_content_similarity(title, content)
+        content_length = content.length
+        tolerance = content_length * 0.05
+        Ragdoll::Document.where(title: title).find do |doc|
+          doc.content.present? &&
+          (doc.content.length - content_length).abs <= tolerance
+        end
+      end
+      def calculate_file_hash(file_path)
+        require 'digest'
+        Digest::SHA256.file(file_path).hexdigest
+      rescue StandardError => e
+        Rails.logger.warn "Failed to calculate file hash for #{file_path}: #{e.message}" if defined?(Rails)
+        nil
+      end
+      def calculate_content_hash(content)
+        require 'digest'
+        Digest::SHA256.hexdigest(content)
+      end
       def extract_title_from_location(location)
         File.basename(location, File.extname(location))
       end