RubyGems - ragdoll - Versions diffs - 0.1.3 → 0.1.9 - Mend

ragdoll 0.1.3 → 0.1.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (20) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +201 -0
data/README.md +160 -31
data/Rakefile +0 -3
data/app/models/ragdoll/embedding.rb +74 -0
data/app/models/ragdoll/search.rb +165 -0
data/app/models/ragdoll/search_result.rb +121 -0
data/app/services/ragdoll/configuration_service.rb +3 -3
data/app/services/ragdoll/document_processor.rb +124 -1
data/app/services/ragdoll/embedding_service.rb +10 -0
data/app/services/ragdoll/search_engine.rb +64 -6
data/db/migrate/007_create_ragdoll_searches.rb +73 -0
data/db/migrate/008_create_ragdoll_search_results.rb +49 -0
data/lib/ragdoll/core/client.rb +75 -8
data/lib/ragdoll/core/model.rb +13 -0
data/lib/ragdoll/core/version.rb +1 -1
data/lib/ragdoll/core.rb +2 -0
data/lib/ragdoll.rb +17 -0
data/lib/tasks/db.rake +13 -13
metadata +371 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 2016536d66d295c1fe5054aedb77526271692d7562131df9de9e1ad756309459
-  data.tar.gz: 725a221ab132fd9ce77f623114c034d675c626428c9d5d8c72e45e275b08feea
+  metadata.gz: cde84c4b5bbf1e8296bdd762ee78acb2f69663e493ce23b0941ada9d1201bdcd
+  data.tar.gz: f8bc456d3c536a295920bc1c806974b2b39f08977a8761604c7a192b83e756d2
 SHA512:
-  metadata.gz: 221c7d3408a9ec1b4c2f735bf733ae40aab896fdc07858b69de0866acb684c1eb65a3fb054342a6d20cd8a6e0b4e3f0c866f1df3a5bd8e5a475d6c3d72062b1a
-  data.tar.gz: 3228762fd152ff2a2fd5c0f514ae39e11e483dba698b8139f6c0696437a70209fb0576d67fb271eed45c4c7a2c08247dcbd68a2eab8f19cda144c01d38c2299f
+  metadata.gz: c1ce0e46be45fe8004930ec231a83a59f31039f4908be2a0e0ba67043237f1ea03bc00991820f6928a6ef5baa6ca910547876f21ddad5a7ead2d6384192e7708
+  data.tar.gz: e3f50e1205b4ba755c6a978acb06240b7b1fa729f4fa9bef33f956a9b245ad3d3323612f300902051237ffa71a763fc6db8d8e0fedc4f2761c46a977b42d6958

data/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,201 @@
+# Changelog
+All notable changes to the Ragdoll Core project will be documented in this file.
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [Unreleased]
+*Note: These features will be included in the next release (likely v0.1.9) featuring comprehensive search tracking and analytics capabilities.*
+### Added
+- **Initial CHANGELOG**: Added comprehensive CHANGELOG.md following Keep a Changelog format
+  - Complete version history from git log analysis
+  - Feature status tracking (implemented vs planned)
+  - Migration guides and breaking changes documentation
+  - Structured release notes with proper categorization
+- **Search Tracking System**: Comprehensive analytics with query embeddings, click-through tracking, and performance monitoring
+  - Automatic search recording with vector embeddings for similarity analysis
+  - Click-through rate tracking and user engagement monitoring
+  - Session and user behavior tracking capabilities
+  - Performance metrics including execution time and result quality analysis
+  - Search similarity analysis using vector embeddings
+  - Automatic cleanup of orphaned and unused searches
+- **Enhanced README**: Updated documentation with search tracking examples and analytics usage
+  - Comprehensive search analytics examples and usage patterns
+  - Updated API examples to use proper top-level Ragdoll methods
+  - Added search tracking configuration and usage examples
+- **API Method Consistency**: Added `hybrid_search` delegation to top-level Ragdoll namespace
+  - Complete documentation with examples and parameter descriptions
+  - Consistent API experience across all search methods
+  - Verified method availability at both Ragdoll and Ragdoll::Core levels
+### Fixed
+- **Model Resolution Warning**: Fixed "undefined method 'empty?' for an instance of Ragdoll::Core::Model" warning
+  - Added defensive `empty?` method to Model class
+  - Enhanced constructor to handle polymorphic Model objects
+  - Added nil/empty checks in embedding service
+### Changed
+- **Test Coverage**: Added coverage directory to .gitignore for cleaner repository state
+### Technical Details
+- Commits: `9186067`, `cb952d3`, `e902a5f`, `632527b`
+- All changes maintain backward compatibility
+- No breaking API changes
+## [0.1.8] - 2025-01-04
+### Added
+- **Search Analytics Foundation**: Added `Ragdoll::Search` model with query embedding and result tracking capabilities
+- **Embedding Service Enhancements**: Fallback mechanism for model resolution in embedding service
+- **Test Coverage**: Added coverage directory to gitignore and improved test infrastructure
+### Changed
+- Updated Gemfile.lock with latest gem versions
+- Enhanced runtime dependencies and version management
+### Fixed
+- Package directory exclusion in gitignore
+## [0.1.7] - 2025-01-04
+### Added
+- **Multi-Modal Content Models**: Added AudioContent model for comprehensive audio processing support
+- **Background Job Processing**: New Ragdoll job classes for asynchronous document processing
+- **Metadata Schemas**: Structured metadata schemas for text and image documents with validation
+### Changed
+- Updated ragdoll gem dependencies
+- Improved submodule management for documentation
+## [0.1.6] - 2025-01-04
+### Added
+- **Documentation Restructure**: Replaced local docs with ragdoll-docs submodule
+- **Conventional Commits**: Updated and restructured Conventional Commits specification
+- **CI/CD Improvements**: Enhanced GitHub Actions workflow and dropped JRuby support for RMagick compatibility
+### Fixed
+- Test skipping logic for CI environments
+- Automated release workflow adjustments
+## [0.1.5] - 2025-01-04
+### Added
+- Enhanced document processing pipeline
+- Improved error handling and logging
+### Fixed
+- Version management and release process refinements
+## [0.1.4] - 2025-01-04
+### Added
+- Extended multi-modal architecture support
+- Performance optimizations for large document processing
+### Changed
+- Refined version numbering and release process
+## [0.1.3] - 2025-01-04
+### Added
+- **Core RAG Architecture**: Multi-modal RAG (Retrieval-Augmented Generation) library built on ActiveRecord
+- **PostgreSQL + pgvector Integration**: High-performance semantic search with vector similarity
+- **Polymorphic Content Architecture**: Unified handling of text, image, and audio content types
+- **Dual Metadata Design**: Separation of LLM-generated content analysis and system file properties
+- **Document Processing Pipeline**: Support for PDF, DOCX, HTML, Markdown, and plain text files
+- **Embedding Generation**: Text chunking and vector embedding creation with multiple LLM provider support
+- **Semantic Search**: Cosine similarity search with usage analytics
+- **Background Processing**: ActiveJob integration for asynchronous document processing
+- **Logging System**: Configurable file-based logging with multiple levels
+### Technical Features
+- **Database Schema**: Multi-modal polymorphic architecture optimized for PostgreSQL
+- **IVFFlat Indexing**: Fast approximate nearest neighbor search for vector similarity
+- **Connection Pooling**: High-concurrency support for production workloads
+- **Configuration Management**: Comprehensive configuration system for LLM providers and processing settings
+## [0.1.1] - 2024-12-XX
+### Added
+- Initial project structure and basic functionality
+- Core document management capabilities
+- Basic search and retrieval features
+## [0.0.2] - 2024-12-XX
+### Added
+- Initial alpha release
+- Basic RAG architecture foundation
+- PostgreSQL database integration
+---
+## Feature Status
+### ✅ Fully Implemented
+- **Text Document Processing**: PDF, DOCX, HTML, Markdown, plain text files
+- **Embedding Generation**: Text chunking and vector embedding creation
+- **Database Schema**: Multi-modal polymorphic architecture with PostgreSQL + pgvector
+- **Dual Metadata Architecture**: Separate LLM-generated content analysis and file properties
+- **Search Functionality**: Semantic search with cosine similarity and usage analytics
+- **Search Tracking System**: Comprehensive analytics with query embeddings, click-through tracking, and performance monitoring
+- **Document Management**: Add, update, delete, list operations
+- **Background Processing**: ActiveJob integration for async embedding generation
+- **LLM Metadata Generation**: AI-powered structured content analysis with schema validation
+- **Logging**: Configurable file-based logging with multiple levels
+### 🚧 In Development
+- **Image Processing**: Framework exists but vision AI integration needs completion
+- **Audio Processing**: Framework exists but speech-to-text integration needs completion
+- **Hybrid Search**: Combining semantic and full-text search capabilities
+### 📋 Planned Features
+- **Multi-modal Search**: Search across text, image, and audio content types
+- **Content-type Specific Embedding Models**: Different models for text, image, audio
+- **Enhanced Metadata Schemas**: Domain-specific metadata templates
+---
+## Migration Guide
+### From 0.1.7 to 0.1.8
+- New search tracking tables will be automatically created via migrations
+- No breaking changes to existing API
+- Search tracking is enabled by default but can be disabled per search
+### From 0.1.6 to 0.1.7
+- AudioContent model added - existing installations will auto-migrate
+- New background job classes available for improved processing
+- Metadata schemas provide enhanced validation
+### From 0.1.5 to 0.1.6
+- Documentation moved to submodule - update local references
+- CI/CD improvements may affect development workflows
+- JRuby support removed due to RMagick dependency
+---
+## Breaking Changes
+### Version 0.1.6
+- **JRuby Support Removed**: RMagick dependency incompatibility
+- **Documentation Structure**: Local docs replaced with submodule
+---
+## Contributors
+- **Dewayne VanHoozer** - Primary developer and maintainer
+---
+## License
+This project is licensed under the MIT License - see the LICENSE file for details.
+---
+*This changelog is automatically maintained and reflects the actual implementation status of features.*

data/README.md CHANGED Viewed

@@ -18,17 +18,63 @@
   </table>
 </div>
-# Ragdoll::Core
+# Ragdoll
 Database-oriented multi-modal RAG (Retrieval-Augmented Generation) library built on ActiveRecord. Features PostgreSQL + pgvector for high-performance semantic search, polymorphic content architecture, and dual metadata design for sophisticated document analysis.
+## Overview
+Ragdoll is a database-first, multi-modal Retrieval-Augmented Generation (RAG) library for Ruby. It pairs PostgreSQL + pgvector with an ActiveRecord-driven schema to deliver fast, production-grade semantic search and clean data modeling. Today it ships with robust text processing; image and audio pipelines are scaffolded and actively being completed.
+The library emphasizes a dual-metadata design: LLM-derived semantic metadata for understanding content, and system file metadata for managing assets. With built-in analytics, background processing, and a high-level API, you can go from ingest to answer quickly—and scale confidently.
+### Why Ragdoll?
+- Database-first foundation on ActiveRecord (PostgreSQL + pgvector only) for performance and reliability
+- Multi-modal architecture (text today; image/audio next) via polymorphic content design
+- Dual metadata model separating semantic analysis from file properties
+- Provider-agnostic LLM integration via `ruby_llm` (OpenAI, Anthropic, Google)
+- Production-friendly: background jobs, connection pooling, indexing, and search analytics
+- Simple, ergonomic high-level API to keep your application code clean
+### Key Capabilities
+- Semantic search with vector similarity (cosine) across polymorphic content
+- Text ingestion, chunking, and embedding generation
+- LLM-powered structured metadata with schema validation
+- Search tracking and analytics (CTR, performance, similarity of queries)
+- Hybrid search (semantic + full-text) planned
+- Extensible model and configuration system
+## Table of Contents
+- [Quick Start](#quick-start)
+- [API Overview](#api-overview)
+- [Search and Retrieval](#search-and-retrieval)
+- [Search Analytics and Tracking](#search-analytics-and-tracking)
+- [System Operations](#system-operations)
+- [Configuration](#configuration)
+- [Current Implementation Status](#current-implementation-status)
+- [Architecture Highlights](#architecture-highlights)
+- [Text Document Processing](#text-document-processing-current)
+- [PostgreSQL + pgvector Configuration](#postgresql--pgvector-configuration)
+- [Performance Features](#performance-features)
+- [Installation](#installation)
+- [Requirements](#requirements)
+- [Use Cases](#use-cases)
+- [Environment Variables](#environment-variables)
+- [Troubleshooting](#troubleshooting)
+- [Related Projects](#related-projects)
+- [Key Design Principles](#key-design-principles)
+- [Contributing & Support](#contributing--support)
 ## Quick Start
 ```ruby
 require 'ragdoll'
 # Configure with PostgreSQL + pgvector
-Ragdoll::Core.configure do |config|
+Ragdoll.configure do |config|
   # Database configuration (PostgreSQL only)
   config.database_config = {
     adapter: 'postgresql',
@@ -55,22 +101,22 @@ Ragdoll::Core.configure do |config|
 end
 # Add documents - returns detailed result
-result = Ragdoll::Core.add_document(path: 'research_paper.pdf')
+result = Ragdoll.add_document(path: 'research_paper.pdf')
 puts result[:message]  # "Document 'research_paper' added successfully with ID 123"
 doc_id = result[:document_id]
 # Check document status
-status = Ragdoll::Core.document_status(id: doc_id)
+status = Ragdoll.document_status(id: doc_id)
 puts status[:message]  # Shows processing status and embeddings count
 # Search across content
-results = Ragdoll::Core.search(query: 'neural networks')
+results = Ragdoll.search(query: 'neural networks')
 # Get detailed document information
-document = Ragdoll::Core.get_document(id: doc_id)
+document = Ragdoll.get_document(id: doc_id)
 ```
-## High-Level API
+## API Overview
 The `Ragdoll` module provides a convenient high-level API for common operations:
@@ -78,37 +124,37 @@ The `Ragdoll` module provides a convenient high-level API for common operations:
 ```ruby
 # Add single document - returns detailed result hash
-result = Ragdoll::Core.add_document(path: 'document.pdf')
+result = Ragdoll.add_document(path: 'document.pdf')
 puts result[:success]         # true
 puts result[:document_id]     # "123"
 puts result[:message]         # "Document 'document' added successfully with ID 123"
 puts result[:embeddings_queued] # true
 # Check document processing status
-status = Ragdoll::Core.document_status(id: result[:document_id])
+status = Ragdoll.document_status(id: result[:document_id])
 puts status[:status]          # "processed"
 puts status[:embeddings_count] # 15
 puts status[:embeddings_ready] # true
 puts status[:message]         # "Document processed successfully with 15 embeddings"
 # Get detailed document information
-document = Ragdoll::Core.get_document(id: result[:document_id])
+document = Ragdoll.get_document(id: result[:document_id])
 puts document[:title]         # "document"
 puts document[:status]        # "processed"
 puts document[:embeddings_count] # 15
 puts document[:content_length]   # 5000
 # Update document metadata
-Ragdoll::Core.update_document(id: result[:document_id], title: 'New Title')
+Ragdoll.update_document(id: result[:document_id], title: 'New Title')
 # Delete document
-Ragdoll::Core.delete_document(id: result[:document_id])
+Ragdoll.delete_document(id: result[:document_id])
 # List all documents
-documents = Ragdoll::Core.list_documents(limit: 10)
+documents = Ragdoll.list_documents(limit: 10)
 # System statistics
-stats = Ragdoll::Core.stats
+stats = Ragdoll.stats
 puts stats[:total_documents]  # 50
 puts stats[:total_embeddings] # 1250
 ```
@@ -117,15 +163,22 @@ puts stats[:total_embeddings] # 1250
 ```ruby
 # Semantic search across all content types
-results = Ragdoll::Core.search(query: 'artificial intelligence')
+results = Ragdoll.search(query: 'artificial intelligence')
+# Search with automatic tracking (default)
+results = Ragdoll.search(
+  query: 'machine learning',
+  session_id: 123,  # Optional: track user sessions
+  user_id:    456   # Optional: track by user
+)
 # Search specific content types
-text_results = Ragdoll::Core.search(query: 'machine learning', content_type: 'text')
-image_results = Ragdoll::Core.search(query: 'neural network diagram', content_type: 'image')
-audio_results = Ragdoll::Core.search(query: 'AI discussion', content_type: 'audio')
+text_results = Ragdoll.search(query: 'machine learning', content_type: 'text')
+image_results = Ragdoll.search(query: 'neural network diagram', content_type: 'image')
+audio_results = Ragdoll.search(query: 'AI discussion', content_type: 'audio')
 # Advanced search with metadata filters
-results = Ragdoll::Core.search(
+results = Ragdoll.search(
   query: 'deep learning',
   classification: 'research',
   keywords: ['AI', 'neural networks'],
@@ -133,44 +186,77 @@ results = Ragdoll::Core.search(
 )
 # Get context for RAG applications
-context = Ragdoll::Core.get_context(query: 'machine learning', limit: 5)
+context = Ragdoll.get_context(query: 'machine learning', limit: 5)
 # Enhanced prompt with context
-enhanced = Ragdoll::Core.enhance_prompt(
+enhanced = Ragdoll.enhance_prompt(
   prompt: 'What is machine learning?',
   context_limit: 5
 )
 # Hybrid search combining semantic and full-text
-results = Ragdoll::Core.hybrid_search(
+results = Ragdoll.hybrid_search(
   query: 'neural networks',
   semantic_weight: 0.7,
   text_weight: 0.3
 )
 ```
+### Search Analytics and Tracking
+Ragdoll automatically tracks all searches to provide comprehensive analytics and improve search relevance over time:
+```ruby
+# Get search analytics for the last 30 days
+analytics = Ragdoll::Search.search_analytics(days: 30)
+puts "Total searches: #{analytics[:total_searches]}"
+puts "Unique queries: #{analytics[:unique_queries]}"
+puts "Average execution time: #{analytics[:avg_execution_time]}ms"
+puts "Click-through rate: #{analytics[:click_through_rate]}%"
+# Find similar searches using vector similarity
+search = Ragdoll::Search.first
+similar_searches = search.nearest_neighbors(:query_embedding, distance: :cosine).limit(5)
+similar_searches.each do |similar|
+  puts "Query: #{similar.query}"
+  puts "Similarity: #{similar.neighbor_distance}"
+  puts "Results: #{similar.results_count}"
+end
+# Track user interactions (clicks on search results)
+search_result = Ragdoll::SearchResult.first
+search_result.mark_as_clicked!
+# Disable tracking for specific searches if needed
+results = Ragdoll.search(
+  query: 'private query',
+  track_search: false
+)
+```
 ### System Operations
 ```ruby
 # Get system statistics
-stats = Ragdoll::Core.stats
+stats = Ragdoll.stats
 # Returns information about documents, content types, embeddings, etc.
 # Health check
-healthy = Ragdoll::Core.healthy?
+healthy = Ragdoll.healthy?
 # Get configuration
-config = Ragdoll::Core.configuration
+config = Ragdoll.configuration
 # Reset configuration (useful for testing)
-Ragdoll::Core.reset_configuration!
+Ragdoll.reset_configuration!
 ```
 ### Configuration
 ```ruby
 # Configure the system
-Ragdoll::Core.configure do |config|
+Ragdoll.configure do |config|
   # Database configuration (PostgreSQL only - REQUIRED)
   config.database_config = {
     adapter: 'postgresql',
@@ -218,6 +304,7 @@ end
 - **Database schema**: Multi-modal polymorphic architecture with PostgreSQL + pgvector
 - **Dual metadata architecture**: Separate LLM-generated content analysis and file properties
 - **Search functionality**: Semantic search with cosine similarity and usage analytics
+- **Search tracking system**: Comprehensive analytics with query embeddings, click-through tracking, and performance monitoring
 - **Document management**: Add, update, delete, list operations
 - **Background processing**: ActiveJob integration for async embedding generation
 - **LLM metadata generation**: AI-powered structured content analysis with schema validation
@@ -264,15 +351,16 @@ Currently, Ragdoll processes text documents through:
 6. **Search**: Semantic search using cosine similarity with usage analytics
 ### Example Usage
 ```ruby
 # Add a text document
-result = Ragdoll::Core.add_document(path: 'document.pdf')
+result = Ragdoll.add_document(path: 'document.pdf')
 # Check processing status
-status = Ragdoll::Core.document_status(id: result[:document_id])
+status = Ragdoll.document_status(id: result[:document_id])
 # Search the content
-results = Ragdoll::Core.search(query: 'machine learning')
+results = Ragdoll.search(query: 'machine learning')
 ```
 ## PostgreSQL + pgvector Configuration
@@ -293,7 +381,7 @@ psql -d ragdoll_production -c "CREATE EXTENSION IF NOT EXISTS vector;"
 ### Configuration Example
 ```ruby
-Ragdoll::Core.configure do |config|
+Ragdoll.configure do |config|
   config.database_config = {
     adapter: 'postgresql',
     database: 'ragdoll_production',
@@ -337,11 +425,52 @@ gem 'ragdoll'
 - **PostgreSQL**: 12+ with pgvector extension (REQUIRED - no other databases supported)
 - **Dependencies**: activerecord, pg, pgvector, neighbor, ruby_llm, pdf-reader, docx, rubyzip, shrine, rmagick, opensearch-ruby, searchkick, ruby-progressbar
+## Use Cases
+- Internal knowledge bases and chat assistants grounded in your documents
+- Product documentation and support search with analytics and relevance feedback
+- Research corpora exploration (summaries, topics, similarity) across large text sets
+- Incident retrospectives and operational analytics with searchable write-ups
+- Media libraries preparing for text + image + audio pipelines (image/audio in progress)
+## Environment Variables
+Set the following as environment variables (do not commit secrets to source control):
+- `OPENAI_API_KEY` — required for OpenAI models
+- `OPENAI_ORGANIZATION` — optional, for OpenAI org scoping
+- `OPENAI_PROJECT` — optional, for OpenAI project scoping
+- `ANTHROPIC_API_KEY` — optional, for Anthropic models
+- `GOOGLE_API_KEY` — optional, for Google models
+- `DATABASE_PASSWORD` — your PostgreSQL password if not using peer auth
+## Troubleshooting
+### pgvector extension missing
+- Ensure the extension is enabled in your database:
+```bash
+psql -d ragdoll_production -c "CREATE EXTENSION IF NOT EXISTS vector;"
+```
+- If the command fails, verify PostgreSQL and pgvector are installed and that you’re connecting to the correct database.
+### Document stuck in "processing"
+- Confirm your API keys are set and valid.
+- Ensure `auto_migrate: true` in configuration (or run migrations if you manage schema yourself).
+- Check logs at the path configured by `logging_config[:log_filepath]` for errors.
 ## Related Projects
 - **ragdoll-cli**: Standalone CLI application using ragdoll
 - **ragdoll-rails**: Rails engine with web interface for ragdoll
+## Contributing & Support
+Contributions are welcome! If you find a bug or have a feature request, please open an issue or submit a pull request. For questions and feedback, open an issue in this repository.
 ## Key Design Principles
 1. **Database-Oriented**: Built on ActiveRecord with PostgreSQL + pgvector for production performance

data/Rakefile CHANGED Viewed

@@ -1,8 +1,5 @@
 # frozen_string_literal: true
-require "simplecov"
-SimpleCov.start
 # Suppress bundler/rubygems warnings
 $VERBOSE = nil

data/app/models/ragdoll/embedding.rb CHANGED Viewed

@@ -11,6 +11,8 @@ module Ragdoll
     has_neighbors :embedding_vector
     belongs_to :embeddable, polymorphic: true
+    has_many :search_results, class_name: "Ragdoll::SearchResult", dependent: :destroy
+    has_many :searches, through: :search_results
     validates :embeddable_id,    presence: true
     validates :embeddable_type,  presence: true
@@ -72,6 +74,24 @@ module Ragdoll
       search_with_pgvector(query_embedding, scope, limit, threshold)
     end
+    # Enhanced search that returns both results and similarity statistics
+    def self.search_similar_with_stats(query_embedding, limit: 20, threshold: 0.8, filters: {})
+      # Apply filters
+      scope = all
+      scope = scope.where(embeddable_id: filters[:embeddable_id]) if filters[:embeddable_id]
+      scope = scope.where(embeddable_type: filters[:embeddable_type]) if filters[:embeddable_type]
+      scope = scope.by_model(filters[:embedding_model]) if filters[:embedding_model]
+      # Document-level filters require joining through embeddable (STI Content) to documents
+      if filters[:document_type]
+        scope = scope.joins("JOIN ragdoll_contents ON ragdoll_contents.id = ragdoll_embeddings.embeddable_id")
+                     .joins("JOIN ragdoll_documents ON ragdoll_documents.id = ragdoll_contents.document_id")
+                     .where("ragdoll_documents.document_type = ?", filters[:document_type])
+      end
+      search_with_pgvector_stats(query_embedding, scope, limit, threshold)
+    end
     # Fast search using pgvector with neighbor gem
     def self.search_with_pgvector(query_embedding, scope, limit, threshold)
       # Use pgvector for similarity search
@@ -103,6 +123,60 @@ module Ragdoll
       results
     end
+    # Enhanced search with statistics
+    def self.search_with_pgvector_stats(query_embedding, scope, limit, threshold)
+      # Use pgvector for similarity search - get more results to analyze
+      # Note: We convert to array immediately to avoid SQL conflicts with count operations
+      neighbor_results = scope
+                         .includes(:embeddable)
+                         .nearest_neighbors(:embedding_vector, query_embedding, distance: "cosine")
+                         .limit([limit * 3, 50].max) # Get enough for statistics
+                         .to_a # Convert to array to avoid SQL conflicts
+      results = []
+      all_similarities = []
+      highest_similarity = 0.0
+      lowest_similarity = 1.0
+      total_checked = neighbor_results.length
+      neighbor_results.each do |embedding|
+        # Calculate cosine similarity (neighbor returns distance, we want similarity)
+        similarity = 1.0 - embedding.neighbor_distance
+        all_similarities << similarity
+        highest_similarity = similarity if similarity > highest_similarity
+        lowest_similarity = similarity if similarity < lowest_similarity
+        next if similarity < threshold
+        usage_score = calculate_usage_score(embedding)
+        combined_score = similarity + usage_score
+        results << build_result_hash(embedding, query_embedding, similarity, highest_similarity,
+                                     usage_score, combined_score)
+      end
+      # Sort by combined score and limit
+      results = results.sort_by { |r| -r[:combined_score] }.take(limit)
+      mark_embeddings_as_used(results)
+      # Calculate statistics
+      stats = {
+        total_embeddings_checked: total_checked,
+        threshold_used: threshold,
+        highest_similarity: highest_similarity,
+        lowest_similarity: lowest_similarity,
+        average_similarity: all_similarities.empty? ? 0.0 : (all_similarities.sum / all_similarities.length),
+        similarities_above_threshold: all_similarities.count { |s| s >= threshold },
+        total_similarities_calculated: all_similarities.length
+      }
+      {
+        results: results,
+        statistics: stats
+      }
+    end
     private
     # Calculate usage score for ranking