ragdoll 0.1.8 → 0.1.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 7fb2f70ebe6d95bfcfca1ba44e84f140f1d75d17e27ead66ce9b7643f3571688
4
- data.tar.gz: 61e3ccb7dc45bb6196e70770d4eaed9cae17602a9b442e5f525752c1e4a53445
3
+ metadata.gz: 4f7b2c95ede1523e9e01af70394217387d876da6317fed651df3e27cf337cfe9
4
+ data.tar.gz: a82ae7d541fd06876acb3acaf8f02639234f8b118274621851678a2799c5f559
5
5
  SHA512:
6
- metadata.gz: 318e00ff0df2e4b075b9379ffc4a13de4700c4fa6c2c544be8678b700e4810d7cc80479eed3f709e6f25891a394741a8dccfc8e1fed6017d31607946c9267549
7
- data.tar.gz: a8261e8a3f2740599564f4dd3b2c31914903339035664c01bfdea4800227858f071d25675ffd17c419b59d47baf8c0eb91313600355ac86bfc8d21eaf5e34add
6
+ metadata.gz: ba14828a6e743677c84072b9f1bb27743e429531ebdd9fbd3d8553add7bbdad070d709cd617dc620fef4ddc6846085ca79d3bb6d32bae8465c6b3b10acc0692f
7
+ data.tar.gz: de630ebf15168b562ef686ec6cd9f1cfe532b5bbf495e33a74085b567cf53ce7bb87e7c5c543756c47bd68c98290221b879a1b4d8e5888aac4916d1c1554fe99
data/CHANGELOG.md ADDED
@@ -0,0 +1,243 @@
1
+ # Changelog
2
+
3
+ All notable changes to the Ragdoll Core project will be documented in this file.
4
+
5
+ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
6
+
7
+ ## [Unreleased]
8
+
9
+ ## [0.1.10] - 2025-01-15
10
+
11
+ ### Changed
12
+ - Continued improvements to search performance and accuracy
13
+
14
+ ### Added
15
+ - **Hybrid Search**: Complete implementation combining semantic and full-text search capabilities
16
+ - Configurable weights for semantic vs text search (default: 70% semantic, 30% text)
17
+ - Deduplication of results by document ID
18
+ - Combined scoring system for unified result ranking
19
+ - **Full-text Search**: PostgreSQL full-text search with tsvector indexing
20
+ - Per-word match ratio scoring (0.0 to 1.0)
21
+ - GIN index for high-performance text search
22
+ - Search across title, summary, keywords, and description fields
23
+ - **Enhanced Search API**: Complete search type delegation at top-level Ragdoll namespace
24
+ - `Ragdoll.hybrid_search` method for combined semantic and text search
25
+ - `Ragdoll::Document.search_content` for full-text search capabilities
26
+ - Consistent parameter handling across all search methods
27
+
28
+ ### Changed
29
+ - **Search Architecture**: Unified search interface supporting semantic, fulltext, and hybrid modes
30
+ - **Database Schema**: Added search_vector column with GIN indexing for full-text search performance
31
+
32
+ ### Technical Details
33
+ - Full-text search uses PostgreSQL's built-in tsvector capabilities
34
+ - Hybrid search combines cosine similarity (semantic) with text match ratios
35
+ - Results are ranked by weighted combined scores
36
+ - All search methods maintain backward compatibility
37
+
38
+ ## [0.1.9] - 2025-01-10
39
+
40
+ ### Added
41
+ - **Initial CHANGELOG**: Added comprehensive CHANGELOG.md following Keep a Changelog format
42
+ - Complete version history from git log analysis
43
+ - Feature status tracking (implemented vs planned)
44
+ - Migration guides and breaking changes documentation
45
+ - Structured release notes with proper categorization
46
+ - **Search Tracking System**: Comprehensive analytics with query embeddings, click-through tracking, and performance monitoring
47
+ - Automatic search recording with vector embeddings for similarity analysis
48
+ - Click-through rate tracking and user engagement monitoring
49
+ - Session and user behavior tracking capabilities
50
+ - Performance metrics including execution time and result quality analysis
51
+ - Search similarity analysis using vector embeddings
52
+ - Automatic cleanup of orphaned and unused searches
53
+ - **Enhanced README**: Updated documentation with search tracking examples and analytics usage
54
+ - Comprehensive search analytics examples and usage patterns
55
+ - Updated API examples to use proper top-level Ragdoll methods
56
+ - Added search tracking configuration and usage examples
57
+ - **API Method Consistency**: Added `hybrid_search` delegation to top-level Ragdoll namespace
58
+ - Complete documentation with examples and parameter descriptions
59
+ - Consistent API experience across all search methods
60
+ - Verified method availability at both Ragdoll and Ragdoll::Core levels
61
+
62
+ ### Fixed
63
+ - **Model Resolution Warning**: Fixed "undefined method 'empty?' for an instance of Ragdoll::Core::Model" warning
64
+ - Added defensive `empty?` method to Model class
65
+ - Enhanced constructor to handle polymorphic Model objects
66
+ - Added nil/empty checks in embedding service
67
+
68
+ ### Changed
69
+ - **Test Coverage**: Added coverage directory to .gitignore for cleaner repository state
70
+
71
+ ### Technical Details
72
+ - Commits: `9186067`, `cb952d3`, `e902a5f`, `632527b`
73
+ - All changes maintain backward compatibility
74
+ - No breaking API changes
75
+
76
+ ## [0.1.8] - 2025-01-04
77
+
78
+ ### Added
79
+ - **Search Analytics Foundation**: Added `Ragdoll::Search` model with query embedding and result tracking capabilities
80
+ - **Embedding Service Enhancements**: Fallback mechanism for model resolution in embedding service
81
+ - **Test Coverage**: Added coverage directory to gitignore and improved test infrastructure
82
+
83
+ ### Changed
84
+ - Updated Gemfile.lock with latest gem versions
85
+ - Enhanced runtime dependencies and version management
86
+
87
+ ### Fixed
88
+ - Package directory exclusion in gitignore
89
+
90
+ ## [0.1.7] - 2025-01-04
91
+
92
+ ### Added
93
+ - **Multi-Modal Content Models**: Added AudioContent model for comprehensive audio processing support
94
+ - **Background Job Processing**: New Ragdoll job classes for asynchronous document processing
95
+ - **Metadata Schemas**: Structured metadata schemas for text and image documents with validation
96
+
97
+ ### Changed
98
+ - Updated ragdoll gem dependencies
99
+ - Improved submodule management for documentation
100
+
101
+ ## [0.1.6] - 2025-01-04
102
+
103
+ ### Added
104
+ - **Documentation Restructure**: Replaced local docs with ragdoll-docs submodule
105
+ - **Conventional Commits**: Updated and restructured Conventional Commits specification
106
+ - **CI/CD Improvements**: Enhanced GitHub Actions workflow and dropped JRuby support for RMagick compatibility
107
+
108
+ ### Fixed
109
+ - Test skipping logic for CI environments
110
+ - Automated release workflow adjustments
111
+
112
+ ## [0.1.5] - 2025-01-04
113
+
114
+ ### Added
115
+ - Enhanced document processing pipeline
116
+ - Improved error handling and logging
117
+
118
+ ### Fixed
119
+ - Version management and release process refinements
120
+
121
+ ## [0.1.4] - 2025-01-04
122
+
123
+ ### Added
124
+ - Extended multi-modal architecture support
125
+ - Performance optimizations for large document processing
126
+
127
+ ### Changed
128
+ - Refined version numbering and release process
129
+
130
+ ## [0.1.3] - 2025-01-04
131
+
132
+ ### Added
133
+ - **Core RAG Architecture**: Multi-modal RAG (Retrieval-Augmented Generation) library built on ActiveRecord
134
+ - **PostgreSQL + pgvector Integration**: High-performance semantic search with vector similarity
135
+ - **Polymorphic Content Architecture**: Unified handling of text, image, and audio content types
136
+ - **Dual Metadata Design**: Separation of LLM-generated content analysis and system file properties
137
+ - **Document Processing Pipeline**: Support for PDF, DOCX, HTML, Markdown, and plain text files
138
+ - **Embedding Generation**: Text chunking and vector embedding creation with multiple LLM provider support
139
+ - **Semantic Search**: Cosine similarity search with usage analytics
140
+ - **Background Processing**: ActiveJob integration for asynchronous document processing
141
+ - **Logging System**: Configurable file-based logging with multiple levels
142
+
143
+ ### Technical Features
144
+ - **Database Schema**: Multi-modal polymorphic architecture optimized for PostgreSQL
145
+ - **IVFFlat Indexing**: Fast approximate nearest neighbor search for vector similarity
146
+ - **Connection Pooling**: High-concurrency support for production workloads
147
+ - **Configuration Management**: Comprehensive configuration system for LLM providers and processing settings
148
+
149
+ ## [0.1.1] - 2024-12-XX
150
+
151
+ ### Added
152
+ - Initial project structure and basic functionality
153
+ - Core document management capabilities
154
+ - Basic search and retrieval features
155
+
156
+ ## [0.0.2] - 2024-12-XX
157
+
158
+ ### Added
159
+ - Initial alpha release
160
+ - Basic RAG architecture foundation
161
+ - PostgreSQL database integration
162
+
163
+ ---
164
+
165
+ ## Feature Status
166
+
167
+ ### ✅ Fully Implemented
168
+ - **Text Document Processing**: PDF, DOCX, HTML, Markdown, plain text files
169
+ - **Embedding Generation**: Text chunking and vector embedding creation
170
+ - **Database Schema**: Multi-modal polymorphic architecture with PostgreSQL + pgvector
171
+ - **Dual Metadata Architecture**: Separate LLM-generated content analysis and file properties
172
+ - **Search Functionality**: Semantic search with cosine similarity and usage analytics
173
+ - **Hybrid Search**: Complete implementation combining semantic and full-text search with configurable weights
174
+ - **Full-text Search**: PostgreSQL tsvector-based text search with GIN indexing
175
+ - **Search Tracking System**: Comprehensive analytics with query embeddings, click-through tracking, and performance monitoring
176
+ - **Document Management**: Add, update, delete, list operations
177
+ - **Background Processing**: ActiveJob integration for async embedding generation
178
+ - **LLM Metadata Generation**: AI-powered structured content analysis with schema validation
179
+ - **Logging**: Configurable file-based logging with multiple levels
180
+
181
+ ### 🚧 In Development
182
+ - **Image Processing**: Framework exists but vision AI integration needs completion
183
+ - **Audio Processing**: Framework exists but speech-to-text integration needs completion
184
+
185
+ ### 📋 Planned Features
186
+ - **Multi-modal Search**: Search across text, image, and audio content types
187
+ - **Content-type Specific Embedding Models**: Different models for text, image, audio
188
+ - **Enhanced Metadata Schemas**: Domain-specific metadata templates
189
+
190
+ ---
191
+
192
+ ## Migration Guide
193
+
194
+ ### From 0.1.9 to 0.1.10
195
+ - **New Search Methods**: `Ragdoll.hybrid_search` and `Ragdoll::Document.search_content` methods now available
196
+ - **Database Migration**: New search_vector column added to documents table with GIN index for full-text search
197
+ - **API Enhancement**: All search methods now support unified parameter interface
198
+ - **Backward Compatibility**: Existing `Ragdoll.search` method unchanged, continues to work as before
199
+ - **CLI Integration**: ragdoll-cli now requires ragdoll >= 0.1.10 for hybrid and full-text search support
200
+
201
+ ### From 0.1.8 to 0.1.9
202
+ - **CHANGELOG Addition**: Comprehensive changelog and feature tracking added
203
+ - **API Method Consistency**: `hybrid_search` method properly delegated to top-level namespace
204
+ - **No Breaking Changes**: All existing functionality remains compatible
205
+
206
+ ### From 0.1.7 to 0.1.8
207
+ - New search tracking tables will be automatically created via migrations
208
+ - No breaking changes to existing API
209
+ - Search tracking is enabled by default but can be disabled per search
210
+
211
+ ### From 0.1.6 to 0.1.7
212
+ - AudioContent model added - existing installations will auto-migrate
213
+ - New background job classes available for improved processing
214
+ - Metadata schemas provide enhanced validation
215
+
216
+ ### From 0.1.5 to 0.1.6
217
+ - Documentation moved to submodule - update local references
218
+ - CI/CD improvements may affect development workflows
219
+ - JRuby support removed due to RMagick dependency
220
+
221
+ ---
222
+
223
+ ## Breaking Changes
224
+
225
+ ### Version 0.1.6
226
+ - **JRuby Support Removed**: RMagick dependency incompatibility
227
+ - **Documentation Structure**: Local docs replaced with submodule
228
+
229
+ ---
230
+
231
+ ## Contributors
232
+
233
+ - **Dewayne VanHoozer** - Primary developer and maintainer
234
+
235
+ ---
236
+
237
+ ## License
238
+
239
+ This project is licensed under the MIT License - see the LICENSE file for details.
240
+
241
+ ---
242
+
243
+ *This changelog is automatically maintained and reflects the actual implementation status of features.*
data/README.md CHANGED
@@ -18,17 +18,65 @@
18
18
  </table>
19
19
  </div>
20
20
 
21
- # Ragdoll::Core
21
+ # Ragdoll
22
22
 
23
23
  Database-oriented multi-modal RAG (Retrieval-Augmented Generation) library built on ActiveRecord. Features PostgreSQL + pgvector for high-performance semantic search, polymorphic content architecture, and dual metadata design for sophisticated document analysis.
24
24
 
25
+ RAG does not have to be hard. Every week its getting simpler. The frontier LLM providers are starting to encorporate RAG services. For example OpenAI offers a vector search service. See: [https://0x1eef.github.io/posts/an-introduction-to-rag-with-llm.rb/](https://0x1eef.github.io/posts/an-introduction-to-rag-with-llm.rb/)
26
+
27
+ ## Overview
28
+
29
+ Ragdoll is a database-first, multi-modal Retrieval-Augmented Generation (RAG) library for Ruby. It pairs PostgreSQL + pgvector with an ActiveRecord-driven schema to deliver fast, production-grade semantic search and clean data modeling. Today it ships with robust text processing; image and audio pipelines are scaffolded and actively being completed.
30
+
31
+ The library emphasizes a dual-metadata design: LLM-derived semantic metadata for understanding content, and system file metadata for managing assets. With built-in analytics, background processing, and a high-level API, you can go from ingest to answer quickly—and scale confidently.
32
+
33
+ ### Why Ragdoll?
34
+
35
+ - Database-first foundation on ActiveRecord (PostgreSQL + pgvector only) for performance and reliability
36
+ - Multi-modal architecture (text today; image/audio next) via polymorphic content design
37
+ - Dual metadata model separating semantic analysis from file properties
38
+ - Provider-agnostic LLM integration via `ruby_llm` (OpenAI, Anthropic, Google)
39
+ - Production-friendly: background jobs, connection pooling, indexing, and search analytics
40
+ - Simple, ergonomic high-level API to keep your application code clean
41
+
42
+ ### Key Capabilities
43
+
44
+ - Semantic search with vector similarity (cosine) across polymorphic content
45
+ - Text ingestion, chunking, and embedding generation
46
+ - LLM-powered structured metadata with schema validation
47
+ - Search tracking and analytics (CTR, performance, similarity of queries)
48
+ - Hybrid search (semantic + full-text) planned
49
+ - Extensible model and configuration system
50
+
51
+ ## Table of Contents
52
+
53
+ - [Quick Start](#quick-start)
54
+ - [API Overview](#api-overview)
55
+ - [Search and Retrieval](#search-and-retrieval)
56
+ - [Search Analytics and Tracking](#search-analytics-and-tracking)
57
+ - [System Operations](#system-operations)
58
+ - [Configuration](#configuration)
59
+ - [Current Implementation Status](#current-implementation-status)
60
+ - [Architecture Highlights](#architecture-highlights)
61
+ - [Text Document Processing](#text-document-processing-current)
62
+ - [PostgreSQL + pgvector Configuration](#postgresql--pgvector-configuration)
63
+ - [Performance Features](#performance-features)
64
+ - [Installation](#installation)
65
+ - [Requirements](#requirements)
66
+ - [Use Cases](#use-cases)
67
+ - [Environment Variables](#environment-variables)
68
+ - [Troubleshooting](#troubleshooting)
69
+ - [Related Projects](#related-projects)
70
+ - [Key Design Principles](#key-design-principles)
71
+ - [Contributing & Support](#contributing--support)
72
+
25
73
  ## Quick Start
26
74
 
27
75
  ```ruby
28
76
  require 'ragdoll'
29
77
 
30
78
  # Configure with PostgreSQL + pgvector
31
- Ragdoll::Core.configure do |config|
79
+ Ragdoll.configure do |config|
32
80
  # Database configuration (PostgreSQL only)
33
81
  config.database_config = {
34
82
  adapter: 'postgresql',
@@ -55,22 +103,22 @@ Ragdoll::Core.configure do |config|
55
103
  end
56
104
 
57
105
  # Add documents - returns detailed result
58
- result = Ragdoll::Core.add_document(path: 'research_paper.pdf')
106
+ result = Ragdoll.add_document(path: 'research_paper.pdf')
59
107
  puts result[:message] # "Document 'research_paper' added successfully with ID 123"
60
108
  doc_id = result[:document_id]
61
109
 
62
110
  # Check document status
63
- status = Ragdoll::Core.document_status(id: doc_id)
111
+ status = Ragdoll.document_status(id: doc_id)
64
112
  puts status[:message] # Shows processing status and embeddings count
65
113
 
66
114
  # Search across content
67
- results = Ragdoll::Core.search(query: 'neural networks')
115
+ results = Ragdoll.search(query: 'neural networks')
68
116
 
69
117
  # Get detailed document information
70
- document = Ragdoll::Core.get_document(id: doc_id)
118
+ document = Ragdoll.get_document(id: doc_id)
71
119
  ```
72
120
 
73
- ## High-Level API
121
+ ## API Overview
74
122
 
75
123
  The `Ragdoll` module provides a convenient high-level API for common operations:
76
124
 
@@ -78,37 +126,37 @@ The `Ragdoll` module provides a convenient high-level API for common operations:
78
126
 
79
127
  ```ruby
80
128
  # Add single document - returns detailed result hash
81
- result = Ragdoll::Core.add_document(path: 'document.pdf')
129
+ result = Ragdoll.add_document(path: 'document.pdf')
82
130
  puts result[:success] # true
83
131
  puts result[:document_id] # "123"
84
132
  puts result[:message] # "Document 'document' added successfully with ID 123"
85
133
  puts result[:embeddings_queued] # true
86
134
 
87
135
  # Check document processing status
88
- status = Ragdoll::Core.document_status(id: result[:document_id])
136
+ status = Ragdoll.document_status(id: result[:document_id])
89
137
  puts status[:status] # "processed"
90
138
  puts status[:embeddings_count] # 15
91
139
  puts status[:embeddings_ready] # true
92
140
  puts status[:message] # "Document processed successfully with 15 embeddings"
93
141
 
94
142
  # Get detailed document information
95
- document = Ragdoll::Core.get_document(id: result[:document_id])
143
+ document = Ragdoll.get_document(id: result[:document_id])
96
144
  puts document[:title] # "document"
97
145
  puts document[:status] # "processed"
98
146
  puts document[:embeddings_count] # 15
99
147
  puts document[:content_length] # 5000
100
148
 
101
149
  # Update document metadata
102
- Ragdoll::Core.update_document(id: result[:document_id], title: 'New Title')
150
+ Ragdoll.update_document(id: result[:document_id], title: 'New Title')
103
151
 
104
152
  # Delete document
105
- Ragdoll::Core.delete_document(id: result[:document_id])
153
+ Ragdoll.delete_document(id: result[:document_id])
106
154
 
107
155
  # List all documents
108
- documents = Ragdoll::Core.list_documents(limit: 10)
156
+ documents = Ragdoll.list_documents(limit: 10)
109
157
 
110
158
  # System statistics
111
- stats = Ragdoll::Core.stats
159
+ stats = Ragdoll.stats
112
160
  puts stats[:total_documents] # 50
113
161
  puts stats[:total_embeddings] # 1250
114
162
  ```
@@ -117,15 +165,22 @@ puts stats[:total_embeddings] # 1250
117
165
 
118
166
  ```ruby
119
167
  # Semantic search across all content types
120
- results = Ragdoll::Core.search(query: 'artificial intelligence')
168
+ results = Ragdoll.search(query: 'artificial intelligence')
169
+
170
+ # Search with automatic tracking (default)
171
+ results = Ragdoll.search(
172
+ query: 'machine learning',
173
+ session_id: 123, # Optional: track user sessions
174
+ user_id: 456 # Optional: track by user
175
+ )
121
176
 
122
177
  # Search specific content types
123
- text_results = Ragdoll::Core.search(query: 'machine learning', content_type: 'text')
124
- image_results = Ragdoll::Core.search(query: 'neural network diagram', content_type: 'image')
125
- audio_results = Ragdoll::Core.search(query: 'AI discussion', content_type: 'audio')
178
+ text_results = Ragdoll.search(query: 'machine learning', content_type: 'text')
179
+ image_results = Ragdoll.search(query: 'neural network diagram', content_type: 'image')
180
+ audio_results = Ragdoll.search(query: 'AI discussion', content_type: 'audio')
126
181
 
127
182
  # Advanced search with metadata filters
128
- results = Ragdoll::Core.search(
183
+ results = Ragdoll.search(
129
184
  query: 'deep learning',
130
185
  classification: 'research',
131
186
  keywords: ['AI', 'neural networks'],
@@ -133,44 +188,124 @@ results = Ragdoll::Core.search(
133
188
  )
134
189
 
135
190
  # Get context for RAG applications
136
- context = Ragdoll::Core.get_context(query: 'machine learning', limit: 5)
191
+ context = Ragdoll.get_context(query: 'machine learning', limit: 5)
137
192
 
138
193
  # Enhanced prompt with context
139
- enhanced = Ragdoll::Core.enhance_prompt(
194
+ enhanced = Ragdoll.enhance_prompt(
140
195
  prompt: 'What is machine learning?',
141
196
  context_limit: 5
142
197
  )
143
198
 
144
199
  # Hybrid search combining semantic and full-text
145
- results = Ragdoll::Core.hybrid_search(
200
+ results = Ragdoll.hybrid_search(
146
201
  query: 'neural networks',
147
202
  semantic_weight: 0.7,
148
203
  text_weight: 0.3
149
204
  )
150
205
  ```
151
206
 
207
+ ### Keywords Search
208
+
209
+ Ragdoll supports powerful keywords-based search that can be used standalone or combined with semantic search. The keywords system uses PostgreSQL array operations for high performance and supports both partial matching (overlap) and exact matching (contains all).
210
+
211
+ ```ruby
212
+ # Keywords-only search (overlap - documents containing any of the keywords)
213
+ results = Ragdoll::Document.search_by_keywords(['machine', 'learning', 'ai'])
214
+
215
+ # Results are sorted by match count (documents with more keyword matches rank higher)
216
+ results.each do |doc|
217
+ puts "#{doc.title}: #{doc.keywords_match_count} matches"
218
+ end
219
+
220
+ # Exact keywords search (contains all - documents must have ALL keywords)
221
+ results = Ragdoll::Document.search_by_keywords_all(['ruby', 'programming'])
222
+
223
+ # Results are sorted by focus (fewer total keywords = more focused document)
224
+ results.each do |doc|
225
+ puts "#{doc.title}: #{doc.total_keywords_count} total keywords"
226
+ end
227
+
228
+ # Combined semantic + keywords search for best results
229
+ results = Ragdoll.search(
230
+ query: 'artificial intelligence applications',
231
+ keywords: ['ai', 'machine learning', 'neural networks'],
232
+ limit: 10
233
+ )
234
+
235
+ # Keywords search with options
236
+ results = Ragdoll::Document.search_by_keywords(
237
+ ['web', 'javascript', 'frontend'],
238
+ limit: 20
239
+ )
240
+
241
+ # Case-insensitive keyword matching (automatically normalized)
242
+ results = Ragdoll::Document.search_by_keywords(['Python', 'DATA-SCIENCE', 'ai'])
243
+ # Will match documents with keywords: ['python', 'data-science', 'ai']
244
+ ```
245
+
246
+ **Keywords Search Features:**
247
+ - **High Performance**: Uses PostgreSQL GIN indexes for fast array operations
248
+ - **Flexible Matching**: Supports both overlap (`&&`) and contains (`@>`) operators
249
+ - **Smart Scoring**: Results ordered by match count or document focus
250
+ - **Case Insensitive**: Automatic keyword normalization
251
+ - **Integration Ready**: Works seamlessly with semantic search
252
+ - **Inspired by `find_matching_entries.rb`**: Optimized for PostgreSQL arrays
253
+
254
+ ### Search Analytics and Tracking
255
+
256
+ Ragdoll automatically tracks all searches to provide comprehensive analytics and improve search relevance over time:
257
+
258
+ ```ruby
259
+ # Get search analytics for the last 30 days
260
+ analytics = Ragdoll::Search.search_analytics(days: 30)
261
+ puts "Total searches: #{analytics[:total_searches]}"
262
+ puts "Unique queries: #{analytics[:unique_queries]}"
263
+ puts "Average execution time: #{analytics[:avg_execution_time]}ms"
264
+ puts "Click-through rate: #{analytics[:click_through_rate]}%"
265
+
266
+ # Find similar searches using vector similarity
267
+ search = Ragdoll::Search.first
268
+ similar_searches = search.nearest_neighbors(:query_embedding, distance: :cosine).limit(5)
269
+
270
+ similar_searches.each do |similar|
271
+ puts "Query: #{similar.query}"
272
+ puts "Similarity: #{similar.neighbor_distance}"
273
+ puts "Results: #{similar.results_count}"
274
+ end
275
+
276
+ # Track user interactions (clicks on search results)
277
+ search_result = Ragdoll::SearchResult.first
278
+ search_result.mark_as_clicked!
279
+
280
+ # Disable tracking for specific searches if needed
281
+ results = Ragdoll.search(
282
+ query: 'private query',
283
+ track_search: false
284
+ )
285
+ ```
286
+
152
287
  ### System Operations
153
288
 
154
289
  ```ruby
155
290
  # Get system statistics
156
- stats = Ragdoll::Core.stats
291
+ stats = Ragdoll.stats
157
292
  # Returns information about documents, content types, embeddings, etc.
158
293
 
159
294
  # Health check
160
- healthy = Ragdoll::Core.healthy?
295
+ healthy = Ragdoll.healthy?
161
296
 
162
297
  # Get configuration
163
- config = Ragdoll::Core.configuration
298
+ config = Ragdoll.configuration
164
299
 
165
300
  # Reset configuration (useful for testing)
166
- Ragdoll::Core.reset_configuration!
301
+ Ragdoll.reset_configuration!
167
302
  ```
168
303
 
169
304
  ### Configuration
170
305
 
171
306
  ```ruby
172
307
  # Configure the system
173
- Ragdoll::Core.configure do |config|
308
+ Ragdoll.configure do |config|
174
309
  # Database configuration (PostgreSQL only - REQUIRED)
175
310
  config.database_config = {
176
311
  adapter: 'postgresql',
@@ -218,6 +353,7 @@ end
218
353
  - **Database schema**: Multi-modal polymorphic architecture with PostgreSQL + pgvector
219
354
  - **Dual metadata architecture**: Separate LLM-generated content analysis and file properties
220
355
  - **Search functionality**: Semantic search with cosine similarity and usage analytics
356
+ - **Search tracking system**: Comprehensive analytics with query embeddings, click-through tracking, and performance monitoring
221
357
  - **Document management**: Add, update, delete, list operations
222
358
  - **Background processing**: ActiveJob integration for async embedding generation
223
359
  - **LLM metadata generation**: AI-powered structured content analysis with schema validation
@@ -264,15 +400,16 @@ Currently, Ragdoll processes text documents through:
264
400
  6. **Search**: Semantic search using cosine similarity with usage analytics
265
401
 
266
402
  ### Example Usage
403
+
267
404
  ```ruby
268
405
  # Add a text document
269
- result = Ragdoll::Core.add_document(path: 'document.pdf')
406
+ result = Ragdoll.add_document(path: 'document.pdf')
270
407
 
271
408
  # Check processing status
272
- status = Ragdoll::Core.document_status(id: result[:document_id])
409
+ status = Ragdoll.document_status(id: result[:document_id])
273
410
 
274
411
  # Search the content
275
- results = Ragdoll::Core.search(query: 'machine learning')
412
+ results = Ragdoll.search(query: 'machine learning')
276
413
  ```
277
414
 
278
415
  ## PostgreSQL + pgvector Configuration
@@ -293,7 +430,7 @@ psql -d ragdoll_production -c "CREATE EXTENSION IF NOT EXISTS vector;"
293
430
  ### Configuration Example
294
431
 
295
432
  ```ruby
296
- Ragdoll::Core.configure do |config|
433
+ Ragdoll.configure do |config|
297
434
  config.database_config = {
298
435
  adapter: 'postgresql',
299
436
  database: 'ragdoll_production',
@@ -337,11 +474,52 @@ gem 'ragdoll'
337
474
  - **PostgreSQL**: 12+ with pgvector extension (REQUIRED - no other databases supported)
338
475
  - **Dependencies**: activerecord, pg, pgvector, neighbor, ruby_llm, pdf-reader, docx, rubyzip, shrine, rmagick, opensearch-ruby, searchkick, ruby-progressbar
339
476
 
477
+ ## Use Cases
478
+
479
+ - Internal knowledge bases and chat assistants grounded in your documents
480
+ - Product documentation and support search with analytics and relevance feedback
481
+ - Research corpora exploration (summaries, topics, similarity) across large text sets
482
+ - Incident retrospectives and operational analytics with searchable write-ups
483
+ - Media libraries preparing for text + image + audio pipelines (image/audio in progress)
484
+
485
+ ## Environment Variables
486
+
487
+ Set the following as environment variables (do not commit secrets to source control):
488
+
489
+ - `OPENAI_API_KEY` — required for OpenAI models
490
+ - `OPENAI_ORGANIZATION` — optional, for OpenAI org scoping
491
+ - `OPENAI_PROJECT` — optional, for OpenAI project scoping
492
+ - `ANTHROPIC_API_KEY` — optional, for Anthropic models
493
+ - `GOOGLE_API_KEY` — optional, for Google models
494
+ - `DATABASE_PASSWORD` — your PostgreSQL password if not using peer auth
495
+
496
+ ## Troubleshooting
497
+
498
+ ### pgvector extension missing
499
+
500
+ - Ensure the extension is enabled in your database:
501
+
502
+ ```bash
503
+ psql -d ragdoll_production -c "CREATE EXTENSION IF NOT EXISTS vector;"
504
+ ```
505
+
506
+ - If the command fails, verify PostgreSQL and pgvector are installed and that you’re connecting to the correct database.
507
+
508
+ ### Document stuck in "processing"
509
+
510
+ - Confirm your API keys are set and valid.
511
+ - Ensure `auto_migrate: true` in configuration (or run migrations if you manage schema yourself).
512
+ - Check logs at the path configured by `logging_config[:log_filepath]` for errors.
513
+
340
514
  ## Related Projects
341
515
 
342
516
  - **ragdoll-cli**: Standalone CLI application using ragdoll
343
517
  - **ragdoll-rails**: Rails engine with web interface for ragdoll
344
518
 
519
+ ## Contributing & Support
520
+
521
+ Contributions are welcome! If you find a bug or have a feature request, please open an issue or submit a pull request. For questions and feedback, open an issue in this repository.
522
+
345
523
  ## Key Design Principles
346
524
 
347
525
  1. **Database-Oriented**: Built on ActiveRecord with PostgreSQL + pgvector for production performance
data/Rakefile CHANGED
@@ -1,8 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
- require "simplecov"
4
- SimpleCov.start
5
-
6
3
  # Suppress bundler/rubygems warnings
7
4
  $VERBOSE = nil
8
5
 
@@ -52,8 +49,10 @@ task :setup_test_db do
52
49
  puts "Warning: Could not install pgvector extension: #{e.message}"
53
50
  end
54
51
 
55
- # Run migrations
56
- Ragdoll::Core::Database.setup(test_db_config.merge(auto_migrate: true, logger: nil))
52
+ # Reset and run migrations (drops all tables and re-runs migrations)
53
+ # This ensures clean state for tests regardless of previous migration versions
54
+ Ragdoll::Core::Database.setup(test_db_config.merge(auto_migrate: false, logger: nil))
55
+ Ragdoll::Core::Database.reset!
57
56
  puts "Test database setup complete"
58
57
  end
59
58