ragdoll 0.1.8 → 0.1.10
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +243 -0
- data/README.md +209 -31
- data/Rakefile +4 -5
- data/app/models/ragdoll/document.rb +115 -12
- data/app/models/ragdoll/embedding.rb +108 -2
- data/app/models/ragdoll/search.rb +165 -0
- data/app/models/ragdoll/search_result.rb +121 -0
- data/app/services/ragdoll/configuration_service.rb +3 -3
- data/app/services/ragdoll/document_processor.rb +124 -1
- data/app/services/ragdoll/embedding_service.rb +10 -0
- data/app/services/ragdoll/search_engine.rb +75 -6
- data/db/migrate/{001_enable_postgresql_extensions.rb → 20250815234901_enable_postgresql_extensions.rb} +7 -8
- data/db/migrate/20250815234902_create_ragdoll_documents.rb +117 -0
- data/db/migrate/{005_create_ragdoll_embeddings.rb → 20250815234903_create_ragdoll_embeddings.rb} +13 -10
- data/db/migrate/{006_create_ragdoll_contents.rb → 20250815234904_create_ragdoll_contents.rb} +14 -11
- data/db/migrate/20250815234905_create_ragdoll_searches.rb +77 -0
- data/db/migrate/20250815234906_create_ragdoll_search_results.rb +49 -0
- data/lib/ragdoll/core/client.rb +75 -8
- data/lib/ragdoll/core/database.rb +8 -3
- data/lib/ragdoll/core/model.rb +13 -0
- data/lib/ragdoll/core/version.rb +1 -1
- data/lib/ragdoll/core.rb +2 -0
- data/lib/ragdoll.rb +17 -0
- data/lib/tasks/db.rake +75 -27
- metadata +375 -6
- data/db/migrate/004_create_ragdoll_documents.rb +0 -70
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 4f7b2c95ede1523e9e01af70394217387d876da6317fed651df3e27cf337cfe9
|
4
|
+
data.tar.gz: a82ae7d541fd06876acb3acaf8f02639234f8b118274621851678a2799c5f559
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: ba14828a6e743677c84072b9f1bb27743e429531ebdd9fbd3d8553add7bbdad070d709cd617dc620fef4ddc6846085ca79d3bb6d32bae8465c6b3b10acc0692f
|
7
|
+
data.tar.gz: de630ebf15168b562ef686ec6cd9f1cfe532b5bbf495e33a74085b567cf53ce7bb87e7c5c543756c47bd68c98290221b879a1b4d8e5888aac4916d1c1554fe99
|
data/CHANGELOG.md
ADDED
@@ -0,0 +1,243 @@
|
|
1
|
+
# Changelog
|
2
|
+
|
3
|
+
All notable changes to the Ragdoll Core project will be documented in this file.
|
4
|
+
|
5
|
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
6
|
+
|
7
|
+
## [Unreleased]
|
8
|
+
|
9
|
+
## [0.1.10] - 2025-01-15
|
10
|
+
|
11
|
+
### Changed
|
12
|
+
- Continued improvements to search performance and accuracy
|
13
|
+
|
14
|
+
### Added
|
15
|
+
- **Hybrid Search**: Complete implementation combining semantic and full-text search capabilities
|
16
|
+
- Configurable weights for semantic vs text search (default: 70% semantic, 30% text)
|
17
|
+
- Deduplication of results by document ID
|
18
|
+
- Combined scoring system for unified result ranking
|
19
|
+
- **Full-text Search**: PostgreSQL full-text search with tsvector indexing
|
20
|
+
- Per-word match ratio scoring (0.0 to 1.0)
|
21
|
+
- GIN index for high-performance text search
|
22
|
+
- Search across title, summary, keywords, and description fields
|
23
|
+
- **Enhanced Search API**: Complete search type delegation at top-level Ragdoll namespace
|
24
|
+
- `Ragdoll.hybrid_search` method for combined semantic and text search
|
25
|
+
- `Ragdoll::Document.search_content` for full-text search capabilities
|
26
|
+
- Consistent parameter handling across all search methods
|
27
|
+
|
28
|
+
### Changed
|
29
|
+
- **Search Architecture**: Unified search interface supporting semantic, fulltext, and hybrid modes
|
30
|
+
- **Database Schema**: Added search_vector column with GIN indexing for full-text search performance
|
31
|
+
|
32
|
+
### Technical Details
|
33
|
+
- Full-text search uses PostgreSQL's built-in tsvector capabilities
|
34
|
+
- Hybrid search combines cosine similarity (semantic) with text match ratios
|
35
|
+
- Results are ranked by weighted combined scores
|
36
|
+
- All search methods maintain backward compatibility
|
37
|
+
|
38
|
+
## [0.1.9] - 2025-01-10
|
39
|
+
|
40
|
+
### Added
|
41
|
+
- **Initial CHANGELOG**: Added comprehensive CHANGELOG.md following Keep a Changelog format
|
42
|
+
- Complete version history from git log analysis
|
43
|
+
- Feature status tracking (implemented vs planned)
|
44
|
+
- Migration guides and breaking changes documentation
|
45
|
+
- Structured release notes with proper categorization
|
46
|
+
- **Search Tracking System**: Comprehensive analytics with query embeddings, click-through tracking, and performance monitoring
|
47
|
+
- Automatic search recording with vector embeddings for similarity analysis
|
48
|
+
- Click-through rate tracking and user engagement monitoring
|
49
|
+
- Session and user behavior tracking capabilities
|
50
|
+
- Performance metrics including execution time and result quality analysis
|
51
|
+
- Search similarity analysis using vector embeddings
|
52
|
+
- Automatic cleanup of orphaned and unused searches
|
53
|
+
- **Enhanced README**: Updated documentation with search tracking examples and analytics usage
|
54
|
+
- Comprehensive search analytics examples and usage patterns
|
55
|
+
- Updated API examples to use proper top-level Ragdoll methods
|
56
|
+
- Added search tracking configuration and usage examples
|
57
|
+
- **API Method Consistency**: Added `hybrid_search` delegation to top-level Ragdoll namespace
|
58
|
+
- Complete documentation with examples and parameter descriptions
|
59
|
+
- Consistent API experience across all search methods
|
60
|
+
- Verified method availability at both Ragdoll and Ragdoll::Core levels
|
61
|
+
|
62
|
+
### Fixed
|
63
|
+
- **Model Resolution Warning**: Fixed "undefined method 'empty?' for an instance of Ragdoll::Core::Model" warning
|
64
|
+
- Added defensive `empty?` method to Model class
|
65
|
+
- Enhanced constructor to handle polymorphic Model objects
|
66
|
+
- Added nil/empty checks in embedding service
|
67
|
+
|
68
|
+
### Changed
|
69
|
+
- **Test Coverage**: Added coverage directory to .gitignore for cleaner repository state
|
70
|
+
|
71
|
+
### Technical Details
|
72
|
+
- Commits: `9186067`, `cb952d3`, `e902a5f`, `632527b`
|
73
|
+
- All changes maintain backward compatibility
|
74
|
+
- No breaking API changes
|
75
|
+
|
76
|
+
## [0.1.8] - 2025-01-04
|
77
|
+
|
78
|
+
### Added
|
79
|
+
- **Search Analytics Foundation**: Added `Ragdoll::Search` model with query embedding and result tracking capabilities
|
80
|
+
- **Embedding Service Enhancements**: Fallback mechanism for model resolution in embedding service
|
81
|
+
- **Test Coverage**: Added coverage directory to gitignore and improved test infrastructure
|
82
|
+
|
83
|
+
### Changed
|
84
|
+
- Updated Gemfile.lock with latest gem versions
|
85
|
+
- Enhanced runtime dependencies and version management
|
86
|
+
|
87
|
+
### Fixed
|
88
|
+
- Package directory exclusion in gitignore
|
89
|
+
|
90
|
+
## [0.1.7] - 2025-01-04
|
91
|
+
|
92
|
+
### Added
|
93
|
+
- **Multi-Modal Content Models**: Added AudioContent model for comprehensive audio processing support
|
94
|
+
- **Background Job Processing**: New Ragdoll job classes for asynchronous document processing
|
95
|
+
- **Metadata Schemas**: Structured metadata schemas for text and image documents with validation
|
96
|
+
|
97
|
+
### Changed
|
98
|
+
- Updated ragdoll gem dependencies
|
99
|
+
- Improved submodule management for documentation
|
100
|
+
|
101
|
+
## [0.1.6] - 2025-01-04
|
102
|
+
|
103
|
+
### Added
|
104
|
+
- **Documentation Restructure**: Replaced local docs with ragdoll-docs submodule
|
105
|
+
- **Conventional Commits**: Updated and restructured Conventional Commits specification
|
106
|
+
- **CI/CD Improvements**: Enhanced GitHub Actions workflow and dropped JRuby support for RMagick compatibility
|
107
|
+
|
108
|
+
### Fixed
|
109
|
+
- Test skipping logic for CI environments
|
110
|
+
- Automated release workflow adjustments
|
111
|
+
|
112
|
+
## [0.1.5] - 2025-01-04
|
113
|
+
|
114
|
+
### Added
|
115
|
+
- Enhanced document processing pipeline
|
116
|
+
- Improved error handling and logging
|
117
|
+
|
118
|
+
### Fixed
|
119
|
+
- Version management and release process refinements
|
120
|
+
|
121
|
+
## [0.1.4] - 2025-01-04
|
122
|
+
|
123
|
+
### Added
|
124
|
+
- Extended multi-modal architecture support
|
125
|
+
- Performance optimizations for large document processing
|
126
|
+
|
127
|
+
### Changed
|
128
|
+
- Refined version numbering and release process
|
129
|
+
|
130
|
+
## [0.1.3] - 2025-01-04
|
131
|
+
|
132
|
+
### Added
|
133
|
+
- **Core RAG Architecture**: Multi-modal RAG (Retrieval-Augmented Generation) library built on ActiveRecord
|
134
|
+
- **PostgreSQL + pgvector Integration**: High-performance semantic search with vector similarity
|
135
|
+
- **Polymorphic Content Architecture**: Unified handling of text, image, and audio content types
|
136
|
+
- **Dual Metadata Design**: Separation of LLM-generated content analysis and system file properties
|
137
|
+
- **Document Processing Pipeline**: Support for PDF, DOCX, HTML, Markdown, and plain text files
|
138
|
+
- **Embedding Generation**: Text chunking and vector embedding creation with multiple LLM provider support
|
139
|
+
- **Semantic Search**: Cosine similarity search with usage analytics
|
140
|
+
- **Background Processing**: ActiveJob integration for asynchronous document processing
|
141
|
+
- **Logging System**: Configurable file-based logging with multiple levels
|
142
|
+
|
143
|
+
### Technical Features
|
144
|
+
- **Database Schema**: Multi-modal polymorphic architecture optimized for PostgreSQL
|
145
|
+
- **IVFFlat Indexing**: Fast approximate nearest neighbor search for vector similarity
|
146
|
+
- **Connection Pooling**: High-concurrency support for production workloads
|
147
|
+
- **Configuration Management**: Comprehensive configuration system for LLM providers and processing settings
|
148
|
+
|
149
|
+
## [0.1.1] - 2024-12-XX
|
150
|
+
|
151
|
+
### Added
|
152
|
+
- Initial project structure and basic functionality
|
153
|
+
- Core document management capabilities
|
154
|
+
- Basic search and retrieval features
|
155
|
+
|
156
|
+
## [0.0.2] - 2024-12-XX
|
157
|
+
|
158
|
+
### Added
|
159
|
+
- Initial alpha release
|
160
|
+
- Basic RAG architecture foundation
|
161
|
+
- PostgreSQL database integration
|
162
|
+
|
163
|
+
---
|
164
|
+
|
165
|
+
## Feature Status
|
166
|
+
|
167
|
+
### ✅ Fully Implemented
|
168
|
+
- **Text Document Processing**: PDF, DOCX, HTML, Markdown, plain text files
|
169
|
+
- **Embedding Generation**: Text chunking and vector embedding creation
|
170
|
+
- **Database Schema**: Multi-modal polymorphic architecture with PostgreSQL + pgvector
|
171
|
+
- **Dual Metadata Architecture**: Separate LLM-generated content analysis and file properties
|
172
|
+
- **Search Functionality**: Semantic search with cosine similarity and usage analytics
|
173
|
+
- **Hybrid Search**: Complete implementation combining semantic and full-text search with configurable weights
|
174
|
+
- **Full-text Search**: PostgreSQL tsvector-based text search with GIN indexing
|
175
|
+
- **Search Tracking System**: Comprehensive analytics with query embeddings, click-through tracking, and performance monitoring
|
176
|
+
- **Document Management**: Add, update, delete, list operations
|
177
|
+
- **Background Processing**: ActiveJob integration for async embedding generation
|
178
|
+
- **LLM Metadata Generation**: AI-powered structured content analysis with schema validation
|
179
|
+
- **Logging**: Configurable file-based logging with multiple levels
|
180
|
+
|
181
|
+
### 🚧 In Development
|
182
|
+
- **Image Processing**: Framework exists but vision AI integration needs completion
|
183
|
+
- **Audio Processing**: Framework exists but speech-to-text integration needs completion
|
184
|
+
|
185
|
+
### 📋 Planned Features
|
186
|
+
- **Multi-modal Search**: Search across text, image, and audio content types
|
187
|
+
- **Content-type Specific Embedding Models**: Different models for text, image, audio
|
188
|
+
- **Enhanced Metadata Schemas**: Domain-specific metadata templates
|
189
|
+
|
190
|
+
---
|
191
|
+
|
192
|
+
## Migration Guide
|
193
|
+
|
194
|
+
### From 0.1.9 to 0.1.10
|
195
|
+
- **New Search Methods**: `Ragdoll.hybrid_search` and `Ragdoll::Document.search_content` methods now available
|
196
|
+
- **Database Migration**: New search_vector column added to documents table with GIN index for full-text search
|
197
|
+
- **API Enhancement**: All search methods now support unified parameter interface
|
198
|
+
- **Backward Compatibility**: Existing `Ragdoll.search` method unchanged, continues to work as before
|
199
|
+
- **CLI Integration**: ragdoll-cli now requires ragdoll >= 0.1.10 for hybrid and full-text search support
|
200
|
+
|
201
|
+
### From 0.1.8 to 0.1.9
|
202
|
+
- **CHANGELOG Addition**: Comprehensive changelog and feature tracking added
|
203
|
+
- **API Method Consistency**: `hybrid_search` method properly delegated to top-level namespace
|
204
|
+
- **No Breaking Changes**: All existing functionality remains compatible
|
205
|
+
|
206
|
+
### From 0.1.7 to 0.1.8
|
207
|
+
- New search tracking tables will be automatically created via migrations
|
208
|
+
- No breaking changes to existing API
|
209
|
+
- Search tracking is enabled by default but can be disabled per search
|
210
|
+
|
211
|
+
### From 0.1.6 to 0.1.7
|
212
|
+
- AudioContent model added - existing installations will auto-migrate
|
213
|
+
- New background job classes available for improved processing
|
214
|
+
- Metadata schemas provide enhanced validation
|
215
|
+
|
216
|
+
### From 0.1.5 to 0.1.6
|
217
|
+
- Documentation moved to submodule - update local references
|
218
|
+
- CI/CD improvements may affect development workflows
|
219
|
+
- JRuby support removed due to RMagick dependency
|
220
|
+
|
221
|
+
---
|
222
|
+
|
223
|
+
## Breaking Changes
|
224
|
+
|
225
|
+
### Version 0.1.6
|
226
|
+
- **JRuby Support Removed**: RMagick dependency incompatibility
|
227
|
+
- **Documentation Structure**: Local docs replaced with submodule
|
228
|
+
|
229
|
+
---
|
230
|
+
|
231
|
+
## Contributors
|
232
|
+
|
233
|
+
- **Dewayne VanHoozer** - Primary developer and maintainer
|
234
|
+
|
235
|
+
---
|
236
|
+
|
237
|
+
## License
|
238
|
+
|
239
|
+
This project is licensed under the MIT License - see the LICENSE file for details.
|
240
|
+
|
241
|
+
---
|
242
|
+
|
243
|
+
*This changelog is automatically maintained and reflects the actual implementation status of features.*
|
data/README.md
CHANGED
@@ -18,17 +18,65 @@
|
|
18
18
|
</table>
|
19
19
|
</div>
|
20
20
|
|
21
|
-
# Ragdoll
|
21
|
+
# Ragdoll
|
22
22
|
|
23
23
|
Database-oriented multi-modal RAG (Retrieval-Augmented Generation) library built on ActiveRecord. Features PostgreSQL + pgvector for high-performance semantic search, polymorphic content architecture, and dual metadata design for sophisticated document analysis.
|
24
24
|
|
25
|
+
RAG does not have to be hard. Every week its getting simpler. The frontier LLM providers are starting to encorporate RAG services. For example OpenAI offers a vector search service. See: [https://0x1eef.github.io/posts/an-introduction-to-rag-with-llm.rb/](https://0x1eef.github.io/posts/an-introduction-to-rag-with-llm.rb/)
|
26
|
+
|
27
|
+
## Overview
|
28
|
+
|
29
|
+
Ragdoll is a database-first, multi-modal Retrieval-Augmented Generation (RAG) library for Ruby. It pairs PostgreSQL + pgvector with an ActiveRecord-driven schema to deliver fast, production-grade semantic search and clean data modeling. Today it ships with robust text processing; image and audio pipelines are scaffolded and actively being completed.
|
30
|
+
|
31
|
+
The library emphasizes a dual-metadata design: LLM-derived semantic metadata for understanding content, and system file metadata for managing assets. With built-in analytics, background processing, and a high-level API, you can go from ingest to answer quickly—and scale confidently.
|
32
|
+
|
33
|
+
### Why Ragdoll?
|
34
|
+
|
35
|
+
- Database-first foundation on ActiveRecord (PostgreSQL + pgvector only) for performance and reliability
|
36
|
+
- Multi-modal architecture (text today; image/audio next) via polymorphic content design
|
37
|
+
- Dual metadata model separating semantic analysis from file properties
|
38
|
+
- Provider-agnostic LLM integration via `ruby_llm` (OpenAI, Anthropic, Google)
|
39
|
+
- Production-friendly: background jobs, connection pooling, indexing, and search analytics
|
40
|
+
- Simple, ergonomic high-level API to keep your application code clean
|
41
|
+
|
42
|
+
### Key Capabilities
|
43
|
+
|
44
|
+
- Semantic search with vector similarity (cosine) across polymorphic content
|
45
|
+
- Text ingestion, chunking, and embedding generation
|
46
|
+
- LLM-powered structured metadata with schema validation
|
47
|
+
- Search tracking and analytics (CTR, performance, similarity of queries)
|
48
|
+
- Hybrid search (semantic + full-text) planned
|
49
|
+
- Extensible model and configuration system
|
50
|
+
|
51
|
+
## Table of Contents
|
52
|
+
|
53
|
+
- [Quick Start](#quick-start)
|
54
|
+
- [API Overview](#api-overview)
|
55
|
+
- [Search and Retrieval](#search-and-retrieval)
|
56
|
+
- [Search Analytics and Tracking](#search-analytics-and-tracking)
|
57
|
+
- [System Operations](#system-operations)
|
58
|
+
- [Configuration](#configuration)
|
59
|
+
- [Current Implementation Status](#current-implementation-status)
|
60
|
+
- [Architecture Highlights](#architecture-highlights)
|
61
|
+
- [Text Document Processing](#text-document-processing-current)
|
62
|
+
- [PostgreSQL + pgvector Configuration](#postgresql--pgvector-configuration)
|
63
|
+
- [Performance Features](#performance-features)
|
64
|
+
- [Installation](#installation)
|
65
|
+
- [Requirements](#requirements)
|
66
|
+
- [Use Cases](#use-cases)
|
67
|
+
- [Environment Variables](#environment-variables)
|
68
|
+
- [Troubleshooting](#troubleshooting)
|
69
|
+
- [Related Projects](#related-projects)
|
70
|
+
- [Key Design Principles](#key-design-principles)
|
71
|
+
- [Contributing & Support](#contributing--support)
|
72
|
+
|
25
73
|
## Quick Start
|
26
74
|
|
27
75
|
```ruby
|
28
76
|
require 'ragdoll'
|
29
77
|
|
30
78
|
# Configure with PostgreSQL + pgvector
|
31
|
-
Ragdoll
|
79
|
+
Ragdoll.configure do |config|
|
32
80
|
# Database configuration (PostgreSQL only)
|
33
81
|
config.database_config = {
|
34
82
|
adapter: 'postgresql',
|
@@ -55,22 +103,22 @@ Ragdoll::Core.configure do |config|
|
|
55
103
|
end
|
56
104
|
|
57
105
|
# Add documents - returns detailed result
|
58
|
-
result = Ragdoll
|
106
|
+
result = Ragdoll.add_document(path: 'research_paper.pdf')
|
59
107
|
puts result[:message] # "Document 'research_paper' added successfully with ID 123"
|
60
108
|
doc_id = result[:document_id]
|
61
109
|
|
62
110
|
# Check document status
|
63
|
-
status = Ragdoll
|
111
|
+
status = Ragdoll.document_status(id: doc_id)
|
64
112
|
puts status[:message] # Shows processing status and embeddings count
|
65
113
|
|
66
114
|
# Search across content
|
67
|
-
results = Ragdoll
|
115
|
+
results = Ragdoll.search(query: 'neural networks')
|
68
116
|
|
69
117
|
# Get detailed document information
|
70
|
-
document = Ragdoll
|
118
|
+
document = Ragdoll.get_document(id: doc_id)
|
71
119
|
```
|
72
120
|
|
73
|
-
##
|
121
|
+
## API Overview
|
74
122
|
|
75
123
|
The `Ragdoll` module provides a convenient high-level API for common operations:
|
76
124
|
|
@@ -78,37 +126,37 @@ The `Ragdoll` module provides a convenient high-level API for common operations:
|
|
78
126
|
|
79
127
|
```ruby
|
80
128
|
# Add single document - returns detailed result hash
|
81
|
-
result = Ragdoll
|
129
|
+
result = Ragdoll.add_document(path: 'document.pdf')
|
82
130
|
puts result[:success] # true
|
83
131
|
puts result[:document_id] # "123"
|
84
132
|
puts result[:message] # "Document 'document' added successfully with ID 123"
|
85
133
|
puts result[:embeddings_queued] # true
|
86
134
|
|
87
135
|
# Check document processing status
|
88
|
-
status = Ragdoll
|
136
|
+
status = Ragdoll.document_status(id: result[:document_id])
|
89
137
|
puts status[:status] # "processed"
|
90
138
|
puts status[:embeddings_count] # 15
|
91
139
|
puts status[:embeddings_ready] # true
|
92
140
|
puts status[:message] # "Document processed successfully with 15 embeddings"
|
93
141
|
|
94
142
|
# Get detailed document information
|
95
|
-
document = Ragdoll
|
143
|
+
document = Ragdoll.get_document(id: result[:document_id])
|
96
144
|
puts document[:title] # "document"
|
97
145
|
puts document[:status] # "processed"
|
98
146
|
puts document[:embeddings_count] # 15
|
99
147
|
puts document[:content_length] # 5000
|
100
148
|
|
101
149
|
# Update document metadata
|
102
|
-
Ragdoll
|
150
|
+
Ragdoll.update_document(id: result[:document_id], title: 'New Title')
|
103
151
|
|
104
152
|
# Delete document
|
105
|
-
Ragdoll
|
153
|
+
Ragdoll.delete_document(id: result[:document_id])
|
106
154
|
|
107
155
|
# List all documents
|
108
|
-
documents = Ragdoll
|
156
|
+
documents = Ragdoll.list_documents(limit: 10)
|
109
157
|
|
110
158
|
# System statistics
|
111
|
-
stats = Ragdoll
|
159
|
+
stats = Ragdoll.stats
|
112
160
|
puts stats[:total_documents] # 50
|
113
161
|
puts stats[:total_embeddings] # 1250
|
114
162
|
```
|
@@ -117,15 +165,22 @@ puts stats[:total_embeddings] # 1250
|
|
117
165
|
|
118
166
|
```ruby
|
119
167
|
# Semantic search across all content types
|
120
|
-
results = Ragdoll
|
168
|
+
results = Ragdoll.search(query: 'artificial intelligence')
|
169
|
+
|
170
|
+
# Search with automatic tracking (default)
|
171
|
+
results = Ragdoll.search(
|
172
|
+
query: 'machine learning',
|
173
|
+
session_id: 123, # Optional: track user sessions
|
174
|
+
user_id: 456 # Optional: track by user
|
175
|
+
)
|
121
176
|
|
122
177
|
# Search specific content types
|
123
|
-
text_results = Ragdoll
|
124
|
-
image_results = Ragdoll
|
125
|
-
audio_results = Ragdoll
|
178
|
+
text_results = Ragdoll.search(query: 'machine learning', content_type: 'text')
|
179
|
+
image_results = Ragdoll.search(query: 'neural network diagram', content_type: 'image')
|
180
|
+
audio_results = Ragdoll.search(query: 'AI discussion', content_type: 'audio')
|
126
181
|
|
127
182
|
# Advanced search with metadata filters
|
128
|
-
results = Ragdoll
|
183
|
+
results = Ragdoll.search(
|
129
184
|
query: 'deep learning',
|
130
185
|
classification: 'research',
|
131
186
|
keywords: ['AI', 'neural networks'],
|
@@ -133,44 +188,124 @@ results = Ragdoll::Core.search(
|
|
133
188
|
)
|
134
189
|
|
135
190
|
# Get context for RAG applications
|
136
|
-
context = Ragdoll
|
191
|
+
context = Ragdoll.get_context(query: 'machine learning', limit: 5)
|
137
192
|
|
138
193
|
# Enhanced prompt with context
|
139
|
-
enhanced = Ragdoll
|
194
|
+
enhanced = Ragdoll.enhance_prompt(
|
140
195
|
prompt: 'What is machine learning?',
|
141
196
|
context_limit: 5
|
142
197
|
)
|
143
198
|
|
144
199
|
# Hybrid search combining semantic and full-text
|
145
|
-
results = Ragdoll
|
200
|
+
results = Ragdoll.hybrid_search(
|
146
201
|
query: 'neural networks',
|
147
202
|
semantic_weight: 0.7,
|
148
203
|
text_weight: 0.3
|
149
204
|
)
|
150
205
|
```
|
151
206
|
|
207
|
+
### Keywords Search
|
208
|
+
|
209
|
+
Ragdoll supports powerful keywords-based search that can be used standalone or combined with semantic search. The keywords system uses PostgreSQL array operations for high performance and supports both partial matching (overlap) and exact matching (contains all).
|
210
|
+
|
211
|
+
```ruby
|
212
|
+
# Keywords-only search (overlap - documents containing any of the keywords)
|
213
|
+
results = Ragdoll::Document.search_by_keywords(['machine', 'learning', 'ai'])
|
214
|
+
|
215
|
+
# Results are sorted by match count (documents with more keyword matches rank higher)
|
216
|
+
results.each do |doc|
|
217
|
+
puts "#{doc.title}: #{doc.keywords_match_count} matches"
|
218
|
+
end
|
219
|
+
|
220
|
+
# Exact keywords search (contains all - documents must have ALL keywords)
|
221
|
+
results = Ragdoll::Document.search_by_keywords_all(['ruby', 'programming'])
|
222
|
+
|
223
|
+
# Results are sorted by focus (fewer total keywords = more focused document)
|
224
|
+
results.each do |doc|
|
225
|
+
puts "#{doc.title}: #{doc.total_keywords_count} total keywords"
|
226
|
+
end
|
227
|
+
|
228
|
+
# Combined semantic + keywords search for best results
|
229
|
+
results = Ragdoll.search(
|
230
|
+
query: 'artificial intelligence applications',
|
231
|
+
keywords: ['ai', 'machine learning', 'neural networks'],
|
232
|
+
limit: 10
|
233
|
+
)
|
234
|
+
|
235
|
+
# Keywords search with options
|
236
|
+
results = Ragdoll::Document.search_by_keywords(
|
237
|
+
['web', 'javascript', 'frontend'],
|
238
|
+
limit: 20
|
239
|
+
)
|
240
|
+
|
241
|
+
# Case-insensitive keyword matching (automatically normalized)
|
242
|
+
results = Ragdoll::Document.search_by_keywords(['Python', 'DATA-SCIENCE', 'ai'])
|
243
|
+
# Will match documents with keywords: ['python', 'data-science', 'ai']
|
244
|
+
```
|
245
|
+
|
246
|
+
**Keywords Search Features:**
|
247
|
+
- **High Performance**: Uses PostgreSQL GIN indexes for fast array operations
|
248
|
+
- **Flexible Matching**: Supports both overlap (`&&`) and contains (`@>`) operators
|
249
|
+
- **Smart Scoring**: Results ordered by match count or document focus
|
250
|
+
- **Case Insensitive**: Automatic keyword normalization
|
251
|
+
- **Integration Ready**: Works seamlessly with semantic search
|
252
|
+
- **Inspired by `find_matching_entries.rb`**: Optimized for PostgreSQL arrays
|
253
|
+
|
254
|
+
### Search Analytics and Tracking
|
255
|
+
|
256
|
+
Ragdoll automatically tracks all searches to provide comprehensive analytics and improve search relevance over time:
|
257
|
+
|
258
|
+
```ruby
|
259
|
+
# Get search analytics for the last 30 days
|
260
|
+
analytics = Ragdoll::Search.search_analytics(days: 30)
|
261
|
+
puts "Total searches: #{analytics[:total_searches]}"
|
262
|
+
puts "Unique queries: #{analytics[:unique_queries]}"
|
263
|
+
puts "Average execution time: #{analytics[:avg_execution_time]}ms"
|
264
|
+
puts "Click-through rate: #{analytics[:click_through_rate]}%"
|
265
|
+
|
266
|
+
# Find similar searches using vector similarity
|
267
|
+
search = Ragdoll::Search.first
|
268
|
+
similar_searches = search.nearest_neighbors(:query_embedding, distance: :cosine).limit(5)
|
269
|
+
|
270
|
+
similar_searches.each do |similar|
|
271
|
+
puts "Query: #{similar.query}"
|
272
|
+
puts "Similarity: #{similar.neighbor_distance}"
|
273
|
+
puts "Results: #{similar.results_count}"
|
274
|
+
end
|
275
|
+
|
276
|
+
# Track user interactions (clicks on search results)
|
277
|
+
search_result = Ragdoll::SearchResult.first
|
278
|
+
search_result.mark_as_clicked!
|
279
|
+
|
280
|
+
# Disable tracking for specific searches if needed
|
281
|
+
results = Ragdoll.search(
|
282
|
+
query: 'private query',
|
283
|
+
track_search: false
|
284
|
+
)
|
285
|
+
```
|
286
|
+
|
152
287
|
### System Operations
|
153
288
|
|
154
289
|
```ruby
|
155
290
|
# Get system statistics
|
156
|
-
stats = Ragdoll
|
291
|
+
stats = Ragdoll.stats
|
157
292
|
# Returns information about documents, content types, embeddings, etc.
|
158
293
|
|
159
294
|
# Health check
|
160
|
-
healthy = Ragdoll
|
295
|
+
healthy = Ragdoll.healthy?
|
161
296
|
|
162
297
|
# Get configuration
|
163
|
-
config = Ragdoll
|
298
|
+
config = Ragdoll.configuration
|
164
299
|
|
165
300
|
# Reset configuration (useful for testing)
|
166
|
-
Ragdoll
|
301
|
+
Ragdoll.reset_configuration!
|
167
302
|
```
|
168
303
|
|
169
304
|
### Configuration
|
170
305
|
|
171
306
|
```ruby
|
172
307
|
# Configure the system
|
173
|
-
Ragdoll
|
308
|
+
Ragdoll.configure do |config|
|
174
309
|
# Database configuration (PostgreSQL only - REQUIRED)
|
175
310
|
config.database_config = {
|
176
311
|
adapter: 'postgresql',
|
@@ -218,6 +353,7 @@ end
|
|
218
353
|
- **Database schema**: Multi-modal polymorphic architecture with PostgreSQL + pgvector
|
219
354
|
- **Dual metadata architecture**: Separate LLM-generated content analysis and file properties
|
220
355
|
- **Search functionality**: Semantic search with cosine similarity and usage analytics
|
356
|
+
- **Search tracking system**: Comprehensive analytics with query embeddings, click-through tracking, and performance monitoring
|
221
357
|
- **Document management**: Add, update, delete, list operations
|
222
358
|
- **Background processing**: ActiveJob integration for async embedding generation
|
223
359
|
- **LLM metadata generation**: AI-powered structured content analysis with schema validation
|
@@ -264,15 +400,16 @@ Currently, Ragdoll processes text documents through:
|
|
264
400
|
6. **Search**: Semantic search using cosine similarity with usage analytics
|
265
401
|
|
266
402
|
### Example Usage
|
403
|
+
|
267
404
|
```ruby
|
268
405
|
# Add a text document
|
269
|
-
result = Ragdoll
|
406
|
+
result = Ragdoll.add_document(path: 'document.pdf')
|
270
407
|
|
271
408
|
# Check processing status
|
272
|
-
status = Ragdoll
|
409
|
+
status = Ragdoll.document_status(id: result[:document_id])
|
273
410
|
|
274
411
|
# Search the content
|
275
|
-
results = Ragdoll
|
412
|
+
results = Ragdoll.search(query: 'machine learning')
|
276
413
|
```
|
277
414
|
|
278
415
|
## PostgreSQL + pgvector Configuration
|
@@ -293,7 +430,7 @@ psql -d ragdoll_production -c "CREATE EXTENSION IF NOT EXISTS vector;"
|
|
293
430
|
### Configuration Example
|
294
431
|
|
295
432
|
```ruby
|
296
|
-
Ragdoll
|
433
|
+
Ragdoll.configure do |config|
|
297
434
|
config.database_config = {
|
298
435
|
adapter: 'postgresql',
|
299
436
|
database: 'ragdoll_production',
|
@@ -337,11 +474,52 @@ gem 'ragdoll'
|
|
337
474
|
- **PostgreSQL**: 12+ with pgvector extension (REQUIRED - no other databases supported)
|
338
475
|
- **Dependencies**: activerecord, pg, pgvector, neighbor, ruby_llm, pdf-reader, docx, rubyzip, shrine, rmagick, opensearch-ruby, searchkick, ruby-progressbar
|
339
476
|
|
477
|
+
## Use Cases
|
478
|
+
|
479
|
+
- Internal knowledge bases and chat assistants grounded in your documents
|
480
|
+
- Product documentation and support search with analytics and relevance feedback
|
481
|
+
- Research corpora exploration (summaries, topics, similarity) across large text sets
|
482
|
+
- Incident retrospectives and operational analytics with searchable write-ups
|
483
|
+
- Media libraries preparing for text + image + audio pipelines (image/audio in progress)
|
484
|
+
|
485
|
+
## Environment Variables
|
486
|
+
|
487
|
+
Set the following as environment variables (do not commit secrets to source control):
|
488
|
+
|
489
|
+
- `OPENAI_API_KEY` — required for OpenAI models
|
490
|
+
- `OPENAI_ORGANIZATION` — optional, for OpenAI org scoping
|
491
|
+
- `OPENAI_PROJECT` — optional, for OpenAI project scoping
|
492
|
+
- `ANTHROPIC_API_KEY` — optional, for Anthropic models
|
493
|
+
- `GOOGLE_API_KEY` — optional, for Google models
|
494
|
+
- `DATABASE_PASSWORD` — your PostgreSQL password if not using peer auth
|
495
|
+
|
496
|
+
## Troubleshooting
|
497
|
+
|
498
|
+
### pgvector extension missing
|
499
|
+
|
500
|
+
- Ensure the extension is enabled in your database:
|
501
|
+
|
502
|
+
```bash
|
503
|
+
psql -d ragdoll_production -c "CREATE EXTENSION IF NOT EXISTS vector;"
|
504
|
+
```
|
505
|
+
|
506
|
+
- If the command fails, verify PostgreSQL and pgvector are installed and that you’re connecting to the correct database.
|
507
|
+
|
508
|
+
### Document stuck in "processing"
|
509
|
+
|
510
|
+
- Confirm your API keys are set and valid.
|
511
|
+
- Ensure `auto_migrate: true` in configuration (or run migrations if you manage schema yourself).
|
512
|
+
- Check logs at the path configured by `logging_config[:log_filepath]` for errors.
|
513
|
+
|
340
514
|
## Related Projects
|
341
515
|
|
342
516
|
- **ragdoll-cli**: Standalone CLI application using ragdoll
|
343
517
|
- **ragdoll-rails**: Rails engine with web interface for ragdoll
|
344
518
|
|
519
|
+
## Contributing & Support
|
520
|
+
|
521
|
+
Contributions are welcome! If you find a bug or have a feature request, please open an issue or submit a pull request. For questions and feedback, open an issue in this repository.
|
522
|
+
|
345
523
|
## Key Design Principles
|
346
524
|
|
347
525
|
1. **Database-Oriented**: Built on ActiveRecord with PostgreSQL + pgvector for production performance
|
data/Rakefile
CHANGED
@@ -1,8 +1,5 @@
|
|
1
1
|
# frozen_string_literal: true
|
2
2
|
|
3
|
-
require "simplecov"
|
4
|
-
SimpleCov.start
|
5
|
-
|
6
3
|
# Suppress bundler/rubygems warnings
|
7
4
|
$VERBOSE = nil
|
8
5
|
|
@@ -52,8 +49,10 @@ task :setup_test_db do
|
|
52
49
|
puts "Warning: Could not install pgvector extension: #{e.message}"
|
53
50
|
end
|
54
51
|
|
55
|
-
#
|
56
|
-
|
52
|
+
# Reset and run migrations (drops all tables and re-runs migrations)
|
53
|
+
# This ensures clean state for tests regardless of previous migration versions
|
54
|
+
Ragdoll::Core::Database.setup(test_db_config.merge(auto_migrate: false, logger: nil))
|
55
|
+
Ragdoll::Core::Database.reset!
|
57
56
|
puts "Test database setup complete"
|
58
57
|
end
|
59
58
|
|