ragdoll 0.1.9 → 0.1.10
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +46 -4
- data/README.md +49 -0
- data/Rakefile +4 -2
- data/app/models/ragdoll/document.rb +115 -12
- data/app/models/ragdoll/embedding.rb +36 -4
- data/app/services/ragdoll/search_engine.rb +13 -2
- data/db/migrate/{001_enable_postgresql_extensions.rb → 20250815234901_enable_postgresql_extensions.rb} +7 -8
- data/db/migrate/20250815234902_create_ragdoll_documents.rb +117 -0
- data/db/migrate/{005_create_ragdoll_embeddings.rb → 20250815234903_create_ragdoll_embeddings.rb} +13 -10
- data/db/migrate/{006_create_ragdoll_contents.rb → 20250815234904_create_ragdoll_contents.rb} +14 -11
- data/db/migrate/{007_create_ragdoll_searches.rb → 20250815234905_create_ragdoll_searches.rb} +24 -20
- data/db/migrate/{008_create_ragdoll_search_results.rb → 20250815234906_create_ragdoll_search_results.rb} +16 -16
- data/lib/ragdoll/core/database.rb +8 -3
- data/lib/ragdoll/core/version.rb +1 -1
- data/lib/tasks/db.rake +63 -15
- metadata +7 -7
- data/db/migrate/004_create_ragdoll_documents.rb +0 -70
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 4f7b2c95ede1523e9e01af70394217387d876da6317fed651df3e27cf337cfe9
|
4
|
+
data.tar.gz: a82ae7d541fd06876acb3acaf8f02639234f8b118274621851678a2799c5f559
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: ba14828a6e743677c84072b9f1bb27743e429531ebdd9fbd3d8553add7bbdad070d709cd617dc620fef4ddc6846085ca79d3bb6d32bae8465c6b3b10acc0692f
|
7
|
+
data.tar.gz: de630ebf15168b562ef686ec6cd9f1cfe532b5bbf495e33a74085b567cf53ce7bb87e7c5c543756c47bd68c98290221b879a1b4d8e5888aac4916d1c1554fe99
|
data/CHANGELOG.md
CHANGED
@@ -6,7 +6,36 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
6
6
|
|
7
7
|
## [Unreleased]
|
8
8
|
|
9
|
-
|
9
|
+
## [0.1.10] - 2025-01-15
|
10
|
+
|
11
|
+
### Changed
|
12
|
+
- Continued improvements to search performance and accuracy
|
13
|
+
|
14
|
+
### Added
|
15
|
+
- **Hybrid Search**: Complete implementation combining semantic and full-text search capabilities
|
16
|
+
- Configurable weights for semantic vs text search (default: 70% semantic, 30% text)
|
17
|
+
- Deduplication of results by document ID
|
18
|
+
- Combined scoring system for unified result ranking
|
19
|
+
- **Full-text Search**: PostgreSQL full-text search with tsvector indexing
|
20
|
+
- Per-word match ratio scoring (0.0 to 1.0)
|
21
|
+
- GIN index for high-performance text search
|
22
|
+
- Search across title, summary, keywords, and description fields
|
23
|
+
- **Enhanced Search API**: Complete search type delegation at top-level Ragdoll namespace
|
24
|
+
- `Ragdoll.hybrid_search` method for combined semantic and text search
|
25
|
+
- `Ragdoll::Document.search_content` for full-text search capabilities
|
26
|
+
- Consistent parameter handling across all search methods
|
27
|
+
|
28
|
+
### Changed
|
29
|
+
- **Search Architecture**: Unified search interface supporting semantic, fulltext, and hybrid modes
|
30
|
+
- **Database Schema**: Added search_vector column with GIN indexing for full-text search performance
|
31
|
+
|
32
|
+
### Technical Details
|
33
|
+
- Full-text search uses PostgreSQL's built-in tsvector capabilities
|
34
|
+
- Hybrid search combines cosine similarity (semantic) with text match ratios
|
35
|
+
- Results are ranked by weighted combined scores
|
36
|
+
- All search methods maintain backward compatibility
|
37
|
+
|
38
|
+
## [0.1.9] - 2025-01-10
|
10
39
|
|
11
40
|
### Added
|
12
41
|
- **Initial CHANGELOG**: Added comprehensive CHANGELOG.md following Keep a Changelog format
|
@@ -40,7 +69,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
40
69
|
- **Test Coverage**: Added coverage directory to .gitignore for cleaner repository state
|
41
70
|
|
42
71
|
### Technical Details
|
43
|
-
- Commits: `9186067`, `cb952d3`, `e902a5f`, `632527b`
|
72
|
+
- Commits: `9186067`, `cb952d3`, `e902a5f`, `632527b`
|
44
73
|
- All changes maintain backward compatibility
|
45
74
|
- No breaking API changes
|
46
75
|
|
@@ -141,6 +170,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
141
170
|
- **Database Schema**: Multi-modal polymorphic architecture with PostgreSQL + pgvector
|
142
171
|
- **Dual Metadata Architecture**: Separate LLM-generated content analysis and file properties
|
143
172
|
- **Search Functionality**: Semantic search with cosine similarity and usage analytics
|
173
|
+
- **Hybrid Search**: Complete implementation combining semantic and full-text search with configurable weights
|
174
|
+
- **Full-text Search**: PostgreSQL tsvector-based text search with GIN indexing
|
144
175
|
- **Search Tracking System**: Comprehensive analytics with query embeddings, click-through tracking, and performance monitoring
|
145
176
|
- **Document Management**: Add, update, delete, list operations
|
146
177
|
- **Background Processing**: ActiveJob integration for async embedding generation
|
@@ -150,7 +181,6 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
150
181
|
### 🚧 In Development
|
151
182
|
- **Image Processing**: Framework exists but vision AI integration needs completion
|
152
183
|
- **Audio Processing**: Framework exists but speech-to-text integration needs completion
|
153
|
-
- **Hybrid Search**: Combining semantic and full-text search capabilities
|
154
184
|
|
155
185
|
### 📋 Planned Features
|
156
186
|
- **Multi-modal Search**: Search across text, image, and audio content types
|
@@ -161,6 +191,18 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
161
191
|
|
162
192
|
## Migration Guide
|
163
193
|
|
194
|
+
### From 0.1.9 to 0.1.10
|
195
|
+
- **New Search Methods**: `Ragdoll.hybrid_search` and `Ragdoll::Document.search_content` methods now available
|
196
|
+
- **Database Migration**: New search_vector column added to documents table with GIN index for full-text search
|
197
|
+
- **API Enhancement**: All search methods now support unified parameter interface
|
198
|
+
- **Backward Compatibility**: Existing `Ragdoll.search` method unchanged, continues to work as before
|
199
|
+
- **CLI Integration**: ragdoll-cli now requires ragdoll >= 0.1.10 for hybrid and full-text search support
|
200
|
+
|
201
|
+
### From 0.1.8 to 0.1.9
|
202
|
+
- **CHANGELOG Addition**: Comprehensive changelog and feature tracking added
|
203
|
+
- **API Method Consistency**: `hybrid_search` method properly delegated to top-level namespace
|
204
|
+
- **No Breaking Changes**: All existing functionality remains compatible
|
205
|
+
|
164
206
|
### From 0.1.7 to 0.1.8
|
165
207
|
- New search tracking tables will be automatically created via migrations
|
166
208
|
- No breaking changes to existing API
|
@@ -198,4 +240,4 @@ This project is licensed under the MIT License - see the LICENSE file for detail
|
|
198
240
|
|
199
241
|
---
|
200
242
|
|
201
|
-
*This changelog is automatically maintained and reflects the actual implementation status of features.*
|
243
|
+
*This changelog is automatically maintained and reflects the actual implementation status of features.*
|
data/README.md
CHANGED
@@ -22,6 +22,8 @@
|
|
22
22
|
|
23
23
|
Database-oriented multi-modal RAG (Retrieval-Augmented Generation) library built on ActiveRecord. Features PostgreSQL + pgvector for high-performance semantic search, polymorphic content architecture, and dual metadata design for sophisticated document analysis.
|
24
24
|
|
25
|
+
RAG does not have to be hard. Every week its getting simpler. The frontier LLM providers are starting to encorporate RAG services. For example OpenAI offers a vector search service. See: [https://0x1eef.github.io/posts/an-introduction-to-rag-with-llm.rb/](https://0x1eef.github.io/posts/an-introduction-to-rag-with-llm.rb/)
|
26
|
+
|
25
27
|
## Overview
|
26
28
|
|
27
29
|
Ragdoll is a database-first, multi-modal Retrieval-Augmented Generation (RAG) library for Ruby. It pairs PostgreSQL + pgvector with an ActiveRecord-driven schema to deliver fast, production-grade semantic search and clean data modeling. Today it ships with robust text processing; image and audio pipelines are scaffolded and actively being completed.
|
@@ -202,6 +204,53 @@ results = Ragdoll.hybrid_search(
|
|
202
204
|
)
|
203
205
|
```
|
204
206
|
|
207
|
+
### Keywords Search
|
208
|
+
|
209
|
+
Ragdoll supports powerful keywords-based search that can be used standalone or combined with semantic search. The keywords system uses PostgreSQL array operations for high performance and supports both partial matching (overlap) and exact matching (contains all).
|
210
|
+
|
211
|
+
```ruby
|
212
|
+
# Keywords-only search (overlap - documents containing any of the keywords)
|
213
|
+
results = Ragdoll::Document.search_by_keywords(['machine', 'learning', 'ai'])
|
214
|
+
|
215
|
+
# Results are sorted by match count (documents with more keyword matches rank higher)
|
216
|
+
results.each do |doc|
|
217
|
+
puts "#{doc.title}: #{doc.keywords_match_count} matches"
|
218
|
+
end
|
219
|
+
|
220
|
+
# Exact keywords search (contains all - documents must have ALL keywords)
|
221
|
+
results = Ragdoll::Document.search_by_keywords_all(['ruby', 'programming'])
|
222
|
+
|
223
|
+
# Results are sorted by focus (fewer total keywords = more focused document)
|
224
|
+
results.each do |doc|
|
225
|
+
puts "#{doc.title}: #{doc.total_keywords_count} total keywords"
|
226
|
+
end
|
227
|
+
|
228
|
+
# Combined semantic + keywords search for best results
|
229
|
+
results = Ragdoll.search(
|
230
|
+
query: 'artificial intelligence applications',
|
231
|
+
keywords: ['ai', 'machine learning', 'neural networks'],
|
232
|
+
limit: 10
|
233
|
+
)
|
234
|
+
|
235
|
+
# Keywords search with options
|
236
|
+
results = Ragdoll::Document.search_by_keywords(
|
237
|
+
['web', 'javascript', 'frontend'],
|
238
|
+
limit: 20
|
239
|
+
)
|
240
|
+
|
241
|
+
# Case-insensitive keyword matching (automatically normalized)
|
242
|
+
results = Ragdoll::Document.search_by_keywords(['Python', 'DATA-SCIENCE', 'ai'])
|
243
|
+
# Will match documents with keywords: ['python', 'data-science', 'ai']
|
244
|
+
```
|
245
|
+
|
246
|
+
**Keywords Search Features:**
|
247
|
+
- **High Performance**: Uses PostgreSQL GIN indexes for fast array operations
|
248
|
+
- **Flexible Matching**: Supports both overlap (`&&`) and contains (`@>`) operators
|
249
|
+
- **Smart Scoring**: Results ordered by match count or document focus
|
250
|
+
- **Case Insensitive**: Automatic keyword normalization
|
251
|
+
- **Integration Ready**: Works seamlessly with semantic search
|
252
|
+
- **Inspired by `find_matching_entries.rb`**: Optimized for PostgreSQL arrays
|
253
|
+
|
205
254
|
### Search Analytics and Tracking
|
206
255
|
|
207
256
|
Ragdoll automatically tracks all searches to provide comprehensive analytics and improve search relevance over time:
|
data/Rakefile
CHANGED
@@ -49,8 +49,10 @@ task :setup_test_db do
|
|
49
49
|
puts "Warning: Could not install pgvector extension: #{e.message}"
|
50
50
|
end
|
51
51
|
|
52
|
-
#
|
53
|
-
|
52
|
+
# Reset and run migrations (drops all tables and re-runs migrations)
|
53
|
+
# This ensures clean state for tests regardless of previous migration versions
|
54
|
+
Ragdoll::Core::Database.setup(test_db_config.merge(auto_migrate: false, logger: nil))
|
55
|
+
Ragdoll::Core::Database.reset!
|
54
56
|
puts "Test database setup complete"
|
55
57
|
end
|
56
58
|
|
@@ -142,10 +142,12 @@ module Ragdoll
|
|
142
142
|
def keywords_array
|
143
143
|
return [] unless keywords.present?
|
144
144
|
|
145
|
+
# After migration, keywords is now a PostgreSQL array
|
145
146
|
case keywords
|
146
147
|
when Array
|
147
|
-
keywords
|
148
|
+
keywords.map(&:to_s).map(&:strip).reject(&:empty?)
|
148
149
|
when String
|
150
|
+
# Fallback for any remaining string data (shouldn't happen after migration)
|
149
151
|
keywords.split(",").map(&:strip).reject(&:empty?)
|
150
152
|
else
|
151
153
|
[]
|
@@ -153,17 +155,23 @@ module Ragdoll
|
|
153
155
|
end
|
154
156
|
|
155
157
|
def add_keyword(keyword)
|
158
|
+
return if keyword.blank?
|
159
|
+
|
156
160
|
current_keywords = keywords_array
|
157
|
-
|
161
|
+
normalized_keyword = keyword.to_s.strip.downcase
|
162
|
+
return if current_keywords.map(&:downcase).include?(normalized_keyword)
|
158
163
|
|
159
|
-
current_keywords <<
|
160
|
-
self.keywords = current_keywords
|
164
|
+
current_keywords << normalized_keyword
|
165
|
+
self.keywords = current_keywords
|
161
166
|
end
|
162
167
|
|
163
168
|
def remove_keyword(keyword)
|
169
|
+
return if keyword.blank?
|
170
|
+
|
164
171
|
current_keywords = keywords_array
|
165
|
-
|
166
|
-
|
172
|
+
normalized_keyword = keyword.to_s.strip.downcase
|
173
|
+
current_keywords.reject! { |k| k.downcase == normalized_keyword }
|
174
|
+
self.keywords = current_keywords
|
167
175
|
end
|
168
176
|
|
169
177
|
# Metadata accessors for common fields
|
@@ -249,15 +257,110 @@ module Ragdoll
|
|
249
257
|
puts "Metadata generation failed: #{e.message}"
|
250
258
|
end
|
251
259
|
|
252
|
-
# PostgreSQL full-text search on metadata fields
|
260
|
+
# PostgreSQL full-text search on metadata fields with per-word match-ratio [0.0..1.0]
|
253
261
|
def self.search_content(query, **options)
|
254
262
|
return none if query.blank?
|
255
263
|
|
256
|
-
#
|
257
|
-
|
258
|
-
|
259
|
-
|
260
|
-
|
264
|
+
# Split into unique alphanumeric words
|
265
|
+
words = query.downcase.scan(/[[:alnum:]]+/).uniq
|
266
|
+
return none if words.empty?
|
267
|
+
|
268
|
+
limit = options[:limit] || 20
|
269
|
+
threshold = options[:threshold] || 0.0
|
270
|
+
|
271
|
+
# Use precomputed tsvector column if it exists, otherwise build on the fly
|
272
|
+
if column_names.include?("search_vector")
|
273
|
+
tsvector = "#{table_name}.search_vector"
|
274
|
+
else
|
275
|
+
# Build tsvector from title and metadata fields
|
276
|
+
text_expr =
|
277
|
+
"COALESCE(title, '') || ' ' || " \
|
278
|
+
"COALESCE(metadata->>'summary', '') || ' ' || " \
|
279
|
+
"COALESCE(metadata->>'keywords', '') || ' ' || " \
|
280
|
+
"COALESCE(metadata->>'description', '')"
|
281
|
+
tsvector = "to_tsvector('english', #{text_expr})"
|
282
|
+
end
|
283
|
+
|
284
|
+
# Prepare sanitized tsquery terms
|
285
|
+
tsqueries = words.map do |word|
|
286
|
+
sanitize_sql_array(["plainto_tsquery('english', ?)", word])
|
287
|
+
end
|
288
|
+
|
289
|
+
# Combine per-word tsqueries with OR so PostgreSQL can use the GIN index
|
290
|
+
combined_tsquery = tsqueries.join(' || ')
|
291
|
+
|
292
|
+
# Score each match (1 if present, 0 if not), sum them
|
293
|
+
score_terms = tsqueries.map { |tsq| "(#{tsvector} @@ #{tsq})::int" }
|
294
|
+
score_sum = score_terms.join(' + ')
|
295
|
+
|
296
|
+
# Similarity ratio: fraction of query words present
|
297
|
+
similarity_sql = "(#{score_sum})::float / #{words.size}"
|
298
|
+
|
299
|
+
# Start with basic search query
|
300
|
+
query = select("#{table_name}.*, #{similarity_sql} AS fulltext_similarity")
|
301
|
+
|
302
|
+
# Build where conditions
|
303
|
+
conditions = ["#{tsvector} @@ (#{combined_tsquery})"]
|
304
|
+
|
305
|
+
# Add status filter (default to processed unless overridden)
|
306
|
+
status = options[:status] || 'processed'
|
307
|
+
conditions << "#{table_name}.status = '#{status}'"
|
308
|
+
|
309
|
+
# Add document type filter if specified
|
310
|
+
if options[:document_type].present?
|
311
|
+
conditions << sanitize_sql_array(["#{table_name}.document_type = ?", options[:document_type]])
|
312
|
+
end
|
313
|
+
|
314
|
+
# Add threshold filtering if specified
|
315
|
+
if threshold > 0.0
|
316
|
+
conditions << "#{similarity_sql} >= #{threshold}"
|
317
|
+
end
|
318
|
+
|
319
|
+
# Combine all conditions
|
320
|
+
where_clause = conditions.join(' AND ')
|
321
|
+
|
322
|
+
# Materialize to array to avoid COUNT/SELECT alias conflicts in some AR versions
|
323
|
+
query.where(where_clause)
|
324
|
+
.order(Arel.sql("fulltext_similarity DESC, updated_at DESC"))
|
325
|
+
.limit(limit)
|
326
|
+
.to_a
|
327
|
+
end
|
328
|
+
|
329
|
+
# Search documents by keywords using PostgreSQL array operations
|
330
|
+
# Returns documents that match keywords with scoring based on match count
|
331
|
+
# Inspired by find_matching_entries.rb algorithm but optimized for PostgreSQL arrays
|
332
|
+
def self.search_by_keywords(keywords_array, **options)
|
333
|
+
return where("1 = 0") if keywords_array.blank?
|
334
|
+
|
335
|
+
# Normalize keywords to lowercase strings array
|
336
|
+
normalized_keywords = Array(keywords_array).map(&:to_s).map(&:downcase).reject(&:empty?)
|
337
|
+
return where("1 = 0") if normalized_keywords.empty?
|
338
|
+
|
339
|
+
limit = options[:limit] || 20
|
340
|
+
|
341
|
+
# Use PostgreSQL array overlap operator with proper array literal
|
342
|
+
quoted_keywords = normalized_keywords.map { |k| "\"#{k}\"" }.join(',')
|
343
|
+
array_literal = "'{#{quoted_keywords}}'::text[]"
|
344
|
+
where("keywords && #{array_literal}")
|
345
|
+
.order("created_at DESC")
|
346
|
+
.limit(limit)
|
347
|
+
end
|
348
|
+
|
349
|
+
# Find documents that contain ALL specified keywords (exact array matching)
|
350
|
+
def self.search_by_keywords_all(keywords_array, **options)
|
351
|
+
return where("1 = 0") if keywords_array.blank?
|
352
|
+
|
353
|
+
normalized_keywords = Array(keywords_array).map(&:to_s).map(&:downcase).reject(&:empty?)
|
354
|
+
return where("1 = 0") if normalized_keywords.empty?
|
355
|
+
|
356
|
+
limit = options[:limit] || 20
|
357
|
+
|
358
|
+
# Use PostgreSQL array contains operator with proper array literal
|
359
|
+
quoted_keywords = normalized_keywords.map { |k| "\"#{k}\"" }.join(',')
|
360
|
+
array_literal = "'{#{quoted_keywords}}'::text[]"
|
361
|
+
where("keywords @> #{array_literal}")
|
362
|
+
.order("created_at DESC")
|
363
|
+
.limit(limit)
|
261
364
|
end
|
262
365
|
|
263
366
|
# Faceted search by metadata fields
|
@@ -64,10 +64,26 @@ module Ragdoll
|
|
64
64
|
scope = scope.by_model(filters[:embedding_model]) if filters[:embedding_model]
|
65
65
|
|
66
66
|
# Document-level filters require joining through embeddable (STI Content) to documents
|
67
|
-
|
67
|
+
needs_document_join = filters[:document_type] || filters[:keywords]
|
68
|
+
|
69
|
+
if needs_document_join
|
68
70
|
scope = scope.joins("JOIN ragdoll_contents ON ragdoll_contents.id = ragdoll_embeddings.embeddable_id")
|
69
71
|
.joins("JOIN ragdoll_documents ON ragdoll_documents.id = ragdoll_contents.document_id")
|
70
|
-
|
72
|
+
end
|
73
|
+
|
74
|
+
if filters[:document_type]
|
75
|
+
scope = scope.where("ragdoll_documents.document_type = ?", filters[:document_type])
|
76
|
+
end
|
77
|
+
|
78
|
+
# Keywords filtering using PostgreSQL array operations
|
79
|
+
if filters[:keywords] && filters[:keywords].any?
|
80
|
+
normalized_keywords = Array(filters[:keywords]).map(&:to_s).map(&:downcase).reject(&:empty?)
|
81
|
+
if normalized_keywords.any?
|
82
|
+
# Use PostgreSQL array overlap operator with proper array literal
|
83
|
+
quoted_keywords = normalized_keywords.map { |k| "\"#{k}\"" }.join(',')
|
84
|
+
array_literal = "'{#{quoted_keywords}}'::text[]"
|
85
|
+
scope = scope.where("ragdoll_documents.keywords && #{array_literal}")
|
86
|
+
end
|
71
87
|
end
|
72
88
|
|
73
89
|
# Use pgvector for similarity search
|
@@ -83,10 +99,26 @@ module Ragdoll
|
|
83
99
|
scope = scope.by_model(filters[:embedding_model]) if filters[:embedding_model]
|
84
100
|
|
85
101
|
# Document-level filters require joining through embeddable (STI Content) to documents
|
86
|
-
|
102
|
+
needs_document_join = filters[:document_type] || filters[:keywords]
|
103
|
+
|
104
|
+
if needs_document_join
|
87
105
|
scope = scope.joins("JOIN ragdoll_contents ON ragdoll_contents.id = ragdoll_embeddings.embeddable_id")
|
88
106
|
.joins("JOIN ragdoll_documents ON ragdoll_documents.id = ragdoll_contents.document_id")
|
89
|
-
|
107
|
+
end
|
108
|
+
|
109
|
+
if filters[:document_type]
|
110
|
+
scope = scope.where("ragdoll_documents.document_type = ?", filters[:document_type])
|
111
|
+
end
|
112
|
+
|
113
|
+
# Keywords filtering using PostgreSQL array operations
|
114
|
+
if filters[:keywords] && filters[:keywords].any?
|
115
|
+
normalized_keywords = Array(filters[:keywords]).map(&:to_s).map(&:downcase).reject(&:empty?)
|
116
|
+
if normalized_keywords.any?
|
117
|
+
# Use PostgreSQL array overlap operator with proper array literal
|
118
|
+
quoted_keywords = normalized_keywords.map { |k| "\"#{k}\"" }.join(',')
|
119
|
+
array_literal = "'{#{quoted_keywords}}'::text[]"
|
120
|
+
scope = scope.where("ragdoll_documents.keywords && #{array_literal}")
|
121
|
+
end
|
90
122
|
end
|
91
123
|
|
92
124
|
search_with_pgvector_stats(query_embedding, scope, limit, threshold)
|
@@ -33,6 +33,10 @@ module Ragdoll
|
|
33
33
|
threshold = options[:threshold] || search_config[:similarity_threshold]
|
34
34
|
filters = options[:filters] || {}
|
35
35
|
|
36
|
+
# Extract keywords option and normalize
|
37
|
+
keywords = options[:keywords] || []
|
38
|
+
keywords = Array(keywords).map(&:to_s).reject(&:empty?)
|
39
|
+
|
36
40
|
# Extract tracking options
|
37
41
|
session_id = options[:session_id]
|
38
42
|
user_id = options[:user_id]
|
@@ -49,6 +53,11 @@ module Ragdoll
|
|
49
53
|
return [] if query_embedding.nil?
|
50
54
|
end
|
51
55
|
|
56
|
+
# Add keywords to filters if provided
|
57
|
+
if keywords.any?
|
58
|
+
filters[:keywords] = keywords
|
59
|
+
end
|
60
|
+
|
52
61
|
# Search using ActiveRecord models with statistics
|
53
62
|
# Try enhanced search first, fall back to original if it fails
|
54
63
|
begin
|
@@ -81,13 +90,15 @@ module Ragdoll
|
|
81
90
|
}
|
82
91
|
end
|
83
92
|
|
93
|
+
search_type = keywords.any? ? "semantic_with_keywords" : "semantic"
|
94
|
+
|
84
95
|
Ragdoll::Search.record_search(
|
85
96
|
query: query_string,
|
86
97
|
query_embedding: query_embedding,
|
87
98
|
results: search_results,
|
88
|
-
search_type:
|
99
|
+
search_type: search_type,
|
89
100
|
filters: filters,
|
90
|
-
options: { limit: limit, threshold: threshold },
|
101
|
+
options: { limit: limit, threshold: threshold, keywords: keywords },
|
91
102
|
execution_time_ms: execution_time,
|
92
103
|
session_id: session_id,
|
93
104
|
user_id: user_id
|
@@ -1,8 +1,5 @@
|
|
1
1
|
class EnablePostgresqlExtensions < ActiveRecord::Migration[7.0]
|
2
2
|
def up
|
3
|
-
# This migration is now handled by the db:create rake task
|
4
|
-
# Just ensure required extensions are available
|
5
|
-
|
6
3
|
# Vector similarity search (required for embeddings)
|
7
4
|
execute "CREATE EXTENSION IF NOT EXISTS vector"
|
8
5
|
|
@@ -15,9 +12,11 @@ class EnablePostgresqlExtensions < ActiveRecord::Migration[7.0]
|
|
15
12
|
end
|
16
13
|
|
17
14
|
def down
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
|
15
|
+
# Extensions are typically not dropped as they might be used by other databases
|
16
|
+
# If you really need to drop them, uncomment the following:
|
17
|
+
# execute "DROP EXTENSION IF EXISTS vector"
|
18
|
+
# execute "DROP EXTENSION IF EXISTS unaccent"
|
19
|
+
# execute "DROP EXTENSION IF EXISTS pg_trgm"
|
20
|
+
# execute "DROP EXTENSION IF EXISTS \"uuid-ossp\""
|
22
21
|
end
|
23
|
-
end
|
22
|
+
end
|
@@ -0,0 +1,117 @@
|
|
1
|
+
class CreateRagdollDocuments < ActiveRecord::Migration[7.0]
|
2
|
+
# For concurrent index creation (PostgreSQL)
|
3
|
+
disable_ddl_transaction!
|
4
|
+
|
5
|
+
def up
|
6
|
+
create_table :ragdoll_documents,
|
7
|
+
comment: "Core documents table with LLM-generated structured metadata" do |t|
|
8
|
+
|
9
|
+
t.string :location, null: false,
|
10
|
+
comment: "Source location of document (file path, URL, or identifier)"
|
11
|
+
|
12
|
+
t.string :title, null: false,
|
13
|
+
comment: "Human-readable document title for display and search"
|
14
|
+
|
15
|
+
t.text :summary, null: false, default: "",
|
16
|
+
comment: "LLM-generated summary of document content"
|
17
|
+
|
18
|
+
t.string :document_type, null: false, default: "text",
|
19
|
+
comment: "Document format type"
|
20
|
+
|
21
|
+
t.string :status, null: false, default: "pending",
|
22
|
+
comment: "Document processing status"
|
23
|
+
|
24
|
+
t.json :metadata, default: {},
|
25
|
+
comment: "LLM-generated structured metadata about the file"
|
26
|
+
|
27
|
+
t.timestamp :file_modified_at, null: false, default: -> { "CURRENT_TIMESTAMP" },
|
28
|
+
comment: "Timestamp when the source file was last modified"
|
29
|
+
|
30
|
+
t.timestamps null: false,
|
31
|
+
comment: "Standard creation and update timestamps"
|
32
|
+
|
33
|
+
# Add tsvector column for full-text search
|
34
|
+
t.tsvector :search_vector
|
35
|
+
|
36
|
+
# Add keywords as array column
|
37
|
+
t.text :keywords, array: true, default: []
|
38
|
+
end
|
39
|
+
|
40
|
+
###########
|
41
|
+
# Indexes #
|
42
|
+
###########
|
43
|
+
|
44
|
+
add_index :ragdoll_documents, :location, unique: true,
|
45
|
+
comment: "Unique index for document source lookup"
|
46
|
+
|
47
|
+
add_index :ragdoll_documents, :title,
|
48
|
+
comment: "Index for title-based search"
|
49
|
+
|
50
|
+
add_index :ragdoll_documents, :document_type,
|
51
|
+
comment: "Index for filtering by document type"
|
52
|
+
|
53
|
+
add_index :ragdoll_documents, :status,
|
54
|
+
comment: "Index for filtering by processing status"
|
55
|
+
|
56
|
+
add_index :ragdoll_documents, :created_at,
|
57
|
+
comment: "Index for chronological sorting"
|
58
|
+
|
59
|
+
add_index :ragdoll_documents, [:document_type, :status],
|
60
|
+
comment: "Composite index for type+status filtering"
|
61
|
+
|
62
|
+
# Full-text search index
|
63
|
+
execute <<-SQL
|
64
|
+
CREATE INDEX CONCURRENTLY index_ragdoll_documents_on_fulltext_search
|
65
|
+
ON ragdoll_documents
|
66
|
+
USING gin(to_tsvector('english',
|
67
|
+
COALESCE(title, '') || ' ' ||
|
68
|
+
COALESCE(metadata->>'summary', '') || ' ' ||
|
69
|
+
COALESCE(metadata->>'keywords', '') || ' ' ||
|
70
|
+
COALESCE(metadata->>'description', '')
|
71
|
+
))
|
72
|
+
SQL
|
73
|
+
|
74
|
+
add_index :ragdoll_documents, "(metadata->>'document_type')",
|
75
|
+
name: "index_ragdoll_documents_on_metadata_type",
|
76
|
+
comment: "Index for filtering by document type"
|
77
|
+
|
78
|
+
add_index :ragdoll_documents, "(metadata->>'classification')",
|
79
|
+
name: "index_ragdoll_documents_on_metadata_classification",
|
80
|
+
comment: "Index for filtering by document classification"
|
81
|
+
|
82
|
+
# GIN index on search_vector
|
83
|
+
add_index :ragdoll_documents, :search_vector, using: :gin, algorithm: :concurrently
|
84
|
+
|
85
|
+
# GIN index on keywords array
|
86
|
+
add_index :ragdoll_documents, :keywords, using: :gin,
|
87
|
+
name: 'index_ragdoll_documents_on_keywords_gin'
|
88
|
+
|
89
|
+
# Trigger to keep search_vector up to date on INSERT/UPDATE
|
90
|
+
execute <<-SQL
|
91
|
+
CREATE FUNCTION ragdoll_documents_vector_update() RETURNS trigger AS $$
|
92
|
+
BEGIN
|
93
|
+
NEW.search_vector := to_tsvector('english',
|
94
|
+
COALESCE(NEW.title, '') || ' ' ||
|
95
|
+
COALESCE(NEW.metadata->>'summary', '') || ' ' ||
|
96
|
+
COALESCE(NEW.metadata->>'keywords', '') || ' ' ||
|
97
|
+
COALESCE(NEW.metadata->>'description', '')
|
98
|
+
);
|
99
|
+
RETURN NEW;
|
100
|
+
END
|
101
|
+
$$ LANGUAGE plpgsql;
|
102
|
+
|
103
|
+
CREATE TRIGGER ragdoll_search_vector_update
|
104
|
+
BEFORE INSERT OR UPDATE ON ragdoll_documents
|
105
|
+
FOR EACH ROW EXECUTE FUNCTION ragdoll_documents_vector_update();
|
106
|
+
SQL
|
107
|
+
end
|
108
|
+
|
109
|
+
def down
|
110
|
+
execute <<-SQL
|
111
|
+
DROP TRIGGER IF EXISTS ragdoll_search_vector_update ON ragdoll_documents;
|
112
|
+
DROP FUNCTION IF EXISTS ragdoll_documents_vector_update();
|
113
|
+
SQL
|
114
|
+
|
115
|
+
drop_table :ragdoll_documents
|
116
|
+
end
|
117
|
+
end
|
data/db/migrate/{005_create_ragdoll_embeddings.rb → 20250815234903_create_ragdoll_embeddings.rb}
RENAMED
@@ -3,7 +3,7 @@ class CreateRagdollEmbeddings < ActiveRecord::Migration[7.0]
|
|
3
3
|
create_table :ragdoll_embeddings,
|
4
4
|
comment: "Polymorphic vector embeddings storage for semantic similarity search" do |t|
|
5
5
|
|
6
|
-
|
6
|
+
t.references :embeddable, polymorphic: true, null: false,
|
7
7
|
comment: "Polymorphic reference to embeddable content"
|
8
8
|
|
9
9
|
t.text :content, null: false, default: "",
|
@@ -26,16 +26,19 @@ class CreateRagdollEmbeddings < ActiveRecord::Migration[7.0]
|
|
26
26
|
|
27
27
|
t.timestamps null: false,
|
28
28
|
comment: "Standard creation and update timestamps"
|
29
|
+
end
|
29
30
|
|
30
|
-
|
31
|
-
|
32
|
-
|
31
|
+
###########
|
32
|
+
# Indexes #
|
33
|
+
###########
|
33
34
|
|
34
|
-
|
35
|
-
|
35
|
+
add_index :ragdoll_embeddings, [:embeddable_type, :embeddable_id],
|
36
|
+
comment: "Index for finding embeddings by embeddable content"
|
36
37
|
|
37
|
-
|
38
|
-
|
39
|
-
|
38
|
+
add_index :ragdoll_embeddings, :embedding_vector,
|
39
|
+
using: :ivfflat,
|
40
|
+
opclass: :vector_cosine_ops,
|
41
|
+
name: "index_ragdoll_embeddings_on_embedding_vector_cosine",
|
42
|
+
comment: "IVFFlat index for fast cosine similarity search"
|
40
43
|
end
|
41
|
-
end
|
44
|
+
end
|
data/db/migrate/{006_create_ragdoll_contents.rb → 20250815234904_create_ragdoll_contents.rb}
RENAMED
@@ -29,19 +29,22 @@ class CreateRagdollContents < ActiveRecord::Migration[7.0]
|
|
29
29
|
|
30
30
|
t.timestamps null: false,
|
31
31
|
comment: "Standard creation and update timestamps"
|
32
|
+
end
|
32
33
|
|
33
|
-
|
34
|
-
|
35
|
-
|
34
|
+
###########
|
35
|
+
# Indexes #
|
36
|
+
###########
|
36
37
|
|
37
|
-
|
38
|
-
|
38
|
+
add_index :ragdoll_contents, :embedding_model,
|
39
|
+
comment: "Index for filtering by embedding model"
|
39
40
|
|
40
|
-
|
41
|
-
|
41
|
+
add_index :ragdoll_contents, :type,
|
42
|
+
comment: "Index for filtering by content type"
|
42
43
|
|
43
|
-
|
44
|
-
|
45
|
-
|
44
|
+
execute <<-SQL
|
45
|
+
CREATE INDEX index_ragdoll_contents_on_fulltext_search
|
46
|
+
ON ragdoll_contents
|
47
|
+
USING gin(to_tsvector('english', COALESCE(content, '')))
|
48
|
+
SQL
|
46
49
|
end
|
47
|
-
end
|
50
|
+
end
|
data/db/migrate/{007_create_ragdoll_searches.rb → 20250815234905_create_ragdoll_searches.rb}
RENAMED
@@ -41,33 +41,37 @@ class CreateRagdollSearches < ActiveRecord::Migration[7.0]
|
|
41
41
|
|
42
42
|
t.timestamps null: false,
|
43
43
|
comment: "Standard creation and update timestamps"
|
44
|
+
end
|
44
45
|
|
45
|
-
|
46
|
-
|
47
|
-
|
46
|
+
###########
|
47
|
+
# Indexes #
|
48
|
+
###########
|
48
49
|
|
49
|
-
|
50
|
-
|
51
|
-
|
50
|
+
add_index :ragdoll_searches, :query_embedding,
|
51
|
+
using: :ivfflat,
|
52
|
+
opclass: :vector_cosine_ops,
|
53
|
+
name: "index_ragdoll_searches_on_query_embedding_cosine",
|
54
|
+
comment: "IVFFlat index for finding similar search queries"
|
52
55
|
|
53
|
-
|
54
|
-
|
56
|
+
add_index :ragdoll_searches, :search_type,
|
57
|
+
comment: "Index for filtering by search type"
|
55
58
|
|
56
|
-
|
57
|
-
|
59
|
+
add_index :ragdoll_searches, :session_id,
|
60
|
+
comment: "Index for grouping searches by session"
|
58
61
|
|
59
|
-
|
60
|
-
|
62
|
+
add_index :ragdoll_searches, :user_id,
|
63
|
+
comment: "Index for filtering searches by user"
|
61
64
|
|
62
|
-
|
63
|
-
|
65
|
+
add_index :ragdoll_searches, :created_at,
|
66
|
+
comment: "Index for chronological search history"
|
64
67
|
|
65
|
-
|
66
|
-
|
68
|
+
add_index :ragdoll_searches, :results_count,
|
69
|
+
comment: "Index for analyzing search effectiveness"
|
67
70
|
|
68
|
-
|
69
|
-
|
70
|
-
|
71
|
-
|
71
|
+
execute <<-SQL
|
72
|
+
CREATE INDEX index_ragdoll_searches_on_fulltext_query
|
73
|
+
ON ragdoll_searches
|
74
|
+
USING gin(to_tsvector('english', query))
|
75
|
+
SQL
|
72
76
|
end
|
73
77
|
end
|
@@ -24,26 +24,26 @@ class CreateRagdollSearchResults < ActiveRecord::Migration[7.0]
|
|
24
24
|
|
25
25
|
t.timestamps null: false,
|
26
26
|
comment: "Standard creation and update timestamps"
|
27
|
+
end
|
27
28
|
|
28
|
-
|
29
|
-
|
30
|
-
|
29
|
+
###########
|
30
|
+
# Indexes #
|
31
|
+
###########
|
31
32
|
|
32
|
-
|
33
|
-
|
34
|
-
|
33
|
+
add_index :ragdoll_search_results, [:search_id, :result_rank],
|
34
|
+
name: "idx_search_results_search_rank",
|
35
|
+
comment: "Index for retrieving results in ranked order"
|
35
36
|
|
36
|
-
|
37
|
-
|
38
|
-
|
37
|
+
add_index :ragdoll_search_results, [:embedding_id, :similarity_score],
|
38
|
+
name: "idx_search_results_embedding_score",
|
39
|
+
comment: "Index for analyzing embedding performance"
|
39
40
|
|
40
|
-
|
41
|
-
|
42
|
-
|
41
|
+
add_index :ragdoll_search_results, :similarity_score,
|
42
|
+
name: "idx_search_results_similarity",
|
43
|
+
comment: "Index for similarity score analysis"
|
43
44
|
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
end
|
45
|
+
add_index :ragdoll_search_results, [:clicked, :clicked_at],
|
46
|
+
name: "idx_search_results_clicks",
|
47
|
+
comment: "Index for click-through analysis"
|
48
48
|
end
|
49
49
|
end
|
@@ -90,10 +90,10 @@ module Ragdoll
|
|
90
90
|
# Drop all tables in correct order (respecting foreign key constraints)
|
91
91
|
# Order: dependent tables first, then parent tables
|
92
92
|
tables_to_drop = %w[
|
93
|
+
ragdoll_search_results
|
94
|
+
ragdoll_searches
|
93
95
|
ragdoll_embeddings
|
94
|
-
|
95
|
-
ragdoll_image_contents
|
96
|
-
ragdoll_audio_contents
|
96
|
+
ragdoll_contents
|
97
97
|
ragdoll_documents
|
98
98
|
schema_migrations
|
99
99
|
]
|
@@ -109,6 +109,11 @@ module Ragdoll
|
|
109
109
|
end
|
110
110
|
end
|
111
111
|
|
112
|
+
# Also drop any functions/triggers that might exist
|
113
|
+
if ActiveRecord::Base.connection.adapter_name.downcase.include?("postgresql")
|
114
|
+
ActiveRecord::Base.connection.execute("DROP FUNCTION IF EXISTS ragdoll_documents_vector_update() CASCADE")
|
115
|
+
end
|
116
|
+
|
112
117
|
migrate!
|
113
118
|
end
|
114
119
|
|
data/lib/ragdoll/core/version.rb
CHANGED
data/lib/tasks/db.rake
CHANGED
@@ -25,22 +25,17 @@ namespace :db do
|
|
25
25
|
)
|
26
26
|
|
27
27
|
# Run individual SQL commands to avoid transaction block issues
|
28
|
-
|
29
|
-
ActiveRecord::Base.connection.execute("DROP DATABASE IF EXISTS ragdoll_development")
|
30
|
-
rescue => e
|
31
|
-
puts "Note: #{e.message}" if e.message.include?("does not exist")
|
32
|
-
end
|
33
|
-
|
34
|
-
begin
|
35
|
-
ActiveRecord::Base.connection.execute("DROP ROLE IF EXISTS ragdoll")
|
36
|
-
rescue => e
|
37
|
-
puts "Note: #{e.message}" if e.message.include?("does not exist")
|
38
|
-
end
|
28
|
+
# Note: Removed the DROP DATABASE/ROLE here since that should be done via db:drop task
|
39
29
|
|
40
30
|
begin
|
41
31
|
ActiveRecord::Base.connection.execute("CREATE ROLE ragdoll WITH LOGIN CREATEDB")
|
32
|
+
puts "Role 'ragdoll' created successfully"
|
42
33
|
rescue => e
|
43
|
-
|
34
|
+
if e.message.include?("already exists")
|
35
|
+
puts "Note: Role 'ragdoll' already exists, continuing..."
|
36
|
+
else
|
37
|
+
raise e
|
38
|
+
end
|
44
39
|
end
|
45
40
|
|
46
41
|
begin
|
@@ -50,8 +45,16 @@ namespace :db do
|
|
50
45
|
ENCODING = 'UTF8'
|
51
46
|
CONNECTION LIMIT = -1
|
52
47
|
SQL
|
48
|
+
puts "Database 'ragdoll_development' created successfully"
|
53
49
|
rescue => e
|
54
|
-
|
50
|
+
if e.message.include?("already exists")
|
51
|
+
puts "ERROR: Database 'ragdoll_development' already exists!"
|
52
|
+
puts "Please run 'rake db:drop' first to remove the existing database, then run 'rake db:create' again."
|
53
|
+
puts "Or use 'rake db:reset' to drop, create, and migrate in one step."
|
54
|
+
exit 1
|
55
|
+
else
|
56
|
+
raise e
|
57
|
+
end
|
55
58
|
end
|
56
59
|
|
57
60
|
ActiveRecord::Base.connection.execute("GRANT ALL PRIVILEGES ON DATABASE ragdoll_development TO ragdoll")
|
@@ -97,8 +100,53 @@ namespace :db do
|
|
97
100
|
puts "Dropping database with config: #{config.database.inspect}"
|
98
101
|
|
99
102
|
case config.database[:adapter]
|
100
|
-
when "postgresql"
|
101
|
-
puts "
|
103
|
+
when "postgresql"
|
104
|
+
puts "PostgreSQL database drop - running as superuser to drop database and role..."
|
105
|
+
|
106
|
+
# Connect as superuser to drop database and role
|
107
|
+
ActiveRecord::Base.establish_connection(
|
108
|
+
adapter: 'postgresql',
|
109
|
+
database: 'postgres', # Connect to postgres database initially
|
110
|
+
username: ENV.fetch('POSTGRES_SUPERUSER', 'postgres'),
|
111
|
+
password: ENV['POSTGRES_SUPERUSER_PASSWORD'],
|
112
|
+
host: config.database[:host] || 'localhost',
|
113
|
+
port: config.database[:port] || 5432
|
114
|
+
)
|
115
|
+
|
116
|
+
# Drop the database if it exists
|
117
|
+
begin
|
118
|
+
ActiveRecord::Base.connection.execute("DROP DATABASE IF EXISTS ragdoll_development")
|
119
|
+
puts "Database 'ragdoll_development' dropped successfully"
|
120
|
+
rescue => e
|
121
|
+
puts "Error dropping database: #{e.message}"
|
122
|
+
end
|
123
|
+
|
124
|
+
# Optionally drop the role (commented out by default to preserve user)
|
125
|
+
# begin
|
126
|
+
# ActiveRecord::Base.connection.execute("DROP ROLE IF EXISTS ragdoll")
|
127
|
+
# puts "Role 'ragdoll' dropped successfully"
|
128
|
+
# rescue => e
|
129
|
+
# puts "Error dropping role: #{e.message}"
|
130
|
+
# end
|
131
|
+
|
132
|
+
when "mysql2"
|
133
|
+
puts "MySQL database drop - connecting to drop database..."
|
134
|
+
|
135
|
+
# Connect without specifying database
|
136
|
+
ActiveRecord::Base.establish_connection(
|
137
|
+
adapter: 'mysql2',
|
138
|
+
username: config.database[:username],
|
139
|
+
password: config.database[:password],
|
140
|
+
host: config.database[:host] || 'localhost',
|
141
|
+
port: config.database[:port] || 3306
|
142
|
+
)
|
143
|
+
|
144
|
+
begin
|
145
|
+
ActiveRecord::Base.connection.execute("DROP DATABASE IF EXISTS #{config.database[:database]}")
|
146
|
+
puts "Database '#{config.database[:database]}' dropped successfully"
|
147
|
+
rescue => e
|
148
|
+
puts "Error dropping database: #{e.message}"
|
149
|
+
end
|
102
150
|
end
|
103
151
|
|
104
152
|
puts "Database drop completed"
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: ragdoll
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.10
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Dewayne VanHoozer
|
@@ -406,12 +406,12 @@ files:
|
|
406
406
|
- app/services/ragdoll/search_engine.rb
|
407
407
|
- app/services/ragdoll/text_chunker.rb
|
408
408
|
- app/services/ragdoll/text_generation_service.rb
|
409
|
-
- db/migrate/
|
410
|
-
- db/migrate/
|
411
|
-
- db/migrate/
|
412
|
-
- db/migrate/
|
413
|
-
- db/migrate/
|
414
|
-
- db/migrate/
|
409
|
+
- db/migrate/20250815234901_enable_postgresql_extensions.rb
|
410
|
+
- db/migrate/20250815234902_create_ragdoll_documents.rb
|
411
|
+
- db/migrate/20250815234903_create_ragdoll_embeddings.rb
|
412
|
+
- db/migrate/20250815234904_create_ragdoll_contents.rb
|
413
|
+
- db/migrate/20250815234905_create_ragdoll_searches.rb
|
414
|
+
- db/migrate/20250815234906_create_ragdoll_search_results.rb
|
415
415
|
- lib/ragdoll-core.rb
|
416
416
|
- lib/ragdoll.rb
|
417
417
|
- lib/ragdoll/core.rb
|
@@ -1,70 +0,0 @@
|
|
1
|
-
class CreateRagdollDocuments < ActiveRecord::Migration[7.0]
|
2
|
-
def change
|
3
|
-
create_table :ragdoll_documents,
|
4
|
-
comment: "Core documents table with LLM-generated structured metadata" do |t|
|
5
|
-
|
6
|
-
t.string :location, null: false,
|
7
|
-
comment: "Source location of document (file path, URL, or identifier)"
|
8
|
-
|
9
|
-
t.string :title, null: false,
|
10
|
-
comment: "Human-readable document title for display and search"
|
11
|
-
|
12
|
-
t.text :summary, null: false, default: "",
|
13
|
-
comment: "LLM-generated summary of document content"
|
14
|
-
|
15
|
-
t.text :keywords , null: false, default: "",
|
16
|
-
comment: "LLM-generated comma-separated keywords of document"
|
17
|
-
|
18
|
-
t.string :document_type, null: false, default: "text",
|
19
|
-
comment: "Document format type"
|
20
|
-
|
21
|
-
t.string :status, null: false, default: "pending",
|
22
|
-
comment: "Document processing status"
|
23
|
-
|
24
|
-
t.json :metadata, default: {},
|
25
|
-
comment: "LLM-generated structured metadata about the file"
|
26
|
-
|
27
|
-
t.timestamp :file_modified_at, null: false, default: -> { "CURRENT_TIMESTAMP" },
|
28
|
-
comment: "Timestamp when the source file was last modified"
|
29
|
-
|
30
|
-
t.timestamps null: false,
|
31
|
-
comment: "Standard creation and update timestamps"
|
32
|
-
|
33
|
-
###########
|
34
|
-
# Indexes #
|
35
|
-
###########
|
36
|
-
|
37
|
-
t.index :location, unique: true,
|
38
|
-
comment: "Unique index for document source lookup"
|
39
|
-
|
40
|
-
t.index :title,
|
41
|
-
comment: "Index for title-based search"
|
42
|
-
|
43
|
-
t.index :document_type,
|
44
|
-
comment: "Index for filtering by document type"
|
45
|
-
|
46
|
-
t.index :status,
|
47
|
-
comment: "Index for filtering by processing status"
|
48
|
-
|
49
|
-
t.index :created_at,
|
50
|
-
comment: "Index for chronological sorting"
|
51
|
-
|
52
|
-
t.index %i[document_type status],
|
53
|
-
comment: "Composite index for type+status filtering"
|
54
|
-
|
55
|
-
t.index "to_tsvector('english', COALESCE(title, '') ||
|
56
|
-
' ' ||
|
57
|
-
COALESCE(metadata->>'summary', '') ||
|
58
|
-
' ' || COALESCE(metadata->>'keywords', '') ||
|
59
|
-
' ' || COALESCE(metadata->>'description', ''))",
|
60
|
-
using: :gin, name: "index_ragdoll_documents_on_fulltext_search",
|
61
|
-
comment: "Full-text search across title and metadata fields"
|
62
|
-
|
63
|
-
t.index "(metadata->>'document_type')", name: "index_ragdoll_documents_on_metadata_type",
|
64
|
-
comment: "Index for filtering by document type"
|
65
|
-
|
66
|
-
t.index "(metadata->>'classification')", name: "index_ragdoll_documents_on_metadata_classification",
|
67
|
-
comment: "Index for filtering by document classification"
|
68
|
-
end
|
69
|
-
end
|
70
|
-
end
|