ragdoll 0.1.9 → 0.1.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: cde84c4b5bbf1e8296bdd762ee78acb2f69663e493ce23b0941ada9d1201bdcd
4
- data.tar.gz: f8bc456d3c536a295920bc1c806974b2b39f08977a8761604c7a192b83e756d2
3
+ metadata.gz: 4f7b2c95ede1523e9e01af70394217387d876da6317fed651df3e27cf337cfe9
4
+ data.tar.gz: a82ae7d541fd06876acb3acaf8f02639234f8b118274621851678a2799c5f559
5
5
  SHA512:
6
- metadata.gz: c1ce0e46be45fe8004930ec231a83a59f31039f4908be2a0e0ba67043237f1ea03bc00991820f6928a6ef5baa6ca910547876f21ddad5a7ead2d6384192e7708
7
- data.tar.gz: e3f50e1205b4ba755c6a978acb06240b7b1fa729f4fa9bef33f956a9b245ad3d3323612f300902051237ffa71a763fc6db8d8e0fedc4f2761c46a977b42d6958
6
+ metadata.gz: ba14828a6e743677c84072b9f1bb27743e429531ebdd9fbd3d8553add7bbdad070d709cd617dc620fef4ddc6846085ca79d3bb6d32bae8465c6b3b10acc0692f
7
+ data.tar.gz: de630ebf15168b562ef686ec6cd9f1cfe532b5bbf495e33a74085b567cf53ce7bb87e7c5c543756c47bd68c98290221b879a1b4d8e5888aac4916d1c1554fe99
data/CHANGELOG.md CHANGED
@@ -6,7 +6,36 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
6
6
 
7
7
  ## [Unreleased]
8
8
 
9
- *Note: These features will be included in the next release (likely v0.1.9) featuring comprehensive search tracking and analytics capabilities.*
9
+ ## [0.1.10] - 2025-01-15
10
+
11
+ ### Changed
12
+ - Continued improvements to search performance and accuracy
13
+
14
+ ### Added
15
+ - **Hybrid Search**: Complete implementation combining semantic and full-text search capabilities
16
+ - Configurable weights for semantic vs text search (default: 70% semantic, 30% text)
17
+ - Deduplication of results by document ID
18
+ - Combined scoring system for unified result ranking
19
+ - **Full-text Search**: PostgreSQL full-text search with tsvector indexing
20
+ - Per-word match ratio scoring (0.0 to 1.0)
21
+ - GIN index for high-performance text search
22
+ - Search across title, summary, keywords, and description fields
23
+ - **Enhanced Search API**: Complete search type delegation at top-level Ragdoll namespace
24
+ - `Ragdoll.hybrid_search` method for combined semantic and text search
25
+ - `Ragdoll::Document.search_content` for full-text search capabilities
26
+ - Consistent parameter handling across all search methods
27
+
28
+ ### Changed
29
+ - **Search Architecture**: Unified search interface supporting semantic, fulltext, and hybrid modes
30
+ - **Database Schema**: Added search_vector column with GIN indexing for full-text search performance
31
+
32
+ ### Technical Details
33
+ - Full-text search uses PostgreSQL's built-in tsvector capabilities
34
+ - Hybrid search combines cosine similarity (semantic) with text match ratios
35
+ - Results are ranked by weighted combined scores
36
+ - All search methods maintain backward compatibility
37
+
38
+ ## [0.1.9] - 2025-01-10
10
39
 
11
40
  ### Added
12
41
  - **Initial CHANGELOG**: Added comprehensive CHANGELOG.md following Keep a Changelog format
@@ -40,7 +69,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
40
69
  - **Test Coverage**: Added coverage directory to .gitignore for cleaner repository state
41
70
 
42
71
  ### Technical Details
43
- - Commits: `9186067`, `cb952d3`, `e902a5f`, `632527b`
72
+ - Commits: `9186067`, `cb952d3`, `e902a5f`, `632527b`
44
73
  - All changes maintain backward compatibility
45
74
  - No breaking API changes
46
75
 
@@ -141,6 +170,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
141
170
  - **Database Schema**: Multi-modal polymorphic architecture with PostgreSQL + pgvector
142
171
  - **Dual Metadata Architecture**: Separate LLM-generated content analysis and file properties
143
172
  - **Search Functionality**: Semantic search with cosine similarity and usage analytics
173
+ - **Hybrid Search**: Complete implementation combining semantic and full-text search with configurable weights
174
+ - **Full-text Search**: PostgreSQL tsvector-based text search with GIN indexing
144
175
  - **Search Tracking System**: Comprehensive analytics with query embeddings, click-through tracking, and performance monitoring
145
176
  - **Document Management**: Add, update, delete, list operations
146
177
  - **Background Processing**: ActiveJob integration for async embedding generation
@@ -150,7 +181,6 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
150
181
  ### 🚧 In Development
151
182
  - **Image Processing**: Framework exists but vision AI integration needs completion
152
183
  - **Audio Processing**: Framework exists but speech-to-text integration needs completion
153
- - **Hybrid Search**: Combining semantic and full-text search capabilities
154
184
 
155
185
  ### 📋 Planned Features
156
186
  - **Multi-modal Search**: Search across text, image, and audio content types
@@ -161,6 +191,18 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
161
191
 
162
192
  ## Migration Guide
163
193
 
194
+ ### From 0.1.9 to 0.1.10
195
+ - **New Search Methods**: `Ragdoll.hybrid_search` and `Ragdoll::Document.search_content` methods now available
196
+ - **Database Migration**: New search_vector column added to documents table with GIN index for full-text search
197
+ - **API Enhancement**: All search methods now support unified parameter interface
198
+ - **Backward Compatibility**: Existing `Ragdoll.search` method unchanged, continues to work as before
199
+ - **CLI Integration**: ragdoll-cli now requires ragdoll >= 0.1.10 for hybrid and full-text search support
200
+
201
+ ### From 0.1.8 to 0.1.9
202
+ - **CHANGELOG Addition**: Comprehensive changelog and feature tracking added
203
+ - **API Method Consistency**: `hybrid_search` method properly delegated to top-level namespace
204
+ - **No Breaking Changes**: All existing functionality remains compatible
205
+
164
206
  ### From 0.1.7 to 0.1.8
165
207
  - New search tracking tables will be automatically created via migrations
166
208
  - No breaking changes to existing API
@@ -198,4 +240,4 @@ This project is licensed under the MIT License - see the LICENSE file for detail
198
240
 
199
241
  ---
200
242
 
201
- *This changelog is automatically maintained and reflects the actual implementation status of features.*
243
+ *This changelog is automatically maintained and reflects the actual implementation status of features.*
data/README.md CHANGED
@@ -22,6 +22,8 @@
22
22
 
23
23
  Database-oriented multi-modal RAG (Retrieval-Augmented Generation) library built on ActiveRecord. Features PostgreSQL + pgvector for high-performance semantic search, polymorphic content architecture, and dual metadata design for sophisticated document analysis.
24
24
 
25
+ RAG does not have to be hard. Every week its getting simpler. The frontier LLM providers are starting to encorporate RAG services. For example OpenAI offers a vector search service. See: [https://0x1eef.github.io/posts/an-introduction-to-rag-with-llm.rb/](https://0x1eef.github.io/posts/an-introduction-to-rag-with-llm.rb/)
26
+
25
27
  ## Overview
26
28
 
27
29
  Ragdoll is a database-first, multi-modal Retrieval-Augmented Generation (RAG) library for Ruby. It pairs PostgreSQL + pgvector with an ActiveRecord-driven schema to deliver fast, production-grade semantic search and clean data modeling. Today it ships with robust text processing; image and audio pipelines are scaffolded and actively being completed.
@@ -202,6 +204,53 @@ results = Ragdoll.hybrid_search(
202
204
  )
203
205
  ```
204
206
 
207
+ ### Keywords Search
208
+
209
+ Ragdoll supports powerful keywords-based search that can be used standalone or combined with semantic search. The keywords system uses PostgreSQL array operations for high performance and supports both partial matching (overlap) and exact matching (contains all).
210
+
211
+ ```ruby
212
+ # Keywords-only search (overlap - documents containing any of the keywords)
213
+ results = Ragdoll::Document.search_by_keywords(['machine', 'learning', 'ai'])
214
+
215
+ # Results are sorted by match count (documents with more keyword matches rank higher)
216
+ results.each do |doc|
217
+ puts "#{doc.title}: #{doc.keywords_match_count} matches"
218
+ end
219
+
220
+ # Exact keywords search (contains all - documents must have ALL keywords)
221
+ results = Ragdoll::Document.search_by_keywords_all(['ruby', 'programming'])
222
+
223
+ # Results are sorted by focus (fewer total keywords = more focused document)
224
+ results.each do |doc|
225
+ puts "#{doc.title}: #{doc.total_keywords_count} total keywords"
226
+ end
227
+
228
+ # Combined semantic + keywords search for best results
229
+ results = Ragdoll.search(
230
+ query: 'artificial intelligence applications',
231
+ keywords: ['ai', 'machine learning', 'neural networks'],
232
+ limit: 10
233
+ )
234
+
235
+ # Keywords search with options
236
+ results = Ragdoll::Document.search_by_keywords(
237
+ ['web', 'javascript', 'frontend'],
238
+ limit: 20
239
+ )
240
+
241
+ # Case-insensitive keyword matching (automatically normalized)
242
+ results = Ragdoll::Document.search_by_keywords(['Python', 'DATA-SCIENCE', 'ai'])
243
+ # Will match documents with keywords: ['python', 'data-science', 'ai']
244
+ ```
245
+
246
+ **Keywords Search Features:**
247
+ - **High Performance**: Uses PostgreSQL GIN indexes for fast array operations
248
+ - **Flexible Matching**: Supports both overlap (`&&`) and contains (`@>`) operators
249
+ - **Smart Scoring**: Results ordered by match count or document focus
250
+ - **Case Insensitive**: Automatic keyword normalization
251
+ - **Integration Ready**: Works seamlessly with semantic search
252
+ - **Inspired by `find_matching_entries.rb`**: Optimized for PostgreSQL arrays
253
+
205
254
  ### Search Analytics and Tracking
206
255
 
207
256
  Ragdoll automatically tracks all searches to provide comprehensive analytics and improve search relevance over time:
data/Rakefile CHANGED
@@ -49,8 +49,10 @@ task :setup_test_db do
49
49
  puts "Warning: Could not install pgvector extension: #{e.message}"
50
50
  end
51
51
 
52
- # Run migrations
53
- Ragdoll::Core::Database.setup(test_db_config.merge(auto_migrate: true, logger: nil))
52
+ # Reset and run migrations (drops all tables and re-runs migrations)
53
+ # This ensures clean state for tests regardless of previous migration versions
54
+ Ragdoll::Core::Database.setup(test_db_config.merge(auto_migrate: false, logger: nil))
55
+ Ragdoll::Core::Database.reset!
54
56
  puts "Test database setup complete"
55
57
  end
56
58
 
@@ -142,10 +142,12 @@ module Ragdoll
142
142
  def keywords_array
143
143
  return [] unless keywords.present?
144
144
 
145
+ # After migration, keywords is now a PostgreSQL array
145
146
  case keywords
146
147
  when Array
147
- keywords
148
+ keywords.map(&:to_s).map(&:strip).reject(&:empty?)
148
149
  when String
150
+ # Fallback for any remaining string data (shouldn't happen after migration)
149
151
  keywords.split(",").map(&:strip).reject(&:empty?)
150
152
  else
151
153
  []
@@ -153,17 +155,23 @@ module Ragdoll
153
155
  end
154
156
 
155
157
  def add_keyword(keyword)
158
+ return if keyword.blank?
159
+
156
160
  current_keywords = keywords_array
157
- return if current_keywords.include?(keyword.strip)
161
+ normalized_keyword = keyword.to_s.strip.downcase
162
+ return if current_keywords.map(&:downcase).include?(normalized_keyword)
158
163
 
159
- current_keywords << keyword.strip
160
- self.keywords = current_keywords.join(", ")
164
+ current_keywords << normalized_keyword
165
+ self.keywords = current_keywords
161
166
  end
162
167
 
163
168
  def remove_keyword(keyword)
169
+ return if keyword.blank?
170
+
164
171
  current_keywords = keywords_array
165
- current_keywords.delete(keyword.strip)
166
- self.keywords = current_keywords.join(", ")
172
+ normalized_keyword = keyword.to_s.strip.downcase
173
+ current_keywords.reject! { |k| k.downcase == normalized_keyword }
174
+ self.keywords = current_keywords
167
175
  end
168
176
 
169
177
  # Metadata accessors for common fields
@@ -249,15 +257,110 @@ module Ragdoll
249
257
  puts "Metadata generation failed: #{e.message}"
250
258
  end
251
259
 
252
- # PostgreSQL full-text search on metadata fields
260
+ # PostgreSQL full-text search on metadata fields with per-word match-ratio [0.0..1.0]
253
261
  def self.search_content(query, **options)
254
262
  return none if query.blank?
255
263
 
256
- # Use PostgreSQL's built-in full-text search across metadata fields
257
- where(
258
- "to_tsvector('english', COALESCE(title, '') || ' ' || COALESCE(metadata->>'summary', '') || ' ' || COALESCE(metadata->>'keywords', '') || ' ' || COALESCE(metadata->>'description', '')) @@ plainto_tsquery('english', ?)",
259
- query
260
- ).limit(options[:limit] || 20)
264
+ # Split into unique alphanumeric words
265
+ words = query.downcase.scan(/[[:alnum:]]+/).uniq
266
+ return none if words.empty?
267
+
268
+ limit = options[:limit] || 20
269
+ threshold = options[:threshold] || 0.0
270
+
271
+ # Use precomputed tsvector column if it exists, otherwise build on the fly
272
+ if column_names.include?("search_vector")
273
+ tsvector = "#{table_name}.search_vector"
274
+ else
275
+ # Build tsvector from title and metadata fields
276
+ text_expr =
277
+ "COALESCE(title, '') || ' ' || " \
278
+ "COALESCE(metadata->>'summary', '') || ' ' || " \
279
+ "COALESCE(metadata->>'keywords', '') || ' ' || " \
280
+ "COALESCE(metadata->>'description', '')"
281
+ tsvector = "to_tsvector('english', #{text_expr})"
282
+ end
283
+
284
+ # Prepare sanitized tsquery terms
285
+ tsqueries = words.map do |word|
286
+ sanitize_sql_array(["plainto_tsquery('english', ?)", word])
287
+ end
288
+
289
+ # Combine per-word tsqueries with OR so PostgreSQL can use the GIN index
290
+ combined_tsquery = tsqueries.join(' || ')
291
+
292
+ # Score each match (1 if present, 0 if not), sum them
293
+ score_terms = tsqueries.map { |tsq| "(#{tsvector} @@ #{tsq})::int" }
294
+ score_sum = score_terms.join(' + ')
295
+
296
+ # Similarity ratio: fraction of query words present
297
+ similarity_sql = "(#{score_sum})::float / #{words.size}"
298
+
299
+ # Start with basic search query
300
+ query = select("#{table_name}.*, #{similarity_sql} AS fulltext_similarity")
301
+
302
+ # Build where conditions
303
+ conditions = ["#{tsvector} @@ (#{combined_tsquery})"]
304
+
305
+ # Add status filter (default to processed unless overridden)
306
+ status = options[:status] || 'processed'
307
+ conditions << "#{table_name}.status = '#{status}'"
308
+
309
+ # Add document type filter if specified
310
+ if options[:document_type].present?
311
+ conditions << sanitize_sql_array(["#{table_name}.document_type = ?", options[:document_type]])
312
+ end
313
+
314
+ # Add threshold filtering if specified
315
+ if threshold > 0.0
316
+ conditions << "#{similarity_sql} >= #{threshold}"
317
+ end
318
+
319
+ # Combine all conditions
320
+ where_clause = conditions.join(' AND ')
321
+
322
+ # Materialize to array to avoid COUNT/SELECT alias conflicts in some AR versions
323
+ query.where(where_clause)
324
+ .order(Arel.sql("fulltext_similarity DESC, updated_at DESC"))
325
+ .limit(limit)
326
+ .to_a
327
+ end
328
+
329
+ # Search documents by keywords using PostgreSQL array operations
330
+ # Returns documents that match keywords with scoring based on match count
331
+ # Inspired by find_matching_entries.rb algorithm but optimized for PostgreSQL arrays
332
+ def self.search_by_keywords(keywords_array, **options)
333
+ return where("1 = 0") if keywords_array.blank?
334
+
335
+ # Normalize keywords to lowercase strings array
336
+ normalized_keywords = Array(keywords_array).map(&:to_s).map(&:downcase).reject(&:empty?)
337
+ return where("1 = 0") if normalized_keywords.empty?
338
+
339
+ limit = options[:limit] || 20
340
+
341
+ # Use PostgreSQL array overlap operator with proper array literal
342
+ quoted_keywords = normalized_keywords.map { |k| "\"#{k}\"" }.join(',')
343
+ array_literal = "'{#{quoted_keywords}}'::text[]"
344
+ where("keywords && #{array_literal}")
345
+ .order("created_at DESC")
346
+ .limit(limit)
347
+ end
348
+
349
+ # Find documents that contain ALL specified keywords (exact array matching)
350
+ def self.search_by_keywords_all(keywords_array, **options)
351
+ return where("1 = 0") if keywords_array.blank?
352
+
353
+ normalized_keywords = Array(keywords_array).map(&:to_s).map(&:downcase).reject(&:empty?)
354
+ return where("1 = 0") if normalized_keywords.empty?
355
+
356
+ limit = options[:limit] || 20
357
+
358
+ # Use PostgreSQL array contains operator with proper array literal
359
+ quoted_keywords = normalized_keywords.map { |k| "\"#{k}\"" }.join(',')
360
+ array_literal = "'{#{quoted_keywords}}'::text[]"
361
+ where("keywords @> #{array_literal}")
362
+ .order("created_at DESC")
363
+ .limit(limit)
261
364
  end
262
365
 
263
366
  # Faceted search by metadata fields
@@ -64,10 +64,26 @@ module Ragdoll
64
64
  scope = scope.by_model(filters[:embedding_model]) if filters[:embedding_model]
65
65
 
66
66
  # Document-level filters require joining through embeddable (STI Content) to documents
67
- if filters[:document_type]
67
+ needs_document_join = filters[:document_type] || filters[:keywords]
68
+
69
+ if needs_document_join
68
70
  scope = scope.joins("JOIN ragdoll_contents ON ragdoll_contents.id = ragdoll_embeddings.embeddable_id")
69
71
  .joins("JOIN ragdoll_documents ON ragdoll_documents.id = ragdoll_contents.document_id")
70
- .where("ragdoll_documents.document_type = ?", filters[:document_type])
72
+ end
73
+
74
+ if filters[:document_type]
75
+ scope = scope.where("ragdoll_documents.document_type = ?", filters[:document_type])
76
+ end
77
+
78
+ # Keywords filtering using PostgreSQL array operations
79
+ if filters[:keywords] && filters[:keywords].any?
80
+ normalized_keywords = Array(filters[:keywords]).map(&:to_s).map(&:downcase).reject(&:empty?)
81
+ if normalized_keywords.any?
82
+ # Use PostgreSQL array overlap operator with proper array literal
83
+ quoted_keywords = normalized_keywords.map { |k| "\"#{k}\"" }.join(',')
84
+ array_literal = "'{#{quoted_keywords}}'::text[]"
85
+ scope = scope.where("ragdoll_documents.keywords && #{array_literal}")
86
+ end
71
87
  end
72
88
 
73
89
  # Use pgvector for similarity search
@@ -83,10 +99,26 @@ module Ragdoll
83
99
  scope = scope.by_model(filters[:embedding_model]) if filters[:embedding_model]
84
100
 
85
101
  # Document-level filters require joining through embeddable (STI Content) to documents
86
- if filters[:document_type]
102
+ needs_document_join = filters[:document_type] || filters[:keywords]
103
+
104
+ if needs_document_join
87
105
  scope = scope.joins("JOIN ragdoll_contents ON ragdoll_contents.id = ragdoll_embeddings.embeddable_id")
88
106
  .joins("JOIN ragdoll_documents ON ragdoll_documents.id = ragdoll_contents.document_id")
89
- .where("ragdoll_documents.document_type = ?", filters[:document_type])
107
+ end
108
+
109
+ if filters[:document_type]
110
+ scope = scope.where("ragdoll_documents.document_type = ?", filters[:document_type])
111
+ end
112
+
113
+ # Keywords filtering using PostgreSQL array operations
114
+ if filters[:keywords] && filters[:keywords].any?
115
+ normalized_keywords = Array(filters[:keywords]).map(&:to_s).map(&:downcase).reject(&:empty?)
116
+ if normalized_keywords.any?
117
+ # Use PostgreSQL array overlap operator with proper array literal
118
+ quoted_keywords = normalized_keywords.map { |k| "\"#{k}\"" }.join(',')
119
+ array_literal = "'{#{quoted_keywords}}'::text[]"
120
+ scope = scope.where("ragdoll_documents.keywords && #{array_literal}")
121
+ end
90
122
  end
91
123
 
92
124
  search_with_pgvector_stats(query_embedding, scope, limit, threshold)
@@ -33,6 +33,10 @@ module Ragdoll
33
33
  threshold = options[:threshold] || search_config[:similarity_threshold]
34
34
  filters = options[:filters] || {}
35
35
 
36
+ # Extract keywords option and normalize
37
+ keywords = options[:keywords] || []
38
+ keywords = Array(keywords).map(&:to_s).reject(&:empty?)
39
+
36
40
  # Extract tracking options
37
41
  session_id = options[:session_id]
38
42
  user_id = options[:user_id]
@@ -49,6 +53,11 @@ module Ragdoll
49
53
  return [] if query_embedding.nil?
50
54
  end
51
55
 
56
+ # Add keywords to filters if provided
57
+ if keywords.any?
58
+ filters[:keywords] = keywords
59
+ end
60
+
52
61
  # Search using ActiveRecord models with statistics
53
62
  # Try enhanced search first, fall back to original if it fails
54
63
  begin
@@ -81,13 +90,15 @@ module Ragdoll
81
90
  }
82
91
  end
83
92
 
93
+ search_type = keywords.any? ? "semantic_with_keywords" : "semantic"
94
+
84
95
  Ragdoll::Search.record_search(
85
96
  query: query_string,
86
97
  query_embedding: query_embedding,
87
98
  results: search_results,
88
- search_type: "semantic",
99
+ search_type: search_type,
89
100
  filters: filters,
90
- options: { limit: limit, threshold: threshold },
101
+ options: { limit: limit, threshold: threshold, keywords: keywords },
91
102
  execution_time_ms: execution_time,
92
103
  session_id: session_id,
93
104
  user_id: user_id
@@ -1,8 +1,5 @@
1
1
  class EnablePostgresqlExtensions < ActiveRecord::Migration[7.0]
2
2
  def up
3
- # This migration is now handled by the db:create rake task
4
- # Just ensure required extensions are available
5
-
6
3
  # Vector similarity search (required for embeddings)
7
4
  execute "CREATE EXTENSION IF NOT EXISTS vector"
8
5
 
@@ -15,9 +12,11 @@ class EnablePostgresqlExtensions < ActiveRecord::Migration[7.0]
15
12
  end
16
13
 
17
14
  def down
18
- execute <<-SQL
19
- DROP DATABASE IF EXISTS ragdoll_development;
20
- DROP ROLE IF EXISTS ragdoll;
21
- SQL
15
+ # Extensions are typically not dropped as they might be used by other databases
16
+ # If you really need to drop them, uncomment the following:
17
+ # execute "DROP EXTENSION IF EXISTS vector"
18
+ # execute "DROP EXTENSION IF EXISTS unaccent"
19
+ # execute "DROP EXTENSION IF EXISTS pg_trgm"
20
+ # execute "DROP EXTENSION IF EXISTS \"uuid-ossp\""
22
21
  end
23
- end
22
+ end
@@ -0,0 +1,117 @@
1
+ class CreateRagdollDocuments < ActiveRecord::Migration[7.0]
2
+ # For concurrent index creation (PostgreSQL)
3
+ disable_ddl_transaction!
4
+
5
+ def up
6
+ create_table :ragdoll_documents,
7
+ comment: "Core documents table with LLM-generated structured metadata" do |t|
8
+
9
+ t.string :location, null: false,
10
+ comment: "Source location of document (file path, URL, or identifier)"
11
+
12
+ t.string :title, null: false,
13
+ comment: "Human-readable document title for display and search"
14
+
15
+ t.text :summary, null: false, default: "",
16
+ comment: "LLM-generated summary of document content"
17
+
18
+ t.string :document_type, null: false, default: "text",
19
+ comment: "Document format type"
20
+
21
+ t.string :status, null: false, default: "pending",
22
+ comment: "Document processing status"
23
+
24
+ t.json :metadata, default: {},
25
+ comment: "LLM-generated structured metadata about the file"
26
+
27
+ t.timestamp :file_modified_at, null: false, default: -> { "CURRENT_TIMESTAMP" },
28
+ comment: "Timestamp when the source file was last modified"
29
+
30
+ t.timestamps null: false,
31
+ comment: "Standard creation and update timestamps"
32
+
33
+ # Add tsvector column for full-text search
34
+ t.tsvector :search_vector
35
+
36
+ # Add keywords as array column
37
+ t.text :keywords, array: true, default: []
38
+ end
39
+
40
+ ###########
41
+ # Indexes #
42
+ ###########
43
+
44
+ add_index :ragdoll_documents, :location, unique: true,
45
+ comment: "Unique index for document source lookup"
46
+
47
+ add_index :ragdoll_documents, :title,
48
+ comment: "Index for title-based search"
49
+
50
+ add_index :ragdoll_documents, :document_type,
51
+ comment: "Index for filtering by document type"
52
+
53
+ add_index :ragdoll_documents, :status,
54
+ comment: "Index for filtering by processing status"
55
+
56
+ add_index :ragdoll_documents, :created_at,
57
+ comment: "Index for chronological sorting"
58
+
59
+ add_index :ragdoll_documents, [:document_type, :status],
60
+ comment: "Composite index for type+status filtering"
61
+
62
+ # Full-text search index
63
+ execute <<-SQL
64
+ CREATE INDEX CONCURRENTLY index_ragdoll_documents_on_fulltext_search
65
+ ON ragdoll_documents
66
+ USING gin(to_tsvector('english',
67
+ COALESCE(title, '') || ' ' ||
68
+ COALESCE(metadata->>'summary', '') || ' ' ||
69
+ COALESCE(metadata->>'keywords', '') || ' ' ||
70
+ COALESCE(metadata->>'description', '')
71
+ ))
72
+ SQL
73
+
74
+ add_index :ragdoll_documents, "(metadata->>'document_type')",
75
+ name: "index_ragdoll_documents_on_metadata_type",
76
+ comment: "Index for filtering by document type"
77
+
78
+ add_index :ragdoll_documents, "(metadata->>'classification')",
79
+ name: "index_ragdoll_documents_on_metadata_classification",
80
+ comment: "Index for filtering by document classification"
81
+
82
+ # GIN index on search_vector
83
+ add_index :ragdoll_documents, :search_vector, using: :gin, algorithm: :concurrently
84
+
85
+ # GIN index on keywords array
86
+ add_index :ragdoll_documents, :keywords, using: :gin,
87
+ name: 'index_ragdoll_documents_on_keywords_gin'
88
+
89
+ # Trigger to keep search_vector up to date on INSERT/UPDATE
90
+ execute <<-SQL
91
+ CREATE FUNCTION ragdoll_documents_vector_update() RETURNS trigger AS $$
92
+ BEGIN
93
+ NEW.search_vector := to_tsvector('english',
94
+ COALESCE(NEW.title, '') || ' ' ||
95
+ COALESCE(NEW.metadata->>'summary', '') || ' ' ||
96
+ COALESCE(NEW.metadata->>'keywords', '') || ' ' ||
97
+ COALESCE(NEW.metadata->>'description', '')
98
+ );
99
+ RETURN NEW;
100
+ END
101
+ $$ LANGUAGE plpgsql;
102
+
103
+ CREATE TRIGGER ragdoll_search_vector_update
104
+ BEFORE INSERT OR UPDATE ON ragdoll_documents
105
+ FOR EACH ROW EXECUTE FUNCTION ragdoll_documents_vector_update();
106
+ SQL
107
+ end
108
+
109
+ def down
110
+ execute <<-SQL
111
+ DROP TRIGGER IF EXISTS ragdoll_search_vector_update ON ragdoll_documents;
112
+ DROP FUNCTION IF EXISTS ragdoll_documents_vector_update();
113
+ SQL
114
+
115
+ drop_table :ragdoll_documents
116
+ end
117
+ end
@@ -3,7 +3,7 @@ class CreateRagdollEmbeddings < ActiveRecord::Migration[7.0]
3
3
  create_table :ragdoll_embeddings,
4
4
  comment: "Polymorphic vector embeddings storage for semantic similarity search" do |t|
5
5
 
6
- t.references :embeddable, polymorphic: true, null: false,
6
+ t.references :embeddable, polymorphic: true, null: false,
7
7
  comment: "Polymorphic reference to embeddable content"
8
8
 
9
9
  t.text :content, null: false, default: "",
@@ -26,16 +26,19 @@ class CreateRagdollEmbeddings < ActiveRecord::Migration[7.0]
26
26
 
27
27
  t.timestamps null: false,
28
28
  comment: "Standard creation and update timestamps"
29
+ end
29
30
 
30
- ###########
31
- # Indexes #
32
- ###########
31
+ ###########
32
+ # Indexes #
33
+ ###########
33
34
 
34
- t.index %i[embeddable_type embeddable_id],
35
- comment: "Index for finding embeddings by embeddable content"
35
+ add_index :ragdoll_embeddings, [:embeddable_type, :embeddable_id],
36
+ comment: "Index for finding embeddings by embeddable content"
36
37
 
37
- t.index :embedding_vector, using: :ivfflat, opclass: :vector_cosine_ops, name: "index_ragdoll_embeddings_on_embedding_vector_cosine",
38
- comment: "IVFFlat index for fast cosine similarity search"
39
- end
38
+ add_index :ragdoll_embeddings, :embedding_vector,
39
+ using: :ivfflat,
40
+ opclass: :vector_cosine_ops,
41
+ name: "index_ragdoll_embeddings_on_embedding_vector_cosine",
42
+ comment: "IVFFlat index for fast cosine similarity search"
40
43
  end
41
- end
44
+ end
@@ -29,19 +29,22 @@ class CreateRagdollContents < ActiveRecord::Migration[7.0]
29
29
 
30
30
  t.timestamps null: false,
31
31
  comment: "Standard creation and update timestamps"
32
+ end
32
33
 
33
- ###########
34
- # Indexes #
35
- ###########
34
+ ###########
35
+ # Indexes #
36
+ ###########
36
37
 
37
- t.index :embedding_model,
38
- comment: "Index for filtering by embedding model"
38
+ add_index :ragdoll_contents, :embedding_model,
39
+ comment: "Index for filtering by embedding model"
39
40
 
40
- t.index :type,
41
- comment: "Index for filtering by content type"
41
+ add_index :ragdoll_contents, :type,
42
+ comment: "Index for filtering by content type"
42
43
 
43
- t.index "to_tsvector('english', COALESCE(content, ''))", using: :gin, name: "index_ragdoll_contents_on_fulltext_search",
44
- comment: "Full-text search index for text content"
45
- end
44
+ execute <<-SQL
45
+ CREATE INDEX index_ragdoll_contents_on_fulltext_search
46
+ ON ragdoll_contents
47
+ USING gin(to_tsvector('english', COALESCE(content, '')))
48
+ SQL
46
49
  end
47
- end
50
+ end
@@ -41,33 +41,37 @@ class CreateRagdollSearches < ActiveRecord::Migration[7.0]
41
41
 
42
42
  t.timestamps null: false,
43
43
  comment: "Standard creation and update timestamps"
44
+ end
44
45
 
45
- ###########
46
- # Indexes #
47
- ###########
46
+ ###########
47
+ # Indexes #
48
+ ###########
48
49
 
49
- t.index :query_embedding, using: :ivfflat, opclass: :vector_cosine_ops,
50
- name: "index_ragdoll_searches_on_query_embedding_cosine",
51
- comment: "IVFFlat index for finding similar search queries"
50
+ add_index :ragdoll_searches, :query_embedding,
51
+ using: :ivfflat,
52
+ opclass: :vector_cosine_ops,
53
+ name: "index_ragdoll_searches_on_query_embedding_cosine",
54
+ comment: "IVFFlat index for finding similar search queries"
52
55
 
53
- t.index :search_type,
54
- comment: "Index for filtering by search type"
56
+ add_index :ragdoll_searches, :search_type,
57
+ comment: "Index for filtering by search type"
55
58
 
56
- t.index :session_id,
57
- comment: "Index for grouping searches by session"
59
+ add_index :ragdoll_searches, :session_id,
60
+ comment: "Index for grouping searches by session"
58
61
 
59
- t.index :user_id,
60
- comment: "Index for filtering searches by user"
62
+ add_index :ragdoll_searches, :user_id,
63
+ comment: "Index for filtering searches by user"
61
64
 
62
- t.index :created_at,
63
- comment: "Index for chronological search history"
65
+ add_index :ragdoll_searches, :created_at,
66
+ comment: "Index for chronological search history"
64
67
 
65
- t.index :results_count,
66
- comment: "Index for analyzing search effectiveness"
68
+ add_index :ragdoll_searches, :results_count,
69
+ comment: "Index for analyzing search effectiveness"
67
70
 
68
- t.index "to_tsvector('english', query)", using: :gin,
69
- name: "index_ragdoll_searches_on_fulltext_query",
70
- comment: "Full-text search index for finding searches by query text"
71
- end
71
+ execute <<-SQL
72
+ CREATE INDEX index_ragdoll_searches_on_fulltext_query
73
+ ON ragdoll_searches
74
+ USING gin(to_tsvector('english', query))
75
+ SQL
72
76
  end
73
77
  end
@@ -24,26 +24,26 @@ class CreateRagdollSearchResults < ActiveRecord::Migration[7.0]
24
24
 
25
25
  t.timestamps null: false,
26
26
  comment: "Standard creation and update timestamps"
27
+ end
27
28
 
28
- ###########
29
- # Indexes #
30
- ###########
29
+ ###########
30
+ # Indexes #
31
+ ###########
31
32
 
32
- t.index [:search_id, :result_rank],
33
- name: "idx_search_results_search_rank",
34
- comment: "Index for retrieving results in ranked order"
33
+ add_index :ragdoll_search_results, [:search_id, :result_rank],
34
+ name: "idx_search_results_search_rank",
35
+ comment: "Index for retrieving results in ranked order"
35
36
 
36
- t.index [:embedding_id, :similarity_score],
37
- name: "idx_search_results_embedding_score",
38
- comment: "Index for analyzing embedding performance"
37
+ add_index :ragdoll_search_results, [:embedding_id, :similarity_score],
38
+ name: "idx_search_results_embedding_score",
39
+ comment: "Index for analyzing embedding performance"
39
40
 
40
- t.index :similarity_score,
41
- name: "idx_search_results_similarity",
42
- comment: "Index for similarity score analysis"
41
+ add_index :ragdoll_search_results, :similarity_score,
42
+ name: "idx_search_results_similarity",
43
+ comment: "Index for similarity score analysis"
43
44
 
44
- t.index [:clicked, :clicked_at],
45
- name: "idx_search_results_clicks",
46
- comment: "Index for click-through analysis"
47
- end
45
+ add_index :ragdoll_search_results, [:clicked, :clicked_at],
46
+ name: "idx_search_results_clicks",
47
+ comment: "Index for click-through analysis"
48
48
  end
49
49
  end
@@ -90,10 +90,10 @@ module Ragdoll
90
90
  # Drop all tables in correct order (respecting foreign key constraints)
91
91
  # Order: dependent tables first, then parent tables
92
92
  tables_to_drop = %w[
93
+ ragdoll_search_results
94
+ ragdoll_searches
93
95
  ragdoll_embeddings
94
- ragdoll_text_contents
95
- ragdoll_image_contents
96
- ragdoll_audio_contents
96
+ ragdoll_contents
97
97
  ragdoll_documents
98
98
  schema_migrations
99
99
  ]
@@ -109,6 +109,11 @@ module Ragdoll
109
109
  end
110
110
  end
111
111
 
112
+ # Also drop any functions/triggers that might exist
113
+ if ActiveRecord::Base.connection.adapter_name.downcase.include?("postgresql")
114
+ ActiveRecord::Base.connection.execute("DROP FUNCTION IF EXISTS ragdoll_documents_vector_update() CASCADE")
115
+ end
116
+
112
117
  migrate!
113
118
  end
114
119
 
@@ -3,6 +3,6 @@
3
3
 
4
4
  module Ragdoll
5
5
  module Core
6
- VERSION = "0.1.9"
6
+ VERSION = "0.1.10"
7
7
  end
8
8
  end
data/lib/tasks/db.rake CHANGED
@@ -25,22 +25,17 @@ namespace :db do
25
25
  )
26
26
 
27
27
  # Run individual SQL commands to avoid transaction block issues
28
- begin
29
- ActiveRecord::Base.connection.execute("DROP DATABASE IF EXISTS ragdoll_development")
30
- rescue => e
31
- puts "Note: #{e.message}" if e.message.include?("does not exist")
32
- end
33
-
34
- begin
35
- ActiveRecord::Base.connection.execute("DROP ROLE IF EXISTS ragdoll")
36
- rescue => e
37
- puts "Note: #{e.message}" if e.message.include?("does not exist")
38
- end
28
+ # Note: Removed the DROP DATABASE/ROLE here since that should be done via db:drop task
39
29
 
40
30
  begin
41
31
  ActiveRecord::Base.connection.execute("CREATE ROLE ragdoll WITH LOGIN CREATEDB")
32
+ puts "Role 'ragdoll' created successfully"
42
33
  rescue => e
43
- puts "Note: Role already exists, continuing..." if e.message.include?("already exists")
34
+ if e.message.include?("already exists")
35
+ puts "Note: Role 'ragdoll' already exists, continuing..."
36
+ else
37
+ raise e
38
+ end
44
39
  end
45
40
 
46
41
  begin
@@ -50,8 +45,16 @@ namespace :db do
50
45
  ENCODING = 'UTF8'
51
46
  CONNECTION LIMIT = -1
52
47
  SQL
48
+ puts "Database 'ragdoll_development' created successfully"
53
49
  rescue => e
54
- puts "Note: Database already exists, continuing..." if e.message.include?("already exists")
50
+ if e.message.include?("already exists")
51
+ puts "ERROR: Database 'ragdoll_development' already exists!"
52
+ puts "Please run 'rake db:drop' first to remove the existing database, then run 'rake db:create' again."
53
+ puts "Or use 'rake db:reset' to drop, create, and migrate in one step."
54
+ exit 1
55
+ else
56
+ raise e
57
+ end
55
58
  end
56
59
 
57
60
  ActiveRecord::Base.connection.execute("GRANT ALL PRIVILEGES ON DATABASE ragdoll_development TO ragdoll")
@@ -97,8 +100,53 @@ namespace :db do
97
100
  puts "Dropping database with config: #{config.database.inspect}"
98
101
 
99
102
  case config.database[:adapter]
100
- when "postgresql", "mysql2"
101
- puts "For #{config.database[:adapter]}, please drop the database manually on your server"
103
+ when "postgresql"
104
+ puts "PostgreSQL database drop - running as superuser to drop database and role..."
105
+
106
+ # Connect as superuser to drop database and role
107
+ ActiveRecord::Base.establish_connection(
108
+ adapter: 'postgresql',
109
+ database: 'postgres', # Connect to postgres database initially
110
+ username: ENV.fetch('POSTGRES_SUPERUSER', 'postgres'),
111
+ password: ENV['POSTGRES_SUPERUSER_PASSWORD'],
112
+ host: config.database[:host] || 'localhost',
113
+ port: config.database[:port] || 5432
114
+ )
115
+
116
+ # Drop the database if it exists
117
+ begin
118
+ ActiveRecord::Base.connection.execute("DROP DATABASE IF EXISTS ragdoll_development")
119
+ puts "Database 'ragdoll_development' dropped successfully"
120
+ rescue => e
121
+ puts "Error dropping database: #{e.message}"
122
+ end
123
+
124
+ # Optionally drop the role (commented out by default to preserve user)
125
+ # begin
126
+ # ActiveRecord::Base.connection.execute("DROP ROLE IF EXISTS ragdoll")
127
+ # puts "Role 'ragdoll' dropped successfully"
128
+ # rescue => e
129
+ # puts "Error dropping role: #{e.message}"
130
+ # end
131
+
132
+ when "mysql2"
133
+ puts "MySQL database drop - connecting to drop database..."
134
+
135
+ # Connect without specifying database
136
+ ActiveRecord::Base.establish_connection(
137
+ adapter: 'mysql2',
138
+ username: config.database[:username],
139
+ password: config.database[:password],
140
+ host: config.database[:host] || 'localhost',
141
+ port: config.database[:port] || 3306
142
+ )
143
+
144
+ begin
145
+ ActiveRecord::Base.connection.execute("DROP DATABASE IF EXISTS #{config.database[:database]}")
146
+ puts "Database '#{config.database[:database]}' dropped successfully"
147
+ rescue => e
148
+ puts "Error dropping database: #{e.message}"
149
+ end
102
150
  end
103
151
 
104
152
  puts "Database drop completed"
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: ragdoll
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.9
4
+ version: 0.1.10
5
5
  platform: ruby
6
6
  authors:
7
7
  - Dewayne VanHoozer
@@ -406,12 +406,12 @@ files:
406
406
  - app/services/ragdoll/search_engine.rb
407
407
  - app/services/ragdoll/text_chunker.rb
408
408
  - app/services/ragdoll/text_generation_service.rb
409
- - db/migrate/001_enable_postgresql_extensions.rb
410
- - db/migrate/004_create_ragdoll_documents.rb
411
- - db/migrate/005_create_ragdoll_embeddings.rb
412
- - db/migrate/006_create_ragdoll_contents.rb
413
- - db/migrate/007_create_ragdoll_searches.rb
414
- - db/migrate/008_create_ragdoll_search_results.rb
409
+ - db/migrate/20250815234901_enable_postgresql_extensions.rb
410
+ - db/migrate/20250815234902_create_ragdoll_documents.rb
411
+ - db/migrate/20250815234903_create_ragdoll_embeddings.rb
412
+ - db/migrate/20250815234904_create_ragdoll_contents.rb
413
+ - db/migrate/20250815234905_create_ragdoll_searches.rb
414
+ - db/migrate/20250815234906_create_ragdoll_search_results.rb
415
415
  - lib/ragdoll-core.rb
416
416
  - lib/ragdoll.rb
417
417
  - lib/ragdoll/core.rb
@@ -1,70 +0,0 @@
1
- class CreateRagdollDocuments < ActiveRecord::Migration[7.0]
2
- def change
3
- create_table :ragdoll_documents,
4
- comment: "Core documents table with LLM-generated structured metadata" do |t|
5
-
6
- t.string :location, null: false,
7
- comment: "Source location of document (file path, URL, or identifier)"
8
-
9
- t.string :title, null: false,
10
- comment: "Human-readable document title for display and search"
11
-
12
- t.text :summary, null: false, default: "",
13
- comment: "LLM-generated summary of document content"
14
-
15
- t.text :keywords , null: false, default: "",
16
- comment: "LLM-generated comma-separated keywords of document"
17
-
18
- t.string :document_type, null: false, default: "text",
19
- comment: "Document format type"
20
-
21
- t.string :status, null: false, default: "pending",
22
- comment: "Document processing status"
23
-
24
- t.json :metadata, default: {},
25
- comment: "LLM-generated structured metadata about the file"
26
-
27
- t.timestamp :file_modified_at, null: false, default: -> { "CURRENT_TIMESTAMP" },
28
- comment: "Timestamp when the source file was last modified"
29
-
30
- t.timestamps null: false,
31
- comment: "Standard creation and update timestamps"
32
-
33
- ###########
34
- # Indexes #
35
- ###########
36
-
37
- t.index :location, unique: true,
38
- comment: "Unique index for document source lookup"
39
-
40
- t.index :title,
41
- comment: "Index for title-based search"
42
-
43
- t.index :document_type,
44
- comment: "Index for filtering by document type"
45
-
46
- t.index :status,
47
- comment: "Index for filtering by processing status"
48
-
49
- t.index :created_at,
50
- comment: "Index for chronological sorting"
51
-
52
- t.index %i[document_type status],
53
- comment: "Composite index for type+status filtering"
54
-
55
- t.index "to_tsvector('english', COALESCE(title, '') ||
56
- ' ' ||
57
- COALESCE(metadata->>'summary', '') ||
58
- ' ' || COALESCE(metadata->>'keywords', '') ||
59
- ' ' || COALESCE(metadata->>'description', ''))",
60
- using: :gin, name: "index_ragdoll_documents_on_fulltext_search",
61
- comment: "Full-text search across title and metadata fields"
62
-
63
- t.index "(metadata->>'document_type')", name: "index_ragdoll_documents_on_metadata_type",
64
- comment: "Index for filtering by document type"
65
-
66
- t.index "(metadata->>'classification')", name: "index_ragdoll_documents_on_metadata_classification",
67
- comment: "Index for filtering by document classification"
68
- end
69
- end
70
- end