documentrix 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: c45e75b570207ac77d9a04e95939050c33d8f5e0645c7ebac96c862de4e252f3
4
- data.tar.gz: 18b0f526ec16115483de74c027a7d8eaa8c8ba2461ac41d4695f53f1ceff32c7
3
+ metadata.gz: 68c409476101e4632597c139494f3fc1fe67bc0af7a655e5e5b3e4ebbd58f58c
4
+ data.tar.gz: fcd07ca7694b3fbed81c3a25f3fa9d9a675d120e5b82b43a13f5f71970012f3a
5
5
  SHA512:
6
- metadata.gz: dddf96ef71ab25c35c6872905cea070c8a27036caf841d97626fadc1db172e496a3b8f13748000cbb9a096b66ae59f459bae145fc2a4a06abffad7de267362a9
7
- data.tar.gz: cab52c41f1749fe0ff56538cd01e4525787921e700dca91e44f10d939ec942afc59478c4f8256033fdbf5ef4a38ac649d80eadfbc711042bccb85e38b056a569
6
+ metadata.gz: cf1e95f2d994bb130b89bff9d2e43062621b849b5d0f7fb4543092b1f491bab4b9dd9e51047c3dfb216090a4f9bc853c08ba35719dea7409c34a15758070fcba
7
+ data.tar.gz: 59aa3347c07c2521f5661326a55d73733470ef30c62e86fc7d1ff53a4acb6cbc2da06e4c7814896b59ca16bdfab8933a7018f3f41ab30ce29593c5fac480f342
data/CHANGES.md CHANGED
@@ -1,5 +1,74 @@
1
1
  # Changes
2
2
 
3
+ ## 2026-05-17 v0.3.0
4
+
5
+ ### New Features
6
+
7
+ - **Source Tracking & Versioning**:
8
+ - Introduced `Documentrix::Utils::Digests` for SHA256 hashing of strings
9
+ and files, including an `mtime`-based cache.
10
+ - Implemented source-based document management in `Documentrix::Documents`
11
+ via `normalize_source`, `source_exist?`, `source_modified?`,
12
+ `source_update`, and `source_remove`.
13
+ - Updated `Documentrix::Documents#add` and
14
+ `Documentrix::Documents#source_update` to support `digest` for version
15
+ tracking.
16
+ - **Text Splitting**:
17
+ - Added `Documentrix::Documents::Splitters::Common` to implement
18
+ `force_split` behavior.
19
+ - Integrated `force` splitting into `Character`, `RecursiveCharacter`, and
20
+ `Semantic` splitters.
21
+ - **Cache Enhancements**:
22
+ - Added `each_source` to `Documentrix::Documents::Cache::Common` and an
23
+ optimized `SELECT DISTINCT source` implementation in
24
+ `Documentrix::Documents::Cache::SQLiteCache`.
25
+ - Added a SQLite trigger `delete_embedding_after_record` to automatically
26
+ clean the `embeddings` table.
27
+
28
+ ### Improvements & Refactorings
29
+
30
+ - **Search & Retrieval**:
31
+ - Added `min_similarity` parameter to `Documentrix::Documents#find`,
32
+ `Documentrix::Documents::Cache::Common#find_records`, and
33
+ `Documentrix::Documents::Cache::SQLiteCache#find_records`.
34
+ - Optimized `Documentrix::Documents::Cache::SQLiteCache#find_records` by
35
+ moving similarity calculations into the SQL query using `1 -
36
+ vec_distance_cosine`.
37
+ - Simplified `Documentrix::Documents#find_where` by streamlining
38
+ `take_while` logic and utilizing `opts[:max_records]`.
39
+ - **Cache Implementations**:
40
+ - Made `object_class` a required keyword argument in
41
+ `Documentrix::Documents::RedisCache#initialize`.
42
+ - Refactored `Documentrix::Documents::Cache::Common#clear_by_source` and
43
+ `Documentrix::Documents::Cache::Common#source_exist?` to use ternary
44
+ operators.
45
+ - Improved `Documentrix::Documents::Cache::SQLiteCache#each_source` and
46
+ `Documentrix::Documents::Cache::SQLiteCache#find_records` for better
47
+ robustness and formatting.
48
+ - **Documentation & Tooling**:
49
+ - Expanded YARD documentation for
50
+ `Documentrix::Documents::Splitters::Character`, `RecursiveCharacter`,
51
+ `Semantic`, and `Documentrix::Utils::ColorizeTexts`.
52
+ - Centralized RSpec configuration via a `.rspec` file.
53
+
54
+ ### Bug Fixes
55
+
56
+ - Fixed an issue in `Documentrix::Documents#find` where `max_records` was
57
+ hardcoded to `nil` when calling the cache.
58
+ - Adjusted default handling of `min_similarity` in
59
+ `Documentrix::Documents#find` to use `min_similarity ||= -1` within the
60
+ method body.
61
+
62
+ ### Testing
63
+
64
+ - Significantly expanded test suites for `SQLiteCache`, `MemoryCache`, and
65
+ `RedisCache`, specifically covering `each_source`, `tags`, `clear_for_tags`,
66
+ and digest-based checks.
67
+ - Added new test cases in `spec/documents_spec.rb` for source management and
68
+ `Documentrix::Documents#source_update`.
69
+ - Added `spec/utils/digests_spec.rb` and updated splitter specs to verify
70
+ `force` splitting behavior.
71
+
3
72
  ## 2026-05-12 v0.2.0
4
73
 
5
74
  ### Added
data/documentrix.gemspec CHANGED
@@ -1,9 +1,9 @@
1
1
  # -*- encoding: utf-8 -*-
2
- # stub: documentrix 0.2.0 ruby lib
2
+ # stub: documentrix 0.3.0 ruby lib
3
3
 
4
4
  Gem::Specification.new do |s|
5
5
  s.name = "documentrix".freeze
6
- s.version = "0.2.0".freeze
6
+ s.version = "0.3.0".freeze
7
7
 
8
8
  s.required_rubygems_version = Gem::Requirement.new(">= 0".freeze) if s.respond_to? :required_rubygems_version=
9
9
  s.require_paths = ["lib".freeze]
@@ -11,15 +11,15 @@ Gem::Specification.new do |s|
11
11
  s.date = "1980-01-02"
12
12
  s.description = "The Ruby library, Documentrix, is designed to provide a way to build and\nquery vector databases for applications in natural language processing\n(NLP) and large language models (LLMs). It allows users to store and\nretrieve dense vector embeddings for text strings.\n".freeze
13
13
  s.email = "flori@ping.de".freeze
14
- s.extra_rdoc_files = ["README.md".freeze, "lib/documentrix.rb".freeze, "lib/documentrix/documents.rb".freeze, "lib/documentrix/documents/cache/common.rb".freeze, "lib/documentrix/documents/cache/memory_cache.rb".freeze, "lib/documentrix/documents/cache/records.rb".freeze, "lib/documentrix/documents/cache/redis_cache.rb".freeze, "lib/documentrix/documents/cache/sqlite_cache.rb".freeze, "lib/documentrix/documents/splitters/character.rb".freeze, "lib/documentrix/documents/splitters/semantic.rb".freeze, "lib/documentrix/utils.rb".freeze, "lib/documentrix/utils/colorize_texts.rb".freeze, "lib/documentrix/utils/math.rb".freeze, "lib/documentrix/utils/tags.rb".freeze, "lib/documentrix/version.rb".freeze]
15
- s.files = [".envrc".freeze, ".utilsrc".freeze, ".yardopts".freeze, "CHANGES.md".freeze, "Gemfile".freeze, "LICENSE".freeze, "README.md".freeze, "Rakefile".freeze, "docker-compose.yml".freeze, "documentrix.gemspec".freeze, "lib/documentrix.rb".freeze, "lib/documentrix/documents.rb".freeze, "lib/documentrix/documents/cache/common.rb".freeze, "lib/documentrix/documents/cache/memory_cache.rb".freeze, "lib/documentrix/documents/cache/records.rb".freeze, "lib/documentrix/documents/cache/redis_cache.rb".freeze, "lib/documentrix/documents/cache/sqlite_cache.rb".freeze, "lib/documentrix/documents/splitters/character.rb".freeze, "lib/documentrix/documents/splitters/semantic.rb".freeze, "lib/documentrix/utils.rb".freeze, "lib/documentrix/utils/colorize_texts.rb".freeze, "lib/documentrix/utils/math.rb".freeze, "lib/documentrix/utils/tags.rb".freeze, "lib/documentrix/version.rb".freeze, "redis/redis.conf".freeze, "spec/assets/embeddings.json".freeze, "spec/documentrix/documents/cache/interface_spec.rb".freeze, "spec/documentrix/documents/cache/memory_cache_spec.rb".freeze, "spec/documentrix/documents/cache/redis_cache_spec.rb".freeze, "spec/documentrix/documents/cache/sqlite_cache_spec.rb".freeze, "spec/documentrix/documents/splitters/character_spec.rb".freeze, "spec/documentrix/documents/splitters/semantic_spec.rb".freeze, "spec/documents_spec.rb".freeze, "spec/spec_helper.rb".freeze, "spec/utils/colorize_texts_spec.rb".freeze, "spec/utils/tags_spec.rb".freeze]
14
+ s.extra_rdoc_files = ["README.md".freeze, "lib/documentrix.rb".freeze, "lib/documentrix/documents.rb".freeze, "lib/documentrix/documents/cache/common.rb".freeze, "lib/documentrix/documents/cache/memory_cache.rb".freeze, "lib/documentrix/documents/cache/records.rb".freeze, "lib/documentrix/documents/cache/redis_cache.rb".freeze, "lib/documentrix/documents/cache/sqlite_cache.rb".freeze, "lib/documentrix/documents/splitters/character.rb".freeze, "lib/documentrix/documents/splitters/common.rb".freeze, "lib/documentrix/documents/splitters/semantic.rb".freeze, "lib/documentrix/utils.rb".freeze, "lib/documentrix/utils/colorize_texts.rb".freeze, "lib/documentrix/utils/digests.rb".freeze, "lib/documentrix/utils/math.rb".freeze, "lib/documentrix/utils/tags.rb".freeze, "lib/documentrix/version.rb".freeze]
15
+ s.files = [".envrc".freeze, ".utilsrc".freeze, ".yardopts".freeze, "CHANGES.md".freeze, "Gemfile".freeze, "LICENSE".freeze, "README.md".freeze, "Rakefile".freeze, "docker-compose.yml".freeze, "documentrix.gemspec".freeze, "lib/documentrix.rb".freeze, "lib/documentrix/documents.rb".freeze, "lib/documentrix/documents/cache/common.rb".freeze, "lib/documentrix/documents/cache/memory_cache.rb".freeze, "lib/documentrix/documents/cache/records.rb".freeze, "lib/documentrix/documents/cache/redis_cache.rb".freeze, "lib/documentrix/documents/cache/sqlite_cache.rb".freeze, "lib/documentrix/documents/splitters/character.rb".freeze, "lib/documentrix/documents/splitters/common.rb".freeze, "lib/documentrix/documents/splitters/semantic.rb".freeze, "lib/documentrix/utils.rb".freeze, "lib/documentrix/utils/colorize_texts.rb".freeze, "lib/documentrix/utils/digests.rb".freeze, "lib/documentrix/utils/math.rb".freeze, "lib/documentrix/utils/tags.rb".freeze, "lib/documentrix/version.rb".freeze, "redis/redis.conf".freeze, "spec/assets/embeddings.json".freeze, "spec/documentrix/documents/cache/interface_spec.rb".freeze, "spec/documentrix/documents/cache/memory_cache_spec.rb".freeze, "spec/documentrix/documents/cache/redis_cache_spec.rb".freeze, "spec/documentrix/documents/cache/sqlite_cache_spec.rb".freeze, "spec/documentrix/documents/splitters/character_spec.rb".freeze, "spec/documentrix/documents/splitters/semantic_spec.rb".freeze, "spec/documents_spec.rb".freeze, "spec/spec_helper.rb".freeze, "spec/utils/colorize_texts_spec.rb".freeze, "spec/utils/digests_spec.rb".freeze, "spec/utils/tags_spec.rb".freeze]
16
16
  s.homepage = "https://github.com/flori/documentrix".freeze
17
17
  s.licenses = ["MIT".freeze]
18
18
  s.rdoc_options = ["--title".freeze, "Documentrix - Ruby library for embedding vector database".freeze, "--main".freeze, "README.md".freeze]
19
19
  s.required_ruby_version = Gem::Requirement.new(">= 3.1".freeze)
20
20
  s.rubygems_version = "4.0.10".freeze
21
21
  s.summary = "Ruby library for embedding vector database".freeze
22
- s.test_files = ["spec/documentrix/documents/cache/interface_spec.rb".freeze, "spec/documentrix/documents/cache/memory_cache_spec.rb".freeze, "spec/documentrix/documents/cache/redis_cache_spec.rb".freeze, "spec/documentrix/documents/cache/sqlite_cache_spec.rb".freeze, "spec/documentrix/documents/splitters/character_spec.rb".freeze, "spec/documentrix/documents/splitters/semantic_spec.rb".freeze, "spec/documents_spec.rb".freeze, "spec/spec_helper.rb".freeze, "spec/utils/colorize_texts_spec.rb".freeze, "spec/utils/tags_spec.rb".freeze]
22
+ s.test_files = ["spec/documentrix/documents/cache/interface_spec.rb".freeze, "spec/documentrix/documents/cache/memory_cache_spec.rb".freeze, "spec/documentrix/documents/cache/redis_cache_spec.rb".freeze, "spec/documentrix/documents/cache/sqlite_cache_spec.rb".freeze, "spec/documentrix/documents/splitters/character_spec.rb".freeze, "spec/documentrix/documents/splitters/semantic_spec.rb".freeze, "spec/documents_spec.rb".freeze, "spec/spec_helper.rb".freeze, "spec/utils/colorize_texts_spec.rb".freeze, "spec/utils/digests_spec.rb".freeze, "spec/utils/tags_spec.rb".freeze]
23
23
 
24
24
  s.specification_version = 4
25
25
 
@@ -12,6 +12,7 @@
12
12
  # memory, Redis, and SQLite.
13
13
  module Documentrix::Documents::Cache::Common
14
14
  include Documentrix::Utils::Math
15
+ include Documentrix::Utils::Digests
15
16
  include Enumerable
16
17
 
17
18
  # The initialize method sets up the Documentrix::Documents::Cache instance's
@@ -62,27 +63,29 @@ module Documentrix::Documents::Cache::Common
62
63
  # @param needle [ Array ] an array containing the embedding vector
63
64
  # @param tags [ String, Array ] a string or array of strings representing the tags to search for
64
65
  # @param max_records [ Integer ] the maximum number of records to return
66
+ # @param min_similarity [ Float ] the minimum similarity score required for a record to be returned (defaults to -1)
65
67
  #
66
- # @yield [ record ]
67
- #
68
- # @return [ Array<Documentrix::Documents::Records> ] an array containing the matching records
69
- def find_records(needle, tags: nil, max_records: nil)
68
+ # @return [ Array<Documentrix::Documents::Record> ] an array containing the matching records
69
+ def find_records(needle, tags: nil, max_records: nil, min_similarity: -1)
70
70
  tags = Documentrix::Utils::Tags.new(Array(tags)).to_a
71
71
  records = self
72
72
  if tags.present?
73
73
  records = records.select { |_key, record| (tags & record.tags).size >= 1 }
74
74
  end
75
+
75
76
  needle_norm = norm(needle)
76
- records = records.sort_by { |key, record|
77
+ records = records.map do |key, record|
77
78
  record.key = key
78
79
  record.similarity = cosine_similarity(
79
- a: needle,
80
- b: record.embedding,
80
+ a: needle,
81
+ b: record.embedding,
81
82
  a_norm: needle_norm,
82
83
  b_norm: record.norm,
83
84
  )
84
- }
85
- records.transpose.last&.reverse.to_a
85
+ record
86
+ end.sort_by(&:similarity).reverse.select { _1.similarity >= min_similarity }
87
+
88
+ max_records ? records.take(max_records) : records
86
89
  end
87
90
 
88
91
  # Returns a set of unique tags found in the cache records.
@@ -116,19 +119,68 @@ module Documentrix::Documents::Cache::Common
116
119
  self
117
120
  end
118
121
 
122
+ # Yields each unique, full source present in the cache records.
123
+ #
124
+ # @yield [source] the full source string
125
+ # @return [Enumerator] an enumerator if no block is given, nil otherwise.
126
+ def each_source(&block)
127
+ block or return enum_for(__method__)
128
+ seen = {}
129
+ each do |_key, record|
130
+ source = record.source.full? or next
131
+ seen.key?(source) and next
132
+ seen[source] = true
133
+ block.(source)
134
+ end
135
+ nil
136
+ end
137
+
119
138
  # The clear_by_source method removes all records from the cache that
120
139
  # have a source matching the given source.
121
140
  #
122
141
  # @param source [String] the source to filter records by
142
+ # @param digest [String, nil] the SHA256 hexadecimal digest of the source.
143
+ # @param operator [Symbol, String] the operator to compare the digest with ('=' or '!=')
123
144
  #
124
145
  # @return [self] self
125
- def clear_by_source(source)
146
+ def clear_by_source(source, digest: nil, operator: ?=)
147
+ operator = operator == '=' ? '==' : '!='
148
+
126
149
  each do |key, record|
127
- delete(unpre(key)) if record.source == source
150
+ next unless record.source == source
151
+ if digest
152
+ should_delete = record.digest.send(operator, digest)
153
+ delete(unpre(key)) if should_delete
154
+ else
155
+ delete(unpre(key))
156
+ end
128
157
  end
129
158
  self
130
159
  end
131
160
 
161
+ # Checks if any records associated with the given source exist in the cache.
162
+ #
163
+ # @param source [String] the source to check for existence
164
+ # @param digest [String, nil] the SHA256 hexadecimal digest to compare against
165
+ # @param operator [Symbol, String] the operator to compare the digest with ('=' or '!=')
166
+ #
167
+ # @return [Boolean] true if a matching record is found, false otherwise.
168
+ def source_exist?(source, digest: nil, operator: ?=)
169
+ operator = operator == '=' ? '==' : '!='
170
+
171
+ each do |_, record|
172
+ next unless record.source == source
173
+ if digest
174
+ if record.digest.send(operator, digest)
175
+ return true
176
+ end
177
+ else
178
+ return true
179
+ end
180
+ end
181
+ false
182
+ end
183
+
132
184
  # The clear method removes cached records based on the provided tags or
133
185
  # clears all records with the current prefix.
134
186
  #
@@ -27,7 +27,7 @@ module Documentrix::Documents::Cache::Records
27
27
  # The to_s method returns a string representation of the object.
28
28
  #
29
29
  # @return [String] A string containing the text and tags of the record,
30
- # along with its similarity score.
30
+ # along with its similarity score.
31
31
  def to_s
32
32
  my_tags = tags_set
33
33
  my_tags.empty? or my_tags = " #{my_tags}"
@@ -23,7 +23,7 @@ class Documentrix::Documents::RedisCache
23
23
  # @param [String] prefix the string to be used as the prefix for this cache
24
24
  # @param [String] url the URL of the Redis server (default: ENV['REDIS_URL'])
25
25
  # @param [Class] object_class the class of objects stored in Redis (default: nil)
26
- def initialize(prefix:, url: ENV['REDIS_URL'], object_class: nil)
26
+ def initialize(prefix:, url: ENV['REDIS_URL'], object_class:)
27
27
  super(prefix:)
28
28
  url or raise ArgumentError, 'require redis url'
29
29
  @url, @object_class = url, object_class
@@ -46,7 +46,7 @@ class Documentrix::Documents::RedisCache
46
46
  def [](key)
47
47
  value = redis.get(pre(key))
48
48
  unless value.nil?
49
- object_class ? JSON.parse(value, object_class:) : JSON.parse(value)
49
+ JSON.parse(value, object_class:)
50
50
  end
51
51
  end
52
52
 
@@ -153,7 +153,7 @@ class Documentrix::Documents::RedisCache
153
153
 
154
154
  redis.scan_each(match: prefix + ?*) do |key|
155
155
  value = redis.get(key) or next
156
- value = object_class ? JSON.parse(value, object_class:) : JSON.parse(value)
156
+ value = JSON.parse(value, object_class:)
157
157
  block.(key, value)
158
158
  end
159
159
  end
@@ -46,17 +46,17 @@ class Documentrix::Documents::Cache::SQLiteCache
46
46
  result = execute(
47
47
  %{
48
48
  SELECT records.key, records.text, records.norm, records.source,
49
- records.tags, embeddings.embedding
49
+ records.digest, records.tags, embeddings.embedding
50
50
  FROM records
51
51
  INNER JOIN embeddings ON records.embedding_id = embeddings.rowid
52
52
  WHERE records.key = ?
53
53
  },
54
54
  pre(key)
55
55
  )&.first or return
56
- key, text, norm, source, tags, embedding = *result
56
+ key, text, norm, source, digest, tags, embedding = *result
57
57
  embedding = embedding.unpack("f*")
58
58
  tags = Documentrix::Utils::Tags.new(JSON(tags.to_s).to_a, source:)
59
- convert_value_to_record(key:, text:, norm:, source:, tags:, embedding:)
59
+ convert_value_to_record(key:, text:, norm:, source:, digest:, tags:, embedding:)
60
60
  end
61
61
 
62
62
  # The []= method sets the value for a given key by inserting it into the
@@ -66,15 +66,16 @@ class Documentrix::Documents::Cache::SQLiteCache
66
66
  # @param [Hash, Documentrix::Documents::Record] value the hash or record
67
67
  # containing the text, embedding, and other metadata
68
68
  def []=(key, value)
69
- value = convert_value_to_record(value)
69
+ value = convert_value_to_record(value)
70
+ digest = compute_file_digest(value.source)
70
71
  embedding = value.embedding.pack("f*")
71
72
  execute(%{BEGIN})
72
73
  execute(%{INSERT INTO embeddings(embedding) VALUES(?)}, [ embedding ])
73
74
  embedding_id, = execute(%{ SELECT last_insert_rowid() }).flatten
74
75
  execute(%{
75
- INSERT INTO records(key,text,embedding_id,norm,source,tags)
76
- VALUES(?,?,?,?,?,?)
77
- }, [ pre(key), value.text, embedding_id, value.norm, value.source, JSON(value.tags) ])
76
+ INSERT INTO records(key,text,embedding_id,norm,source,digest,tags)
77
+ VALUES(?,?,?,?,?,?,?)
78
+ }, [ pre(key), value.text, embedding_id, value.norm, value.source, digest, JSON(value.tags) ])
78
79
  execute(%{COMMIT})
79
80
  end
80
81
 
@@ -157,17 +158,70 @@ class Documentrix::Documents::Cache::SQLiteCache
157
158
  self
158
159
  end
159
160
 
160
- # The clear_by_source method removes all records from the cache that
161
- # have a source matching the given source.
161
+ # Removes all records associated with the specified source from the cache.
162
162
  #
163
- # @param source [String] the source to filter records by
163
+ # If a digest is provided, the method will only remove records that do NOT
164
+ # match this digest. This allows for updating a source by wiping old versions
165
+ # while preserving records that are already up-to-date.
164
166
  #
165
- # @return [Documentrix::Documents::Cache::SQLiteCache] self
166
- def clear_by_source(source)
167
- execute(%{DELETE FROM records WHERE source = ?}, [ source ])
167
+ # @param source [String] the source identifier used to filter records
168
+ # @param digest [String, nil] the SHA256 hexadecimal digest of the source.
169
+ # Records matching this digest will be preserved.
170
+ #
171
+ # @return [self] the cache instance for method chaining
172
+ def clear_by_source(source, digest: nil, operator: ?=)
173
+ operator = '!=' if operator != ?=
174
+ if digest
175
+ execute(%{DELETE FROM records WHERE source = ? AND digest #{operator} ? }, [ source, digest ])
176
+ else
177
+ execute(%{DELETE FROM records WHERE source = ?}, [ source ])
178
+ end
168
179
  self
169
180
  end
170
181
 
182
+ # The source_exist? method checks if any records associated with the given
183
+ # source exist in the cache. If a digest is provided, it verifies if the
184
+ # source exists and matches the specified digest using the provided operator.
185
+ #
186
+ # @param source [#to_s] the source to check for existence
187
+ # @param digest [String, nil] the SHA256 hexadecimal digest to compare
188
+ # against the stored source digest (optional)
189
+ # @param operator [String] the operator to use for comparison ('=' or '!=').
190
+ # Defaults to '='.
191
+ #
192
+ # @return [Boolean] true if the source exists (and matches the digest
193
+ # condition if provided), false otherwise.
194
+ def source_exist?(source, digest: nil, operator: ?=)
195
+ operator = '!=' if operator != ?=
196
+ if digest
197
+ !!execute(%{SELECT 1 FROM records WHERE source = ? AND digest #{operator} ? }, [ source, digest ]).first
198
+ else
199
+ !!execute(%{SELECT 1 FROM records WHERE source = ?}, [ source ]).first
200
+ end
201
+ end
202
+
203
+ # Yields each unique, full source present in the cache records.
204
+ #
205
+ # This is a high-performance override for SQLite that avoids loading
206
+ # embeddings and parsing JSON for every record.
207
+ #
208
+ # @yield [source] the full source string
209
+ # @return [Enumerator] an enumerator if no block is given, nil otherwise.
210
+ def each_source(&block)
211
+ block or return enum_for(__method__)
212
+
213
+ execute(%{
214
+ SELECT DISTINCT source
215
+ FROM records
216
+ WHERE key LIKE ? AND source IS NOT NULL
217
+ }, [ "#@prefix%" ]).each do |source,|
218
+ source = source.full? or next
219
+
220
+ block.(source)
221
+ end
222
+ nil
223
+ end
224
+
171
225
  # Move a key prefix in the cache.
172
226
  #
173
227
  # This operation updates every record whose key starts with +old_prefix+,
@@ -208,14 +262,14 @@ class Documentrix::Documents::Cache::SQLiteCache
208
262
 
209
263
  execute(%{
210
264
  SELECT records.key, records.text, records.norm, records.source,
211
- records.tags, embeddings.embedding
265
+ records.digest, records.tags, embeddings.embedding
212
266
  FROM records
213
267
  INNER JOIN embeddings ON records.embedding_id = embeddings.rowid
214
268
  WHERE records.key LIKE ?
215
- }, [ prefix ]).each do |key, text, norm, source, tags, embedding|
269
+ }, [ prefix ]).each do |key, text, norm, source, digest, tags, embedding|
216
270
  embedding = embedding.unpack("f*")
217
271
  tags = Documentrix::Utils::Tags.new(JSON(tags.to_s).to_a, source:)
218
- value = convert_value_to_record(key:, text:, norm:, source:, tags:, embedding:)
272
+ value = convert_value_to_record(key:, text:, norm:, source:, digest:, tags:, embedding:)
219
273
  block.(key, value)
220
274
  end
221
275
  self
@@ -275,34 +329,40 @@ class Documentrix::Documents::Cache::SQLiteCache
275
329
  # @param needle [ Array ] the embedding vector
276
330
  # @param tags [ Array ] the list of tags to filter by (optional)
277
331
  # @param max_records [ Integer ] the maximum number of records to return (optional)
332
+ # @param min_similarity [ Float ] the minimum similarity score to include (defaults to -1)
278
333
  #
279
334
  # @yield [ key, value ]
280
335
  #
281
336
  # @raise [ ArgumentError ] if needle size does not match embedding length
282
337
  #
283
338
  # @example
284
- # documents.find_records([ 0.1 ] * 1_024, tags: %w[ test ])
339
+ # documents.find_records([ 0.1 ] * 1_024, tags: %w[ test ], min_similarity: 0.7)
285
340
  #
286
341
  # @return [ Array<Documentrix::Documents::Record> ] the list of matching records
287
- def find_records(needle, tags: nil, max_records: nil)
342
+ def find_records(needle, tags: nil, max_records: nil, min_similarity: -1)
288
343
  needle.size != @embedding_length and
289
344
  raise ArgumentError, "needle embedding length != %s" % @embedding_length
290
345
  needle_binary = needle.pack("f*")
291
346
  max_records = [ max_records, size, 4_096 ].compact.min
292
347
  records = find_records_for_tags(tags)
293
348
  rowids_where = '(%s)' % records.transpose.last&.join(?,)
294
- execute(%{
295
- SELECT records.key, records.text, records.norm, records.source,
296
- records.tags, embeddings.embedding
297
- FROM records
298
- INNER JOIN embeddings ON records.embedding_id = embeddings.rowid
299
- WHERE embeddings.rowid IN #{rowids_where}
300
- AND embeddings.embedding MATCH ? AND embeddings.k = ?
301
- }, [ needle_binary, max_records ]).map do |key, text, norm, source, tags, embedding|
349
+ execute(
350
+ %{
351
+ SELECT records.key, records.text, records.norm, records.source,
352
+ records.digest, records.tags, embeddings.embedding,
353
+ 1 - vec_distance_cosine(?, vec_f32(embeddings.embedding)) AS similarity
354
+ FROM records
355
+ INNER JOIN embeddings ON records.embedding_id = embeddings.rowid
356
+ WHERE embeddings.rowid IN #{rowids_where}
357
+ AND embeddings.embedding MATCH ? AND similarity >= ?
358
+ AND embeddings.k = ?
359
+ ORDER BY similarity DESC
360
+ }, [ needle_binary, needle_binary, min_similarity, max_records ]
361
+ ).map do |key, text, norm, source, digest, tags, embedding, similarity|
302
362
  key = unpre(key)
303
363
  embedding = embedding.unpack("f*")
304
364
  tags = Documentrix::Utils::Tags.new(JSON(tags.to_s).to_a, source:)
305
- convert_value_to_record(key:, text:, norm:, source:, tags:, embedding:)
365
+ convert_value_to_record(key:, text:, norm:, source:, digest:, tags:, embedding:, similarity:)
306
366
  end
307
367
  end
308
368
 
@@ -362,10 +422,18 @@ class Documentrix::Documents::Cache::SQLiteCache
362
422
  embedding_id integer,
363
423
  norm float NOT NULL DEFAULT 0.0,
364
424
  source text,
425
+ digest text,
365
426
  tags json NOT NULL DEFAULT [],
366
427
  FOREIGN KEY(embedding_id) REFERENCES embeddings(id) ON DELETE CASCADE
367
428
  )
368
429
  }
430
+ execute %{
431
+ CREATE TRIGGER IF NOT EXISTS delete_embedding_after_record AFTER DELETE ON records
432
+ FOR EACH ROW
433
+ BEGIN
434
+ DELETE FROM embeddings WHERE rowid = OLD.embedding_id;
435
+ END
436
+ }
369
437
  nil
370
438
  end
371
439
 
@@ -1,15 +1,38 @@
1
1
  module Documentrix::Documents::Splitters
2
+ # The Character class provides basic text splitting based on a single
3
+ # separator and bundles the resulting segments into chunks of a maximum size.
4
+ #
5
+ # It allows for the preservation of separators and uses a combining string
6
+ # to join segments back together into chunks.
2
7
  class Character
8
+ include Documentrix::Documents::Splitters::Common
9
+
10
+ # The default regex used to identify paragraph boundaries.
11
+ # It matches two or more consecutive newline characters (CRLF or LF).
12
+ #
13
+ # @return [Regexp]
3
14
  DEFAULT_SEPARATOR = /(?:\r?\n){2,}/
4
15
 
5
- def initialize(separator: DEFAULT_SEPARATOR, include_separator: false, combining_string: "\n\n", chunk_size: 4096)
6
- @separator, @include_separator, @combining_string, @chunk_size =
7
- separator, include_separator, combining_string, chunk_size
16
+ # Initializes a new Character splitter.
17
+ #
18
+ # @param separator [Regexp] the regex used to split the text (defaults to DEFAULT_SEPARATOR)
19
+ # @param include_separator [Boolean] whether to include the separator in the resulting chunks (defaults to false)
20
+ # @param combining_string [String] the string used to join segments into chunks (defaults to "\n\n")
21
+ # @param chunk_size [Integer] the maximum size of each resulting chunk (defaults to 4096)
22
+ # @param force [Boolean] whether to force-split the final chunk if it exceeds `chunk_size` (defaults to false)
23
+ def initialize(separator: DEFAULT_SEPARATOR, include_separator: false, combining_string: "\n\n", chunk_size: 4096, force: false)
24
+ @separator, @include_separator, @combining_string, @chunk_size, @force =
25
+ separator, include_separator, combining_string, chunk_size, force
8
26
  if include_separator
9
27
  @separator = Regexp.new("(#@separator)")
10
28
  end
11
29
  end
12
30
 
31
+ # Splits the given text into chunks based on the configured separator and
32
+ # size limit.
33
+ #
34
+ # @param text [String] the text to be split
35
+ # @return [Array<String>] an array of text chunks
13
36
  def split(text)
14
37
  texts = []
15
38
  text.split(@separator) do |t|
@@ -29,12 +52,27 @@ module Documentrix::Documents::Splitters
29
52
  current_text = t
30
53
  end
31
54
  end
32
- current_text.empty? or result << current_text
55
+ result.concat force_split(current_text)
33
56
  result
34
57
  end
35
58
  end
36
59
 
60
+ # The RecursiveCharacter class implements a hierarchical splitting strategy.
61
+ #
62
+ # It attempts to split text using a priority list of separators. If a
63
+ # resulting chunk is still larger than the specified chunk_size, it
64
+ # recursively applies the next separator in the list until the size limit is
65
+ # met or all separators have been exhausted.
37
66
  class RecursiveCharacter
67
+ include Documentrix::Documents::Splitters::Common
68
+
69
+ # The default priority list of regexes used for recursive splitting.
70
+ # The strategy is to split by the coarsest grain first (paragraphs)
71
+ # and move toward the finest grain (individual characters) as needed.
72
+ #
73
+ # Order: Paragraphs -> Newlines -> Word Boundaries -> Characters
74
+ #
75
+ # @return [Array<Regexp>]
38
76
  DEFAULT_SEPARATORS = [
39
77
  /(?:\r?\n){2,}/,
40
78
  /\r?\n/,
@@ -42,13 +80,27 @@ module Documentrix::Documents::Splitters
42
80
  //,
43
81
  ].freeze
44
82
 
83
+ # Initializes a new RecursiveCharacter splitter.
84
+ #
85
+ # @param separators [Array<Regexp>] a priority list of regexes to use for splitting (defaults to DEFAULT_SEPARATORS)
86
+ # @param include_separator [Boolean] whether to include the separator in the resulting chunks (defaults to false)
87
+ # @param combining_string [String] the string used to join segments into chunks (defaults to "\n\n")
88
+ # @param chunk_size [Integer] the maximum size of each resulting chunk (defaults to 4096)
89
+ # @raise [ArgumentError] if the separators array is empty
45
90
  def initialize(separators: DEFAULT_SEPARATORS, include_separator: false, combining_string: "\n\n", chunk_size: 4096)
46
91
  separators.empty? and
47
92
  raise ArgumentError, "non-empty array of separators required"
48
93
  @separators, @include_separator, @combining_string, @chunk_size =
49
94
  separators, include_separator, combining_string, chunk_size
95
+ @force = separators.last == //
50
96
  end
51
97
 
98
+ # Recursively splits the given text into chunks using the list of
99
+ # separators.
100
+ #
101
+ # @param text [String] the text to be split
102
+ # @param separators [Array<Regexp>] the list of separators to use (defaults to @separators)
103
+ # @return [Array<String>] an array of text chunks
52
104
  def split(text, separators: @separators)
53
105
  separators.empty? and return [ text ]
54
106
  separators = separators.dup
@@ -0,0 +1,38 @@
1
+ # A shared utility module for text splitters that provides consistent
2
+ # handling of chunk size constraints.
3
+ #
4
+ # This module is intended to be included in splitter classes that
5
+ # implement a maximum chunk size limit. It expects the including class
6
+ # to provide the following attributes:
7
+ # - `force` [Boolean]: Whether to hard-split chunks that exceed the limit.
8
+ # - `chunk_size` [Integer]: The maximum allowed size for a single chunk.
9
+ module Documentrix::Documents::Splitters::Common
10
+ private
11
+
12
+ # Whether to force-split chunks that exceed the chunk size limit.
13
+ # @return [Boolean]
14
+ attr_reader :force
15
+
16
+ # The maximum allowed size for a single chunk.
17
+ # @return [Integer]
18
+ attr_reader :chunk_size
19
+
20
+ # Ensures text respects the chunk size limit if force splitting is enabled.
21
+ #
22
+ # If the `force` attribute is true and the provided text exceeds the
23
+ # `chunk_size`, the text is hard-split into fixed-size chunks using a
24
+ # regular expression. If `force` is false or the text is within the
25
+ # limit, the text is returned wrapped in a single-element array to
26
+ # maintain return-type consistency (Array<String>).
27
+ #
28
+ # @param text [String, nil] the text to potentially split
29
+ # @return [Array<String>] the resulting chunk(s), or an empty array if text is nil/empty
30
+ def force_split(text)
31
+ text&.empty? and return []
32
+ if force && text.size > chunk_size
33
+ text.scan(/.{1,#{chunk_size}}/)
34
+ else
35
+ Array(text)
36
+ end
37
+ end
38
+ end