documentrix 0.1.1 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: ef2b0d8150c99cc22cc17d7d29d0331729b06d116e593fd32289852eb7739228
4
- data.tar.gz: 386b85971a711fd6ba73aad746e98df33c14109c5524b48dd8fa20f3059fc875
3
+ metadata.gz: 68c409476101e4632597c139494f3fc1fe67bc0af7a655e5e5b3e4ebbd58f58c
4
+ data.tar.gz: fcd07ca7694b3fbed81c3a25f3fa9d9a675d120e5b82b43a13f5f71970012f3a
5
5
  SHA512:
6
- metadata.gz: 7b1abe342b523199e58724d1b9ec66ce56654389dc010fc59c2847647825876b8c03e1511c5b007dd97ee8f19f4e79caf132760dbb549906d5febdc9ac403bc1
7
- data.tar.gz: 5b638e7af884173350e48dcdb83a81d2adeb6c6a4f90dc20243b690d888f083d95edcd34e00fd3bffc9435aa8a29db151b0c1c6311f696d39379e397379ef9df
6
+ metadata.gz: cf1e95f2d994bb130b89bff9d2e43062621b849b5d0f7fb4543092b1f491bab4b9dd9e51047c3dfb216090a4f9bc853c08ba35719dea7409c34a15758070fcba
7
+ data.tar.gz: 59aa3347c07c2521f5661326a55d73733470ef30c62e86fc7d1ff53a4acb6cbc2da06e4c7814896b59ca16bdfab8933a7018f3f41ab30ce29593c5fac480f342
data/CHANGES.md CHANGED
@@ -1,5 +1,97 @@
1
1
  # Changes
2
2
 
3
+ ## 2026-05-17 v0.3.0
4
+
5
+ ### New Features
6
+
7
+ - **Source Tracking & Versioning**:
8
+ - Introduced `Documentrix::Utils::Digests` for SHA256 hashing of strings
9
+ and files, including an `mtime`-based cache.
10
+ - Implemented source-based document management in `Documentrix::Documents`
11
+ via `normalize_source`, `source_exist?`, `source_modified?`,
12
+ `source_update`, and `source_remove`.
13
+ - Updated `Documentrix::Documents#add` and
14
+ `Documentrix::Documents#source_update` to support `digest` for version
15
+ tracking.
16
+ - **Text Splitting**:
17
+ - Added `Documentrix::Documents::Splitters::Common` to implement
18
+ `force_split` behavior.
19
+ - Integrated `force` splitting into `Character`, `RecursiveCharacter`, and
20
+ `Semantic` splitters.
21
+ - **Cache Enhancements**:
22
+ - Added `each_source` to `Documentrix::Documents::Cache::Common` and an
23
+ optimized `SELECT DISTINCT source` implementation in
24
+ `Documentrix::Documents::Cache::SQLiteCache`.
25
+ - Added a SQLite trigger `delete_embedding_after_record` to automatically
26
+ clean the `embeddings` table.
27
+
28
+ ### Improvements & Refactorings
29
+
30
+ - **Search & Retrieval**:
31
+ - Added `min_similarity` parameter to `Documentrix::Documents#find`,
32
+ `Documentrix::Documents::Cache::Common#find_records`, and
33
+ `Documentrix::Documents::Cache::SQLiteCache#find_records`.
34
+ - Optimized `Documentrix::Documents::Cache::SQLiteCache#find_records` by
35
+ moving similarity calculations into the SQL query using `1 -
36
+ vec_distance_cosine`.
37
+ - Simplified `Documentrix::Documents#find_where` by streamlining
38
+ `take_while` logic and utilizing `opts[:max_records]`.
39
+ - **Cache Implementations**:
40
+ - Made `object_class` a required keyword argument in
41
+ `Documentrix::Documents::RedisCache#initialize`.
42
+ - Refactored `Documentrix::Documents::Cache::Common#clear_by_source` and
43
+ `Documentrix::Documents::Cache::Common#source_exist?` to use ternary
44
+ operators.
45
+ - Improved `Documentrix::Documents::Cache::SQLiteCache#each_source` and
46
+ `Documentrix::Documents::Cache::SQLiteCache#find_records` for better
47
+ robustness and formatting.
48
+ - **Documentation & Tooling**:
49
+ - Expanded YARD documentation for
50
+ `Documentrix::Documents::Splitters::Character`, `RecursiveCharacter`,
51
+ `Semantic`, and `Documentrix::Utils::ColorizeTexts`.
52
+ - Centralized RSpec configuration via a `.rspec` file.
53
+
54
+ ### Bug Fixes
55
+
56
+ - Fixed an issue in `Documentrix::Documents#find` where `max_records` was
57
+ hardcoded to `nil` when calling the cache.
58
+ - Adjusted default handling of `min_similarity` in
59
+ `Documentrix::Documents#find` to use `min_similarity ||= -1` within the
60
+ method body.
61
+
62
+ ### Testing
63
+
64
+ - Significantly expanded test suites for `SQLiteCache`, `MemoryCache`, and
65
+ `RedisCache`, specifically covering `each_source`, `tags`, `clear_for_tags`,
66
+ and digest-based checks.
67
+ - Added new test cases in `spec/documents_spec.rb` for source management and
68
+ `Documentrix::Documents#source_update`.
69
+ - Added `spec/utils/digests_spec.rb` and updated splitter specs to verify
70
+ `force` splitting behavior.
71
+
72
+ ## 2026-05-12 v0.2.0
73
+
74
+ ### Added
75
+
76
+ - Implemented source-based document removal by adding the `remove` method to
77
+ `Documentrix::Documents`.
78
+ - Added `clear_by_source` to `Documentrix::Documents::Cache::Common` as the
79
+ default cache implementation.
80
+ - Added an optimized `clear_by_source` override in
81
+ `Documentrix::Documents::Cache::SQLiteCache` utilizing a direct SQL `DELETE`
82
+ query.
83
+
84
+ ### Changed
85
+
86
+ - Updated `documentrix.gemspec` to use `rubygems_version` **4.0.10**.
87
+ - Updated `gem_hadar` dependency to **2.17.1**.
88
+
89
+ ### Testing
90
+
91
+ - Expanded test coverage in `spec/documents_spec.rb`,
92
+ `spec/documentrix/documents/cache/interface_spec.rb`, and all specific cache
93
+ specs.
94
+
3
95
  ## 2026-03-31 v0.1.1
4
96
 
5
97
  - Improved compatibility and reliability by ensuring the gem uses a stable,
data/Rakefile CHANGED
@@ -33,7 +33,7 @@ GemHadar do
33
33
  dependency 'infobar', '~> 0.9'
34
34
  dependency 'json', '~> 2.0'
35
35
  dependency 'tins', '~> 1.34'
36
- dependency 'sqlite-vec', '>= 0.1.8'
36
+ dependency 'sqlite-vec', '>= 0.1.9'
37
37
  dependency 'sqlite3', '~> 2.0', '>= 2.0.1'
38
38
  dependency 'kramdown-ansi', '~> 0.0', '>= 0.0.1'
39
39
  dependency 'numo-narray-alt', '~> 0.9'
data/documentrix.gemspec CHANGED
@@ -1,9 +1,9 @@
1
1
  # -*- encoding: utf-8 -*-
2
- # stub: documentrix 0.1.1 ruby lib
2
+ # stub: documentrix 0.3.0 ruby lib
3
3
 
4
4
  Gem::Specification.new do |s|
5
5
  s.name = "documentrix".freeze
6
- s.version = "0.1.1".freeze
6
+ s.version = "0.3.0".freeze
7
7
 
8
8
  s.required_rubygems_version = Gem::Requirement.new(">= 0".freeze) if s.respond_to? :required_rubygems_version=
9
9
  s.require_paths = ["lib".freeze]
@@ -11,19 +11,19 @@ Gem::Specification.new do |s|
11
11
  s.date = "1980-01-02"
12
12
  s.description = "The Ruby library, Documentrix, is designed to provide a way to build and\nquery vector databases for applications in natural language processing\n(NLP) and large language models (LLMs). It allows users to store and\nretrieve dense vector embeddings for text strings.\n".freeze
13
13
  s.email = "flori@ping.de".freeze
14
- s.extra_rdoc_files = ["README.md".freeze, "lib/documentrix.rb".freeze, "lib/documentrix/documents.rb".freeze, "lib/documentrix/documents/cache/common.rb".freeze, "lib/documentrix/documents/cache/memory_cache.rb".freeze, "lib/documentrix/documents/cache/records.rb".freeze, "lib/documentrix/documents/cache/redis_cache.rb".freeze, "lib/documentrix/documents/cache/sqlite_cache.rb".freeze, "lib/documentrix/documents/splitters/character.rb".freeze, "lib/documentrix/documents/splitters/semantic.rb".freeze, "lib/documentrix/utils.rb".freeze, "lib/documentrix/utils/colorize_texts.rb".freeze, "lib/documentrix/utils/math.rb".freeze, "lib/documentrix/utils/tags.rb".freeze, "lib/documentrix/version.rb".freeze]
15
- s.files = [".envrc".freeze, ".utilsrc".freeze, ".yardopts".freeze, "CHANGES.md".freeze, "Gemfile".freeze, "LICENSE".freeze, "README.md".freeze, "Rakefile".freeze, "docker-compose.yml".freeze, "documentrix.gemspec".freeze, "lib/documentrix.rb".freeze, "lib/documentrix/documents.rb".freeze, "lib/documentrix/documents/cache/common.rb".freeze, "lib/documentrix/documents/cache/memory_cache.rb".freeze, "lib/documentrix/documents/cache/records.rb".freeze, "lib/documentrix/documents/cache/redis_cache.rb".freeze, "lib/documentrix/documents/cache/sqlite_cache.rb".freeze, "lib/documentrix/documents/splitters/character.rb".freeze, "lib/documentrix/documents/splitters/semantic.rb".freeze, "lib/documentrix/utils.rb".freeze, "lib/documentrix/utils/colorize_texts.rb".freeze, "lib/documentrix/utils/math.rb".freeze, "lib/documentrix/utils/tags.rb".freeze, "lib/documentrix/version.rb".freeze, "redis/redis.conf".freeze, "spec/assets/embeddings.json".freeze, "spec/documentrix/documents/cache/interface_spec.rb".freeze, "spec/documentrix/documents/cache/memory_cache_spec.rb".freeze, "spec/documentrix/documents/cache/redis_cache_spec.rb".freeze, "spec/documentrix/documents/cache/sqlite_cache_spec.rb".freeze, "spec/documentrix/documents/splitters/character_spec.rb".freeze, "spec/documentrix/documents/splitters/semantic_spec.rb".freeze, "spec/documents_spec.rb".freeze, "spec/spec_helper.rb".freeze, "spec/utils/colorize_texts_spec.rb".freeze, "spec/utils/tags_spec.rb".freeze]
14
+ s.extra_rdoc_files = ["README.md".freeze, "lib/documentrix.rb".freeze, "lib/documentrix/documents.rb".freeze, "lib/documentrix/documents/cache/common.rb".freeze, "lib/documentrix/documents/cache/memory_cache.rb".freeze, "lib/documentrix/documents/cache/records.rb".freeze, "lib/documentrix/documents/cache/redis_cache.rb".freeze, "lib/documentrix/documents/cache/sqlite_cache.rb".freeze, "lib/documentrix/documents/splitters/character.rb".freeze, "lib/documentrix/documents/splitters/common.rb".freeze, "lib/documentrix/documents/splitters/semantic.rb".freeze, "lib/documentrix/utils.rb".freeze, "lib/documentrix/utils/colorize_texts.rb".freeze, "lib/documentrix/utils/digests.rb".freeze, "lib/documentrix/utils/math.rb".freeze, "lib/documentrix/utils/tags.rb".freeze, "lib/documentrix/version.rb".freeze]
15
+ s.files = [".envrc".freeze, ".utilsrc".freeze, ".yardopts".freeze, "CHANGES.md".freeze, "Gemfile".freeze, "LICENSE".freeze, "README.md".freeze, "Rakefile".freeze, "docker-compose.yml".freeze, "documentrix.gemspec".freeze, "lib/documentrix.rb".freeze, "lib/documentrix/documents.rb".freeze, "lib/documentrix/documents/cache/common.rb".freeze, "lib/documentrix/documents/cache/memory_cache.rb".freeze, "lib/documentrix/documents/cache/records.rb".freeze, "lib/documentrix/documents/cache/redis_cache.rb".freeze, "lib/documentrix/documents/cache/sqlite_cache.rb".freeze, "lib/documentrix/documents/splitters/character.rb".freeze, "lib/documentrix/documents/splitters/common.rb".freeze, "lib/documentrix/documents/splitters/semantic.rb".freeze, "lib/documentrix/utils.rb".freeze, "lib/documentrix/utils/colorize_texts.rb".freeze, "lib/documentrix/utils/digests.rb".freeze, "lib/documentrix/utils/math.rb".freeze, "lib/documentrix/utils/tags.rb".freeze, "lib/documentrix/version.rb".freeze, "redis/redis.conf".freeze, "spec/assets/embeddings.json".freeze, "spec/documentrix/documents/cache/interface_spec.rb".freeze, "spec/documentrix/documents/cache/memory_cache_spec.rb".freeze, "spec/documentrix/documents/cache/redis_cache_spec.rb".freeze, "spec/documentrix/documents/cache/sqlite_cache_spec.rb".freeze, "spec/documentrix/documents/splitters/character_spec.rb".freeze, "spec/documentrix/documents/splitters/semantic_spec.rb".freeze, "spec/documents_spec.rb".freeze, "spec/spec_helper.rb".freeze, "spec/utils/colorize_texts_spec.rb".freeze, "spec/utils/digests_spec.rb".freeze, "spec/utils/tags_spec.rb".freeze]
16
16
  s.homepage = "https://github.com/flori/documentrix".freeze
17
17
  s.licenses = ["MIT".freeze]
18
18
  s.rdoc_options = ["--title".freeze, "Documentrix - Ruby library for embedding vector database".freeze, "--main".freeze, "README.md".freeze]
19
19
  s.required_ruby_version = Gem::Requirement.new(">= 3.1".freeze)
20
- s.rubygems_version = "4.0.8".freeze
20
+ s.rubygems_version = "4.0.10".freeze
21
21
  s.summary = "Ruby library for embedding vector database".freeze
22
- s.test_files = ["spec/documentrix/documents/cache/interface_spec.rb".freeze, "spec/documentrix/documents/cache/memory_cache_spec.rb".freeze, "spec/documentrix/documents/cache/redis_cache_spec.rb".freeze, "spec/documentrix/documents/cache/sqlite_cache_spec.rb".freeze, "spec/documentrix/documents/splitters/character_spec.rb".freeze, "spec/documentrix/documents/splitters/semantic_spec.rb".freeze, "spec/documents_spec.rb".freeze, "spec/spec_helper.rb".freeze, "spec/utils/colorize_texts_spec.rb".freeze, "spec/utils/tags_spec.rb".freeze]
22
+ s.test_files = ["spec/documentrix/documents/cache/interface_spec.rb".freeze, "spec/documentrix/documents/cache/memory_cache_spec.rb".freeze, "spec/documentrix/documents/cache/redis_cache_spec.rb".freeze, "spec/documentrix/documents/cache/sqlite_cache_spec.rb".freeze, "spec/documentrix/documents/splitters/character_spec.rb".freeze, "spec/documentrix/documents/splitters/semantic_spec.rb".freeze, "spec/documents_spec.rb".freeze, "spec/spec_helper.rb".freeze, "spec/utils/colorize_texts_spec.rb".freeze, "spec/utils/digests_spec.rb".freeze, "spec/utils/tags_spec.rb".freeze]
23
23
 
24
24
  s.specification_version = 4
25
25
 
26
- s.add_development_dependency(%q<gem_hadar>.freeze, [">= 2.17.0".freeze])
26
+ s.add_development_dependency(%q<gem_hadar>.freeze, [">= 2.17.1".freeze])
27
27
  s.add_development_dependency(%q<all_images>.freeze, ["~> 0.12".freeze])
28
28
  s.add_development_dependency(%q<rspec>.freeze, ["~> 3.2".freeze])
29
29
  s.add_development_dependency(%q<kramdown>.freeze, ["~> 2.0".freeze])
@@ -32,7 +32,7 @@ Gem::Specification.new do |s|
32
32
  s.add_runtime_dependency(%q<infobar>.freeze, ["~> 0.9".freeze])
33
33
  s.add_runtime_dependency(%q<json>.freeze, ["~> 2.0".freeze])
34
34
  s.add_runtime_dependency(%q<tins>.freeze, ["~> 1.34".freeze])
35
- s.add_runtime_dependency(%q<sqlite-vec>.freeze, [">= 0.1.8".freeze])
35
+ s.add_runtime_dependency(%q<sqlite-vec>.freeze, [">= 0.1.9".freeze])
36
36
  s.add_runtime_dependency(%q<sqlite3>.freeze, ["~> 2.0".freeze, ">= 2.0.1".freeze])
37
37
  s.add_runtime_dependency(%q<kramdown-ansi>.freeze, ["~> 0.0".freeze, ">= 0.0.1".freeze])
38
38
  s.add_runtime_dependency(%q<numo-narray-alt>.freeze, ["~> 0.9".freeze])
@@ -12,6 +12,7 @@
12
12
  # memory, Redis, and SQLite.
13
13
  module Documentrix::Documents::Cache::Common
14
14
  include Documentrix::Utils::Math
15
+ include Documentrix::Utils::Digests
15
16
  include Enumerable
16
17
 
17
18
  # The initialize method sets up the Documentrix::Documents::Cache instance's
@@ -62,27 +63,29 @@ module Documentrix::Documents::Cache::Common
62
63
  # @param needle [ Array ] an array containing the embedding vector
63
64
  # @param tags [ String, Array ] a string or array of strings representing the tags to search for
64
65
  # @param max_records [ Integer ] the maximum number of records to return
66
+ # @param min_similarity [ Float ] the minimum similarity score required for a record to be returned (defaults to -1)
65
67
  #
66
- # @yield [ record ]
67
- #
68
- # @return [ Array<Documentrix::Documents::Records> ] an array containing the matching records
69
- def find_records(needle, tags: nil, max_records: nil)
68
+ # @return [ Array<Documentrix::Documents::Record> ] an array containing the matching records
69
+ def find_records(needle, tags: nil, max_records: nil, min_similarity: -1)
70
70
  tags = Documentrix::Utils::Tags.new(Array(tags)).to_a
71
71
  records = self
72
72
  if tags.present?
73
73
  records = records.select { |_key, record| (tags & record.tags).size >= 1 }
74
74
  end
75
+
75
76
  needle_norm = norm(needle)
76
- records = records.sort_by { |key, record|
77
+ records = records.map do |key, record|
77
78
  record.key = key
78
79
  record.similarity = cosine_similarity(
79
- a: needle,
80
- b: record.embedding,
80
+ a: needle,
81
+ b: record.embedding,
81
82
  a_norm: needle_norm,
82
83
  b_norm: record.norm,
83
84
  )
84
- }
85
- records.transpose.last&.reverse.to_a
85
+ record
86
+ end.sort_by(&:similarity).reverse.select { _1.similarity >= min_similarity }
87
+
88
+ max_records ? records.take(max_records) : records
86
89
  end
87
90
 
88
91
  # Returns a set of unique tags found in the cache records.
@@ -116,6 +119,68 @@ module Documentrix::Documents::Cache::Common
116
119
  self
117
120
  end
118
121
 
122
+ # Yields each unique, full source present in the cache records.
123
+ #
124
+ # @yield [source] the full source string
125
+ # @return [Enumerator] an enumerator if no block is given, nil otherwise.
126
+ def each_source(&block)
127
+ block or return enum_for(__method__)
128
+ seen = {}
129
+ each do |_key, record|
130
+ source = record.source.full? or next
131
+ seen.key?(source) and next
132
+ seen[source] = true
133
+ block.(source)
134
+ end
135
+ nil
136
+ end
137
+
138
+ # The clear_by_source method removes all records from the cache that
139
+ # have a source matching the given source.
140
+ #
141
+ # @param source [String] the source to filter records by
142
+ # @param digest [String, nil] the SHA256 hexadecimal digest of the source.
143
+ # @param operator [Symbol, String] the operator to compare the digest with ('=' or '!=')
144
+ #
145
+ # @return [self] self
146
+ def clear_by_source(source, digest: nil, operator: ?=)
147
+ operator = operator == '=' ? '==' : '!='
148
+
149
+ each do |key, record|
150
+ next unless record.source == source
151
+ if digest
152
+ should_delete = record.digest.send(operator, digest)
153
+ delete(unpre(key)) if should_delete
154
+ else
155
+ delete(unpre(key))
156
+ end
157
+ end
158
+ self
159
+ end
160
+
161
+ # Checks if any records associated with the given source exist in the cache.
162
+ #
163
+ # @param source [String] the source to check for existence
164
+ # @param digest [String, nil] the SHA256 hexadecimal digest to compare against
165
+ # @param operator [Symbol, String] the operator to compare the digest with ('=' or '!=')
166
+ #
167
+ # @return [Boolean] true if a matching record is found, false otherwise.
168
+ def source_exist?(source, digest: nil, operator: ?=)
169
+ operator = operator == '=' ? '==' : '!='
170
+
171
+ each do |_, record|
172
+ next unless record.source == source
173
+ if digest
174
+ if record.digest.send(operator, digest)
175
+ return true
176
+ end
177
+ else
178
+ return true
179
+ end
180
+ end
181
+ false
182
+ end
183
+
119
184
  # The clear method removes cached records based on the provided tags or
120
185
  # clears all records with the current prefix.
121
186
  #
@@ -27,7 +27,7 @@ module Documentrix::Documents::Cache::Records
27
27
  # The to_s method returns a string representation of the object.
28
28
  #
29
29
  # @return [String] A string containing the text and tags of the record,
30
- # along with its similarity score.
30
+ # along with its similarity score.
31
31
  def to_s
32
32
  my_tags = tags_set
33
33
  my_tags.empty? or my_tags = " #{my_tags}"
@@ -23,7 +23,7 @@ class Documentrix::Documents::RedisCache
23
23
  # @param [String] prefix the string to be used as the prefix for this cache
24
24
  # @param [String] url the URL of the Redis server (default: ENV['REDIS_URL'])
25
25
  # @param [Class] object_class the class of objects stored in Redis (default: nil)
26
- def initialize(prefix:, url: ENV['REDIS_URL'], object_class: nil)
26
+ def initialize(prefix:, url: ENV['REDIS_URL'], object_class:)
27
27
  super(prefix:)
28
28
  url or raise ArgumentError, 'require redis url'
29
29
  @url, @object_class = url, object_class
@@ -46,7 +46,7 @@ class Documentrix::Documents::RedisCache
46
46
  def [](key)
47
47
  value = redis.get(pre(key))
48
48
  unless value.nil?
49
- object_class ? JSON.parse(value, object_class:) : JSON.parse(value)
49
+ JSON.parse(value, object_class:)
50
50
  end
51
51
  end
52
52
 
@@ -153,7 +153,7 @@ class Documentrix::Documents::RedisCache
153
153
 
154
154
  redis.scan_each(match: prefix + ?*) do |key|
155
155
  value = redis.get(key) or next
156
- value = object_class ? JSON.parse(value, object_class:) : JSON.parse(value)
156
+ value = JSON.parse(value, object_class:)
157
157
  block.(key, value)
158
158
  end
159
159
  end
@@ -46,17 +46,17 @@ class Documentrix::Documents::Cache::SQLiteCache
46
46
  result = execute(
47
47
  %{
48
48
  SELECT records.key, records.text, records.norm, records.source,
49
- records.tags, embeddings.embedding
49
+ records.digest, records.tags, embeddings.embedding
50
50
  FROM records
51
51
  INNER JOIN embeddings ON records.embedding_id = embeddings.rowid
52
52
  WHERE records.key = ?
53
53
  },
54
54
  pre(key)
55
55
  )&.first or return
56
- key, text, norm, source, tags, embedding = *result
56
+ key, text, norm, source, digest, tags, embedding = *result
57
57
  embedding = embedding.unpack("f*")
58
58
  tags = Documentrix::Utils::Tags.new(JSON(tags.to_s).to_a, source:)
59
- convert_value_to_record(key:, text:, norm:, source:, tags:, embedding:)
59
+ convert_value_to_record(key:, text:, norm:, source:, digest:, tags:, embedding:)
60
60
  end
61
61
 
62
62
  # The []= method sets the value for a given key by inserting it into the
@@ -66,15 +66,16 @@ class Documentrix::Documents::Cache::SQLiteCache
66
66
  # @param [Hash, Documentrix::Documents::Record] value the hash or record
67
67
  # containing the text, embedding, and other metadata
68
68
  def []=(key, value)
69
- value = convert_value_to_record(value)
69
+ value = convert_value_to_record(value)
70
+ digest = compute_file_digest(value.source)
70
71
  embedding = value.embedding.pack("f*")
71
72
  execute(%{BEGIN})
72
73
  execute(%{INSERT INTO embeddings(embedding) VALUES(?)}, [ embedding ])
73
74
  embedding_id, = execute(%{ SELECT last_insert_rowid() }).flatten
74
75
  execute(%{
75
- INSERT INTO records(key,text,embedding_id,norm,source,tags)
76
- VALUES(?,?,?,?,?,?)
77
- }, [ pre(key), value.text, embedding_id, value.norm, value.source, JSON(value.tags) ])
76
+ INSERT INTO records(key,text,embedding_id,norm,source,digest,tags)
77
+ VALUES(?,?,?,?,?,?,?)
78
+ }, [ pre(key), value.text, embedding_id, value.norm, value.source, digest, JSON(value.tags) ])
78
79
  execute(%{COMMIT})
79
80
  end
80
81
 
@@ -157,6 +158,70 @@ class Documentrix::Documents::Cache::SQLiteCache
157
158
  self
158
159
  end
159
160
 
161
+ # Removes all records associated with the specified source from the cache.
162
+ #
163
+ # If a digest is provided, the method will only remove records that do NOT
164
+ # match this digest. This allows for updating a source by wiping old versions
165
+ # while preserving records that are already up-to-date.
166
+ #
167
+ # @param source [String] the source identifier used to filter records
168
+ # @param digest [String, nil] the SHA256 hexadecimal digest of the source.
169
+ # Records matching this digest will be preserved.
170
+ #
171
+ # @return [self] the cache instance for method chaining
172
+ def clear_by_source(source, digest: nil, operator: ?=)
173
+ operator = '!=' if operator != ?=
174
+ if digest
175
+ execute(%{DELETE FROM records WHERE source = ? AND digest #{operator} ? }, [ source, digest ])
176
+ else
177
+ execute(%{DELETE FROM records WHERE source = ?}, [ source ])
178
+ end
179
+ self
180
+ end
181
+
182
+ # The source_exist? method checks if any records associated with the given
183
+ # source exist in the cache. If a digest is provided, it verifies if the
184
+ # source exists and matches the specified digest using the provided operator.
185
+ #
186
+ # @param source [#to_s] the source to check for existence
187
+ # @param digest [String, nil] the SHA256 hexadecimal digest to compare
188
+ # against the stored source digest (optional)
189
+ # @param operator [String] the operator to use for comparison ('=' or '!=').
190
+ # Defaults to '='.
191
+ #
192
+ # @return [Boolean] true if the source exists (and matches the digest
193
+ # condition if provided), false otherwise.
194
+ def source_exist?(source, digest: nil, operator: ?=)
195
+ operator = '!=' if operator != ?=
196
+ if digest
197
+ !!execute(%{SELECT 1 FROM records WHERE source = ? AND digest #{operator} ? }, [ source, digest ]).first
198
+ else
199
+ !!execute(%{SELECT 1 FROM records WHERE source = ?}, [ source ]).first
200
+ end
201
+ end
202
+
203
+ # Yields each unique, full source present in the cache records.
204
+ #
205
+ # This is a high-performance override for SQLite that avoids loading
206
+ # embeddings and parsing JSON for every record.
207
+ #
208
+ # @yield [source] the full source string
209
+ # @return [Enumerator] an enumerator if no block is given, nil otherwise.
210
+ def each_source(&block)
211
+ block or return enum_for(__method__)
212
+
213
+ execute(%{
214
+ SELECT DISTINCT source
215
+ FROM records
216
+ WHERE key LIKE ? AND source IS NOT NULL
217
+ }, [ "#@prefix%" ]).each do |source,|
218
+ source = source.full? or next
219
+
220
+ block.(source)
221
+ end
222
+ nil
223
+ end
224
+
160
225
  # Move a key prefix in the cache.
161
226
  #
162
227
  # This operation updates every record whose key starts with +old_prefix+,
@@ -197,14 +262,14 @@ class Documentrix::Documents::Cache::SQLiteCache
197
262
 
198
263
  execute(%{
199
264
  SELECT records.key, records.text, records.norm, records.source,
200
- records.tags, embeddings.embedding
265
+ records.digest, records.tags, embeddings.embedding
201
266
  FROM records
202
267
  INNER JOIN embeddings ON records.embedding_id = embeddings.rowid
203
268
  WHERE records.key LIKE ?
204
- }, [ prefix ]).each do |key, text, norm, source, tags, embedding|
269
+ }, [ prefix ]).each do |key, text, norm, source, digest, tags, embedding|
205
270
  embedding = embedding.unpack("f*")
206
271
  tags = Documentrix::Utils::Tags.new(JSON(tags.to_s).to_a, source:)
207
- value = convert_value_to_record(key:, text:, norm:, source:, tags:, embedding:)
272
+ value = convert_value_to_record(key:, text:, norm:, source:, digest:, tags:, embedding:)
208
273
  block.(key, value)
209
274
  end
210
275
  self
@@ -264,34 +329,40 @@ class Documentrix::Documents::Cache::SQLiteCache
264
329
  # @param needle [ Array ] the embedding vector
265
330
  # @param tags [ Array ] the list of tags to filter by (optional)
266
331
  # @param max_records [ Integer ] the maximum number of records to return (optional)
332
+ # @param min_similarity [ Float ] the minimum similarity score to include (defaults to -1)
267
333
  #
268
334
  # @yield [ key, value ]
269
335
  #
270
336
  # @raise [ ArgumentError ] if needle size does not match embedding length
271
337
  #
272
338
  # @example
273
- # documents.find_records([ 0.1 ] * 1_024, tags: %w[ test ])
339
+ # documents.find_records([ 0.1 ] * 1_024, tags: %w[ test ], min_similarity: 0.7)
274
340
  #
275
341
  # @return [ Array<Documentrix::Documents::Record> ] the list of matching records
276
- def find_records(needle, tags: nil, max_records: nil)
342
+ def find_records(needle, tags: nil, max_records: nil, min_similarity: -1)
277
343
  needle.size != @embedding_length and
278
344
  raise ArgumentError, "needle embedding length != %s" % @embedding_length
279
345
  needle_binary = needle.pack("f*")
280
346
  max_records = [ max_records, size, 4_096 ].compact.min
281
347
  records = find_records_for_tags(tags)
282
348
  rowids_where = '(%s)' % records.transpose.last&.join(?,)
283
- execute(%{
284
- SELECT records.key, records.text, records.norm, records.source,
285
- records.tags, embeddings.embedding
286
- FROM records
287
- INNER JOIN embeddings ON records.embedding_id = embeddings.rowid
288
- WHERE embeddings.rowid IN #{rowids_where}
289
- AND embeddings.embedding MATCH ? AND embeddings.k = ?
290
- }, [ needle_binary, max_records ]).map do |key, text, norm, source, tags, embedding|
349
+ execute(
350
+ %{
351
+ SELECT records.key, records.text, records.norm, records.source,
352
+ records.digest, records.tags, embeddings.embedding,
353
+ 1 - vec_distance_cosine(?, vec_f32(embeddings.embedding)) AS similarity
354
+ FROM records
355
+ INNER JOIN embeddings ON records.embedding_id = embeddings.rowid
356
+ WHERE embeddings.rowid IN #{rowids_where}
357
+ AND embeddings.embedding MATCH ? AND similarity >= ?
358
+ AND embeddings.k = ?
359
+ ORDER BY similarity DESC
360
+ }, [ needle_binary, needle_binary, min_similarity, max_records ]
361
+ ).map do |key, text, norm, source, digest, tags, embedding, similarity|
291
362
  key = unpre(key)
292
363
  embedding = embedding.unpack("f*")
293
364
  tags = Documentrix::Utils::Tags.new(JSON(tags.to_s).to_a, source:)
294
- convert_value_to_record(key:, text:, norm:, source:, tags:, embedding:)
365
+ convert_value_to_record(key:, text:, norm:, source:, digest:, tags:, embedding:, similarity:)
295
366
  end
296
367
  end
297
368
 
@@ -351,10 +422,18 @@ class Documentrix::Documents::Cache::SQLiteCache
351
422
  embedding_id integer,
352
423
  norm float NOT NULL DEFAULT 0.0,
353
424
  source text,
425
+ digest text,
354
426
  tags json NOT NULL DEFAULT [],
355
427
  FOREIGN KEY(embedding_id) REFERENCES embeddings(id) ON DELETE CASCADE
356
428
  )
357
429
  }
430
+ execute %{
431
+ CREATE TRIGGER IF NOT EXISTS delete_embedding_after_record AFTER DELETE ON records
432
+ FOR EACH ROW
433
+ BEGIN
434
+ DELETE FROM embeddings WHERE rowid = OLD.embedding_id;
435
+ END
436
+ }
358
437
  nil
359
438
  end
360
439
 
@@ -1,15 +1,38 @@
1
1
  module Documentrix::Documents::Splitters
2
+ # The Character class provides basic text splitting based on a single
3
+ # separator and bundles the resulting segments into chunks of a maximum size.
4
+ #
5
+ # It allows for the preservation of separators and uses a combining string
6
+ # to join segments back together into chunks.
2
7
  class Character
8
+ include Documentrix::Documents::Splitters::Common
9
+
10
+ # The default regex used to identify paragraph boundaries.
11
+ # It matches two or more consecutive newline characters (CRLF or LF).
12
+ #
13
+ # @return [Regexp]
3
14
  DEFAULT_SEPARATOR = /(?:\r?\n){2,}/
4
15
 
5
- def initialize(separator: DEFAULT_SEPARATOR, include_separator: false, combining_string: "\n\n", chunk_size: 4096)
6
- @separator, @include_separator, @combining_string, @chunk_size =
7
- separator, include_separator, combining_string, chunk_size
16
+ # Initializes a new Character splitter.
17
+ #
18
+ # @param separator [Regexp] the regex used to split the text (defaults to DEFAULT_SEPARATOR)
19
+ # @param include_separator [Boolean] whether to include the separator in the resulting chunks (defaults to false)
20
+ # @param combining_string [String] the string used to join segments into chunks (defaults to "\n\n")
21
+ # @param chunk_size [Integer] the maximum size of each resulting chunk (defaults to 4096)
22
+ # @param force [Boolean] whether to force-split the final chunk if it exceeds `chunk_size` (defaults to false)
23
+ def initialize(separator: DEFAULT_SEPARATOR, include_separator: false, combining_string: "\n\n", chunk_size: 4096, force: false)
24
+ @separator, @include_separator, @combining_string, @chunk_size, @force =
25
+ separator, include_separator, combining_string, chunk_size, force
8
26
  if include_separator
9
27
  @separator = Regexp.new("(#@separator)")
10
28
  end
11
29
  end
12
30
 
31
+ # Splits the given text into chunks based on the configured separator and
32
+ # size limit.
33
+ #
34
+ # @param text [String] the text to be split
35
+ # @return [Array<String>] an array of text chunks
13
36
  def split(text)
14
37
  texts = []
15
38
  text.split(@separator) do |t|
@@ -29,12 +52,27 @@ module Documentrix::Documents::Splitters
29
52
  current_text = t
30
53
  end
31
54
  end
32
- current_text.empty? or result << current_text
55
+ result.concat force_split(current_text)
33
56
  result
34
57
  end
35
58
  end
36
59
 
60
+ # The RecursiveCharacter class implements a hierarchical splitting strategy.
61
+ #
62
+ # It attempts to split text using a priority list of separators. If a
63
+ # resulting chunk is still larger than the specified chunk_size, it
64
+ # recursively applies the next separator in the list until the size limit is
65
+ # met or all separators have been exhausted.
37
66
  class RecursiveCharacter
67
+ include Documentrix::Documents::Splitters::Common
68
+
69
+ # The default priority list of regexes used for recursive splitting.
70
+ # The strategy is to split by the coarsest grain first (paragraphs)
71
+ # and move toward the finest grain (individual characters) as needed.
72
+ #
73
+ # Order: Paragraphs -> Newlines -> Word Boundaries -> Characters
74
+ #
75
+ # @return [Array<Regexp>]
38
76
  DEFAULT_SEPARATORS = [
39
77
  /(?:\r?\n){2,}/,
40
78
  /\r?\n/,
@@ -42,13 +80,27 @@ module Documentrix::Documents::Splitters
42
80
  //,
43
81
  ].freeze
44
82
 
83
+ # Initializes a new RecursiveCharacter splitter.
84
+ #
85
+ # @param separators [Array<Regexp>] a priority list of regexes to use for splitting (defaults to DEFAULT_SEPARATORS)
86
+ # @param include_separator [Boolean] whether to include the separator in the resulting chunks (defaults to false)
87
+ # @param combining_string [String] the string used to join segments into chunks (defaults to "\n\n")
88
+ # @param chunk_size [Integer] the maximum size of each resulting chunk (defaults to 4096)
89
+ # @raise [ArgumentError] if the separators array is empty
45
90
  def initialize(separators: DEFAULT_SEPARATORS, include_separator: false, combining_string: "\n\n", chunk_size: 4096)
46
91
  separators.empty? and
47
92
  raise ArgumentError, "non-empty array of separators required"
48
93
  @separators, @include_separator, @combining_string, @chunk_size =
49
94
  separators, include_separator, combining_string, chunk_size
95
+ @force = separators.last == //
50
96
  end
51
97
 
98
+ # Recursively splits the given text into chunks using the list of
99
+ # separators.
100
+ #
101
+ # @param text [String] the text to be split
102
+ # @param separators [Array<Regexp>] the list of separators to use (defaults to @separators)
103
+ # @return [Array<String>] an array of text chunks
52
104
  def split(text, separators: @separators)
53
105
  separators.empty? and return [ text ]
54
106
  separators = separators.dup