documentrix 0.2.0 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGES.md +69 -0
- data/documentrix.gemspec +5 -5
- data/lib/documentrix/documents/cache/common.rb +63 -11
- data/lib/documentrix/documents/cache/records.rb +1 -1
- data/lib/documentrix/documents/cache/redis_cache.rb +3 -3
- data/lib/documentrix/documents/cache/sqlite_cache.rb +95 -27
- data/lib/documentrix/documents/splitters/character.rb +56 -4
- data/lib/documentrix/documents/splitters/common.rb +38 -0
- data/lib/documentrix/documents/splitters/semantic.rb +67 -8
- data/lib/documentrix/documents.rb +133 -29
- data/lib/documentrix/utils/colorize_texts.rb +25 -21
- data/lib/documentrix/utils/digests.rb +78 -0
- data/lib/documentrix/utils.rb +1 -0
- data/lib/documentrix/version.rb +1 -1
- data/spec/documentrix/documents/cache/interface_spec.rb +16 -3
- data/spec/documentrix/documents/cache/memory_cache_spec.rb +64 -2
- data/spec/documentrix/documents/cache/redis_cache_spec.rb +68 -19
- data/spec/documentrix/documents/cache/sqlite_cache_spec.rb +128 -2
- data/spec/documentrix/documents/splitters/character_spec.rb +20 -2
- data/spec/documentrix/documents/splitters/semantic_spec.rb +17 -5
- data/spec/documents_spec.rb +59 -3
- data/spec/utils/colorize_texts_spec.rb +0 -2
- data/spec/utils/digests_spec.rb +97 -0
- data/spec/utils/tags_spec.rb +0 -2
- metadata +7 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 68c409476101e4632597c139494f3fc1fe67bc0af7a655e5e5b3e4ebbd58f58c
|
|
4
|
+
data.tar.gz: fcd07ca7694b3fbed81c3a25f3fa9d9a675d120e5b82b43a13f5f71970012f3a
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: cf1e95f2d994bb130b89bff9d2e43062621b849b5d0f7fb4543092b1f491bab4b9dd9e51047c3dfb216090a4f9bc853c08ba35719dea7409c34a15758070fcba
|
|
7
|
+
data.tar.gz: 59aa3347c07c2521f5661326a55d73733470ef30c62e86fc7d1ff53a4acb6cbc2da06e4c7814896b59ca16bdfab8933a7018f3f41ab30ce29593c5fac480f342
|
data/CHANGES.md
CHANGED
|
@@ -1,5 +1,74 @@
|
|
|
1
1
|
# Changes
|
|
2
2
|
|
|
3
|
+
## 2026-05-17 v0.3.0
|
|
4
|
+
|
|
5
|
+
### New Features
|
|
6
|
+
|
|
7
|
+
- **Source Tracking & Versioning**:
|
|
8
|
+
- Introduced `Documentrix::Utils::Digests` for SHA256 hashing of strings
|
|
9
|
+
and files, including an `mtime`-based cache.
|
|
10
|
+
- Implemented source-based document management in `Documentrix::Documents`
|
|
11
|
+
via `normalize_source`, `source_exist?`, `source_modified?`,
|
|
12
|
+
`source_update`, and `source_remove`.
|
|
13
|
+
- Updated `Documentrix::Documents#add` and
|
|
14
|
+
`Documentrix::Documents#source_update` to support `digest` for version
|
|
15
|
+
tracking.
|
|
16
|
+
- **Text Splitting**:
|
|
17
|
+
- Added `Documentrix::Documents::Splitters::Common` to implement
|
|
18
|
+
`force_split` behavior.
|
|
19
|
+
- Integrated `force` splitting into `Character`, `RecursiveCharacter`, and
|
|
20
|
+
`Semantic` splitters.
|
|
21
|
+
- **Cache Enhancements**:
|
|
22
|
+
- Added `each_source` to `Documentrix::Documents::Cache::Common` and an
|
|
23
|
+
optimized `SELECT DISTINCT source` implementation in
|
|
24
|
+
`Documentrix::Documents::Cache::SQLiteCache`.
|
|
25
|
+
- Added a SQLite trigger `delete_embedding_after_record` to automatically
|
|
26
|
+
clean the `embeddings` table.
|
|
27
|
+
|
|
28
|
+
### Improvements & Refactorings
|
|
29
|
+
|
|
30
|
+
- **Search & Retrieval**:
|
|
31
|
+
- Added `min_similarity` parameter to `Documentrix::Documents#find`,
|
|
32
|
+
`Documentrix::Documents::Cache::Common#find_records`, and
|
|
33
|
+
`Documentrix::Documents::Cache::SQLiteCache#find_records`.
|
|
34
|
+
- Optimized `Documentrix::Documents::Cache::SQLiteCache#find_records` by
|
|
35
|
+
moving similarity calculations into the SQL query using `1 -
|
|
36
|
+
vec_distance_cosine`.
|
|
37
|
+
- Simplified `Documentrix::Documents#find_where` by streamlining
|
|
38
|
+
`take_while` logic and utilizing `opts[:max_records]`.
|
|
39
|
+
- **Cache Implementations**:
|
|
40
|
+
- Made `object_class` a required keyword argument in
|
|
41
|
+
`Documentrix::Documents::RedisCache#initialize`.
|
|
42
|
+
- Refactored `Documentrix::Documents::Cache::Common#clear_by_source` and
|
|
43
|
+
`Documentrix::Documents::Cache::Common#source_exist?` to use ternary
|
|
44
|
+
operators.
|
|
45
|
+
- Improved `Documentrix::Documents::Cache::SQLiteCache#each_source` and
|
|
46
|
+
`Documentrix::Documents::Cache::SQLiteCache#find_records` for better
|
|
47
|
+
robustness and formatting.
|
|
48
|
+
- **Documentation & Tooling**:
|
|
49
|
+
- Expanded YARD documentation for
|
|
50
|
+
`Documentrix::Documents::Splitters::Character`, `RecursiveCharacter`,
|
|
51
|
+
`Semantic`, and `Documentrix::Utils::ColorizeTexts`.
|
|
52
|
+
- Centralized RSpec configuration via a `.rspec` file.
|
|
53
|
+
|
|
54
|
+
### Bug Fixes
|
|
55
|
+
|
|
56
|
+
- Fixed an issue in `Documentrix::Documents#find` where `max_records` was
|
|
57
|
+
hardcoded to `nil` when calling the cache.
|
|
58
|
+
- Adjusted default handling of `min_similarity` in
|
|
59
|
+
`Documentrix::Documents#find` to use `min_similarity ||= -1` within the
|
|
60
|
+
method body.
|
|
61
|
+
|
|
62
|
+
### Testing
|
|
63
|
+
|
|
64
|
+
- Significantly expanded test suites for `SQLiteCache`, `MemoryCache`, and
|
|
65
|
+
`RedisCache`, specifically covering `each_source`, `tags`, `clear_for_tags`,
|
|
66
|
+
and digest-based checks.
|
|
67
|
+
- Added new test cases in `spec/documents_spec.rb` for source management and
|
|
68
|
+
`Documentrix::Documents#source_update`.
|
|
69
|
+
- Added `spec/utils/digests_spec.rb` and updated splitter specs to verify
|
|
70
|
+
`force` splitting behavior.
|
|
71
|
+
|
|
3
72
|
## 2026-05-12 v0.2.0
|
|
4
73
|
|
|
5
74
|
### Added
|
data/documentrix.gemspec
CHANGED
|
@@ -1,9 +1,9 @@
|
|
|
1
1
|
# -*- encoding: utf-8 -*-
|
|
2
|
-
# stub: documentrix 0.
|
|
2
|
+
# stub: documentrix 0.3.0 ruby lib
|
|
3
3
|
|
|
4
4
|
Gem::Specification.new do |s|
|
|
5
5
|
s.name = "documentrix".freeze
|
|
6
|
-
s.version = "0.
|
|
6
|
+
s.version = "0.3.0".freeze
|
|
7
7
|
|
|
8
8
|
s.required_rubygems_version = Gem::Requirement.new(">= 0".freeze) if s.respond_to? :required_rubygems_version=
|
|
9
9
|
s.require_paths = ["lib".freeze]
|
|
@@ -11,15 +11,15 @@ Gem::Specification.new do |s|
|
|
|
11
11
|
s.date = "1980-01-02"
|
|
12
12
|
s.description = "The Ruby library, Documentrix, is designed to provide a way to build and\nquery vector databases for applications in natural language processing\n(NLP) and large language models (LLMs). It allows users to store and\nretrieve dense vector embeddings for text strings.\n".freeze
|
|
13
13
|
s.email = "flori@ping.de".freeze
|
|
14
|
-
s.extra_rdoc_files = ["README.md".freeze, "lib/documentrix.rb".freeze, "lib/documentrix/documents.rb".freeze, "lib/documentrix/documents/cache/common.rb".freeze, "lib/documentrix/documents/cache/memory_cache.rb".freeze, "lib/documentrix/documents/cache/records.rb".freeze, "lib/documentrix/documents/cache/redis_cache.rb".freeze, "lib/documentrix/documents/cache/sqlite_cache.rb".freeze, "lib/documentrix/documents/splitters/character.rb".freeze, "lib/documentrix/documents/splitters/semantic.rb".freeze, "lib/documentrix/utils.rb".freeze, "lib/documentrix/utils/colorize_texts.rb".freeze, "lib/documentrix/utils/math.rb".freeze, "lib/documentrix/utils/tags.rb".freeze, "lib/documentrix/version.rb".freeze]
|
|
15
|
-
s.files = [".envrc".freeze, ".utilsrc".freeze, ".yardopts".freeze, "CHANGES.md".freeze, "Gemfile".freeze, "LICENSE".freeze, "README.md".freeze, "Rakefile".freeze, "docker-compose.yml".freeze, "documentrix.gemspec".freeze, "lib/documentrix.rb".freeze, "lib/documentrix/documents.rb".freeze, "lib/documentrix/documents/cache/common.rb".freeze, "lib/documentrix/documents/cache/memory_cache.rb".freeze, "lib/documentrix/documents/cache/records.rb".freeze, "lib/documentrix/documents/cache/redis_cache.rb".freeze, "lib/documentrix/documents/cache/sqlite_cache.rb".freeze, "lib/documentrix/documents/splitters/character.rb".freeze, "lib/documentrix/documents/splitters/semantic.rb".freeze, "lib/documentrix/utils.rb".freeze, "lib/documentrix/utils/colorize_texts.rb".freeze, "lib/documentrix/utils/math.rb".freeze, "lib/documentrix/utils/tags.rb".freeze, "lib/documentrix/version.rb".freeze, "redis/redis.conf".freeze, "spec/assets/embeddings.json".freeze, "spec/documentrix/documents/cache/interface_spec.rb".freeze, "spec/documentrix/documents/cache/memory_cache_spec.rb".freeze, "spec/documentrix/documents/cache/redis_cache_spec.rb".freeze, "spec/documentrix/documents/cache/sqlite_cache_spec.rb".freeze, "spec/documentrix/documents/splitters/character_spec.rb".freeze, "spec/documentrix/documents/splitters/semantic_spec.rb".freeze, "spec/documents_spec.rb".freeze, "spec/spec_helper.rb".freeze, "spec/utils/colorize_texts_spec.rb".freeze, "spec/utils/tags_spec.rb".freeze]
|
|
14
|
+
s.extra_rdoc_files = ["README.md".freeze, "lib/documentrix.rb".freeze, "lib/documentrix/documents.rb".freeze, "lib/documentrix/documents/cache/common.rb".freeze, "lib/documentrix/documents/cache/memory_cache.rb".freeze, "lib/documentrix/documents/cache/records.rb".freeze, "lib/documentrix/documents/cache/redis_cache.rb".freeze, "lib/documentrix/documents/cache/sqlite_cache.rb".freeze, "lib/documentrix/documents/splitters/character.rb".freeze, "lib/documentrix/documents/splitters/common.rb".freeze, "lib/documentrix/documents/splitters/semantic.rb".freeze, "lib/documentrix/utils.rb".freeze, "lib/documentrix/utils/colorize_texts.rb".freeze, "lib/documentrix/utils/digests.rb".freeze, "lib/documentrix/utils/math.rb".freeze, "lib/documentrix/utils/tags.rb".freeze, "lib/documentrix/version.rb".freeze]
|
|
15
|
+
s.files = [".envrc".freeze, ".utilsrc".freeze, ".yardopts".freeze, "CHANGES.md".freeze, "Gemfile".freeze, "LICENSE".freeze, "README.md".freeze, "Rakefile".freeze, "docker-compose.yml".freeze, "documentrix.gemspec".freeze, "lib/documentrix.rb".freeze, "lib/documentrix/documents.rb".freeze, "lib/documentrix/documents/cache/common.rb".freeze, "lib/documentrix/documents/cache/memory_cache.rb".freeze, "lib/documentrix/documents/cache/records.rb".freeze, "lib/documentrix/documents/cache/redis_cache.rb".freeze, "lib/documentrix/documents/cache/sqlite_cache.rb".freeze, "lib/documentrix/documents/splitters/character.rb".freeze, "lib/documentrix/documents/splitters/common.rb".freeze, "lib/documentrix/documents/splitters/semantic.rb".freeze, "lib/documentrix/utils.rb".freeze, "lib/documentrix/utils/colorize_texts.rb".freeze, "lib/documentrix/utils/digests.rb".freeze, "lib/documentrix/utils/math.rb".freeze, "lib/documentrix/utils/tags.rb".freeze, "lib/documentrix/version.rb".freeze, "redis/redis.conf".freeze, "spec/assets/embeddings.json".freeze, "spec/documentrix/documents/cache/interface_spec.rb".freeze, "spec/documentrix/documents/cache/memory_cache_spec.rb".freeze, "spec/documentrix/documents/cache/redis_cache_spec.rb".freeze, "spec/documentrix/documents/cache/sqlite_cache_spec.rb".freeze, "spec/documentrix/documents/splitters/character_spec.rb".freeze, "spec/documentrix/documents/splitters/semantic_spec.rb".freeze, "spec/documents_spec.rb".freeze, "spec/spec_helper.rb".freeze, "spec/utils/colorize_texts_spec.rb".freeze, "spec/utils/digests_spec.rb".freeze, "spec/utils/tags_spec.rb".freeze]
|
|
16
16
|
s.homepage = "https://github.com/flori/documentrix".freeze
|
|
17
17
|
s.licenses = ["MIT".freeze]
|
|
18
18
|
s.rdoc_options = ["--title".freeze, "Documentrix - Ruby library for embedding vector database".freeze, "--main".freeze, "README.md".freeze]
|
|
19
19
|
s.required_ruby_version = Gem::Requirement.new(">= 3.1".freeze)
|
|
20
20
|
s.rubygems_version = "4.0.10".freeze
|
|
21
21
|
s.summary = "Ruby library for embedding vector database".freeze
|
|
22
|
-
s.test_files = ["spec/documentrix/documents/cache/interface_spec.rb".freeze, "spec/documentrix/documents/cache/memory_cache_spec.rb".freeze, "spec/documentrix/documents/cache/redis_cache_spec.rb".freeze, "spec/documentrix/documents/cache/sqlite_cache_spec.rb".freeze, "spec/documentrix/documents/splitters/character_spec.rb".freeze, "spec/documentrix/documents/splitters/semantic_spec.rb".freeze, "spec/documents_spec.rb".freeze, "spec/spec_helper.rb".freeze, "spec/utils/colorize_texts_spec.rb".freeze, "spec/utils/tags_spec.rb".freeze]
|
|
22
|
+
s.test_files = ["spec/documentrix/documents/cache/interface_spec.rb".freeze, "spec/documentrix/documents/cache/memory_cache_spec.rb".freeze, "spec/documentrix/documents/cache/redis_cache_spec.rb".freeze, "spec/documentrix/documents/cache/sqlite_cache_spec.rb".freeze, "spec/documentrix/documents/splitters/character_spec.rb".freeze, "spec/documentrix/documents/splitters/semantic_spec.rb".freeze, "spec/documents_spec.rb".freeze, "spec/spec_helper.rb".freeze, "spec/utils/colorize_texts_spec.rb".freeze, "spec/utils/digests_spec.rb".freeze, "spec/utils/tags_spec.rb".freeze]
|
|
23
23
|
|
|
24
24
|
s.specification_version = 4
|
|
25
25
|
|
|
@@ -12,6 +12,7 @@
|
|
|
12
12
|
# memory, Redis, and SQLite.
|
|
13
13
|
module Documentrix::Documents::Cache::Common
|
|
14
14
|
include Documentrix::Utils::Math
|
|
15
|
+
include Documentrix::Utils::Digests
|
|
15
16
|
include Enumerable
|
|
16
17
|
|
|
17
18
|
# The initialize method sets up the Documentrix::Documents::Cache instance's
|
|
@@ -62,27 +63,29 @@ module Documentrix::Documents::Cache::Common
|
|
|
62
63
|
# @param needle [ Array ] an array containing the embedding vector
|
|
63
64
|
# @param tags [ String, Array ] a string or array of strings representing the tags to search for
|
|
64
65
|
# @param max_records [ Integer ] the maximum number of records to return
|
|
66
|
+
# @param min_similarity [ Float ] the minimum similarity score required for a record to be returned (defaults to -1)
|
|
65
67
|
#
|
|
66
|
-
# @
|
|
67
|
-
|
|
68
|
-
# @return [ Array<Documentrix::Documents::Records> ] an array containing the matching records
|
|
69
|
-
def find_records(needle, tags: nil, max_records: nil)
|
|
68
|
+
# @return [ Array<Documentrix::Documents::Record> ] an array containing the matching records
|
|
69
|
+
def find_records(needle, tags: nil, max_records: nil, min_similarity: -1)
|
|
70
70
|
tags = Documentrix::Utils::Tags.new(Array(tags)).to_a
|
|
71
71
|
records = self
|
|
72
72
|
if tags.present?
|
|
73
73
|
records = records.select { |_key, record| (tags & record.tags).size >= 1 }
|
|
74
74
|
end
|
|
75
|
+
|
|
75
76
|
needle_norm = norm(needle)
|
|
76
|
-
records = records.
|
|
77
|
+
records = records.map do |key, record|
|
|
77
78
|
record.key = key
|
|
78
79
|
record.similarity = cosine_similarity(
|
|
79
|
-
a:
|
|
80
|
-
b:
|
|
80
|
+
a: needle,
|
|
81
|
+
b: record.embedding,
|
|
81
82
|
a_norm: needle_norm,
|
|
82
83
|
b_norm: record.norm,
|
|
83
84
|
)
|
|
84
|
-
|
|
85
|
-
|
|
85
|
+
record
|
|
86
|
+
end.sort_by(&:similarity).reverse.select { _1.similarity >= min_similarity }
|
|
87
|
+
|
|
88
|
+
max_records ? records.take(max_records) : records
|
|
86
89
|
end
|
|
87
90
|
|
|
88
91
|
# Returns a set of unique tags found in the cache records.
|
|
@@ -116,19 +119,68 @@ module Documentrix::Documents::Cache::Common
|
|
|
116
119
|
self
|
|
117
120
|
end
|
|
118
121
|
|
|
122
|
+
# Yields each unique, full source present in the cache records.
|
|
123
|
+
#
|
|
124
|
+
# @yield [source] the full source string
|
|
125
|
+
# @return [Enumerator] an enumerator if no block is given, nil otherwise.
|
|
126
|
+
def each_source(&block)
|
|
127
|
+
block or return enum_for(__method__)
|
|
128
|
+
seen = {}
|
|
129
|
+
each do |_key, record|
|
|
130
|
+
source = record.source.full? or next
|
|
131
|
+
seen.key?(source) and next
|
|
132
|
+
seen[source] = true
|
|
133
|
+
block.(source)
|
|
134
|
+
end
|
|
135
|
+
nil
|
|
136
|
+
end
|
|
137
|
+
|
|
119
138
|
# The clear_by_source method removes all records from the cache that
|
|
120
139
|
# have a source matching the given source.
|
|
121
140
|
#
|
|
122
141
|
# @param source [String] the source to filter records by
|
|
142
|
+
# @param digest [String, nil] the SHA256 hexadecimal digest of the source.
|
|
143
|
+
# @param operator [Symbol, String] the operator to compare the digest with ('=' or '!=')
|
|
123
144
|
#
|
|
124
145
|
# @return [self] self
|
|
125
|
-
def clear_by_source(source)
|
|
146
|
+
def clear_by_source(source, digest: nil, operator: ?=)
|
|
147
|
+
operator = operator == '=' ? '==' : '!='
|
|
148
|
+
|
|
126
149
|
each do |key, record|
|
|
127
|
-
|
|
150
|
+
next unless record.source == source
|
|
151
|
+
if digest
|
|
152
|
+
should_delete = record.digest.send(operator, digest)
|
|
153
|
+
delete(unpre(key)) if should_delete
|
|
154
|
+
else
|
|
155
|
+
delete(unpre(key))
|
|
156
|
+
end
|
|
128
157
|
end
|
|
129
158
|
self
|
|
130
159
|
end
|
|
131
160
|
|
|
161
|
+
# Checks if any records associated with the given source exist in the cache.
|
|
162
|
+
#
|
|
163
|
+
# @param source [String] the source to check for existence
|
|
164
|
+
# @param digest [String, nil] the SHA256 hexadecimal digest to compare against
|
|
165
|
+
# @param operator [Symbol, String] the operator to compare the digest with ('=' or '!=')
|
|
166
|
+
#
|
|
167
|
+
# @return [Boolean] true if a matching record is found, false otherwise.
|
|
168
|
+
def source_exist?(source, digest: nil, operator: ?=)
|
|
169
|
+
operator = operator == '=' ? '==' : '!='
|
|
170
|
+
|
|
171
|
+
each do |_, record|
|
|
172
|
+
next unless record.source == source
|
|
173
|
+
if digest
|
|
174
|
+
if record.digest.send(operator, digest)
|
|
175
|
+
return true
|
|
176
|
+
end
|
|
177
|
+
else
|
|
178
|
+
return true
|
|
179
|
+
end
|
|
180
|
+
end
|
|
181
|
+
false
|
|
182
|
+
end
|
|
183
|
+
|
|
132
184
|
# The clear method removes cached records based on the provided tags or
|
|
133
185
|
# clears all records with the current prefix.
|
|
134
186
|
#
|
|
@@ -27,7 +27,7 @@ module Documentrix::Documents::Cache::Records
|
|
|
27
27
|
# The to_s method returns a string representation of the object.
|
|
28
28
|
#
|
|
29
29
|
# @return [String] A string containing the text and tags of the record,
|
|
30
|
-
#
|
|
30
|
+
# along with its similarity score.
|
|
31
31
|
def to_s
|
|
32
32
|
my_tags = tags_set
|
|
33
33
|
my_tags.empty? or my_tags = " #{my_tags}"
|
|
@@ -23,7 +23,7 @@ class Documentrix::Documents::RedisCache
|
|
|
23
23
|
# @param [String] prefix the string to be used as the prefix for this cache
|
|
24
24
|
# @param [String] url the URL of the Redis server (default: ENV['REDIS_URL'])
|
|
25
25
|
# @param [Class] object_class the class of objects stored in Redis (default: nil)
|
|
26
|
-
def initialize(prefix:, url: ENV['REDIS_URL'], object_class:
|
|
26
|
+
def initialize(prefix:, url: ENV['REDIS_URL'], object_class:)
|
|
27
27
|
super(prefix:)
|
|
28
28
|
url or raise ArgumentError, 'require redis url'
|
|
29
29
|
@url, @object_class = url, object_class
|
|
@@ -46,7 +46,7 @@ class Documentrix::Documents::RedisCache
|
|
|
46
46
|
def [](key)
|
|
47
47
|
value = redis.get(pre(key))
|
|
48
48
|
unless value.nil?
|
|
49
|
-
|
|
49
|
+
JSON.parse(value, object_class:)
|
|
50
50
|
end
|
|
51
51
|
end
|
|
52
52
|
|
|
@@ -153,7 +153,7 @@ class Documentrix::Documents::RedisCache
|
|
|
153
153
|
|
|
154
154
|
redis.scan_each(match: prefix + ?*) do |key|
|
|
155
155
|
value = redis.get(key) or next
|
|
156
|
-
value =
|
|
156
|
+
value = JSON.parse(value, object_class:)
|
|
157
157
|
block.(key, value)
|
|
158
158
|
end
|
|
159
159
|
end
|
|
@@ -46,17 +46,17 @@ class Documentrix::Documents::Cache::SQLiteCache
|
|
|
46
46
|
result = execute(
|
|
47
47
|
%{
|
|
48
48
|
SELECT records.key, records.text, records.norm, records.source,
|
|
49
|
-
records.tags, embeddings.embedding
|
|
49
|
+
records.digest, records.tags, embeddings.embedding
|
|
50
50
|
FROM records
|
|
51
51
|
INNER JOIN embeddings ON records.embedding_id = embeddings.rowid
|
|
52
52
|
WHERE records.key = ?
|
|
53
53
|
},
|
|
54
54
|
pre(key)
|
|
55
55
|
)&.first or return
|
|
56
|
-
key, text, norm, source, tags, embedding = *result
|
|
56
|
+
key, text, norm, source, digest, tags, embedding = *result
|
|
57
57
|
embedding = embedding.unpack("f*")
|
|
58
58
|
tags = Documentrix::Utils::Tags.new(JSON(tags.to_s).to_a, source:)
|
|
59
|
-
convert_value_to_record(key:, text:, norm:, source:, tags:, embedding:)
|
|
59
|
+
convert_value_to_record(key:, text:, norm:, source:, digest:, tags:, embedding:)
|
|
60
60
|
end
|
|
61
61
|
|
|
62
62
|
# The []= method sets the value for a given key by inserting it into the
|
|
@@ -66,15 +66,16 @@ class Documentrix::Documents::Cache::SQLiteCache
|
|
|
66
66
|
# @param [Hash, Documentrix::Documents::Record] value the hash or record
|
|
67
67
|
# containing the text, embedding, and other metadata
|
|
68
68
|
def []=(key, value)
|
|
69
|
-
value
|
|
69
|
+
value = convert_value_to_record(value)
|
|
70
|
+
digest = compute_file_digest(value.source)
|
|
70
71
|
embedding = value.embedding.pack("f*")
|
|
71
72
|
execute(%{BEGIN})
|
|
72
73
|
execute(%{INSERT INTO embeddings(embedding) VALUES(?)}, [ embedding ])
|
|
73
74
|
embedding_id, = execute(%{ SELECT last_insert_rowid() }).flatten
|
|
74
75
|
execute(%{
|
|
75
|
-
INSERT INTO records(key,text,embedding_id,norm,source,tags)
|
|
76
|
-
VALUES(
|
|
77
|
-
}, [ pre(key), value.text, embedding_id, value.norm, value.source, JSON(value.tags) ])
|
|
76
|
+
INSERT INTO records(key,text,embedding_id,norm,source,digest,tags)
|
|
77
|
+
VALUES(?,?,?,?,?,?,?)
|
|
78
|
+
}, [ pre(key), value.text, embedding_id, value.norm, value.source, digest, JSON(value.tags) ])
|
|
78
79
|
execute(%{COMMIT})
|
|
79
80
|
end
|
|
80
81
|
|
|
@@ -157,17 +158,70 @@ class Documentrix::Documents::Cache::SQLiteCache
|
|
|
157
158
|
self
|
|
158
159
|
end
|
|
159
160
|
|
|
160
|
-
#
|
|
161
|
-
# have a source matching the given source.
|
|
161
|
+
# Removes all records associated with the specified source from the cache.
|
|
162
162
|
#
|
|
163
|
-
#
|
|
163
|
+
# If a digest is provided, the method will only remove records that do NOT
|
|
164
|
+
# match this digest. This allows for updating a source by wiping old versions
|
|
165
|
+
# while preserving records that are already up-to-date.
|
|
164
166
|
#
|
|
165
|
-
# @
|
|
166
|
-
|
|
167
|
-
|
|
167
|
+
# @param source [String] the source identifier used to filter records
|
|
168
|
+
# @param digest [String, nil] the SHA256 hexadecimal digest of the source.
|
|
169
|
+
# Records matching this digest will be preserved.
|
|
170
|
+
#
|
|
171
|
+
# @return [self] the cache instance for method chaining
|
|
172
|
+
def clear_by_source(source, digest: nil, operator: ?=)
|
|
173
|
+
operator = '!=' if operator != ?=
|
|
174
|
+
if digest
|
|
175
|
+
execute(%{DELETE FROM records WHERE source = ? AND digest #{operator} ? }, [ source, digest ])
|
|
176
|
+
else
|
|
177
|
+
execute(%{DELETE FROM records WHERE source = ?}, [ source ])
|
|
178
|
+
end
|
|
168
179
|
self
|
|
169
180
|
end
|
|
170
181
|
|
|
182
|
+
# The source_exist? method checks if any records associated with the given
|
|
183
|
+
# source exist in the cache. If a digest is provided, it verifies if the
|
|
184
|
+
# source exists and matches the specified digest using the provided operator.
|
|
185
|
+
#
|
|
186
|
+
# @param source [#to_s] the source to check for existence
|
|
187
|
+
# @param digest [String, nil] the SHA256 hexadecimal digest to compare
|
|
188
|
+
# against the stored source digest (optional)
|
|
189
|
+
# @param operator [String] the operator to use for comparison ('=' or '!=').
|
|
190
|
+
# Defaults to '='.
|
|
191
|
+
#
|
|
192
|
+
# @return [Boolean] true if the source exists (and matches the digest
|
|
193
|
+
# condition if provided), false otherwise.
|
|
194
|
+
def source_exist?(source, digest: nil, operator: ?=)
|
|
195
|
+
operator = '!=' if operator != ?=
|
|
196
|
+
if digest
|
|
197
|
+
!!execute(%{SELECT 1 FROM records WHERE source = ? AND digest #{operator} ? }, [ source, digest ]).first
|
|
198
|
+
else
|
|
199
|
+
!!execute(%{SELECT 1 FROM records WHERE source = ?}, [ source ]).first
|
|
200
|
+
end
|
|
201
|
+
end
|
|
202
|
+
|
|
203
|
+
# Yields each unique, full source present in the cache records.
|
|
204
|
+
#
|
|
205
|
+
# This is a high-performance override for SQLite that avoids loading
|
|
206
|
+
# embeddings and parsing JSON for every record.
|
|
207
|
+
#
|
|
208
|
+
# @yield [source] the full source string
|
|
209
|
+
# @return [Enumerator] an enumerator if no block is given, nil otherwise.
|
|
210
|
+
def each_source(&block)
|
|
211
|
+
block or return enum_for(__method__)
|
|
212
|
+
|
|
213
|
+
execute(%{
|
|
214
|
+
SELECT DISTINCT source
|
|
215
|
+
FROM records
|
|
216
|
+
WHERE key LIKE ? AND source IS NOT NULL
|
|
217
|
+
}, [ "#@prefix%" ]).each do |source,|
|
|
218
|
+
source = source.full? or next
|
|
219
|
+
|
|
220
|
+
block.(source)
|
|
221
|
+
end
|
|
222
|
+
nil
|
|
223
|
+
end
|
|
224
|
+
|
|
171
225
|
# Move a key prefix in the cache.
|
|
172
226
|
#
|
|
173
227
|
# This operation updates every record whose key starts with +old_prefix+,
|
|
@@ -208,14 +262,14 @@ class Documentrix::Documents::Cache::SQLiteCache
|
|
|
208
262
|
|
|
209
263
|
execute(%{
|
|
210
264
|
SELECT records.key, records.text, records.norm, records.source,
|
|
211
|
-
records.tags, embeddings.embedding
|
|
265
|
+
records.digest, records.tags, embeddings.embedding
|
|
212
266
|
FROM records
|
|
213
267
|
INNER JOIN embeddings ON records.embedding_id = embeddings.rowid
|
|
214
268
|
WHERE records.key LIKE ?
|
|
215
|
-
}, [ prefix ]).each do |key, text, norm, source, tags, embedding|
|
|
269
|
+
}, [ prefix ]).each do |key, text, norm, source, digest, tags, embedding|
|
|
216
270
|
embedding = embedding.unpack("f*")
|
|
217
271
|
tags = Documentrix::Utils::Tags.new(JSON(tags.to_s).to_a, source:)
|
|
218
|
-
value = convert_value_to_record(key:, text:, norm:, source:, tags:, embedding:)
|
|
272
|
+
value = convert_value_to_record(key:, text:, norm:, source:, digest:, tags:, embedding:)
|
|
219
273
|
block.(key, value)
|
|
220
274
|
end
|
|
221
275
|
self
|
|
@@ -275,34 +329,40 @@ class Documentrix::Documents::Cache::SQLiteCache
|
|
|
275
329
|
# @param needle [ Array ] the embedding vector
|
|
276
330
|
# @param tags [ Array ] the list of tags to filter by (optional)
|
|
277
331
|
# @param max_records [ Integer ] the maximum number of records to return (optional)
|
|
332
|
+
# @param min_similarity [ Float ] the minimum similarity score to include (defaults to -1)
|
|
278
333
|
#
|
|
279
334
|
# @yield [ key, value ]
|
|
280
335
|
#
|
|
281
336
|
# @raise [ ArgumentError ] if needle size does not match embedding length
|
|
282
337
|
#
|
|
283
338
|
# @example
|
|
284
|
-
# documents.find_records([ 0.1 ] * 1_024, tags: %w[ test ])
|
|
339
|
+
# documents.find_records([ 0.1 ] * 1_024, tags: %w[ test ], min_similarity: 0.7)
|
|
285
340
|
#
|
|
286
341
|
# @return [ Array<Documentrix::Documents::Record> ] the list of matching records
|
|
287
|
-
def find_records(needle, tags: nil, max_records: nil)
|
|
342
|
+
def find_records(needle, tags: nil, max_records: nil, min_similarity: -1)
|
|
288
343
|
needle.size != @embedding_length and
|
|
289
344
|
raise ArgumentError, "needle embedding length != %s" % @embedding_length
|
|
290
345
|
needle_binary = needle.pack("f*")
|
|
291
346
|
max_records = [ max_records, size, 4_096 ].compact.min
|
|
292
347
|
records = find_records_for_tags(tags)
|
|
293
348
|
rowids_where = '(%s)' % records.transpose.last&.join(?,)
|
|
294
|
-
execute(
|
|
295
|
-
|
|
296
|
-
records.
|
|
297
|
-
|
|
298
|
-
|
|
299
|
-
|
|
300
|
-
|
|
301
|
-
|
|
349
|
+
execute(
|
|
350
|
+
%{
|
|
351
|
+
SELECT records.key, records.text, records.norm, records.source,
|
|
352
|
+
records.digest, records.tags, embeddings.embedding,
|
|
353
|
+
1 - vec_distance_cosine(?, vec_f32(embeddings.embedding)) AS similarity
|
|
354
|
+
FROM records
|
|
355
|
+
INNER JOIN embeddings ON records.embedding_id = embeddings.rowid
|
|
356
|
+
WHERE embeddings.rowid IN #{rowids_where}
|
|
357
|
+
AND embeddings.embedding MATCH ? AND similarity >= ?
|
|
358
|
+
AND embeddings.k = ?
|
|
359
|
+
ORDER BY similarity DESC
|
|
360
|
+
}, [ needle_binary, needle_binary, min_similarity, max_records ]
|
|
361
|
+
).map do |key, text, norm, source, digest, tags, embedding, similarity|
|
|
302
362
|
key = unpre(key)
|
|
303
363
|
embedding = embedding.unpack("f*")
|
|
304
364
|
tags = Documentrix::Utils::Tags.new(JSON(tags.to_s).to_a, source:)
|
|
305
|
-
convert_value_to_record(key:, text:, norm:, source:, tags:, embedding:)
|
|
365
|
+
convert_value_to_record(key:, text:, norm:, source:, digest:, tags:, embedding:, similarity:)
|
|
306
366
|
end
|
|
307
367
|
end
|
|
308
368
|
|
|
@@ -362,10 +422,18 @@ class Documentrix::Documents::Cache::SQLiteCache
|
|
|
362
422
|
embedding_id integer,
|
|
363
423
|
norm float NOT NULL DEFAULT 0.0,
|
|
364
424
|
source text,
|
|
425
|
+
digest text,
|
|
365
426
|
tags json NOT NULL DEFAULT [],
|
|
366
427
|
FOREIGN KEY(embedding_id) REFERENCES embeddings(id) ON DELETE CASCADE
|
|
367
428
|
)
|
|
368
429
|
}
|
|
430
|
+
execute %{
|
|
431
|
+
CREATE TRIGGER IF NOT EXISTS delete_embedding_after_record AFTER DELETE ON records
|
|
432
|
+
FOR EACH ROW
|
|
433
|
+
BEGIN
|
|
434
|
+
DELETE FROM embeddings WHERE rowid = OLD.embedding_id;
|
|
435
|
+
END
|
|
436
|
+
}
|
|
369
437
|
nil
|
|
370
438
|
end
|
|
371
439
|
|
|
@@ -1,15 +1,38 @@
|
|
|
1
1
|
module Documentrix::Documents::Splitters
|
|
2
|
+
# The Character class provides basic text splitting based on a single
|
|
3
|
+
# separator and bundles the resulting segments into chunks of a maximum size.
|
|
4
|
+
#
|
|
5
|
+
# It allows for the preservation of separators and uses a combining string
|
|
6
|
+
# to join segments back together into chunks.
|
|
2
7
|
class Character
|
|
8
|
+
include Documentrix::Documents::Splitters::Common
|
|
9
|
+
|
|
10
|
+
# The default regex used to identify paragraph boundaries.
|
|
11
|
+
# It matches two or more consecutive newline characters (CRLF or LF).
|
|
12
|
+
#
|
|
13
|
+
# @return [Regexp]
|
|
3
14
|
DEFAULT_SEPARATOR = /(?:\r?\n){2,}/
|
|
4
15
|
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
|
|
16
|
+
# Initializes a new Character splitter.
|
|
17
|
+
#
|
|
18
|
+
# @param separator [Regexp] the regex used to split the text (defaults to DEFAULT_SEPARATOR)
|
|
19
|
+
# @param include_separator [Boolean] whether to include the separator in the resulting chunks (defaults to false)
|
|
20
|
+
# @param combining_string [String] the string used to join segments into chunks (defaults to "\n\n")
|
|
21
|
+
# @param chunk_size [Integer] the maximum size of each resulting chunk (defaults to 4096)
|
|
22
|
+
# @param force [Boolean] whether to force-split the final chunk if it exceeds `chunk_size` (defaults to false)
|
|
23
|
+
def initialize(separator: DEFAULT_SEPARATOR, include_separator: false, combining_string: "\n\n", chunk_size: 4096, force: false)
|
|
24
|
+
@separator, @include_separator, @combining_string, @chunk_size, @force =
|
|
25
|
+
separator, include_separator, combining_string, chunk_size, force
|
|
8
26
|
if include_separator
|
|
9
27
|
@separator = Regexp.new("(#@separator)")
|
|
10
28
|
end
|
|
11
29
|
end
|
|
12
30
|
|
|
31
|
+
# Splits the given text into chunks based on the configured separator and
|
|
32
|
+
# size limit.
|
|
33
|
+
#
|
|
34
|
+
# @param text [String] the text to be split
|
|
35
|
+
# @return [Array<String>] an array of text chunks
|
|
13
36
|
def split(text)
|
|
14
37
|
texts = []
|
|
15
38
|
text.split(@separator) do |t|
|
|
@@ -29,12 +52,27 @@ module Documentrix::Documents::Splitters
|
|
|
29
52
|
current_text = t
|
|
30
53
|
end
|
|
31
54
|
end
|
|
32
|
-
|
|
55
|
+
result.concat force_split(current_text)
|
|
33
56
|
result
|
|
34
57
|
end
|
|
35
58
|
end
|
|
36
59
|
|
|
60
|
+
# The RecursiveCharacter class implements a hierarchical splitting strategy.
|
|
61
|
+
#
|
|
62
|
+
# It attempts to split text using a priority list of separators. If a
|
|
63
|
+
# resulting chunk is still larger than the specified chunk_size, it
|
|
64
|
+
# recursively applies the next separator in the list until the size limit is
|
|
65
|
+
# met or all separators have been exhausted.
|
|
37
66
|
class RecursiveCharacter
|
|
67
|
+
include Documentrix::Documents::Splitters::Common
|
|
68
|
+
|
|
69
|
+
# The default priority list of regexes used for recursive splitting.
|
|
70
|
+
# The strategy is to split by the coarsest grain first (paragraphs)
|
|
71
|
+
# and move toward the finest grain (individual characters) as needed.
|
|
72
|
+
#
|
|
73
|
+
# Order: Paragraphs -> Newlines -> Word Boundaries -> Characters
|
|
74
|
+
#
|
|
75
|
+
# @return [Array<Regexp>]
|
|
38
76
|
DEFAULT_SEPARATORS = [
|
|
39
77
|
/(?:\r?\n){2,}/,
|
|
40
78
|
/\r?\n/,
|
|
@@ -42,13 +80,27 @@ module Documentrix::Documents::Splitters
|
|
|
42
80
|
//,
|
|
43
81
|
].freeze
|
|
44
82
|
|
|
83
|
+
# Initializes a new RecursiveCharacter splitter.
|
|
84
|
+
#
|
|
85
|
+
# @param separators [Array<Regexp>] a priority list of regexes to use for splitting (defaults to DEFAULT_SEPARATORS)
|
|
86
|
+
# @param include_separator [Boolean] whether to include the separator in the resulting chunks (defaults to false)
|
|
87
|
+
# @param combining_string [String] the string used to join segments into chunks (defaults to "\n\n")
|
|
88
|
+
# @param chunk_size [Integer] the maximum size of each resulting chunk (defaults to 4096)
|
|
89
|
+
# @raise [ArgumentError] if the separators array is empty
|
|
45
90
|
def initialize(separators: DEFAULT_SEPARATORS, include_separator: false, combining_string: "\n\n", chunk_size: 4096)
|
|
46
91
|
separators.empty? and
|
|
47
92
|
raise ArgumentError, "non-empty array of separators required"
|
|
48
93
|
@separators, @include_separator, @combining_string, @chunk_size =
|
|
49
94
|
separators, include_separator, combining_string, chunk_size
|
|
95
|
+
@force = separators.last == //
|
|
50
96
|
end
|
|
51
97
|
|
|
98
|
+
# Recursively splits the given text into chunks using the list of
|
|
99
|
+
# separators.
|
|
100
|
+
#
|
|
101
|
+
# @param text [String] the text to be split
|
|
102
|
+
# @param separators [Array<Regexp>] the list of separators to use (defaults to @separators)
|
|
103
|
+
# @return [Array<String>] an array of text chunks
|
|
52
104
|
def split(text, separators: @separators)
|
|
53
105
|
separators.empty? and return [ text ]
|
|
54
106
|
separators = separators.dup
|
|
@@ -0,0 +1,38 @@
|
|
|
1
|
+
# A shared utility module for text splitters that provides consistent
|
|
2
|
+
# handling of chunk size constraints.
|
|
3
|
+
#
|
|
4
|
+
# This module is intended to be included in splitter classes that
|
|
5
|
+
# implement a maximum chunk size limit. It expects the including class
|
|
6
|
+
# to provide the following attributes:
|
|
7
|
+
# - `force` [Boolean]: Whether to hard-split chunks that exceed the limit.
|
|
8
|
+
# - `chunk_size` [Integer]: The maximum allowed size for a single chunk.
|
|
9
|
+
module Documentrix::Documents::Splitters::Common
|
|
10
|
+
private
|
|
11
|
+
|
|
12
|
+
# Whether to force-split chunks that exceed the chunk size limit.
|
|
13
|
+
# @return [Boolean]
|
|
14
|
+
attr_reader :force
|
|
15
|
+
|
|
16
|
+
# The maximum allowed size for a single chunk.
|
|
17
|
+
# @return [Integer]
|
|
18
|
+
attr_reader :chunk_size
|
|
19
|
+
|
|
20
|
+
# Ensures text respects the chunk size limit if force splitting is enabled.
|
|
21
|
+
#
|
|
22
|
+
# If the `force` attribute is true and the provided text exceeds the
|
|
23
|
+
# `chunk_size`, the text is hard-split into fixed-size chunks using a
|
|
24
|
+
# regular expression. If `force` is false or the text is within the
|
|
25
|
+
# limit, the text is returned wrapped in a single-element array to
|
|
26
|
+
# maintain return-type consistency (Array<String>).
|
|
27
|
+
#
|
|
28
|
+
# @param text [String, nil] the text to potentially split
|
|
29
|
+
# @return [Array<String>] the resulting chunk(s), or an empty array if text is nil/empty
|
|
30
|
+
def force_split(text)
|
|
31
|
+
text&.empty? and return []
|
|
32
|
+
if force && text.size > chunk_size
|
|
33
|
+
text.scan(/.{1,#{chunk_size}}/)
|
|
34
|
+
else
|
|
35
|
+
Array(text)
|
|
36
|
+
end
|
|
37
|
+
end
|
|
38
|
+
end
|