documentrix 0.1.1 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGES.md +92 -0
- data/Rakefile +1 -1
- data/documentrix.gemspec +8 -8
- data/lib/documentrix/documents/cache/common.rb +74 -9
- data/lib/documentrix/documents/cache/records.rb +1 -1
- data/lib/documentrix/documents/cache/redis_cache.rb +3 -3
- data/lib/documentrix/documents/cache/sqlite_cache.rb +100 -21
- data/lib/documentrix/documents/splitters/character.rb +56 -4
- data/lib/documentrix/documents/splitters/common.rb +38 -0
- data/lib/documentrix/documents/splitters/semantic.rb +67 -8
- data/lib/documentrix/documents.rb +139 -25
- data/lib/documentrix/utils/colorize_texts.rb +25 -21
- data/lib/documentrix/utils/digests.rb +78 -0
- data/lib/documentrix/utils.rb +1 -0
- data/lib/documentrix/version.rb +1 -1
- data/spec/documentrix/documents/cache/interface_spec.rb +25 -3
- data/spec/documentrix/documents/cache/memory_cache_spec.rb +75 -2
- data/spec/documentrix/documents/cache/redis_cache_spec.rb +82 -19
- data/spec/documentrix/documents/cache/sqlite_cache_spec.rb +142 -2
- data/spec/documentrix/documents/splitters/character_spec.rb +20 -2
- data/spec/documentrix/documents/splitters/semantic_spec.rb +17 -5
- data/spec/documents_spec.rb +76 -2
- data/spec/utils/colorize_texts_spec.rb +0 -2
- data/spec/utils/digests_spec.rb +97 -0
- data/spec/utils/tags_spec.rb +0 -2
- metadata +12 -6
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 68c409476101e4632597c139494f3fc1fe67bc0af7a655e5e5b3e4ebbd58f58c
|
|
4
|
+
data.tar.gz: fcd07ca7694b3fbed81c3a25f3fa9d9a675d120e5b82b43a13f5f71970012f3a
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: cf1e95f2d994bb130b89bff9d2e43062621b849b5d0f7fb4543092b1f491bab4b9dd9e51047c3dfb216090a4f9bc853c08ba35719dea7409c34a15758070fcba
|
|
7
|
+
data.tar.gz: 59aa3347c07c2521f5661326a55d73733470ef30c62e86fc7d1ff53a4acb6cbc2da06e4c7814896b59ca16bdfab8933a7018f3f41ab30ce29593c5fac480f342
|
data/CHANGES.md
CHANGED
|
@@ -1,5 +1,97 @@
|
|
|
1
1
|
# Changes
|
|
2
2
|
|
|
3
|
+
## 2026-05-17 v0.3.0
|
|
4
|
+
|
|
5
|
+
### New Features
|
|
6
|
+
|
|
7
|
+
- **Source Tracking & Versioning**:
|
|
8
|
+
- Introduced `Documentrix::Utils::Digests` for SHA256 hashing of strings
|
|
9
|
+
and files, including an `mtime`-based cache.
|
|
10
|
+
- Implemented source-based document management in `Documentrix::Documents`
|
|
11
|
+
via `normalize_source`, `source_exist?`, `source_modified?`,
|
|
12
|
+
`source_update`, and `source_remove`.
|
|
13
|
+
- Updated `Documentrix::Documents#add` and
|
|
14
|
+
`Documentrix::Documents#source_update` to support `digest` for version
|
|
15
|
+
tracking.
|
|
16
|
+
- **Text Splitting**:
|
|
17
|
+
- Added `Documentrix::Documents::Splitters::Common` to implement
|
|
18
|
+
`force_split` behavior.
|
|
19
|
+
- Integrated `force` splitting into `Character`, `RecursiveCharacter`, and
|
|
20
|
+
`Semantic` splitters.
|
|
21
|
+
- **Cache Enhancements**:
|
|
22
|
+
- Added `each_source` to `Documentrix::Documents::Cache::Common` and an
|
|
23
|
+
optimized `SELECT DISTINCT source` implementation in
|
|
24
|
+
`Documentrix::Documents::Cache::SQLiteCache`.
|
|
25
|
+
- Added a SQLite trigger `delete_embedding_after_record` to automatically
|
|
26
|
+
clean the `embeddings` table.
|
|
27
|
+
|
|
28
|
+
### Improvements & Refactorings
|
|
29
|
+
|
|
30
|
+
- **Search & Retrieval**:
|
|
31
|
+
- Added `min_similarity` parameter to `Documentrix::Documents#find`,
|
|
32
|
+
`Documentrix::Documents::Cache::Common#find_records`, and
|
|
33
|
+
`Documentrix::Documents::Cache::SQLiteCache#find_records`.
|
|
34
|
+
- Optimized `Documentrix::Documents::Cache::SQLiteCache#find_records` by
|
|
35
|
+
moving similarity calculations into the SQL query using `1 -
|
|
36
|
+
vec_distance_cosine`.
|
|
37
|
+
- Simplified `Documentrix::Documents#find_where` by streamlining
|
|
38
|
+
`take_while` logic and utilizing `opts[:max_records]`.
|
|
39
|
+
- **Cache Implementations**:
|
|
40
|
+
- Made `object_class` a required keyword argument in
|
|
41
|
+
`Documentrix::Documents::RedisCache#initialize`.
|
|
42
|
+
- Refactored `Documentrix::Documents::Cache::Common#clear_by_source` and
|
|
43
|
+
`Documentrix::Documents::Cache::Common#source_exist?` to use ternary
|
|
44
|
+
operators.
|
|
45
|
+
- Improved `Documentrix::Documents::Cache::SQLiteCache#each_source` and
|
|
46
|
+
`Documentrix::Documents::Cache::SQLiteCache#find_records` for better
|
|
47
|
+
robustness and formatting.
|
|
48
|
+
- **Documentation & Tooling**:
|
|
49
|
+
- Expanded YARD documentation for
|
|
50
|
+
`Documentrix::Documents::Splitters::Character`, `RecursiveCharacter`,
|
|
51
|
+
`Semantic`, and `Documentrix::Utils::ColorizeTexts`.
|
|
52
|
+
- Centralized RSpec configuration via a `.rspec` file.
|
|
53
|
+
|
|
54
|
+
### Bug Fixes
|
|
55
|
+
|
|
56
|
+
- Fixed an issue in `Documentrix::Documents#find` where `max_records` was
|
|
57
|
+
hardcoded to `nil` when calling the cache.
|
|
58
|
+
- Adjusted default handling of `min_similarity` in
|
|
59
|
+
`Documentrix::Documents#find` to use `min_similarity ||= -1` within the
|
|
60
|
+
method body.
|
|
61
|
+
|
|
62
|
+
### Testing
|
|
63
|
+
|
|
64
|
+
- Significantly expanded test suites for `SQLiteCache`, `MemoryCache`, and
|
|
65
|
+
`RedisCache`, specifically covering `each_source`, `tags`, `clear_for_tags`,
|
|
66
|
+
and digest-based checks.
|
|
67
|
+
- Added new test cases in `spec/documents_spec.rb` for source management and
|
|
68
|
+
`Documentrix::Documents#source_update`.
|
|
69
|
+
- Added `spec/utils/digests_spec.rb` and updated splitter specs to verify
|
|
70
|
+
`force` splitting behavior.
|
|
71
|
+
|
|
72
|
+
## 2026-05-12 v0.2.0
|
|
73
|
+
|
|
74
|
+
### Added
|
|
75
|
+
|
|
76
|
+
- Implemented source-based document removal by adding the `remove` method to
|
|
77
|
+
`Documentrix::Documents`.
|
|
78
|
+
- Added `clear_by_source` to `Documentrix::Documents::Cache::Common` as the
|
|
79
|
+
default cache implementation.
|
|
80
|
+
- Added an optimized `clear_by_source` override in
|
|
81
|
+
`Documentrix::Documents::Cache::SQLiteCache` utilizing a direct SQL `DELETE`
|
|
82
|
+
query.
|
|
83
|
+
|
|
84
|
+
### Changed
|
|
85
|
+
|
|
86
|
+
- Updated `documentrix.gemspec` to use `rubygems_version` **4.0.10**.
|
|
87
|
+
- Updated `gem_hadar` dependency to **2.17.1**.
|
|
88
|
+
|
|
89
|
+
### Testing
|
|
90
|
+
|
|
91
|
+
- Expanded test coverage in `spec/documents_spec.rb`,
|
|
92
|
+
`spec/documentrix/documents/cache/interface_spec.rb`, and all specific cache
|
|
93
|
+
specs.
|
|
94
|
+
|
|
3
95
|
## 2026-03-31 v0.1.1
|
|
4
96
|
|
|
5
97
|
- Improved compatibility and reliability by ensuring the gem uses a stable,
|
data/Rakefile
CHANGED
|
@@ -33,7 +33,7 @@ GemHadar do
|
|
|
33
33
|
dependency 'infobar', '~> 0.9'
|
|
34
34
|
dependency 'json', '~> 2.0'
|
|
35
35
|
dependency 'tins', '~> 1.34'
|
|
36
|
-
dependency 'sqlite-vec', '>= 0.1.
|
|
36
|
+
dependency 'sqlite-vec', '>= 0.1.9'
|
|
37
37
|
dependency 'sqlite3', '~> 2.0', '>= 2.0.1'
|
|
38
38
|
dependency 'kramdown-ansi', '~> 0.0', '>= 0.0.1'
|
|
39
39
|
dependency 'numo-narray-alt', '~> 0.9'
|
data/documentrix.gemspec
CHANGED
|
@@ -1,9 +1,9 @@
|
|
|
1
1
|
# -*- encoding: utf-8 -*-
|
|
2
|
-
# stub: documentrix 0.
|
|
2
|
+
# stub: documentrix 0.3.0 ruby lib
|
|
3
3
|
|
|
4
4
|
Gem::Specification.new do |s|
|
|
5
5
|
s.name = "documentrix".freeze
|
|
6
|
-
s.version = "0.
|
|
6
|
+
s.version = "0.3.0".freeze
|
|
7
7
|
|
|
8
8
|
s.required_rubygems_version = Gem::Requirement.new(">= 0".freeze) if s.respond_to? :required_rubygems_version=
|
|
9
9
|
s.require_paths = ["lib".freeze]
|
|
@@ -11,19 +11,19 @@ Gem::Specification.new do |s|
|
|
|
11
11
|
s.date = "1980-01-02"
|
|
12
12
|
s.description = "The Ruby library, Documentrix, is designed to provide a way to build and\nquery vector databases for applications in natural language processing\n(NLP) and large language models (LLMs). It allows users to store and\nretrieve dense vector embeddings for text strings.\n".freeze
|
|
13
13
|
s.email = "flori@ping.de".freeze
|
|
14
|
-
s.extra_rdoc_files = ["README.md".freeze, "lib/documentrix.rb".freeze, "lib/documentrix/documents.rb".freeze, "lib/documentrix/documents/cache/common.rb".freeze, "lib/documentrix/documents/cache/memory_cache.rb".freeze, "lib/documentrix/documents/cache/records.rb".freeze, "lib/documentrix/documents/cache/redis_cache.rb".freeze, "lib/documentrix/documents/cache/sqlite_cache.rb".freeze, "lib/documentrix/documents/splitters/character.rb".freeze, "lib/documentrix/documents/splitters/semantic.rb".freeze, "lib/documentrix/utils.rb".freeze, "lib/documentrix/utils/colorize_texts.rb".freeze, "lib/documentrix/utils/math.rb".freeze, "lib/documentrix/utils/tags.rb".freeze, "lib/documentrix/version.rb".freeze]
|
|
15
|
-
s.files = [".envrc".freeze, ".utilsrc".freeze, ".yardopts".freeze, "CHANGES.md".freeze, "Gemfile".freeze, "LICENSE".freeze, "README.md".freeze, "Rakefile".freeze, "docker-compose.yml".freeze, "documentrix.gemspec".freeze, "lib/documentrix.rb".freeze, "lib/documentrix/documents.rb".freeze, "lib/documentrix/documents/cache/common.rb".freeze, "lib/documentrix/documents/cache/memory_cache.rb".freeze, "lib/documentrix/documents/cache/records.rb".freeze, "lib/documentrix/documents/cache/redis_cache.rb".freeze, "lib/documentrix/documents/cache/sqlite_cache.rb".freeze, "lib/documentrix/documents/splitters/character.rb".freeze, "lib/documentrix/documents/splitters/semantic.rb".freeze, "lib/documentrix/utils.rb".freeze, "lib/documentrix/utils/colorize_texts.rb".freeze, "lib/documentrix/utils/math.rb".freeze, "lib/documentrix/utils/tags.rb".freeze, "lib/documentrix/version.rb".freeze, "redis/redis.conf".freeze, "spec/assets/embeddings.json".freeze, "spec/documentrix/documents/cache/interface_spec.rb".freeze, "spec/documentrix/documents/cache/memory_cache_spec.rb".freeze, "spec/documentrix/documents/cache/redis_cache_spec.rb".freeze, "spec/documentrix/documents/cache/sqlite_cache_spec.rb".freeze, "spec/documentrix/documents/splitters/character_spec.rb".freeze, "spec/documentrix/documents/splitters/semantic_spec.rb".freeze, "spec/documents_spec.rb".freeze, "spec/spec_helper.rb".freeze, "spec/utils/colorize_texts_spec.rb".freeze, "spec/utils/tags_spec.rb".freeze]
|
|
14
|
+
s.extra_rdoc_files = ["README.md".freeze, "lib/documentrix.rb".freeze, "lib/documentrix/documents.rb".freeze, "lib/documentrix/documents/cache/common.rb".freeze, "lib/documentrix/documents/cache/memory_cache.rb".freeze, "lib/documentrix/documents/cache/records.rb".freeze, "lib/documentrix/documents/cache/redis_cache.rb".freeze, "lib/documentrix/documents/cache/sqlite_cache.rb".freeze, "lib/documentrix/documents/splitters/character.rb".freeze, "lib/documentrix/documents/splitters/common.rb".freeze, "lib/documentrix/documents/splitters/semantic.rb".freeze, "lib/documentrix/utils.rb".freeze, "lib/documentrix/utils/colorize_texts.rb".freeze, "lib/documentrix/utils/digests.rb".freeze, "lib/documentrix/utils/math.rb".freeze, "lib/documentrix/utils/tags.rb".freeze, "lib/documentrix/version.rb".freeze]
|
|
15
|
+
s.files = [".envrc".freeze, ".utilsrc".freeze, ".yardopts".freeze, "CHANGES.md".freeze, "Gemfile".freeze, "LICENSE".freeze, "README.md".freeze, "Rakefile".freeze, "docker-compose.yml".freeze, "documentrix.gemspec".freeze, "lib/documentrix.rb".freeze, "lib/documentrix/documents.rb".freeze, "lib/documentrix/documents/cache/common.rb".freeze, "lib/documentrix/documents/cache/memory_cache.rb".freeze, "lib/documentrix/documents/cache/records.rb".freeze, "lib/documentrix/documents/cache/redis_cache.rb".freeze, "lib/documentrix/documents/cache/sqlite_cache.rb".freeze, "lib/documentrix/documents/splitters/character.rb".freeze, "lib/documentrix/documents/splitters/common.rb".freeze, "lib/documentrix/documents/splitters/semantic.rb".freeze, "lib/documentrix/utils.rb".freeze, "lib/documentrix/utils/colorize_texts.rb".freeze, "lib/documentrix/utils/digests.rb".freeze, "lib/documentrix/utils/math.rb".freeze, "lib/documentrix/utils/tags.rb".freeze, "lib/documentrix/version.rb".freeze, "redis/redis.conf".freeze, "spec/assets/embeddings.json".freeze, "spec/documentrix/documents/cache/interface_spec.rb".freeze, "spec/documentrix/documents/cache/memory_cache_spec.rb".freeze, "spec/documentrix/documents/cache/redis_cache_spec.rb".freeze, "spec/documentrix/documents/cache/sqlite_cache_spec.rb".freeze, "spec/documentrix/documents/splitters/character_spec.rb".freeze, "spec/documentrix/documents/splitters/semantic_spec.rb".freeze, "spec/documents_spec.rb".freeze, "spec/spec_helper.rb".freeze, "spec/utils/colorize_texts_spec.rb".freeze, "spec/utils/digests_spec.rb".freeze, "spec/utils/tags_spec.rb".freeze]
|
|
16
16
|
s.homepage = "https://github.com/flori/documentrix".freeze
|
|
17
17
|
s.licenses = ["MIT".freeze]
|
|
18
18
|
s.rdoc_options = ["--title".freeze, "Documentrix - Ruby library for embedding vector database".freeze, "--main".freeze, "README.md".freeze]
|
|
19
19
|
s.required_ruby_version = Gem::Requirement.new(">= 3.1".freeze)
|
|
20
|
-
s.rubygems_version = "4.0.
|
|
20
|
+
s.rubygems_version = "4.0.10".freeze
|
|
21
21
|
s.summary = "Ruby library for embedding vector database".freeze
|
|
22
|
-
s.test_files = ["spec/documentrix/documents/cache/interface_spec.rb".freeze, "spec/documentrix/documents/cache/memory_cache_spec.rb".freeze, "spec/documentrix/documents/cache/redis_cache_spec.rb".freeze, "spec/documentrix/documents/cache/sqlite_cache_spec.rb".freeze, "spec/documentrix/documents/splitters/character_spec.rb".freeze, "spec/documentrix/documents/splitters/semantic_spec.rb".freeze, "spec/documents_spec.rb".freeze, "spec/spec_helper.rb".freeze, "spec/utils/colorize_texts_spec.rb".freeze, "spec/utils/tags_spec.rb".freeze]
|
|
22
|
+
s.test_files = ["spec/documentrix/documents/cache/interface_spec.rb".freeze, "spec/documentrix/documents/cache/memory_cache_spec.rb".freeze, "spec/documentrix/documents/cache/redis_cache_spec.rb".freeze, "spec/documentrix/documents/cache/sqlite_cache_spec.rb".freeze, "spec/documentrix/documents/splitters/character_spec.rb".freeze, "spec/documentrix/documents/splitters/semantic_spec.rb".freeze, "spec/documents_spec.rb".freeze, "spec/spec_helper.rb".freeze, "spec/utils/colorize_texts_spec.rb".freeze, "spec/utils/digests_spec.rb".freeze, "spec/utils/tags_spec.rb".freeze]
|
|
23
23
|
|
|
24
24
|
s.specification_version = 4
|
|
25
25
|
|
|
26
|
-
s.add_development_dependency(%q<gem_hadar>.freeze, [">= 2.17.
|
|
26
|
+
s.add_development_dependency(%q<gem_hadar>.freeze, [">= 2.17.1".freeze])
|
|
27
27
|
s.add_development_dependency(%q<all_images>.freeze, ["~> 0.12".freeze])
|
|
28
28
|
s.add_development_dependency(%q<rspec>.freeze, ["~> 3.2".freeze])
|
|
29
29
|
s.add_development_dependency(%q<kramdown>.freeze, ["~> 2.0".freeze])
|
|
@@ -32,7 +32,7 @@ Gem::Specification.new do |s|
|
|
|
32
32
|
s.add_runtime_dependency(%q<infobar>.freeze, ["~> 0.9".freeze])
|
|
33
33
|
s.add_runtime_dependency(%q<json>.freeze, ["~> 2.0".freeze])
|
|
34
34
|
s.add_runtime_dependency(%q<tins>.freeze, ["~> 1.34".freeze])
|
|
35
|
-
s.add_runtime_dependency(%q<sqlite-vec>.freeze, [">= 0.1.
|
|
35
|
+
s.add_runtime_dependency(%q<sqlite-vec>.freeze, [">= 0.1.9".freeze])
|
|
36
36
|
s.add_runtime_dependency(%q<sqlite3>.freeze, ["~> 2.0".freeze, ">= 2.0.1".freeze])
|
|
37
37
|
s.add_runtime_dependency(%q<kramdown-ansi>.freeze, ["~> 0.0".freeze, ">= 0.0.1".freeze])
|
|
38
38
|
s.add_runtime_dependency(%q<numo-narray-alt>.freeze, ["~> 0.9".freeze])
|
|
@@ -12,6 +12,7 @@
|
|
|
12
12
|
# memory, Redis, and SQLite.
|
|
13
13
|
module Documentrix::Documents::Cache::Common
|
|
14
14
|
include Documentrix::Utils::Math
|
|
15
|
+
include Documentrix::Utils::Digests
|
|
15
16
|
include Enumerable
|
|
16
17
|
|
|
17
18
|
# The initialize method sets up the Documentrix::Documents::Cache instance's
|
|
@@ -62,27 +63,29 @@ module Documentrix::Documents::Cache::Common
|
|
|
62
63
|
# @param needle [ Array ] an array containing the embedding vector
|
|
63
64
|
# @param tags [ String, Array ] a string or array of strings representing the tags to search for
|
|
64
65
|
# @param max_records [ Integer ] the maximum number of records to return
|
|
66
|
+
# @param min_similarity [ Float ] the minimum similarity score required for a record to be returned (defaults to -1)
|
|
65
67
|
#
|
|
66
|
-
# @
|
|
67
|
-
|
|
68
|
-
# @return [ Array<Documentrix::Documents::Records> ] an array containing the matching records
|
|
69
|
-
def find_records(needle, tags: nil, max_records: nil)
|
|
68
|
+
# @return [ Array<Documentrix::Documents::Record> ] an array containing the matching records
|
|
69
|
+
def find_records(needle, tags: nil, max_records: nil, min_similarity: -1)
|
|
70
70
|
tags = Documentrix::Utils::Tags.new(Array(tags)).to_a
|
|
71
71
|
records = self
|
|
72
72
|
if tags.present?
|
|
73
73
|
records = records.select { |_key, record| (tags & record.tags).size >= 1 }
|
|
74
74
|
end
|
|
75
|
+
|
|
75
76
|
needle_norm = norm(needle)
|
|
76
|
-
records = records.
|
|
77
|
+
records = records.map do |key, record|
|
|
77
78
|
record.key = key
|
|
78
79
|
record.similarity = cosine_similarity(
|
|
79
|
-
a:
|
|
80
|
-
b:
|
|
80
|
+
a: needle,
|
|
81
|
+
b: record.embedding,
|
|
81
82
|
a_norm: needle_norm,
|
|
82
83
|
b_norm: record.norm,
|
|
83
84
|
)
|
|
84
|
-
|
|
85
|
-
|
|
85
|
+
record
|
|
86
|
+
end.sort_by(&:similarity).reverse.select { _1.similarity >= min_similarity }
|
|
87
|
+
|
|
88
|
+
max_records ? records.take(max_records) : records
|
|
86
89
|
end
|
|
87
90
|
|
|
88
91
|
# Returns a set of unique tags found in the cache records.
|
|
@@ -116,6 +119,68 @@ module Documentrix::Documents::Cache::Common
|
|
|
116
119
|
self
|
|
117
120
|
end
|
|
118
121
|
|
|
122
|
+
# Yields each unique, full source present in the cache records.
|
|
123
|
+
#
|
|
124
|
+
# @yield [source] the full source string
|
|
125
|
+
# @return [Enumerator] an enumerator if no block is given, nil otherwise.
|
|
126
|
+
def each_source(&block)
|
|
127
|
+
block or return enum_for(__method__)
|
|
128
|
+
seen = {}
|
|
129
|
+
each do |_key, record|
|
|
130
|
+
source = record.source.full? or next
|
|
131
|
+
seen.key?(source) and next
|
|
132
|
+
seen[source] = true
|
|
133
|
+
block.(source)
|
|
134
|
+
end
|
|
135
|
+
nil
|
|
136
|
+
end
|
|
137
|
+
|
|
138
|
+
# The clear_by_source method removes all records from the cache that
|
|
139
|
+
# have a source matching the given source.
|
|
140
|
+
#
|
|
141
|
+
# @param source [String] the source to filter records by
|
|
142
|
+
# @param digest [String, nil] the SHA256 hexadecimal digest of the source.
|
|
143
|
+
# @param operator [Symbol, String] the operator to compare the digest with ('=' or '!=')
|
|
144
|
+
#
|
|
145
|
+
# @return [self] self
|
|
146
|
+
def clear_by_source(source, digest: nil, operator: ?=)
|
|
147
|
+
operator = operator == '=' ? '==' : '!='
|
|
148
|
+
|
|
149
|
+
each do |key, record|
|
|
150
|
+
next unless record.source == source
|
|
151
|
+
if digest
|
|
152
|
+
should_delete = record.digest.send(operator, digest)
|
|
153
|
+
delete(unpre(key)) if should_delete
|
|
154
|
+
else
|
|
155
|
+
delete(unpre(key))
|
|
156
|
+
end
|
|
157
|
+
end
|
|
158
|
+
self
|
|
159
|
+
end
|
|
160
|
+
|
|
161
|
+
# Checks if any records associated with the given source exist in the cache.
|
|
162
|
+
#
|
|
163
|
+
# @param source [String] the source to check for existence
|
|
164
|
+
# @param digest [String, nil] the SHA256 hexadecimal digest to compare against
|
|
165
|
+
# @param operator [Symbol, String] the operator to compare the digest with ('=' or '!=')
|
|
166
|
+
#
|
|
167
|
+
# @return [Boolean] true if a matching record is found, false otherwise.
|
|
168
|
+
def source_exist?(source, digest: nil, operator: ?=)
|
|
169
|
+
operator = operator == '=' ? '==' : '!='
|
|
170
|
+
|
|
171
|
+
each do |_, record|
|
|
172
|
+
next unless record.source == source
|
|
173
|
+
if digest
|
|
174
|
+
if record.digest.send(operator, digest)
|
|
175
|
+
return true
|
|
176
|
+
end
|
|
177
|
+
else
|
|
178
|
+
return true
|
|
179
|
+
end
|
|
180
|
+
end
|
|
181
|
+
false
|
|
182
|
+
end
|
|
183
|
+
|
|
119
184
|
# The clear method removes cached records based on the provided tags or
|
|
120
185
|
# clears all records with the current prefix.
|
|
121
186
|
#
|
|
@@ -27,7 +27,7 @@ module Documentrix::Documents::Cache::Records
|
|
|
27
27
|
# The to_s method returns a string representation of the object.
|
|
28
28
|
#
|
|
29
29
|
# @return [String] A string containing the text and tags of the record,
|
|
30
|
-
#
|
|
30
|
+
# along with its similarity score.
|
|
31
31
|
def to_s
|
|
32
32
|
my_tags = tags_set
|
|
33
33
|
my_tags.empty? or my_tags = " #{my_tags}"
|
|
@@ -23,7 +23,7 @@ class Documentrix::Documents::RedisCache
|
|
|
23
23
|
# @param [String] prefix the string to be used as the prefix for this cache
|
|
24
24
|
# @param [String] url the URL of the Redis server (default: ENV['REDIS_URL'])
|
|
25
25
|
# @param [Class] object_class the class of objects stored in Redis (default: nil)
|
|
26
|
-
def initialize(prefix:, url: ENV['REDIS_URL'], object_class:
|
|
26
|
+
def initialize(prefix:, url: ENV['REDIS_URL'], object_class:)
|
|
27
27
|
super(prefix:)
|
|
28
28
|
url or raise ArgumentError, 'require redis url'
|
|
29
29
|
@url, @object_class = url, object_class
|
|
@@ -46,7 +46,7 @@ class Documentrix::Documents::RedisCache
|
|
|
46
46
|
def [](key)
|
|
47
47
|
value = redis.get(pre(key))
|
|
48
48
|
unless value.nil?
|
|
49
|
-
|
|
49
|
+
JSON.parse(value, object_class:)
|
|
50
50
|
end
|
|
51
51
|
end
|
|
52
52
|
|
|
@@ -153,7 +153,7 @@ class Documentrix::Documents::RedisCache
|
|
|
153
153
|
|
|
154
154
|
redis.scan_each(match: prefix + ?*) do |key|
|
|
155
155
|
value = redis.get(key) or next
|
|
156
|
-
value =
|
|
156
|
+
value = JSON.parse(value, object_class:)
|
|
157
157
|
block.(key, value)
|
|
158
158
|
end
|
|
159
159
|
end
|
|
@@ -46,17 +46,17 @@ class Documentrix::Documents::Cache::SQLiteCache
|
|
|
46
46
|
result = execute(
|
|
47
47
|
%{
|
|
48
48
|
SELECT records.key, records.text, records.norm, records.source,
|
|
49
|
-
records.tags, embeddings.embedding
|
|
49
|
+
records.digest, records.tags, embeddings.embedding
|
|
50
50
|
FROM records
|
|
51
51
|
INNER JOIN embeddings ON records.embedding_id = embeddings.rowid
|
|
52
52
|
WHERE records.key = ?
|
|
53
53
|
},
|
|
54
54
|
pre(key)
|
|
55
55
|
)&.first or return
|
|
56
|
-
key, text, norm, source, tags, embedding = *result
|
|
56
|
+
key, text, norm, source, digest, tags, embedding = *result
|
|
57
57
|
embedding = embedding.unpack("f*")
|
|
58
58
|
tags = Documentrix::Utils::Tags.new(JSON(tags.to_s).to_a, source:)
|
|
59
|
-
convert_value_to_record(key:, text:, norm:, source:, tags:, embedding:)
|
|
59
|
+
convert_value_to_record(key:, text:, norm:, source:, digest:, tags:, embedding:)
|
|
60
60
|
end
|
|
61
61
|
|
|
62
62
|
# The []= method sets the value for a given key by inserting it into the
|
|
@@ -66,15 +66,16 @@ class Documentrix::Documents::Cache::SQLiteCache
|
|
|
66
66
|
# @param [Hash, Documentrix::Documents::Record] value the hash or record
|
|
67
67
|
# containing the text, embedding, and other metadata
|
|
68
68
|
def []=(key, value)
|
|
69
|
-
value
|
|
69
|
+
value = convert_value_to_record(value)
|
|
70
|
+
digest = compute_file_digest(value.source)
|
|
70
71
|
embedding = value.embedding.pack("f*")
|
|
71
72
|
execute(%{BEGIN})
|
|
72
73
|
execute(%{INSERT INTO embeddings(embedding) VALUES(?)}, [ embedding ])
|
|
73
74
|
embedding_id, = execute(%{ SELECT last_insert_rowid() }).flatten
|
|
74
75
|
execute(%{
|
|
75
|
-
INSERT INTO records(key,text,embedding_id,norm,source,tags)
|
|
76
|
-
VALUES(
|
|
77
|
-
}, [ pre(key), value.text, embedding_id, value.norm, value.source, JSON(value.tags) ])
|
|
76
|
+
INSERT INTO records(key,text,embedding_id,norm,source,digest,tags)
|
|
77
|
+
VALUES(?,?,?,?,?,?,?)
|
|
78
|
+
}, [ pre(key), value.text, embedding_id, value.norm, value.source, digest, JSON(value.tags) ])
|
|
78
79
|
execute(%{COMMIT})
|
|
79
80
|
end
|
|
80
81
|
|
|
@@ -157,6 +158,70 @@ class Documentrix::Documents::Cache::SQLiteCache
|
|
|
157
158
|
self
|
|
158
159
|
end
|
|
159
160
|
|
|
161
|
+
# Removes all records associated with the specified source from the cache.
|
|
162
|
+
#
|
|
163
|
+
# If a digest is provided, the method will only remove records that do NOT
|
|
164
|
+
# match this digest. This allows for updating a source by wiping old versions
|
|
165
|
+
# while preserving records that are already up-to-date.
|
|
166
|
+
#
|
|
167
|
+
# @param source [String] the source identifier used to filter records
|
|
168
|
+
# @param digest [String, nil] the SHA256 hexadecimal digest of the source.
|
|
169
|
+
# Records matching this digest will be preserved.
|
|
170
|
+
#
|
|
171
|
+
# @return [self] the cache instance for method chaining
|
|
172
|
+
def clear_by_source(source, digest: nil, operator: ?=)
|
|
173
|
+
operator = '!=' if operator != ?=
|
|
174
|
+
if digest
|
|
175
|
+
execute(%{DELETE FROM records WHERE source = ? AND digest #{operator} ? }, [ source, digest ])
|
|
176
|
+
else
|
|
177
|
+
execute(%{DELETE FROM records WHERE source = ?}, [ source ])
|
|
178
|
+
end
|
|
179
|
+
self
|
|
180
|
+
end
|
|
181
|
+
|
|
182
|
+
# The source_exist? method checks if any records associated with the given
|
|
183
|
+
# source exist in the cache. If a digest is provided, it verifies if the
|
|
184
|
+
# source exists and matches the specified digest using the provided operator.
|
|
185
|
+
#
|
|
186
|
+
# @param source [#to_s] the source to check for existence
|
|
187
|
+
# @param digest [String, nil] the SHA256 hexadecimal digest to compare
|
|
188
|
+
# against the stored source digest (optional)
|
|
189
|
+
# @param operator [String] the operator to use for comparison ('=' or '!=').
|
|
190
|
+
# Defaults to '='.
|
|
191
|
+
#
|
|
192
|
+
# @return [Boolean] true if the source exists (and matches the digest
|
|
193
|
+
# condition if provided), false otherwise.
|
|
194
|
+
def source_exist?(source, digest: nil, operator: ?=)
|
|
195
|
+
operator = '!=' if operator != ?=
|
|
196
|
+
if digest
|
|
197
|
+
!!execute(%{SELECT 1 FROM records WHERE source = ? AND digest #{operator} ? }, [ source, digest ]).first
|
|
198
|
+
else
|
|
199
|
+
!!execute(%{SELECT 1 FROM records WHERE source = ?}, [ source ]).first
|
|
200
|
+
end
|
|
201
|
+
end
|
|
202
|
+
|
|
203
|
+
# Yields each unique, full source present in the cache records.
|
|
204
|
+
#
|
|
205
|
+
# This is a high-performance override for SQLite that avoids loading
|
|
206
|
+
# embeddings and parsing JSON for every record.
|
|
207
|
+
#
|
|
208
|
+
# @yield [source] the full source string
|
|
209
|
+
# @return [Enumerator] an enumerator if no block is given, nil otherwise.
|
|
210
|
+
def each_source(&block)
|
|
211
|
+
block or return enum_for(__method__)
|
|
212
|
+
|
|
213
|
+
execute(%{
|
|
214
|
+
SELECT DISTINCT source
|
|
215
|
+
FROM records
|
|
216
|
+
WHERE key LIKE ? AND source IS NOT NULL
|
|
217
|
+
}, [ "#@prefix%" ]).each do |source,|
|
|
218
|
+
source = source.full? or next
|
|
219
|
+
|
|
220
|
+
block.(source)
|
|
221
|
+
end
|
|
222
|
+
nil
|
|
223
|
+
end
|
|
224
|
+
|
|
160
225
|
# Move a key prefix in the cache.
|
|
161
226
|
#
|
|
162
227
|
# This operation updates every record whose key starts with +old_prefix+,
|
|
@@ -197,14 +262,14 @@ class Documentrix::Documents::Cache::SQLiteCache
|
|
|
197
262
|
|
|
198
263
|
execute(%{
|
|
199
264
|
SELECT records.key, records.text, records.norm, records.source,
|
|
200
|
-
records.tags, embeddings.embedding
|
|
265
|
+
records.digest, records.tags, embeddings.embedding
|
|
201
266
|
FROM records
|
|
202
267
|
INNER JOIN embeddings ON records.embedding_id = embeddings.rowid
|
|
203
268
|
WHERE records.key LIKE ?
|
|
204
|
-
}, [ prefix ]).each do |key, text, norm, source, tags, embedding|
|
|
269
|
+
}, [ prefix ]).each do |key, text, norm, source, digest, tags, embedding|
|
|
205
270
|
embedding = embedding.unpack("f*")
|
|
206
271
|
tags = Documentrix::Utils::Tags.new(JSON(tags.to_s).to_a, source:)
|
|
207
|
-
value = convert_value_to_record(key:, text:, norm:, source:, tags:, embedding:)
|
|
272
|
+
value = convert_value_to_record(key:, text:, norm:, source:, digest:, tags:, embedding:)
|
|
208
273
|
block.(key, value)
|
|
209
274
|
end
|
|
210
275
|
self
|
|
@@ -264,34 +329,40 @@ class Documentrix::Documents::Cache::SQLiteCache
|
|
|
264
329
|
# @param needle [ Array ] the embedding vector
|
|
265
330
|
# @param tags [ Array ] the list of tags to filter by (optional)
|
|
266
331
|
# @param max_records [ Integer ] the maximum number of records to return (optional)
|
|
332
|
+
# @param min_similarity [ Float ] the minimum similarity score to include (defaults to -1)
|
|
267
333
|
#
|
|
268
334
|
# @yield [ key, value ]
|
|
269
335
|
#
|
|
270
336
|
# @raise [ ArgumentError ] if needle size does not match embedding length
|
|
271
337
|
#
|
|
272
338
|
# @example
|
|
273
|
-
# documents.find_records([ 0.1 ] * 1_024, tags: %w[ test ])
|
|
339
|
+
# documents.find_records([ 0.1 ] * 1_024, tags: %w[ test ], min_similarity: 0.7)
|
|
274
340
|
#
|
|
275
341
|
# @return [ Array<Documentrix::Documents::Record> ] the list of matching records
|
|
276
|
-
def find_records(needle, tags: nil, max_records: nil)
|
|
342
|
+
def find_records(needle, tags: nil, max_records: nil, min_similarity: -1)
|
|
277
343
|
needle.size != @embedding_length and
|
|
278
344
|
raise ArgumentError, "needle embedding length != %s" % @embedding_length
|
|
279
345
|
needle_binary = needle.pack("f*")
|
|
280
346
|
max_records = [ max_records, size, 4_096 ].compact.min
|
|
281
347
|
records = find_records_for_tags(tags)
|
|
282
348
|
rowids_where = '(%s)' % records.transpose.last&.join(?,)
|
|
283
|
-
execute(
|
|
284
|
-
|
|
285
|
-
records.
|
|
286
|
-
|
|
287
|
-
|
|
288
|
-
|
|
289
|
-
|
|
290
|
-
|
|
349
|
+
execute(
|
|
350
|
+
%{
|
|
351
|
+
SELECT records.key, records.text, records.norm, records.source,
|
|
352
|
+
records.digest, records.tags, embeddings.embedding,
|
|
353
|
+
1 - vec_distance_cosine(?, vec_f32(embeddings.embedding)) AS similarity
|
|
354
|
+
FROM records
|
|
355
|
+
INNER JOIN embeddings ON records.embedding_id = embeddings.rowid
|
|
356
|
+
WHERE embeddings.rowid IN #{rowids_where}
|
|
357
|
+
AND embeddings.embedding MATCH ? AND similarity >= ?
|
|
358
|
+
AND embeddings.k = ?
|
|
359
|
+
ORDER BY similarity DESC
|
|
360
|
+
}, [ needle_binary, needle_binary, min_similarity, max_records ]
|
|
361
|
+
).map do |key, text, norm, source, digest, tags, embedding, similarity|
|
|
291
362
|
key = unpre(key)
|
|
292
363
|
embedding = embedding.unpack("f*")
|
|
293
364
|
tags = Documentrix::Utils::Tags.new(JSON(tags.to_s).to_a, source:)
|
|
294
|
-
convert_value_to_record(key:, text:, norm:, source:, tags:, embedding:)
|
|
365
|
+
convert_value_to_record(key:, text:, norm:, source:, digest:, tags:, embedding:, similarity:)
|
|
295
366
|
end
|
|
296
367
|
end
|
|
297
368
|
|
|
@@ -351,10 +422,18 @@ class Documentrix::Documents::Cache::SQLiteCache
|
|
|
351
422
|
embedding_id integer,
|
|
352
423
|
norm float NOT NULL DEFAULT 0.0,
|
|
353
424
|
source text,
|
|
425
|
+
digest text,
|
|
354
426
|
tags json NOT NULL DEFAULT [],
|
|
355
427
|
FOREIGN KEY(embedding_id) REFERENCES embeddings(id) ON DELETE CASCADE
|
|
356
428
|
)
|
|
357
429
|
}
|
|
430
|
+
execute %{
|
|
431
|
+
CREATE TRIGGER IF NOT EXISTS delete_embedding_after_record AFTER DELETE ON records
|
|
432
|
+
FOR EACH ROW
|
|
433
|
+
BEGIN
|
|
434
|
+
DELETE FROM embeddings WHERE rowid = OLD.embedding_id;
|
|
435
|
+
END
|
|
436
|
+
}
|
|
358
437
|
nil
|
|
359
438
|
end
|
|
360
439
|
|
|
@@ -1,15 +1,38 @@
|
|
|
1
1
|
module Documentrix::Documents::Splitters
|
|
2
|
+
# The Character class provides basic text splitting based on a single
|
|
3
|
+
# separator and bundles the resulting segments into chunks of a maximum size.
|
|
4
|
+
#
|
|
5
|
+
# It allows for the preservation of separators and uses a combining string
|
|
6
|
+
# to join segments back together into chunks.
|
|
2
7
|
class Character
|
|
8
|
+
include Documentrix::Documents::Splitters::Common
|
|
9
|
+
|
|
10
|
+
# The default regex used to identify paragraph boundaries.
|
|
11
|
+
# It matches two or more consecutive newline characters (CRLF or LF).
|
|
12
|
+
#
|
|
13
|
+
# @return [Regexp]
|
|
3
14
|
DEFAULT_SEPARATOR = /(?:\r?\n){2,}/
|
|
4
15
|
|
|
5
|
-
|
|
6
|
-
|
|
7
|
-
|
|
16
|
+
# Initializes a new Character splitter.
|
|
17
|
+
#
|
|
18
|
+
# @param separator [Regexp] the regex used to split the text (defaults to DEFAULT_SEPARATOR)
|
|
19
|
+
# @param include_separator [Boolean] whether to include the separator in the resulting chunks (defaults to false)
|
|
20
|
+
# @param combining_string [String] the string used to join segments into chunks (defaults to "\n\n")
|
|
21
|
+
# @param chunk_size [Integer] the maximum size of each resulting chunk (defaults to 4096)
|
|
22
|
+
# @param force [Boolean] whether to force-split the final chunk if it exceeds `chunk_size` (defaults to false)
|
|
23
|
+
def initialize(separator: DEFAULT_SEPARATOR, include_separator: false, combining_string: "\n\n", chunk_size: 4096, force: false)
|
|
24
|
+
@separator, @include_separator, @combining_string, @chunk_size, @force =
|
|
25
|
+
separator, include_separator, combining_string, chunk_size, force
|
|
8
26
|
if include_separator
|
|
9
27
|
@separator = Regexp.new("(#@separator)")
|
|
10
28
|
end
|
|
11
29
|
end
|
|
12
30
|
|
|
31
|
+
# Splits the given text into chunks based on the configured separator and
|
|
32
|
+
# size limit.
|
|
33
|
+
#
|
|
34
|
+
# @param text [String] the text to be split
|
|
35
|
+
# @return [Array<String>] an array of text chunks
|
|
13
36
|
def split(text)
|
|
14
37
|
texts = []
|
|
15
38
|
text.split(@separator) do |t|
|
|
@@ -29,12 +52,27 @@ module Documentrix::Documents::Splitters
|
|
|
29
52
|
current_text = t
|
|
30
53
|
end
|
|
31
54
|
end
|
|
32
|
-
|
|
55
|
+
result.concat force_split(current_text)
|
|
33
56
|
result
|
|
34
57
|
end
|
|
35
58
|
end
|
|
36
59
|
|
|
60
|
+
# The RecursiveCharacter class implements a hierarchical splitting strategy.
|
|
61
|
+
#
|
|
62
|
+
# It attempts to split text using a priority list of separators. If a
|
|
63
|
+
# resulting chunk is still larger than the specified chunk_size, it
|
|
64
|
+
# recursively applies the next separator in the list until the size limit is
|
|
65
|
+
# met or all separators have been exhausted.
|
|
37
66
|
class RecursiveCharacter
|
|
67
|
+
include Documentrix::Documents::Splitters::Common
|
|
68
|
+
|
|
69
|
+
# The default priority list of regexes used for recursive splitting.
|
|
70
|
+
# The strategy is to split by the coarsest grain first (paragraphs)
|
|
71
|
+
# and move toward the finest grain (individual characters) as needed.
|
|
72
|
+
#
|
|
73
|
+
# Order: Paragraphs -> Newlines -> Word Boundaries -> Characters
|
|
74
|
+
#
|
|
75
|
+
# @return [Array<Regexp>]
|
|
38
76
|
DEFAULT_SEPARATORS = [
|
|
39
77
|
/(?:\r?\n){2,}/,
|
|
40
78
|
/\r?\n/,
|
|
@@ -42,13 +80,27 @@ module Documentrix::Documents::Splitters
|
|
|
42
80
|
//,
|
|
43
81
|
].freeze
|
|
44
82
|
|
|
83
|
+
# Initializes a new RecursiveCharacter splitter.
|
|
84
|
+
#
|
|
85
|
+
# @param separators [Array<Regexp>] a priority list of regexes to use for splitting (defaults to DEFAULT_SEPARATORS)
|
|
86
|
+
# @param include_separator [Boolean] whether to include the separator in the resulting chunks (defaults to false)
|
|
87
|
+
# @param combining_string [String] the string used to join segments into chunks (defaults to "\n\n")
|
|
88
|
+
# @param chunk_size [Integer] the maximum size of each resulting chunk (defaults to 4096)
|
|
89
|
+
# @raise [ArgumentError] if the separators array is empty
|
|
45
90
|
def initialize(separators: DEFAULT_SEPARATORS, include_separator: false, combining_string: "\n\n", chunk_size: 4096)
|
|
46
91
|
separators.empty? and
|
|
47
92
|
raise ArgumentError, "non-empty array of separators required"
|
|
48
93
|
@separators, @include_separator, @combining_string, @chunk_size =
|
|
49
94
|
separators, include_separator, combining_string, chunk_size
|
|
95
|
+
@force = separators.last == //
|
|
50
96
|
end
|
|
51
97
|
|
|
98
|
+
# Recursively splits the given text into chunks using the list of
|
|
99
|
+
# separators.
|
|
100
|
+
#
|
|
101
|
+
# @param text [String] the text to be split
|
|
102
|
+
# @param separators [Array<Regexp>] the list of separators to use (defaults to @separators)
|
|
103
|
+
# @return [Array<String>] an array of text chunks
|
|
52
104
|
def split(text, separators: @separators)
|
|
53
105
|
separators.empty? and return [ text ]
|
|
54
106
|
separators = separators.dup
|