RubyGems - documentrix - Versions diffs - 0.2.0 → 0.3.0 - Mend

documentrix 0.2.0 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (26) hide show

checksums.yaml +4 -4
data/CHANGES.md +69 -0
data/documentrix.gemspec +5 -5
data/lib/documentrix/documents/cache/common.rb +63 -11
data/lib/documentrix/documents/cache/records.rb +1 -1
data/lib/documentrix/documents/cache/redis_cache.rb +3 -3
data/lib/documentrix/documents/cache/sqlite_cache.rb +95 -27
data/lib/documentrix/documents/splitters/character.rb +56 -4
data/lib/documentrix/documents/splitters/common.rb +38 -0
data/lib/documentrix/documents/splitters/semantic.rb +67 -8
data/lib/documentrix/documents.rb +133 -29
data/lib/documentrix/utils/colorize_texts.rb +25 -21
data/lib/documentrix/utils/digests.rb +78 -0
data/lib/documentrix/utils.rb +1 -0
data/lib/documentrix/version.rb +1 -1
data/spec/documentrix/documents/cache/interface_spec.rb +16 -3
data/spec/documentrix/documents/cache/memory_cache_spec.rb +64 -2
data/spec/documentrix/documents/cache/redis_cache_spec.rb +68 -19
data/spec/documentrix/documents/cache/sqlite_cache_spec.rb +128 -2
data/spec/documentrix/documents/splitters/character_spec.rb +20 -2
data/spec/documentrix/documents/splitters/semantic_spec.rb +17 -5
data/spec/documents_spec.rb +59 -3
data/spec/utils/colorize_texts_spec.rb +0 -2
data/spec/utils/digests_spec.rb +97 -0
data/spec/utils/tags_spec.rb +0 -2
metadata +7 -1

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: c45e75b570207ac77d9a04e95939050c33d8f5e0645c7ebac96c862de4e252f3
-  data.tar.gz: 18b0f526ec16115483de74c027a7d8eaa8c8ba2461ac41d4695f53f1ceff32c7
+  metadata.gz: 68c409476101e4632597c139494f3fc1fe67bc0af7a655e5e5b3e4ebbd58f58c
+  data.tar.gz: fcd07ca7694b3fbed81c3a25f3fa9d9a675d120e5b82b43a13f5f71970012f3a
 SHA512:
-  metadata.gz: dddf96ef71ab25c35c6872905cea070c8a27036caf841d97626fadc1db172e496a3b8f13748000cbb9a096b66ae59f459bae145fc2a4a06abffad7de267362a9
-  data.tar.gz: cab52c41f1749fe0ff56538cd01e4525787921e700dca91e44f10d939ec942afc59478c4f8256033fdbf5ef4a38ac649d80eadfbc711042bccb85e38b056a569
+  metadata.gz: cf1e95f2d994bb130b89bff9d2e43062621b849b5d0f7fb4543092b1f491bab4b9dd9e51047c3dfb216090a4f9bc853c08ba35719dea7409c34a15758070fcba
+  data.tar.gz: 59aa3347c07c2521f5661326a55d73733470ef30c62e86fc7d1ff53a4acb6cbc2da06e4c7814896b59ca16bdfab8933a7018f3f41ab30ce29593c5fac480f342

data/CHANGES.md CHANGED Viewed

@@ -1,5 +1,74 @@
 # Changes
+## 2026-05-17 v0.3.0
+### New Features
+- **Source Tracking & Versioning**:
+    - Introduced `Documentrix::Utils::Digests` for SHA256 hashing of strings
+      and files, including an `mtime`-based cache.
+    - Implemented source-based document management in `Documentrix::Documents`
+      via `normalize_source`, `source_exist?`, `source_modified?`,
+      `source_update`, and `source_remove`.
+    - Updated `Documentrix::Documents#add` and
+      `Documentrix::Documents#source_update` to support `digest` for version
+      tracking.
+- **Text Splitting**:
+    - Added `Documentrix::Documents::Splitters::Common` to implement
+      `force_split` behavior.
+    - Integrated `force` splitting into `Character`, `RecursiveCharacter`, and
+      `Semantic` splitters.
+- **Cache Enhancements**:
+    - Added `each_source` to `Documentrix::Documents::Cache::Common` and an
+      optimized `SELECT DISTINCT source` implementation in
+      `Documentrix::Documents::Cache::SQLiteCache`.
+    - Added a SQLite trigger `delete_embedding_after_record` to automatically
+      clean the `embeddings` table.
+### Improvements & Refactorings
+- **Search & Retrieval**:
+    - Added `min_similarity` parameter to `Documentrix::Documents#find`,
+      `Documentrix::Documents::Cache::Common#find_records`, and
+      `Documentrix::Documents::Cache::SQLiteCache#find_records`.
+    - Optimized `Documentrix::Documents::Cache::SQLiteCache#find_records` by
+      moving similarity calculations into the SQL query using `1 -
+      vec_distance_cosine`.
+    - Simplified `Documentrix::Documents#find_where` by streamlining
+      `take_while` logic and utilizing `opts[:max_records]`.
+- **Cache Implementations**:
+    - Made `object_class` a required keyword argument in
+      `Documentrix::Documents::RedisCache#initialize`.
+    - Refactored `Documentrix::Documents::Cache::Common#clear_by_source` and
+      `Documentrix::Documents::Cache::Common#source_exist?` to use ternary
+      operators.
+    - Improved `Documentrix::Documents::Cache::SQLiteCache#each_source` and
+      `Documentrix::Documents::Cache::SQLiteCache#find_records` for better
+      robustness and formatting.
+- **Documentation & Tooling**:
+    - Expanded YARD documentation for
+      `Documentrix::Documents::Splitters::Character`, `RecursiveCharacter`,
+      `Semantic`, and `Documentrix::Utils::ColorizeTexts`.
+    - Centralized RSpec configuration via a `.rspec` file.
+### Bug Fixes
+- Fixed an issue in `Documentrix::Documents#find` where `max_records` was
+  hardcoded to `nil` when calling the cache.
+- Adjusted default handling of `min_similarity` in
+  `Documentrix::Documents#find` to use `min_similarity ||= -1` within the
+  method body.
+### Testing
+- Significantly expanded test suites for `SQLiteCache`, `MemoryCache`, and
+  `RedisCache`, specifically covering `each_source`, `tags`, `clear_for_tags`,
+  and digest-based checks.
+- Added new test cases in `spec/documents_spec.rb` for source management and
+  `Documentrix::Documents#source_update`.
+- Added `spec/utils/digests_spec.rb` and updated splitter specs to verify
+  `force` splitting behavior.
 ## 2026-05-12 v0.2.0
 ### Added

data/documentrix.gemspec CHANGED Viewed

@@ -1,9 +1,9 @@
 # -*- encoding: utf-8 -*-
-# stub: documentrix 0.2.0 ruby lib
+# stub: documentrix 0.3.0 ruby lib
 Gem::Specification.new do |s|
   s.name = "documentrix".freeze
-  s.version = "0.2.0".freeze
+  s.version = "0.3.0".freeze
   s.required_rubygems_version = Gem::Requirement.new(">= 0".freeze) if s.respond_to? :required_rubygems_version=
   s.require_paths = ["lib".freeze]
@@ -11,15 +11,15 @@ Gem::Specification.new do |s|
   s.date = "1980-01-02"
   s.description = "The Ruby library, Documentrix, is designed to provide a way to build and\nquery vector databases for applications in natural language processing\n(NLP) and large language models (LLMs). It allows users to store and\nretrieve dense vector embeddings for text strings.\n".freeze
   s.email = "flori@ping.de".freeze
-  s.extra_rdoc_files = ["README.md".freeze, "lib/documentrix.rb".freeze, "lib/documentrix/documents.rb".freeze, "lib/documentrix/documents/cache/common.rb".freeze, "lib/documentrix/documents/cache/memory_cache.rb".freeze, "lib/documentrix/documents/cache/records.rb".freeze, "lib/documentrix/documents/cache/redis_cache.rb".freeze, "lib/documentrix/documents/cache/sqlite_cache.rb".freeze, "lib/documentrix/documents/splitters/character.rb".freeze, "lib/documentrix/documents/splitters/semantic.rb".freeze, "lib/documentrix/utils.rb".freeze, "lib/documentrix/utils/colorize_texts.rb".freeze, "lib/documentrix/utils/math.rb".freeze, "lib/documentrix/utils/tags.rb".freeze, "lib/documentrix/version.rb".freeze]
-  s.files = [".envrc".freeze, ".utilsrc".freeze, ".yardopts".freeze, "CHANGES.md".freeze, "Gemfile".freeze, "LICENSE".freeze, "README.md".freeze, "Rakefile".freeze, "docker-compose.yml".freeze, "documentrix.gemspec".freeze, "lib/documentrix.rb".freeze, "lib/documentrix/documents.rb".freeze, "lib/documentrix/documents/cache/common.rb".freeze, "lib/documentrix/documents/cache/memory_cache.rb".freeze, "lib/documentrix/documents/cache/records.rb".freeze, "lib/documentrix/documents/cache/redis_cache.rb".freeze, "lib/documentrix/documents/cache/sqlite_cache.rb".freeze, "lib/documentrix/documents/splitters/character.rb".freeze, "lib/documentrix/documents/splitters/semantic.rb".freeze, "lib/documentrix/utils.rb".freeze, "lib/documentrix/utils/colorize_texts.rb".freeze, "lib/documentrix/utils/math.rb".freeze, "lib/documentrix/utils/tags.rb".freeze, "lib/documentrix/version.rb".freeze, "redis/redis.conf".freeze, "spec/assets/embeddings.json".freeze, "spec/documentrix/documents/cache/interface_spec.rb".freeze, "spec/documentrix/documents/cache/memory_cache_spec.rb".freeze, "spec/documentrix/documents/cache/redis_cache_spec.rb".freeze, "spec/documentrix/documents/cache/sqlite_cache_spec.rb".freeze, "spec/documentrix/documents/splitters/character_spec.rb".freeze, "spec/documentrix/documents/splitters/semantic_spec.rb".freeze, "spec/documents_spec.rb".freeze, "spec/spec_helper.rb".freeze, "spec/utils/colorize_texts_spec.rb".freeze, "spec/utils/tags_spec.rb".freeze]
+  s.extra_rdoc_files = ["README.md".freeze, "lib/documentrix.rb".freeze, "lib/documentrix/documents.rb".freeze, "lib/documentrix/documents/cache/common.rb".freeze, "lib/documentrix/documents/cache/memory_cache.rb".freeze, "lib/documentrix/documents/cache/records.rb".freeze, "lib/documentrix/documents/cache/redis_cache.rb".freeze, "lib/documentrix/documents/cache/sqlite_cache.rb".freeze, "lib/documentrix/documents/splitters/character.rb".freeze, "lib/documentrix/documents/splitters/common.rb".freeze, "lib/documentrix/documents/splitters/semantic.rb".freeze, "lib/documentrix/utils.rb".freeze, "lib/documentrix/utils/colorize_texts.rb".freeze, "lib/documentrix/utils/digests.rb".freeze, "lib/documentrix/utils/math.rb".freeze, "lib/documentrix/utils/tags.rb".freeze, "lib/documentrix/version.rb".freeze]
+  s.files = [".envrc".freeze, ".utilsrc".freeze, ".yardopts".freeze, "CHANGES.md".freeze, "Gemfile".freeze, "LICENSE".freeze, "README.md".freeze, "Rakefile".freeze, "docker-compose.yml".freeze, "documentrix.gemspec".freeze, "lib/documentrix.rb".freeze, "lib/documentrix/documents.rb".freeze, "lib/documentrix/documents/cache/common.rb".freeze, "lib/documentrix/documents/cache/memory_cache.rb".freeze, "lib/documentrix/documents/cache/records.rb".freeze, "lib/documentrix/documents/cache/redis_cache.rb".freeze, "lib/documentrix/documents/cache/sqlite_cache.rb".freeze, "lib/documentrix/documents/splitters/character.rb".freeze, "lib/documentrix/documents/splitters/common.rb".freeze, "lib/documentrix/documents/splitters/semantic.rb".freeze, "lib/documentrix/utils.rb".freeze, "lib/documentrix/utils/colorize_texts.rb".freeze, "lib/documentrix/utils/digests.rb".freeze, "lib/documentrix/utils/math.rb".freeze, "lib/documentrix/utils/tags.rb".freeze, "lib/documentrix/version.rb".freeze, "redis/redis.conf".freeze, "spec/assets/embeddings.json".freeze, "spec/documentrix/documents/cache/interface_spec.rb".freeze, "spec/documentrix/documents/cache/memory_cache_spec.rb".freeze, "spec/documentrix/documents/cache/redis_cache_spec.rb".freeze, "spec/documentrix/documents/cache/sqlite_cache_spec.rb".freeze, "spec/documentrix/documents/splitters/character_spec.rb".freeze, "spec/documentrix/documents/splitters/semantic_spec.rb".freeze, "spec/documents_spec.rb".freeze, "spec/spec_helper.rb".freeze, "spec/utils/colorize_texts_spec.rb".freeze, "spec/utils/digests_spec.rb".freeze, "spec/utils/tags_spec.rb".freeze]
   s.homepage = "https://github.com/flori/documentrix".freeze
   s.licenses = ["MIT".freeze]
   s.rdoc_options = ["--title".freeze, "Documentrix - Ruby library for embedding vector database".freeze, "--main".freeze, "README.md".freeze]
   s.required_ruby_version = Gem::Requirement.new(">= 3.1".freeze)
   s.rubygems_version = "4.0.10".freeze
   s.summary = "Ruby library for embedding vector database".freeze
-  s.test_files = ["spec/documentrix/documents/cache/interface_spec.rb".freeze, "spec/documentrix/documents/cache/memory_cache_spec.rb".freeze, "spec/documentrix/documents/cache/redis_cache_spec.rb".freeze, "spec/documentrix/documents/cache/sqlite_cache_spec.rb".freeze, "spec/documentrix/documents/splitters/character_spec.rb".freeze, "spec/documentrix/documents/splitters/semantic_spec.rb".freeze, "spec/documents_spec.rb".freeze, "spec/spec_helper.rb".freeze, "spec/utils/colorize_texts_spec.rb".freeze, "spec/utils/tags_spec.rb".freeze]
+  s.test_files = ["spec/documentrix/documents/cache/interface_spec.rb".freeze, "spec/documentrix/documents/cache/memory_cache_spec.rb".freeze, "spec/documentrix/documents/cache/redis_cache_spec.rb".freeze, "spec/documentrix/documents/cache/sqlite_cache_spec.rb".freeze, "spec/documentrix/documents/splitters/character_spec.rb".freeze, "spec/documentrix/documents/splitters/semantic_spec.rb".freeze, "spec/documents_spec.rb".freeze, "spec/spec_helper.rb".freeze, "spec/utils/colorize_texts_spec.rb".freeze, "spec/utils/digests_spec.rb".freeze, "spec/utils/tags_spec.rb".freeze]
   s.specification_version = 4

data/lib/documentrix/documents/cache/common.rb CHANGED Viewed

@@ -12,6 +12,7 @@
 # memory, Redis, and SQLite.
 module Documentrix::Documents::Cache::Common
   include Documentrix::Utils::Math
+  include Documentrix::Utils::Digests
   include Enumerable
   # The initialize method sets up the Documentrix::Documents::Cache instance's
@@ -62,27 +63,29 @@ module Documentrix::Documents::Cache::Common
   # @param needle [ Array ] an array containing the embedding vector
   # @param tags [ String, Array ] a string or array of strings representing the tags to search for
   # @param max_records [ Integer ] the maximum number of records to return
+  # @param min_similarity [ Float ] the minimum similarity score required for a record to be returned (defaults to -1)
   #
-  # @yield [ record ]
-  #
-  # @return [ Array<Documentrix::Documents::Records> ] an array containing the matching records
-  def find_records(needle, tags: nil, max_records: nil)
+  # @return [ Array<Documentrix::Documents::Record> ] an array containing the matching records
+  def find_records(needle, tags: nil, max_records: nil, min_similarity: -1)
     tags    = Documentrix::Utils::Tags.new(Array(tags)).to_a
     records = self
     if tags.present?
       records = records.select { |_key, record| (tags & record.tags).size >= 1 }
     end
     needle_norm = norm(needle)
-    records     = records.sort_by { |key, record|
+    records     = records.map do |key, record|
       record.key        = key
       record.similarity = cosine_similarity(
-        a: needle,
-        b: record.embedding,
+        a:      needle,
+        b:      record.embedding,
         a_norm: needle_norm,
         b_norm: record.norm,
       )
-    }
-    records.transpose.last&.reverse.to_a
+      record
+    end.sort_by(&:similarity).reverse.select { _1.similarity >= min_similarity }
+    max_records ? records.take(max_records) : records
   end
   # Returns a set of unique tags found in the cache records.
@@ -116,19 +119,68 @@ module Documentrix::Documents::Cache::Common
     self
   end
+  # Yields each unique, full source present in the cache records.
+  #
+  # @yield [source] the full source string
+  # @return [Enumerator] an enumerator if no block is given, nil otherwise.
+  def each_source(&block)
+    block or return enum_for(__method__)
+    seen = {}
+    each do |_key, record|
+      source = record.source.full? or next
+      seen.key?(source) and next
+      seen[source] = true
+      block.(source)
+    end
+    nil
+  end
   # The clear_by_source method removes all records from the cache that
   # have a source matching the given source.
   #
   # @param source [String] the source to filter records by
+  # @param digest [String, nil] the SHA256 hexadecimal digest of the source.
+  # @param operator [Symbol, String] the operator to compare the digest with ('=' or '!=')
   #
   # @return [self] self
-  def clear_by_source(source)
+  def clear_by_source(source, digest: nil, operator: ?=)
+    operator = operator == '=' ? '==' : '!='
     each do |key, record|
-      delete(unpre(key)) if record.source == source
+      next unless record.source == source
+      if digest
+        should_delete = record.digest.send(operator, digest)
+        delete(unpre(key)) if should_delete
+      else
+        delete(unpre(key))
+      end
     end
     self
   end
+  # Checks if any records associated with the given source exist in the cache.
+  #
+  # @param source [String] the source to check for existence
+  # @param digest [String, nil] the SHA256 hexadecimal digest to compare against
+  # @param operator [Symbol, String] the operator to compare the digest with ('=' or '!=')
+  #
+  # @return [Boolean] true if a matching record is found, false otherwise.
+  def source_exist?(source, digest: nil, operator: ?=)
+    operator = operator == '=' ? '==' : '!='
+    each do |_, record|
+      next unless record.source == source
+      if digest
+        if record.digest.send(operator, digest)
+          return true
+        end
+      else
+        return true
+      end
+    end
+    false
+  end
   # The clear method removes cached records based on the provided tags or
   # clears all records with the current prefix.
   #

data/lib/documentrix/documents/cache/records.rb CHANGED Viewed

@@ -27,7 +27,7 @@ module Documentrix::Documents::Cache::Records
     # The to_s method returns a string representation of the object.
     #
     # @return [String] A string containing the text and tags of the record,
-    # along with its similarity score.
+    #   along with its similarity score.
     def to_s
       my_tags = tags_set
       my_tags.empty? or my_tags = " #{my_tags}"

data/lib/documentrix/documents/cache/redis_cache.rb CHANGED Viewed

@@ -23,7 +23,7 @@ class Documentrix::Documents::RedisCache
   # @param [String] prefix the string to be used as the prefix for this cache
   # @param [String] url the URL of the Redis server (default: ENV['REDIS_URL'])
   # @param [Class] object_class the class of objects stored in Redis (default: nil)
-  def initialize(prefix:, url: ENV['REDIS_URL'], object_class: nil)
+  def initialize(prefix:, url: ENV['REDIS_URL'], object_class:)
     super(prefix:)
     url or raise ArgumentError, 'require redis url'
     @url, @object_class = url, object_class
@@ -46,7 +46,7 @@ class Documentrix::Documents::RedisCache
   def [](key)
     value = redis.get(pre(key))
     unless value.nil?
-      object_class ? JSON.parse(value, object_class:) : JSON.parse(value)
+      JSON.parse(value, object_class:)
     end
   end
@@ -153,7 +153,7 @@ class Documentrix::Documents::RedisCache
     redis.scan_each(match: prefix + ?*) do |key|
       value = redis.get(key) or next
-      value = object_class ? JSON.parse(value, object_class:) : JSON.parse(value)
+      value = JSON.parse(value, object_class:)
       block.(key, value)
     end
   end

data/lib/documentrix/documents/cache/sqlite_cache.rb CHANGED Viewed

@@ -46,17 +46,17 @@ class Documentrix::Documents::Cache::SQLiteCache
     result = execute(
       %{
         SELECT records.key, records.text, records.norm, records.source,
-          records.tags, embeddings.embedding
+          records.digest, records.tags, embeddings.embedding
         FROM records
         INNER JOIN embeddings ON records.embedding_id = embeddings.rowid
         WHERE records.key = ?
       },
       pre(key)
     )&.first or return
-    key, text, norm, source, tags, embedding = *result
+    key, text, norm, source, digest, tags, embedding = *result
     embedding = embedding.unpack("f*")
     tags      = Documentrix::Utils::Tags.new(JSON(tags.to_s).to_a, source:)
-    convert_value_to_record(key:, text:, norm:, source:, tags:, embedding:)
+    convert_value_to_record(key:, text:, norm:, source:, digest:, tags:, embedding:)
   end
   # The []= method sets the value for a given key by inserting it into the
@@ -66,15 +66,16 @@ class Documentrix::Documents::Cache::SQLiteCache
   # @param [Hash, Documentrix::Documents::Record] value the hash or record
   #        containing the text, embedding, and other metadata
   def []=(key, value)
-    value = convert_value_to_record(value)
+    value     = convert_value_to_record(value)
+    digest    = compute_file_digest(value.source)
     embedding = value.embedding.pack("f*")
     execute(%{BEGIN})
     execute(%{INSERT INTO embeddings(embedding) VALUES(?)}, [ embedding ])
     embedding_id, = execute(%{ SELECT last_insert_rowid() }).flatten
     execute(%{
-      INSERT INTO records(key,text,embedding_id,norm,source,tags)
-      VALUES(?,?,?,?,?,?)
-    }, [ pre(key), value.text, embedding_id, value.norm, value.source, JSON(value.tags) ])
+      INSERT INTO records(key,text,embedding_id,norm,source,digest,tags)
+      VALUES(?,?,?,?,?,?,?)
+    }, [ pre(key), value.text, embedding_id, value.norm, value.source, digest, JSON(value.tags) ])
     execute(%{COMMIT})
   end
@@ -157,17 +158,70 @@ class Documentrix::Documents::Cache::SQLiteCache
     self
   end
-  # The clear_by_source method removes all records from the cache that
-  # have a source matching the given source.
+  # Removes all records associated with the specified source from the cache.
   #
-  # @param source [String] the source to filter records by
+  # If a digest is provided, the method will only remove records that do NOT
+  # match this digest. This allows for updating a source by wiping old versions
+  # while preserving records that are already up-to-date.
   #
-  # @return [Documentrix::Documents::Cache::SQLiteCache] self
-  def clear_by_source(source)
-    execute(%{DELETE FROM records WHERE source = ?}, [ source ])
+  # @param source [String] the source identifier used to filter records
+  # @param digest [String, nil] the SHA256 hexadecimal digest of the source.
+  #   Records matching this digest will be preserved.
+  #
+  # @return [self] the cache instance for method chaining
+  def clear_by_source(source, digest: nil, operator: ?=)
+    operator = '!=' if operator != ?=
+    if digest
+      execute(%{DELETE FROM records WHERE source = ? AND digest #{operator} ? }, [ source, digest ])
+    else
+      execute(%{DELETE FROM records WHERE source = ?}, [ source ])
+    end
     self
   end
+  # The source_exist? method checks if any records associated with the given
+  # source exist in the cache. If a digest is provided, it verifies if the
+  # source exists and matches the specified digest using the provided operator.
+  #
+  # @param source [#to_s] the source to check for existence
+  # @param digest [String, nil] the SHA256 hexadecimal digest to compare
+  #   against the stored source digest (optional)
+  # @param operator [String] the operator to use for comparison ('=' or '!=').
+  #   Defaults to '='.
+  #
+  # @return [Boolean] true if the source exists (and matches the digest
+  #   condition if provided), false otherwise.
+  def source_exist?(source, digest: nil, operator: ?=)
+    operator = '!=' if operator != ?=
+    if digest
+      !!execute(%{SELECT 1 FROM records WHERE source = ? AND digest #{operator} ? }, [ source, digest ]).first
+    else
+      !!execute(%{SELECT 1 FROM records WHERE source = ?}, [ source ]).first
+    end
+  end
+  # Yields each unique, full source present in the cache records.
+  #
+  # This is a high-performance override for SQLite that avoids loading
+  # embeddings and parsing JSON for every record.
+  #
+  # @yield [source] the full source string
+  # @return [Enumerator] an enumerator if no block is given, nil otherwise.
+  def each_source(&block)
+    block or return enum_for(__method__)
+    execute(%{
+      SELECT DISTINCT source
+      FROM records
+      WHERE key LIKE ? AND source IS NOT NULL
+    }, [ "#@prefix%" ]).each do |source,|
+      source = source.full? or next
+      block.(source)
+    end
+    nil
+  end
   # Move a key prefix in the cache.
   #
   # This operation updates every record whose key starts with +old_prefix+,
@@ -208,14 +262,14 @@ class Documentrix::Documents::Cache::SQLiteCache
     execute(%{
       SELECT records.key, records.text, records.norm, records.source,
-        records.tags, embeddings.embedding
+        records.digest, records.tags, embeddings.embedding
       FROM records
       INNER JOIN embeddings ON records.embedding_id = embeddings.rowid
       WHERE records.key LIKE ?
-    }, [ prefix ]).each do |key, text, norm, source, tags, embedding|
+    }, [ prefix ]).each do |key, text, norm, source, digest, tags, embedding|
       embedding = embedding.unpack("f*")
       tags      = Documentrix::Utils::Tags.new(JSON(tags.to_s).to_a, source:)
-      value     = convert_value_to_record(key:, text:, norm:, source:, tags:, embedding:)
+      value     = convert_value_to_record(key:, text:, norm:, source:, digest:, tags:, embedding:)
       block.(key, value)
     end
     self
@@ -275,34 +329,40 @@ class Documentrix::Documents::Cache::SQLiteCache
   # @param needle [ Array ] the embedding vector
   # @param tags [ Array ] the list of tags to filter by (optional)
   # @param max_records [ Integer ] the maximum number of records to return (optional)
+  # @param min_similarity [ Float ] the minimum similarity score to include (defaults to -1)
   #
   # @yield [ key, value ]
   #
   # @raise [ ArgumentError ] if needle size does not match embedding length
   #
   # @example
-  #   documents.find_records([ 0.1 ] * 1_024, tags: %w[ test ])
+  #   documents.find_records([ 0.1 ] * 1_024, tags: %w[ test ], min_similarity: 0.7)
   #
   # @return [ Array<Documentrix::Documents::Record> ] the list of matching records
-  def find_records(needle, tags: nil, max_records: nil)
+  def find_records(needle, tags: nil, max_records: nil, min_similarity: -1)
     needle.size != @embedding_length and
       raise ArgumentError, "needle embedding length != %s" % @embedding_length
     needle_binary = needle.pack("f*")
     max_records   = [ max_records, size, 4_096 ].compact.min
     records = find_records_for_tags(tags)
     rowids_where = '(%s)' % records.transpose.last&.join(?,)
-    execute(%{
-      SELECT records.key, records.text, records.norm, records.source,
-        records.tags, embeddings.embedding
-      FROM records
-      INNER JOIN embeddings ON records.embedding_id = embeddings.rowid
-      WHERE embeddings.rowid IN #{rowids_where}
-        AND embeddings.embedding MATCH ? AND embeddings.k = ?
-    }, [ needle_binary, max_records ]).map do |key, text, norm, source, tags, embedding|
+    execute(
+      %{
+        SELECT records.key, records.text, records.norm, records.source,
+          records.digest, records.tags, embeddings.embedding,
+          1 - vec_distance_cosine(?, vec_f32(embeddings.embedding)) AS similarity
+        FROM records
+        INNER JOIN embeddings ON records.embedding_id = embeddings.rowid
+        WHERE embeddings.rowid IN #{rowids_where}
+          AND embeddings.embedding MATCH ? AND similarity >= ?
+          AND embeddings.k = ?
+        ORDER BY similarity DESC
+      }, [ needle_binary, needle_binary, min_similarity, max_records ]
+    ).map do |key, text, norm, source, digest, tags, embedding, similarity|
       key       = unpre(key)
       embedding = embedding.unpack("f*")
       tags      = Documentrix::Utils::Tags.new(JSON(tags.to_s).to_a, source:)
-      convert_value_to_record(key:, text:, norm:, source:, tags:, embedding:)
+      convert_value_to_record(key:, text:, norm:, source:, digest:, tags:, embedding:, similarity:)
     end
   end
@@ -362,10 +422,18 @@ class Documentrix::Documents::Cache::SQLiteCache
         embedding_id integer,
         norm         float NOT NULL DEFAULT 0.0,
         source       text,
+        digest       text,
         tags         json NOT NULL DEFAULT [],
         FOREIGN KEY(embedding_id) REFERENCES embeddings(id) ON DELETE CASCADE
       )
     }
+    execute %{
+      CREATE TRIGGER IF NOT EXISTS delete_embedding_after_record AFTER DELETE ON records
+      FOR EACH ROW
+      BEGIN
+        DELETE FROM embeddings WHERE rowid = OLD.embedding_id;
+      END
+    }
     nil
   end

data/lib/documentrix/documents/splitters/character.rb CHANGED Viewed

@@ -1,15 +1,38 @@
 module Documentrix::Documents::Splitters
+  # The Character class provides basic text splitting based on a single
+  # separator and bundles the resulting segments into chunks of a maximum size.
+  #
+  # It allows for the preservation of separators and uses a combining string
+  # to join segments back together into chunks.
   class Character
+    include Documentrix::Documents::Splitters::Common
+    # The default regex used to identify paragraph boundaries.
+    # It matches two or more consecutive newline characters (CRLF or LF).
+    #
+    # @return [Regexp]
     DEFAULT_SEPARATOR = /(?:\r?\n){2,}/
-    def initialize(separator: DEFAULT_SEPARATOR, include_separator: false, combining_string: "\n\n", chunk_size: 4096)
-      @separator, @include_separator, @combining_string, @chunk_size =
-        separator, include_separator, combining_string, chunk_size
+    # Initializes a new Character splitter.
+    #
+    # @param separator [Regexp] the regex used to split the text (defaults to DEFAULT_SEPARATOR)
+    # @param include_separator [Boolean] whether to include the separator in the resulting chunks (defaults to false)
+    # @param combining_string [String] the string used to join segments into chunks (defaults to "\n\n")
+    # @param chunk_size [Integer] the maximum size of each resulting chunk (defaults to 4096)
+    # @param force [Boolean] whether to force-split the final chunk if it exceeds `chunk_size` (defaults to false)
+    def initialize(separator: DEFAULT_SEPARATOR, include_separator: false, combining_string: "\n\n", chunk_size: 4096, force: false)
+      @separator, @include_separator, @combining_string, @chunk_size, @force =
+        separator, include_separator, combining_string, chunk_size, force
       if include_separator
         @separator = Regexp.new("(#@separator)")
       end
     end
+    # Splits the given text into chunks based on the configured separator and
+    # size limit.
+    #
+    # @param text [String] the text to be split
+    # @return [Array<String>] an array of text chunks
     def split(text)
       texts = []
       text.split(@separator) do |t|
@@ -29,12 +52,27 @@ module Documentrix::Documents::Splitters
           current_text = t
         end
       end
-      current_text.empty? or result << current_text
+      result.concat force_split(current_text)
       result
     end
   end
+  # The RecursiveCharacter class implements a hierarchical splitting strategy.
+  #
+  # It attempts to split text using a priority list of separators. If a
+  # resulting chunk is still larger than the specified chunk_size, it
+  # recursively applies the next separator in the list until the size limit is
+  # met or all separators have been exhausted.
   class RecursiveCharacter
+    include Documentrix::Documents::Splitters::Common
+    # The default priority list of regexes used for recursive splitting.
+    # The strategy is to split by the coarsest grain first (paragraphs)
+    # and move toward the finest grain (individual characters) as needed.
+    #
+    # Order: Paragraphs -> Newlines -> Word Boundaries -> Characters
+    #
+    # @return [Array<Regexp>]
     DEFAULT_SEPARATORS = [
       /(?:\r?\n){2,}/,
       /\r?\n/,
@@ -42,13 +80,27 @@ module Documentrix::Documents::Splitters
       //,
     ].freeze
+    # Initializes a new RecursiveCharacter splitter.
+    #
+    # @param separators [Array<Regexp>] a priority list of regexes to use for splitting (defaults to DEFAULT_SEPARATORS)
+    # @param include_separator [Boolean] whether to include the separator in the resulting chunks (defaults to false)
+    # @param combining_string [String] the string used to join segments into chunks (defaults to "\n\n")
+    # @param chunk_size [Integer] the maximum size of each resulting chunk (defaults to 4096)
+    # @raise [ArgumentError] if the separators array is empty
     def initialize(separators: DEFAULT_SEPARATORS, include_separator: false, combining_string: "\n\n", chunk_size: 4096)
       separators.empty? and
         raise ArgumentError, "non-empty array of separators required"
       @separators, @include_separator, @combining_string, @chunk_size =
         separators, include_separator, combining_string, chunk_size
+      @force = separators.last == //
     end
+    # Recursively splits the given text into chunks using the list of
+    # separators.
+    #
+    # @param text [String] the text to be split
+    # @param separators [Array<Regexp>] the list of separators to use (defaults to @separators)
+    # @return [Array<String>] an array of text chunks
     def split(text, separators: @separators)
       separators.empty? and return [ text ]
       separators = separators.dup

data/lib/documentrix/documents/splitters/common.rb ADDED Viewed

@@ -0,0 +1,38 @@
+# A shared utility module for text splitters that provides consistent
+# handling of chunk size constraints.
+#
+# This module is intended to be included in splitter classes that
+# implement a maximum chunk size limit. It expects the including class
+# to provide the following attributes:
+# - `force` [Boolean]: Whether to hard-split chunks that exceed the limit.
+# - `chunk_size` [Integer]: The maximum allowed size for a single chunk.
+module Documentrix::Documents::Splitters::Common
+  private
+  # Whether to force-split chunks that exceed the chunk size limit.
+  # @return [Boolean]
+  attr_reader :force
+  # The maximum allowed size for a single chunk.
+  # @return [Integer]
+  attr_reader :chunk_size
+  # Ensures text respects the chunk size limit if force splitting is enabled.
+  #
+  # If the `force` attribute is true and the provided text exceeds the
+  # `chunk_size`, the text is hard-split into fixed-size chunks using a
+  # regular expression. If `force` is false or the text is within the
+  # limit, the text is returned wrapped in a single-element array to
+  # maintain return-type consistency (Array<String>).
+  #
+  # @param text [String, nil] the text to potentially split
+  # @return [Array<String>] the resulting chunk(s), or an empty array if text is nil/empty
+  def force_split(text)
+    text&.empty? and return []
+    if force && text.size > chunk_size
+      text.scan(/.{1,#{chunk_size}}/)
+    else
+      Array(text)
+    end
+  end
+end