RubyGems - semantic_chunker - Versions diffs - 0.6.3 → 0.6.4 - Mend

semantic_chunker 0.6.3 → 0.6.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +14 -0
data/README.md +70 -6
data/bin/semantic_chunker +5 -1
data/lib/semantic_chunker/adapters/base.rb +7 -0
data/lib/semantic_chunker/adapters/hugging_face_adapter.rb +48 -21
data/lib/semantic_chunker/adapters/openai_adapter.rb +24 -8
data/lib/semantic_chunker/adapters/test_adapter.rb +13 -2
data/lib/semantic_chunker/chunker.rb +95 -22
data/lib/semantic_chunker/version.rb +6 -1
data/lib/semantic_chunker.rb +18 -1
metadata +59 -3

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 9eb71c3c0285ded52be28cc83f18c2590040bcac489bf07963134b9b877c7dd6
-  data.tar.gz: 2cbb2ad4565519fd068b3a1eb6805e7989bd7d78886884b311faa6c4109193bc
+  metadata.gz: ec9d3bbf399db8e906944a579296b65cae0119b2690a6d267007b2444477494b
+  data.tar.gz: 1ac963e323fa5a4bbd9ce7c86e39627518e3dd8add656f11a304fc3ae6f1d61b
 SHA512:
-  metadata.gz: 2210e23a05cc4ed601528f0c1b877d02c13d9b3da77486a0fa5845a799fd02815f21f9ecd9cd52bf1c4bcb87aff2c733914016e3efba9c1454df992ea0161fa7
-  data.tar.gz: 2541b7fcb705b410444e109979a95cc202d660ec9febc0d1bd4e1116173b99a0d1b75a4a88c204a4e590d2a20e84888c3e75bd8174f43e18b2a19201098b756e
+  metadata.gz: 9be15aaeb5b4b55b6fec0a0c1c42ff659df19a9af969e70626f2d82dacfbadd10a0ad798e06e821c419dc554d29fc09b1c9308de38439942fef8d9e4731321ec
+  data.tar.gz: 525daebf70c85895c24d17f3d2ff718ea1874b0987c89e63e0813a61c9322692ee2a11ccf047ae2a5b46e54a8efabf7a7c4259e6dbeb0c8167cb9d22fe8e0612

data/CHANGELOG.md CHANGED Viewed

@@ -1,6 +1,20 @@
 # Changelog
 All notable changes to this project will be documented in this file.
+## [0.6.4] - 2026-01-15
+### Added
+- **Anchor-Sentence Drift Protection**: New optional feature to prevent topic bleed.
+- **CLI Drift Flag**: Added `-d` and `--drift` flags to the command-line interface.
+- **Drift Validation**: Added input validation to ensure thresholds stay within semantic bounds (-1.0 to 1.0).
+### Fixed
+- Improved internal centroid calculation efficiency for long chunks.
+## [0.6.3] - 2026-01-10
+### Added
+- **YARD Documentation**: Added YARD documentation to all classes and methods in the `lib` directory.
+- **Rake Task**: Added a `yard` rake task to generate documentation.
 [0.6.2] - 2026-01-07
 ----------------------

data/README.md CHANGED Viewed

@@ -240,6 +240,20 @@ You can set a hard limit on the character length of a chunk using `max_chunk_siz
 chunker = SemanticChunker::Chunker.new(max_chunk_size: 1000)
 ```
+### Anchor-Sentence Drift Protection (v0.6.4)
+While standard semantic chunking looks at the "local" flow of sentences, long chunks can sometimes suffer from **Topic Drift** (the "Telephone Game" effect).
+By setting a `drift_threshold`, the chunker compares every new sentence not just to its neighbors, but also to the **Anchor** (the very first sentence of the chunk). If the meaning wanders too far from the starting topic, a split is forced.
+**Usage:**
+```ruby
+# Recommended drift_threshold is 0.75 to 0.80
+chunker = SemanticChunker::Chunker.new(drift_threshold: 0.75)
+```
+```bash
+semantic_chunker -t auto --drift 0.75 document.txt
+```
 ### Adapters
 The gem is designed to be extensible with different embedding providers. It currently ships with:
@@ -280,6 +294,15 @@ $ OPENAI_API_KEY="your-key" bundle exec ruby test_integration.rb
 $ HUGGING_FACE_API_KEY="your-key" bundle exec ruby test_hugging_face.rb
 ```
+### Documentation
+This project uses YARD for documentation. To generate the documentation, run:
+```bash
+bundle exec rake yard
+```
+This will generate the documentation in the `docs` directory.
 ### Security Note: Handling API Keys
 When using an adapter that requires an API key, **never hardcode your API keys** directly into your source code. To keep your application secure (especially if you are working on public repositories), use one of the following methods:
@@ -413,12 +436,53 @@ The Hugging Face adapter is built for production-grade reliability:
 - **Auto-Wait**: Uses the `X-Wait-For-Model` header to ensure stable results on the Inference API.
-## 🚀 Roadmap to v1.0.0
-- [x] Adaptive Dynamic Thresholding
-- [x] CLI with JSON output
-- [x] Robust error handling and retries
-- [ ] **Next:** Local embedding cache (reduce API costs)
-- [ ] **Next:** Drift protection (Anchor-sentence comparison)
+## ### Roadmap to v1.0.0
+#### **v0.6.x: Stability & Core Logic**
+*   \[x\] **Adaptive Dynamic Thresholding:** Core semantic splitting logic.
+*   \[x\] **CLI with JSON output:** Global execution and piping support.
+*   \[x\] **Robust Error Handling:** API retry logic and validation.
+*   \[x\] **Anchor-Sentence Drift Protection:** Prevents topic-bleed by comparing current sentences against the chunk's starting "anchor."
+*   \[ \] **Multiple Breakpoint Strategies:** Support for Percentile, StandardDeviation, and Interquartile range splitting (Claude's suggestion).
+#### **v0.7.x: Performance & Efficiency**
+*   \[ \] **Local Embedding Cache:** SQLite or file-based cache to store sentence embeddings (saves $$$ and speeds up repeated runs).
+*   \[ \] **Batch Processing:** Support for batching multiple sentences into a single API call to HuggingFace/OpenAI.
+*   \[ \] **Progress Indicators:** CLI progress bars for large document processing.
+#### **v0.8.x: RAG & Enterprise Features**
+*   \[ \] **Rich Metadata Support:** Return Chunk objects instead of raw strings, including source, index, and token counts.
+*   \[ \] **Contextual Overlap:** Support for "sliding window" overlap between chunks to preserve context.
+*   \[ \] **PII Sanitization Hook:** Integration point for masking sensitive data before it hits the provider API.
+#### **v0.9.x: Ecosystem & Adapters**
+*   \[ \] **Provider Expansion:** Add native adapters for OpenAI, Cohere, and local Transformers (via informers gem).
+*   \[ \] **Gem Documentation:** Full API documentation and a "Best Practices" guide for threshold tuning.
+*   \[ \] **LangChain.rb Integration:** Provide a standard interface for use within the Ruby AI ecosystem.
+#### **v1.0.0: Production Ready**
+*   \[ \] **Benchmark Suite:** Comparative performance tests against character-based chunkers.
+*   \[ \] **Stable Public API:** Finalizing the class structure for long-term compatibility.
 ## Contributing

data/bin/semantic_chunker CHANGED Viewed

@@ -24,6 +24,9 @@ OptionParser.new do |opts|
     puts SemanticChunker::VERSION
     exit
   end
+  opts.on("-d", "--drift FLOAT", Float, "Anchor-sentence drift threshold (e.g., 0.75)") do |d|
+    options[:drift_threshold] = d
+  end
 end.parse!
 input_file = ARGV[0]
@@ -48,7 +51,8 @@ chunker = SemanticChunker::Chunker.new(
   embedding_provider: provider,
   threshold: options[:threshold],
   max_chunk_size: options[:max_size],
-  buffer_size: options[:buffer]
+  buffer_size: options[:buffer],
+  drift_threshold: options[:drift_threshold]
 )
 chunks = chunker.chunks_for(text)

data/lib/semantic_chunker/adapters/base.rb CHANGED Viewed

@@ -1,7 +1,14 @@
+# frozen_string_literal: true
 # lib/semantic_chunker/adapters/base.rb
 module SemanticChunker
   module Adapters
+    # The Base class for all adapters.
     class Base
+      # This method should be implemented by subclasses to generate embeddings for the given sentences.
+      #
+      # @param sentences [Array<String>] An array of sentences to embed.
+      # @raise [NotImplementedError] If the method is not implemented by a subclass.
       def embed(sentences)
         raise NotImplementedError, "#{self.class} must implement #embed"
       end

data/lib/semantic_chunker/adapters/hugging_face_adapter.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+# frozen_string_literal: true
 # lib/semantic_chunker/adapters/hugging_face_adapter.rb
 require 'net/http'
 require 'json'
@@ -5,29 +7,40 @@ require 'uri'
 module SemanticChunker
   module Adapters
+    # The HuggingFaceAdapter class is responsible for fetching embeddings from the Hugging Face API.
     class HuggingFaceAdapter < Base
-      BASE_URL = "https://router.huggingface.co/hf-inference/models/%{model}"
-      # Configuration for reliability
+      # The base URL for the Hugging Face API.
+      BASE_URL = 'https://router.huggingface.co/hf-inference/models/%<model>s'
+      # The maximum number of retries for transient errors.
       MAX_RETRIES = 3
+      # The initial backoff time in seconds for retries.
       INITIAL_BACKOFF = 2 # seconds
+      # The timeout for opening a connection in seconds.
       OPEN_TIMEOUT = 5    # seconds to open connection
+      # The timeout for reading the response in seconds.
       READ_TIMEOUT = 60   # seconds to wait for embeddings
+      # Initializes a new HuggingFaceAdapter.
+      #
+      # @param api_key [String] The Hugging Face API key.
+      # @param model [String] The name of the model to use. I.E. 'sentence-transformers/all-MiniLM-L6-v2' or 'BAAI/bge-small-en-v1.5'
       def initialize(api_key:, model: 'intfloat/multilingual-e5-large')
         @api_key = api_key
         @model = model
-        # @model = 'sentence-transformers/all-MiniLM-L6-v2'
-        # @model = 'BAAI/bge-small-en-v1.5'
       end
+      # Fetches embeddings for the given sentences from the Hugging Face API.
+      #
+      # @param sentences [Array<String>] An array of sentences to embed.
+      # @return [Array<Array<Float>>] An array of embeddings.
       def embed(sentences)
         retry_count = 0
         begin
           response = post_request(sentences)
           handle_response(response)
-        rescue => e
+        rescue StandardError => e
           if retryable?(e, retry_count)
             wait_time = INITIAL_BACKOFF * (2**retry_count)
             puts "HuggingFace: Transient error (#{e.message}). Retrying in #{wait_time}s..."
@@ -41,16 +54,20 @@ module SemanticChunker
       private
+      # Sends a POST request to the Hugging Face API.
+      #
+      # @param sentences [Array<String>] An array of sentences to embed.
+      # @return [Net::HTTPResponse] The HTTP response.
       def post_request(sentences)
-        uri = URI(BASE_URL % { model: @model })
+        uri = URI(format(BASE_URL, { model: @model }))
         request = Net::HTTP::Post.new(uri)
-        request["Authorization"] = "Bearer #{@api_key}"
-        request["Content-Type"] = "application/json"
-        request["X-Wait-For-Model"] = "true" # Tells HF to wait for model load
+        request['Authorization'] = "Bearer #{@api_key}"
+        request['Content-Type'] = 'application/json'
+        request['X-Wait-For-Model'] = 'true' # Tells HF to wait for model load
         request.body = { inputs: sentences }.to_json
         Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) do |http|
           http.open_timeout = OPEN_TIMEOUT
           http.read_timeout = READ_TIMEOUT
@@ -58,8 +75,13 @@ module SemanticChunker
         end
       end
+      # Handles the HTTP response from the Hugging Face API.
+      #
+      # @param response [Net::HTTPResponse] The HTTP response.
+      # @raise [StandardError] If the response is not successful.
+      # @return [Array<Array<Float>>] The parsed embeddings.
       def handle_response(response)
-        unless response.content_type == "application/json"
+        unless response.content_type == 'application/json'
           raise "HuggingFace Error: Expected JSON, got #{response.content_type}."
         end
@@ -67,21 +89,26 @@ module SemanticChunker
         if response.is_a?(Net::HTTPSuccess)
           parsed
-        elsif parsed.is_a?(Hash) && parsed["error"]&.include?("loading")
+        elsif parsed.is_a?(Hash) && parsed['error']&.include?('loading')
           # This specifically triggers a retry for model warmups
-          raise "Model is still loading"
+          raise 'Model is still loading'
         else
           raise "HuggingFace API Error: #{parsed['error'] || response.body}"
         end
       end
+      # Checks if an error is retryable.
+      #
+      # @param error [StandardError] The error to check.
+      # @param count [Integer] The current retry count.
+      # @return [Boolean] True if the error is retryable, false otherwise.
       def retryable?(error, count)
         return false if count >= MAX_RETRIES
         # Retry on timeouts, loading errors, or 5xx server errors
-        error.message.include?("loading") ||
-        error.is_a?(Net::ReadTimeout) ||
-        error.is_a?(Net::OpenTimeout)
+        error.message.include?('loading') ||
+          error.is_a?(Net::ReadTimeout) ||
+          error.is_a?(Net::OpenTimeout)
       end
     end
   end

data/lib/semantic_chunker/adapters/openai_adapter.rb CHANGED Viewed

@@ -1,18 +1,30 @@
+# frozen_string_literal: true
 # lib/semantic_chunker/adapters/openai_adapter.rb
-require "net/http"
-require "json"
-require "uri"
+require 'net/http'
+require 'json'
+require 'uri'
 module SemanticChunker
   module Adapters
+    # The OpenAIAdapter class is responsible for fetching embeddings from the OpenAI API.
     class OpenAIAdapter < Base
-      ENDPOINT = "https://api.openai.com/v1/embeddings"
+      # The endpoint for the OpenAI API.
+      ENDPOINT = 'https://api.openai.com/v1/embeddings'
-      def initialize(api_key:, model: "text-embedding-3-small")
+      # Initializes a new OpenAIAdapter.
+      #
+      # @param api_key [String] The OpenAI API key.
+      # @param model [String] The name of the model to use.
+      def initialize(api_key:, model: 'text-embedding-3-small')
         @api_key = api_key
         @model = model
       end
+      # Fetches embeddings for the given sentences from the OpenAI API.
+      #
+      # @param sentences [Array<String>] An array of sentences to embed.
+      # @return [Array<Array<Float>>] An array of embeddings.
       def embed(sentences)
         response = post_request(sentences)
         parsed = JSON.parse(response.body)
@@ -20,7 +32,7 @@ module SemanticChunker
         if response.is_a?(Net::HTTPSuccess)
           # OpenAI returns data in the same order as input
           # We extract just the embedding arrays
-          parsed["data"].map { |entry| entry["embedding"] }
+          parsed['data'].map { |entry| entry['embedding'] }
         else
           raise "OpenAI Error: #{parsed.dig('error', 'message') || response.code}"
         end
@@ -28,11 +40,15 @@ module SemanticChunker
       private
+      # Sends a POST request to the OpenAI API.
+      #
+      # @param sentences [Array<String>] An array of sentences to embed.
+      # @return [Net::HTTPResponse] The HTTP response.
       def post_request(sentences)
         uri = URI(ENDPOINT)
         request = Net::HTTP::Post.new(uri)
-        request["Authorization"] = "Bearer #{@api_key}"
-        request["Content-Type"] = "application/json"
+        request['Authorization'] = "Bearer #{@api_key}"
+        request['Content-Type'] = 'application/json'
         request.body = { input: sentences, model: @model }.to_json
         Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) do |http|

data/lib/semantic_chunker/adapters/test_adapter.rb CHANGED Viewed

@@ -1,14 +1,25 @@
+# frozen_string_literal: true
 # lib/semantic_chunker/adapters/test_adapter.rb
 module SemanticChunker
   module Adapters
+    # The TestAdapter class is a dummy adapter for testing purposes.
+    # It returns predefined or random vectors.
     class TestAdapter < Base
-      # We can pass specific vectors to simulate "topics"
+      # Initializes a new TestAdapter.
+      #
+      # @param predefined_vectors [Array<Array<Float>>, nil] A list of vectors to return.
+      #   If nil, random vectors will be generated.
       def initialize(predefined_vectors = nil)
         @predefined_vectors = predefined_vectors
       end
+      # Returns predefined or random embeddings for the given sentences.
+      #
+      # @param sentences [Array<String>] An array of sentences.
+      # @return [Array<Array<Float>>] An array of embeddings.
       def embed(sentences)
-        # If we have specific vectors, use them;
+        # If we have specific vectors, use them;
         # otherwise, return random vectors for each sentence
         @predefined_vectors || sentences.map { [rand, rand, rand] }
       end

data/lib/semantic_chunker/chunker.rb CHANGED Viewed

@@ -1,36 +1,56 @@
+# frozen_string_literal: true
 # lib/semantic_chunker/chunker.rb
 require 'matrix'
 require 'pragmatic_segmenter'
 module SemanticChunker
+  # The Chunker class is responsible for splitting text into semantic chunks.
   class Chunker
+    # The default threshold for cosine similarity.
     DEFAULT_THRESHOLD = 0.82
+    # The default buffer size.
     DEFAULT_BUFFER = 1
+    # The default maximum size of a chunk in characters.
     DEFAULT_MAX_SIZE = 1500 # Characters
-    def initialize(embedding_provider: nil, threshold: DEFAULT_THRESHOLD, buffer_size: DEFAULT_BUFFER, max_chunk_size: DEFAULT_MAX_SIZE, segmenter_options: {})
+    # Initializes a new Chunker.
+    #
+    # @param embedding_provider [Object] The provider for generating embeddings.
+    # @param threshold [Float, Symbol] The cosine similarity threshold or :auto.
+    # @param buffer_size [Integer, Symbol] The buffer size or :auto.
+    # @param max_chunk_size [Integer] The maximum size of a chunk in characters.
+    # @param drift_threshold [Float] The threshold to detect semantic drift from the beginning of a chunk.
+    # @param segmenter_options [Hash] Options for the PragmaticSegmenter.
+    def initialize(embedding_provider: nil, threshold: DEFAULT_THRESHOLD, buffer_size: DEFAULT_BUFFER, max_chunk_size: DEFAULT_MAX_SIZE, drift_threshold: nil, segmenter_options: {})
       @provider = embedding_provider || SemanticChunker.configuration&.provider
       @threshold = threshold
       @buffer_size = buffer_size
       @max_chunk_size = max_chunk_size
+      @drift_threshold = validate_drift_threshold(drift_threshold || SemanticChunker.configuration&.drift_threshold)
       @segmenter_options = segmenter_options # e.g., { language: 'hy', doc_type: 'pdf' }
-      raise ArgumentError, "A provider must be configured" if @provider.nil?
+      raise ArgumentError, 'A provider must be configured' if @provider.nil?
     end
+    # Splits the given text into semantic chunks.
+    #
+    # @param text [String] The text to chunk.
+    # @return [Array<String>] An array of semantic chunks.
     def chunks_for(text)
       return [] if text.nil? || text.strip.empty?
       sentences = split_sentences(text)
       # Step 1: Logic to determine the best buffer window
       effective_buffer = determine_buffer(sentences)
       # Step 2: Create overlapping "context groups" for more stable embeddings
       context_groups = build_context_groups(sentences, effective_buffer)
       # Step 3: Embed the groups, not the raw sentences
       group_embeddings = @provider.embed(context_groups)
       # Resolve the threshold dynamically if requested
       resolved_threshold = resolve_threshold(group_embeddings)
@@ -39,12 +59,25 @@ module SemanticChunker
     private
-    # Selects buffer based on average sentence length if user passes :auto
+    def validate_drift_threshold(val)
+      return nil if val.nil? # Keep it off by default
+      unless val.is_a?(Numeric) && val.between?(-1.0, 1.0)
+        raise ArgumentError, "drift_threshold must be a Numeric between -1.0 and 1.0 (received #{val.inspect})"
+      end
+      val
+    end
+    # Determines the buffer size based on the average sentence length.
+    #
+    # @param sentences [Array<String>] An array of sentences.
+    # @return [Integer] The buffer size.
     def determine_buffer(sentences)
       return @buffer_size unless @buffer_size == :auto
       avg_length = sentences.map(&:length).sum / sentences.size.to_f
       # Strategy: If sentences are very short (< 50 chars), we need more context.
       # If they are long (> 150 chars), they are likely self-contained.
       case avg_length
@@ -54,38 +87,66 @@ module SemanticChunker
       end
     end
+    # Builds overlapping context groups from sentences.
+    #
+    # @param sentences [Array<String>] An array of sentences.
+    # @param buffer [Integer] The buffer size.
+    # @return [Array<String>] An array of context groups.
     def build_context_groups(sentences, buffer)
       sentences.each_with_index.map do |_, i|
         start_idx = [0, i - buffer].max
         end_idx   = [sentences.size - 1, i + buffer].min
-        sentences[start_idx..end_idx].join(" ")
+        sentences[start_idx..end_idx].join(' ')
       end
     end
+    # Splits the given text into sentences.
+    #
+    # @param text [String] The text to split.
+    # @return [Array<String>] An array of sentences.
     def split_sentences(text)
       options = @segmenter_options.merge(text: text)
       ps = PragmaticSegmenter::Segmenter.new(**options)
       ps.segment
     end
+    # Calculates the semantic groups from sentences and embeddings.
+    #
+    # @param sentences [Array<String>] An array of sentences.
+    # @param embeddings [Array<Array<Float>>] An array of embeddings.
+    # @param resolved_threshold [Float] The cosine similarity threshold.
+    # @return [Array<String>] An array of semantic chunks.
     def calculate_groups(sentences, embeddings, resolved_threshold)
       chunks = []
       current_chunk_text = [sentences[0]]
-      current_chunk_vectors = [Vector[*embeddings[0]]]
+      # The Anchor is the first vector of the new chunk
+      anchor_vector = Vector[*embeddings[0]]
+      current_chunk_vectors = [anchor_vector]
       (1...sentences.size).each do |i|
         new_sentence = sentences[i]
         new_vec = Vector[*embeddings[i]]
+        # 1. Similarity to Centroid (Current behavior)
         centroid = current_chunk_vectors.inject(:+) / current_chunk_vectors.size.to_f
-        sim = cosine_similarity(centroid, new_vec)
+        centroid_sim = cosine_similarity(centroid, new_vec)
+        # 2. Similarity to Anchor (New behavior)
+        # We only check this if @drift_threshold is configured
+        drifted = false
+        if @drift_threshold
+          anchor_sim = cosine_similarity(anchor_vector, new_vec)
+          drifted = anchor_sim < @drift_threshold
+        end
-        potential_size = current_chunk_text.join(" ").length + new_sentence.length + 1
-        # Use the resolved_threshold instead of @threshold
-        if sim < resolved_threshold || potential_size > @max_chunk_size
-          chunks << current_chunk_text.join(" ")
+        potential_size = current_chunk_text.join(' ').length + new_sentence.length + 1
+        # Logic: Split if Centroid similarity is low OR it drifted from Anchor OR max size reached
+        if centroid_sim < resolved_threshold || drifted || potential_size > @max_chunk_size
+          chunks << current_chunk_text.join(' ')
           current_chunk_text = [new_sentence]
+          anchor_vector = new_vec # Reset Anchor for the new chunk
           current_chunk_vectors = [new_vec]
         else
           current_chunk_text << new_sentence
@@ -93,41 +154,53 @@ module SemanticChunker
         end
       end
-      chunks << current_chunk_text.join(" ")
+      chunks << current_chunk_text.join(' ')
       chunks
     end
+    # Calculates the cosine similarity between two vectors.
+    #
+    # @param v1 [Vector, Array<Float>] The first vector.
+    # @param v2 [Vector, Array<Float>] The second vector.
+    # @return [Float] The cosine similarity.
     def cosine_similarity(v1, v2)
       # Ensure we are working with Vectors
       v1 = Vector[*v1] unless v1.is_a?(Vector)
       v2 = Vector[*v2] unless v2.is_a?(Vector)
       mag1 = v1.magnitude
       mag2 = v2.magnitude
       return 0.0 if mag1.zero? || mag2.zero?
       v1.inner_product(v2) / (mag1 * mag2)
     end
+    # Resolves the threshold dynamically based on the embeddings.
+    #
+    # @param embeddings [Array<Array<Float>>] An array of embeddings.
+    # @return [Float] The resolved threshold.
     def resolve_threshold(embeddings)
       return @threshold if @threshold.is_a?(Numeric)
       return DEFAULT_THRESHOLD if embeddings.size < 2
       similarities = []
       (0...embeddings.size - 1).each do |i|
-        # Note: We wrap them here, but ensure cosine_similarity
+        # Note: We wrap them here, but ensure cosine_similarity
         # doesn't re-wrap them if they are already Vectors.
         v1 = Vector[*embeddings[i]]
-        v2 = Vector[*embeddings[i+1]]
+        v2 = Vector[*embeddings[i + 1]]
         similarities << cosine_similarity(v1, v2)
       end
       return DEFAULT_THRESHOLD if similarities.empty?
       percentile_val = @threshold.is_a?(Hash) ? @threshold[:percentile] : 20
       # Use (size - 1) for the index to avoid "out of bounds" on small lists
       sorted_sims = similarities.sort
       index = ((sorted_sims.size - 1) * (percentile_val / 100.0)).round
       dynamic_val = sorted_sims[index]
       # Guardrail: Clamp to prevent hyper-splitting or never-splitting

data/lib/semantic_chunker/version.rb CHANGED Viewed

@@ -1,3 +1,8 @@
+# frozen_string_literal: true
+# YARD documentation for SemanticChunker module
+# This module contains the version of the gem
 module SemanticChunker
-  VERSION = "0.6.3"
+  # The current version of the gem
+  VERSION = "0.6.4"
 end

data/lib/semantic_chunker.rb CHANGED Viewed

@@ -1,3 +1,5 @@
+# frozen_string_literal: true
 # lib/semantic_chunker.rb
 # 1. Require dependencies
 require 'matrix'
@@ -13,21 +15,36 @@ require_relative 'semantic_chunker/adapters/openai_adapter'
 require_relative 'semantic_chunker/adapters/test_adapter'
 require_relative 'semantic_chunker/chunker'
 require_relative 'semantic_chunker/adapters/hugging_face_adapter'
+# YARD documentation for the SemanticChunker module
+# This module provides an interface to chunk text semantically.
+#
+# @!attribute [rw] configuration
+#   @return [Configuration] The configuration object for the gem.
 module SemanticChunker
   class << self
     attr_accessor :configuration
   end
+  # Configures the SemanticChunker gem.
+  #
+  # @yield [configuration] The configuration object.
+  # @yieldparam configuration [Configuration] The configuration object.
   def self.configure
     self.configuration ||= Configuration.new
     yield(configuration)
   end
+  # The Configuration class for the SemanticChunker gem.
+  # This class holds the configuration for the gem.
   class Configuration
-    attr_accessor :provider
+    # @!attribute [rw] provider
+    #   @return [Symbol] The provider to use for semantic chunking.
+    attr_accessor :provider, :drift_threshold
     def initialize
       @provider = nil # User must set this
+      @drift_threshold = nil
     end
   end
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: semantic_chunker
 version: !ruby/object:Gem::Version
-  version: 0.6.3
+  version: 0.6.4
 platform: ruby
 authors:
 - Daniele Frisanco
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2026-01-08 00:00:00.000000000 Z
+date: 2026-01-15 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: pragmatic_segmenter
@@ -94,6 +94,62 @@ dependencies:
     - - ">="
       - !ruby/object:Gem::Version
         version: '0'
+- !ruby/object:Gem::Dependency
+  name: yard
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.9'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.9'
+- !ruby/object:Gem::Dependency
+  name: rubocop
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '1.0'
+- !ruby/object:Gem::Dependency
+  name: rubocop-performance
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: dotenv
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '3.1'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '3.1'
 description: A powerful tool for RAG (Retrieval-Augmented Generation) that splits
   text into chunks based on semantic meaning rather than just character counts. Supports
   sliding windows, adaptive buffering, and dynamic percentile-based thresholding.
@@ -122,7 +178,7 @@ metadata:
   source_code_uri: https://github.com/danielefrisanco/semantic_chunker
   changelog_uri: https://github.com/danielefrisanco/semantic_chunker/blob/main/CHANGELOG.md
   bug_tracker_uri: https://github.com/danielefrisanco/semantic_chunker/issues
-  documentation_uri: https://www.rubydoc.info/gems/semantic_chunker/0.6.3
+  documentation_uri: https://www.rubydoc.info/gems/semantic_chunker/0.6.4
   allowed_push_host: https://rubygems.org
 post_install_message:
 rdoc_options: []