semantic_chunker 0.6.3 → 0.6.4

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 9eb71c3c0285ded52be28cc83f18c2590040bcac489bf07963134b9b877c7dd6
4
- data.tar.gz: 2cbb2ad4565519fd068b3a1eb6805e7989bd7d78886884b311faa6c4109193bc
3
+ metadata.gz: ec9d3bbf399db8e906944a579296b65cae0119b2690a6d267007b2444477494b
4
+ data.tar.gz: 1ac963e323fa5a4bbd9ce7c86e39627518e3dd8add656f11a304fc3ae6f1d61b
5
5
  SHA512:
6
- metadata.gz: 2210e23a05cc4ed601528f0c1b877d02c13d9b3da77486a0fa5845a799fd02815f21f9ecd9cd52bf1c4bcb87aff2c733914016e3efba9c1454df992ea0161fa7
7
- data.tar.gz: 2541b7fcb705b410444e109979a95cc202d660ec9febc0d1bd4e1116173b99a0d1b75a4a88c204a4e590d2a20e84888c3e75bd8174f43e18b2a19201098b756e
6
+ metadata.gz: 9be15aaeb5b4b55b6fec0a0c1c42ff659df19a9af969e70626f2d82dacfbadd10a0ad798e06e821c419dc554d29fc09b1c9308de38439942fef8d9e4731321ec
7
+ data.tar.gz: 525daebf70c85895c24d17f3d2ff718ea1874b0987c89e63e0813a61c9322692ee2a11ccf047ae2a5b46e54a8efabf7a7c4259e6dbeb0c8167cb9d22fe8e0612
data/CHANGELOG.md CHANGED
@@ -1,6 +1,20 @@
1
1
  # Changelog
2
2
 
3
3
  All notable changes to this project will be documented in this file.
4
+ ## [0.6.4] - 2026-01-15
5
+ ### Added
6
+ - **Anchor-Sentence Drift Protection**: New optional feature to prevent topic bleed.
7
+ - **CLI Drift Flag**: Added `-d` and `--drift` flags to the command-line interface.
8
+ - **Drift Validation**: Added input validation to ensure thresholds stay within semantic bounds (-1.0 to 1.0).
9
+
10
+ ### Fixed
11
+ - Improved internal centroid calculation efficiency for long chunks.
12
+
13
+ ## [0.6.3] - 2026-01-10
14
+ ### Added
15
+ - **YARD Documentation**: Added YARD documentation to all classes and methods in the `lib` directory.
16
+ - **Rake Task**: Added a `yard` rake task to generate documentation.
17
+
4
18
  [0.6.2] - 2026-01-07
5
19
  ----------------------
6
20
 
data/README.md CHANGED
@@ -240,6 +240,20 @@ You can set a hard limit on the character length of a chunk using `max_chunk_siz
240
240
  chunker = SemanticChunker::Chunker.new(max_chunk_size: 1000)
241
241
  ```
242
242
 
243
+ ### Anchor-Sentence Drift Protection (v0.6.4)
244
+ While standard semantic chunking looks at the "local" flow of sentences, long chunks can sometimes suffer from **Topic Drift** (the "Telephone Game" effect).
245
+
246
+ By setting a `drift_threshold`, the chunker compares every new sentence not just to its neighbors, but also to the **Anchor** (the very first sentence of the chunk). If the meaning wanders too far from the starting topic, a split is forced.
247
+
248
+ **Usage:**
249
+ ```ruby
250
+ # Recommended drift_threshold is 0.75 to 0.80
251
+ chunker = SemanticChunker::Chunker.new(drift_threshold: 0.75)
252
+ ```
253
+ ```bash
254
+ semantic_chunker -t auto --drift 0.75 document.txt
255
+ ```
256
+
243
257
  ### Adapters
244
258
 
245
259
  The gem is designed to be extensible with different embedding providers. It currently ships with:
@@ -280,6 +294,15 @@ $ OPENAI_API_KEY="your-key" bundle exec ruby test_integration.rb
280
294
  $ HUGGING_FACE_API_KEY="your-key" bundle exec ruby test_hugging_face.rb
281
295
  ```
282
296
 
297
+ ### Documentation
298
+
299
+ This project uses YARD for documentation. To generate the documentation, run:
300
+ ```bash
301
+ bundle exec rake yard
302
+ ```
303
+
304
+ This will generate the documentation in the `docs` directory.
305
+
283
306
  ### Security Note: Handling API Keys
284
307
 
285
308
  When using an adapter that requires an API key, **never hardcode your API keys** directly into your source code. To keep your application secure (especially if you are working on public repositories), use one of the following methods:
@@ -413,12 +436,53 @@ The Hugging Face adapter is built for production-grade reliability:
413
436
  - **Auto-Wait**: Uses the `X-Wait-For-Model` header to ensure stable results on the Inference API.
414
437
 
415
438
 
416
- ## 🚀 Roadmap to v1.0.0
417
- - [x] Adaptive Dynamic Thresholding
418
- - [x] CLI with JSON output
419
- - [x] Robust error handling and retries
420
- - [ ] **Next:** Local embedding cache (reduce API costs)
421
- - [ ] **Next:** Drift protection (Anchor-sentence comparison)
439
+ ## ### Roadmap to v1.0.0
440
+
441
+ #### **v0.6.x: Stability & Core Logic**
442
+
443
+ * \[x\] **Adaptive Dynamic Thresholding:** Core semantic splitting logic.
444
+
445
+ * \[x\] **CLI with JSON output:** Global execution and piping support.
446
+
447
+ * \[x\] **Robust Error Handling:** API retry logic and validation.
448
+
449
+ * \[x\] **Anchor-Sentence Drift Protection:** Prevents topic-bleed by comparing current sentences against the chunk's starting "anchor."
450
+
451
+ * \[ \] **Multiple Breakpoint Strategies:** Support for Percentile, StandardDeviation, and Interquartile range splitting (Claude's suggestion).
452
+
453
+
454
+ #### **v0.7.x: Performance & Efficiency**
455
+
456
+ * \[ \] **Local Embedding Cache:** SQLite or file-based cache to store sentence embeddings (saves $$$ and speeds up repeated runs).
457
+
458
+ * \[ \] **Batch Processing:** Support for batching multiple sentences into a single API call to HuggingFace/OpenAI.
459
+
460
+ * \[ \] **Progress Indicators:** CLI progress bars for large document processing.
461
+
462
+
463
+ #### **v0.8.x: RAG & Enterprise Features**
464
+
465
+ * \[ \] **Rich Metadata Support:** Return Chunk objects instead of raw strings, including source, index, and token counts.
466
+
467
+ * \[ \] **Contextual Overlap:** Support for "sliding window" overlap between chunks to preserve context.
468
+
469
+ * \[ \] **PII Sanitization Hook:** Integration point for masking sensitive data before it hits the provider API.
470
+
471
+
472
+ #### **v0.9.x: Ecosystem & Adapters**
473
+
474
+ * \[ \] **Provider Expansion:** Add native adapters for OpenAI, Cohere, and local Transformers (via informers gem).
475
+
476
+ * \[ \] **Gem Documentation:** Full API documentation and a "Best Practices" guide for threshold tuning.
477
+
478
+ * \[ \] **LangChain.rb Integration:** Provide a standard interface for use within the Ruby AI ecosystem.
479
+
480
+
481
+ #### **v1.0.0: Production Ready**
482
+
483
+ * \[ \] **Benchmark Suite:** Comparative performance tests against character-based chunkers.
484
+
485
+ * \[ \] **Stable Public API:** Finalizing the class structure for long-term compatibility.
422
486
 
423
487
  ## Contributing
424
488
 
data/bin/semantic_chunker CHANGED
@@ -24,6 +24,9 @@ OptionParser.new do |opts|
24
24
  puts SemanticChunker::VERSION
25
25
  exit
26
26
  end
27
+ opts.on("-d", "--drift FLOAT", Float, "Anchor-sentence drift threshold (e.g., 0.75)") do |d|
28
+ options[:drift_threshold] = d
29
+ end
27
30
  end.parse!
28
31
 
29
32
  input_file = ARGV[0]
@@ -48,7 +51,8 @@ chunker = SemanticChunker::Chunker.new(
48
51
  embedding_provider: provider,
49
52
  threshold: options[:threshold],
50
53
  max_chunk_size: options[:max_size],
51
- buffer_size: options[:buffer]
54
+ buffer_size: options[:buffer],
55
+ drift_threshold: options[:drift_threshold]
52
56
  )
53
57
 
54
58
  chunks = chunker.chunks_for(text)
@@ -1,7 +1,14 @@
1
+ # frozen_string_literal: true
2
+
1
3
  # lib/semantic_chunker/adapters/base.rb
2
4
  module SemanticChunker
3
5
  module Adapters
6
+ # The Base class for all adapters.
4
7
  class Base
8
+ # This method should be implemented by subclasses to generate embeddings for the given sentences.
9
+ #
10
+ # @param sentences [Array<String>] An array of sentences to embed.
11
+ # @raise [NotImplementedError] If the method is not implemented by a subclass.
5
12
  def embed(sentences)
6
13
  raise NotImplementedError, "#{self.class} must implement #embed"
7
14
  end
@@ -1,3 +1,5 @@
1
+ # frozen_string_literal: true
2
+
1
3
  # lib/semantic_chunker/adapters/hugging_face_adapter.rb
2
4
  require 'net/http'
3
5
  require 'json'
@@ -5,29 +7,40 @@ require 'uri'
5
7
 
6
8
  module SemanticChunker
7
9
  module Adapters
10
+ # The HuggingFaceAdapter class is responsible for fetching embeddings from the Hugging Face API.
8
11
  class HuggingFaceAdapter < Base
9
- BASE_URL = "https://router.huggingface.co/hf-inference/models/%{model}"
10
-
11
- # Configuration for reliability
12
+ # The base URL for the Hugging Face API.
13
+ BASE_URL = 'https://router.huggingface.co/hf-inference/models/%<model>s'
14
+
15
+ # The maximum number of retries for transient errors.
12
16
  MAX_RETRIES = 3
17
+ # The initial backoff time in seconds for retries.
13
18
  INITIAL_BACKOFF = 2 # seconds
19
+ # The timeout for opening a connection in seconds.
14
20
  OPEN_TIMEOUT = 5 # seconds to open connection
21
+ # The timeout for reading the response in seconds.
15
22
  READ_TIMEOUT = 60 # seconds to wait for embeddings
16
23
 
24
+ # Initializes a new HuggingFaceAdapter.
25
+ #
26
+ # @param api_key [String] The Hugging Face API key.
27
+ # @param model [String] The name of the model to use. I.E. 'sentence-transformers/all-MiniLM-L6-v2' or 'BAAI/bge-small-en-v1.5'
17
28
  def initialize(api_key:, model: 'intfloat/multilingual-e5-large')
18
29
  @api_key = api_key
19
30
  @model = model
20
- # @model = 'sentence-transformers/all-MiniLM-L6-v2'
21
- # @model = 'BAAI/bge-small-en-v1.5'
22
31
  end
23
32
 
33
+ # Fetches embeddings for the given sentences from the Hugging Face API.
34
+ #
35
+ # @param sentences [Array<String>] An array of sentences to embed.
36
+ # @return [Array<Array<Float>>] An array of embeddings.
24
37
  def embed(sentences)
25
38
  retry_count = 0
26
-
39
+
27
40
  begin
28
41
  response = post_request(sentences)
29
42
  handle_response(response)
30
- rescue => e
43
+ rescue StandardError => e
31
44
  if retryable?(e, retry_count)
32
45
  wait_time = INITIAL_BACKOFF * (2**retry_count)
33
46
  puts "HuggingFace: Transient error (#{e.message}). Retrying in #{wait_time}s..."
@@ -41,16 +54,20 @@ module SemanticChunker
41
54
 
42
55
  private
43
56
 
57
+ # Sends a POST request to the Hugging Face API.
58
+ #
59
+ # @param sentences [Array<String>] An array of sentences to embed.
60
+ # @return [Net::HTTPResponse] The HTTP response.
44
61
  def post_request(sentences)
45
- uri = URI(BASE_URL % { model: @model })
62
+ uri = URI(format(BASE_URL, { model: @model }))
46
63
  request = Net::HTTP::Post.new(uri)
47
-
48
- request["Authorization"] = "Bearer #{@api_key}"
49
- request["Content-Type"] = "application/json"
50
- request["X-Wait-For-Model"] = "true" # Tells HF to wait for model load
51
-
64
+
65
+ request['Authorization'] = "Bearer #{@api_key}"
66
+ request['Content-Type'] = 'application/json'
67
+ request['X-Wait-For-Model'] = 'true' # Tells HF to wait for model load
68
+
52
69
  request.body = { inputs: sentences }.to_json
53
-
70
+
54
71
  Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) do |http|
55
72
  http.open_timeout = OPEN_TIMEOUT
56
73
  http.read_timeout = READ_TIMEOUT
@@ -58,8 +75,13 @@ module SemanticChunker
58
75
  end
59
76
  end
60
77
 
78
+ # Handles the HTTP response from the Hugging Face API.
79
+ #
80
+ # @param response [Net::HTTPResponse] The HTTP response.
81
+ # @raise [StandardError] If the response is not successful.
82
+ # @return [Array<Array<Float>>] The parsed embeddings.
61
83
  def handle_response(response)
62
- unless response.content_type == "application/json"
84
+ unless response.content_type == 'application/json'
63
85
  raise "HuggingFace Error: Expected JSON, got #{response.content_type}."
64
86
  end
65
87
 
@@ -67,21 +89,26 @@ module SemanticChunker
67
89
 
68
90
  if response.is_a?(Net::HTTPSuccess)
69
91
  parsed
70
- elsif parsed.is_a?(Hash) && parsed["error"]&.include?("loading")
92
+ elsif parsed.is_a?(Hash) && parsed['error']&.include?('loading')
71
93
  # This specifically triggers a retry for model warmups
72
- raise "Model is still loading"
94
+ raise 'Model is still loading'
73
95
  else
74
96
  raise "HuggingFace API Error: #{parsed['error'] || response.body}"
75
97
  end
76
98
  end
77
99
 
100
+ # Checks if an error is retryable.
101
+ #
102
+ # @param error [StandardError] The error to check.
103
+ # @param count [Integer] The current retry count.
104
+ # @return [Boolean] True if the error is retryable, false otherwise.
78
105
  def retryable?(error, count)
79
106
  return false if count >= MAX_RETRIES
80
-
107
+
81
108
  # Retry on timeouts, loading errors, or 5xx server errors
82
- error.message.include?("loading") ||
83
- error.is_a?(Net::ReadTimeout) ||
84
- error.is_a?(Net::OpenTimeout)
109
+ error.message.include?('loading') ||
110
+ error.is_a?(Net::ReadTimeout) ||
111
+ error.is_a?(Net::OpenTimeout)
85
112
  end
86
113
  end
87
114
  end
@@ -1,18 +1,30 @@
1
+ # frozen_string_literal: true
2
+
1
3
  # lib/semantic_chunker/adapters/openai_adapter.rb
2
- require "net/http"
3
- require "json"
4
- require "uri"
4
+ require 'net/http'
5
+ require 'json'
6
+ require 'uri'
5
7
 
6
8
  module SemanticChunker
7
9
  module Adapters
10
+ # The OpenAIAdapter class is responsible for fetching embeddings from the OpenAI API.
8
11
  class OpenAIAdapter < Base
9
- ENDPOINT = "https://api.openai.com/v1/embeddings"
12
+ # The endpoint for the OpenAI API.
13
+ ENDPOINT = 'https://api.openai.com/v1/embeddings'
10
14
 
11
- def initialize(api_key:, model: "text-embedding-3-small")
15
+ # Initializes a new OpenAIAdapter.
16
+ #
17
+ # @param api_key [String] The OpenAI API key.
18
+ # @param model [String] The name of the model to use.
19
+ def initialize(api_key:, model: 'text-embedding-3-small')
12
20
  @api_key = api_key
13
21
  @model = model
14
22
  end
15
23
 
24
+ # Fetches embeddings for the given sentences from the OpenAI API.
25
+ #
26
+ # @param sentences [Array<String>] An array of sentences to embed.
27
+ # @return [Array<Array<Float>>] An array of embeddings.
16
28
  def embed(sentences)
17
29
  response = post_request(sentences)
18
30
  parsed = JSON.parse(response.body)
@@ -20,7 +32,7 @@ module SemanticChunker
20
32
  if response.is_a?(Net::HTTPSuccess)
21
33
  # OpenAI returns data in the same order as input
22
34
  # We extract just the embedding arrays
23
- parsed["data"].map { |entry| entry["embedding"] }
35
+ parsed['data'].map { |entry| entry['embedding'] }
24
36
  else
25
37
  raise "OpenAI Error: #{parsed.dig('error', 'message') || response.code}"
26
38
  end
@@ -28,11 +40,15 @@ module SemanticChunker
28
40
 
29
41
  private
30
42
 
43
+ # Sends a POST request to the OpenAI API.
44
+ #
45
+ # @param sentences [Array<String>] An array of sentences to embed.
46
+ # @return [Net::HTTPResponse] The HTTP response.
31
47
  def post_request(sentences)
32
48
  uri = URI(ENDPOINT)
33
49
  request = Net::HTTP::Post.new(uri)
34
- request["Authorization"] = "Bearer #{@api_key}"
35
- request["Content-Type"] = "application/json"
50
+ request['Authorization'] = "Bearer #{@api_key}"
51
+ request['Content-Type'] = 'application/json'
36
52
  request.body = { input: sentences, model: @model }.to_json
37
53
 
38
54
  Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) do |http|
@@ -1,14 +1,25 @@
1
+ # frozen_string_literal: true
2
+
1
3
  # lib/semantic_chunker/adapters/test_adapter.rb
2
4
  module SemanticChunker
3
5
  module Adapters
6
+ # The TestAdapter class is a dummy adapter for testing purposes.
7
+ # It returns predefined or random vectors.
4
8
  class TestAdapter < Base
5
- # We can pass specific vectors to simulate "topics"
9
+ # Initializes a new TestAdapter.
10
+ #
11
+ # @param predefined_vectors [Array<Array<Float>>, nil] A list of vectors to return.
12
+ # If nil, random vectors will be generated.
6
13
  def initialize(predefined_vectors = nil)
7
14
  @predefined_vectors = predefined_vectors
8
15
  end
9
16
 
17
+ # Returns predefined or random embeddings for the given sentences.
18
+ #
19
+ # @param sentences [Array<String>] An array of sentences.
20
+ # @return [Array<Array<Float>>] An array of embeddings.
10
21
  def embed(sentences)
11
- # If we have specific vectors, use them;
22
+ # If we have specific vectors, use them;
12
23
  # otherwise, return random vectors for each sentence
13
24
  @predefined_vectors || sentences.map { [rand, rand, rand] }
14
25
  end
@@ -1,36 +1,56 @@
1
+ # frozen_string_literal: true
2
+
1
3
  # lib/semantic_chunker/chunker.rb
2
4
  require 'matrix'
3
5
  require 'pragmatic_segmenter'
4
6
 
5
7
  module SemanticChunker
8
+ # The Chunker class is responsible for splitting text into semantic chunks.
6
9
  class Chunker
10
+ # The default threshold for cosine similarity.
7
11
  DEFAULT_THRESHOLD = 0.82
12
+ # The default buffer size.
8
13
  DEFAULT_BUFFER = 1
14
+ # The default maximum size of a chunk in characters.
9
15
  DEFAULT_MAX_SIZE = 1500 # Characters
10
16
 
11
- def initialize(embedding_provider: nil, threshold: DEFAULT_THRESHOLD, buffer_size: DEFAULT_BUFFER, max_chunk_size: DEFAULT_MAX_SIZE, segmenter_options: {})
17
+ # Initializes a new Chunker.
18
+ #
19
+ # @param embedding_provider [Object] The provider for generating embeddings.
20
+ # @param threshold [Float, Symbol] The cosine similarity threshold or :auto.
21
+ # @param buffer_size [Integer, Symbol] The buffer size or :auto.
22
+ # @param max_chunk_size [Integer] The maximum size of a chunk in characters.
23
+ # @param drift_threshold [Float] The threshold to detect semantic drift from the beginning of a chunk.
24
+ # @param segmenter_options [Hash] Options for the PragmaticSegmenter.
25
+ def initialize(embedding_provider: nil, threshold: DEFAULT_THRESHOLD, buffer_size: DEFAULT_BUFFER, max_chunk_size: DEFAULT_MAX_SIZE, drift_threshold: nil, segmenter_options: {})
12
26
  @provider = embedding_provider || SemanticChunker.configuration&.provider
13
27
  @threshold = threshold
14
28
  @buffer_size = buffer_size
15
29
  @max_chunk_size = max_chunk_size
30
+ @drift_threshold = validate_drift_threshold(drift_threshold || SemanticChunker.configuration&.drift_threshold)
16
31
  @segmenter_options = segmenter_options # e.g., { language: 'hy', doc_type: 'pdf' }
17
32
 
18
- raise ArgumentError, "A provider must be configured" if @provider.nil?
33
+ raise ArgumentError, 'A provider must be configured' if @provider.nil?
19
34
  end
20
35
 
36
+ # Splits the given text into semantic chunks.
37
+ #
38
+ # @param text [String] The text to chunk.
39
+ # @return [Array<String>] An array of semantic chunks.
21
40
  def chunks_for(text)
22
41
  return [] if text.nil? || text.strip.empty?
42
+
23
43
  sentences = split_sentences(text)
24
44
 
25
45
  # Step 1: Logic to determine the best buffer window
26
46
  effective_buffer = determine_buffer(sentences)
27
-
47
+
28
48
  # Step 2: Create overlapping "context groups" for more stable embeddings
29
49
  context_groups = build_context_groups(sentences, effective_buffer)
30
-
50
+
31
51
  # Step 3: Embed the groups, not the raw sentences
32
52
  group_embeddings = @provider.embed(context_groups)
33
-
53
+
34
54
  # Resolve the threshold dynamically if requested
35
55
  resolved_threshold = resolve_threshold(group_embeddings)
36
56
 
@@ -39,12 +59,25 @@ module SemanticChunker
39
59
 
40
60
  private
41
61
 
42
- # Selects buffer based on average sentence length if user passes :auto
62
+ def validate_drift_threshold(val)
63
+ return nil if val.nil? # Keep it off by default
64
+
65
+ unless val.is_a?(Numeric) && val.between?(-1.0, 1.0)
66
+ raise ArgumentError, "drift_threshold must be a Numeric between -1.0 and 1.0 (received #{val.inspect})"
67
+ end
68
+
69
+ val
70
+ end
71
+
72
+ # Determines the buffer size based on the average sentence length.
73
+ #
74
+ # @param sentences [Array<String>] An array of sentences.
75
+ # @return [Integer] The buffer size.
43
76
  def determine_buffer(sentences)
44
77
  return @buffer_size unless @buffer_size == :auto
45
78
 
46
79
  avg_length = sentences.map(&:length).sum / sentences.size.to_f
47
-
80
+
48
81
  # Strategy: If sentences are very short (< 50 chars), we need more context.
49
82
  # If they are long (> 150 chars), they are likely self-contained.
50
83
  case avg_length
@@ -54,38 +87,66 @@ module SemanticChunker
54
87
  end
55
88
  end
56
89
 
90
+ # Builds overlapping context groups from sentences.
91
+ #
92
+ # @param sentences [Array<String>] An array of sentences.
93
+ # @param buffer [Integer] The buffer size.
94
+ # @return [Array<String>] An array of context groups.
57
95
  def build_context_groups(sentences, buffer)
58
96
  sentences.each_with_index.map do |_, i|
59
97
  start_idx = [0, i - buffer].max
60
98
  end_idx = [sentences.size - 1, i + buffer].min
61
- sentences[start_idx..end_idx].join(" ")
99
+ sentences[start_idx..end_idx].join(' ')
62
100
  end
63
101
  end
64
102
 
103
+ # Splits the given text into sentences.
104
+ #
105
+ # @param text [String] The text to split.
106
+ # @return [Array<String>] An array of sentences.
65
107
  def split_sentences(text)
66
108
  options = @segmenter_options.merge(text: text)
67
109
  ps = PragmaticSegmenter::Segmenter.new(**options)
68
110
  ps.segment
69
111
  end
70
112
 
113
+ # Calculates the semantic groups from sentences and embeddings.
114
+ #
115
+ # @param sentences [Array<String>] An array of sentences.
116
+ # @param embeddings [Array<Array<Float>>] An array of embeddings.
117
+ # @param resolved_threshold [Float] The cosine similarity threshold.
118
+ # @return [Array<String>] An array of semantic chunks.
71
119
  def calculate_groups(sentences, embeddings, resolved_threshold)
72
120
  chunks = []
73
121
  current_chunk_text = [sentences[0]]
74
- current_chunk_vectors = [Vector[*embeddings[0]]]
122
+ # The Anchor is the first vector of the new chunk
123
+ anchor_vector = Vector[*embeddings[0]]
124
+ current_chunk_vectors = [anchor_vector]
75
125
 
76
126
  (1...sentences.size).each do |i|
77
127
  new_sentence = sentences[i]
78
128
  new_vec = Vector[*embeddings[i]]
79
-
129
+
130
+ # 1. Similarity to Centroid (Current behavior)
80
131
  centroid = current_chunk_vectors.inject(:+) / current_chunk_vectors.size.to_f
81
- sim = cosine_similarity(centroid, new_vec)
132
+ centroid_sim = cosine_similarity(centroid, new_vec)
133
+
134
+ # 2. Similarity to Anchor (New behavior)
135
+ # We only check this if @drift_threshold is configured
136
+ drifted = false
137
+ if @drift_threshold
138
+ anchor_sim = cosine_similarity(anchor_vector, new_vec)
139
+ drifted = anchor_sim < @drift_threshold
140
+ end
82
141
 
83
- potential_size = current_chunk_text.join(" ").length + new_sentence.length + 1
84
142
 
85
- # Use the resolved_threshold instead of @threshold
86
- if sim < resolved_threshold || potential_size > @max_chunk_size
87
- chunks << current_chunk_text.join(" ")
143
+ potential_size = current_chunk_text.join(' ').length + new_sentence.length + 1
144
+
145
+ # Logic: Split if Centroid similarity is low OR it drifted from Anchor OR max size reached
146
+ if centroid_sim < resolved_threshold || drifted || potential_size > @max_chunk_size
147
+ chunks << current_chunk_text.join(' ')
88
148
  current_chunk_text = [new_sentence]
149
+ anchor_vector = new_vec # Reset Anchor for the new chunk
89
150
  current_chunk_vectors = [new_vec]
90
151
  else
91
152
  current_chunk_text << new_sentence
@@ -93,41 +154,53 @@ module SemanticChunker
93
154
  end
94
155
  end
95
156
 
96
- chunks << current_chunk_text.join(" ")
157
+ chunks << current_chunk_text.join(' ')
97
158
  chunks
98
159
  end
160
+
161
+ # Calculates the cosine similarity between two vectors.
162
+ #
163
+ # @param v1 [Vector, Array<Float>] The first vector.
164
+ # @param v2 [Vector, Array<Float>] The second vector.
165
+ # @return [Float] The cosine similarity.
99
166
  def cosine_similarity(v1, v2)
100
167
  # Ensure we are working with Vectors
101
168
  v1 = Vector[*v1] unless v1.is_a?(Vector)
102
169
  v2 = Vector[*v2] unless v2.is_a?(Vector)
103
-
170
+
104
171
  mag1 = v1.magnitude
105
172
  mag2 = v2.magnitude
106
-
173
+
107
174
  return 0.0 if mag1.zero? || mag2.zero?
175
+
108
176
  v1.inner_product(v2) / (mag1 * mag2)
109
177
  end
178
+
179
+ # Resolves the threshold dynamically based on the embeddings.
180
+ #
181
+ # @param embeddings [Array<Array<Float>>] An array of embeddings.
182
+ # @return [Float] The resolved threshold.
110
183
  def resolve_threshold(embeddings)
111
184
  return @threshold if @threshold.is_a?(Numeric)
112
185
  return DEFAULT_THRESHOLD if embeddings.size < 2
113
186
 
114
187
  similarities = []
115
188
  (0...embeddings.size - 1).each do |i|
116
- # Note: We wrap them here, but ensure cosine_similarity
189
+ # Note: We wrap them here, but ensure cosine_similarity
117
190
  # doesn't re-wrap them if they are already Vectors.
118
191
  v1 = Vector[*embeddings[i]]
119
- v2 = Vector[*embeddings[i+1]]
192
+ v2 = Vector[*embeddings[i + 1]]
120
193
  similarities << cosine_similarity(v1, v2)
121
194
  end
122
195
 
123
196
  return DEFAULT_THRESHOLD if similarities.empty?
124
197
 
125
198
  percentile_val = @threshold.is_a?(Hash) ? @threshold[:percentile] : 20
126
-
199
+
127
200
  # Use (size - 1) for the index to avoid "out of bounds" on small lists
128
201
  sorted_sims = similarities.sort
129
202
  index = ((sorted_sims.size - 1) * (percentile_val / 100.0)).round
130
-
203
+
131
204
  dynamic_val = sorted_sims[index]
132
205
 
133
206
  # Guardrail: Clamp to prevent hyper-splitting or never-splitting
@@ -1,3 +1,8 @@
1
+ # frozen_string_literal: true
2
+
3
+ # YARD documentation for SemanticChunker module
4
+ # This module contains the version of the gem
1
5
  module SemanticChunker
2
- VERSION = "0.6.3"
6
+ # The current version of the gem
7
+ VERSION = "0.6.4"
3
8
  end
@@ -1,3 +1,5 @@
1
+ # frozen_string_literal: true
2
+
1
3
  # lib/semantic_chunker.rb
2
4
  # 1. Require dependencies
3
5
  require 'matrix'
@@ -13,21 +15,36 @@ require_relative 'semantic_chunker/adapters/openai_adapter'
13
15
  require_relative 'semantic_chunker/adapters/test_adapter'
14
16
  require_relative 'semantic_chunker/chunker'
15
17
  require_relative 'semantic_chunker/adapters/hugging_face_adapter'
18
+
19
+ # YARD documentation for the SemanticChunker module
20
+ # This module provides an interface to chunk text semantically.
21
+ #
22
+ # @!attribute [rw] configuration
23
+ # @return [Configuration] The configuration object for the gem.
16
24
  module SemanticChunker
17
25
  class << self
18
26
  attr_accessor :configuration
19
27
  end
20
28
 
29
+ # Configures the SemanticChunker gem.
30
+ #
31
+ # @yield [configuration] The configuration object.
32
+ # @yieldparam configuration [Configuration] The configuration object.
21
33
  def self.configure
22
34
  self.configuration ||= Configuration.new
23
35
  yield(configuration)
24
36
  end
25
37
 
38
+ # The Configuration class for the SemanticChunker gem.
39
+ # This class holds the configuration for the gem.
26
40
  class Configuration
27
- attr_accessor :provider
41
+ # @!attribute [rw] provider
42
+ # @return [Symbol] The provider to use for semantic chunking.
43
+ attr_accessor :provider, :drift_threshold
28
44
 
29
45
  def initialize
30
46
  @provider = nil # User must set this
47
+ @drift_threshold = nil
31
48
  end
32
49
  end
33
50
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: semantic_chunker
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.6.3
4
+ version: 0.6.4
5
5
  platform: ruby
6
6
  authors:
7
7
  - Daniele Frisanco
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2026-01-08 00:00:00.000000000 Z
11
+ date: 2026-01-15 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: pragmatic_segmenter
@@ -94,6 +94,62 @@ dependencies:
94
94
  - - ">="
95
95
  - !ruby/object:Gem::Version
96
96
  version: '0'
97
+ - !ruby/object:Gem::Dependency
98
+ name: yard
99
+ requirement: !ruby/object:Gem::Requirement
100
+ requirements:
101
+ - - "~>"
102
+ - !ruby/object:Gem::Version
103
+ version: '0.9'
104
+ type: :development
105
+ prerelease: false
106
+ version_requirements: !ruby/object:Gem::Requirement
107
+ requirements:
108
+ - - "~>"
109
+ - !ruby/object:Gem::Version
110
+ version: '0.9'
111
+ - !ruby/object:Gem::Dependency
112
+ name: rubocop
113
+ requirement: !ruby/object:Gem::Requirement
114
+ requirements:
115
+ - - "~>"
116
+ - !ruby/object:Gem::Version
117
+ version: '1.0'
118
+ type: :development
119
+ prerelease: false
120
+ version_requirements: !ruby/object:Gem::Requirement
121
+ requirements:
122
+ - - "~>"
123
+ - !ruby/object:Gem::Version
124
+ version: '1.0'
125
+ - !ruby/object:Gem::Dependency
126
+ name: rubocop-performance
127
+ requirement: !ruby/object:Gem::Requirement
128
+ requirements:
129
+ - - ">="
130
+ - !ruby/object:Gem::Version
131
+ version: '0'
132
+ type: :development
133
+ prerelease: false
134
+ version_requirements: !ruby/object:Gem::Requirement
135
+ requirements:
136
+ - - ">="
137
+ - !ruby/object:Gem::Version
138
+ version: '0'
139
+ - !ruby/object:Gem::Dependency
140
+ name: dotenv
141
+ requirement: !ruby/object:Gem::Requirement
142
+ requirements:
143
+ - - "~>"
144
+ - !ruby/object:Gem::Version
145
+ version: '3.1'
146
+ type: :development
147
+ prerelease: false
148
+ version_requirements: !ruby/object:Gem::Requirement
149
+ requirements:
150
+ - - "~>"
151
+ - !ruby/object:Gem::Version
152
+ version: '3.1'
97
153
  description: A powerful tool for RAG (Retrieval-Augmented Generation) that splits
98
154
  text into chunks based on semantic meaning rather than just character counts. Supports
99
155
  sliding windows, adaptive buffering, and dynamic percentile-based thresholding.
@@ -122,7 +178,7 @@ metadata:
122
178
  source_code_uri: https://github.com/danielefrisanco/semantic_chunker
123
179
  changelog_uri: https://github.com/danielefrisanco/semantic_chunker/blob/main/CHANGELOG.md
124
180
  bug_tracker_uri: https://github.com/danielefrisanco/semantic_chunker/issues
125
- documentation_uri: https://www.rubydoc.info/gems/semantic_chunker/0.6.3
181
+ documentation_uri: https://www.rubydoc.info/gems/semantic_chunker/0.6.4
126
182
  allowed_push_host: https://rubygems.org
127
183
  post_install_message:
128
184
  rdoc_options: []