semantic_chunker 0.6.3 → 0.6.4
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +14 -0
- data/README.md +70 -6
- data/bin/semantic_chunker +5 -1
- data/lib/semantic_chunker/adapters/base.rb +7 -0
- data/lib/semantic_chunker/adapters/hugging_face_adapter.rb +48 -21
- data/lib/semantic_chunker/adapters/openai_adapter.rb +24 -8
- data/lib/semantic_chunker/adapters/test_adapter.rb +13 -2
- data/lib/semantic_chunker/chunker.rb +95 -22
- data/lib/semantic_chunker/version.rb +6 -1
- data/lib/semantic_chunker.rb +18 -1
- metadata +59 -3
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: ec9d3bbf399db8e906944a579296b65cae0119b2690a6d267007b2444477494b
|
|
4
|
+
data.tar.gz: 1ac963e323fa5a4bbd9ce7c86e39627518e3dd8add656f11a304fc3ae6f1d61b
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 9be15aaeb5b4b55b6fec0a0c1c42ff659df19a9af969e70626f2d82dacfbadd10a0ad798e06e821c419dc554d29fc09b1c9308de38439942fef8d9e4731321ec
|
|
7
|
+
data.tar.gz: 525daebf70c85895c24d17f3d2ff718ea1874b0987c89e63e0813a61c9322692ee2a11ccf047ae2a5b46e54a8efabf7a7c4259e6dbeb0c8167cb9d22fe8e0612
|
data/CHANGELOG.md
CHANGED
|
@@ -1,6 +1,20 @@
|
|
|
1
1
|
# Changelog
|
|
2
2
|
|
|
3
3
|
All notable changes to this project will be documented in this file.
|
|
4
|
+
## [0.6.4] - 2026-01-15
|
|
5
|
+
### Added
|
|
6
|
+
- **Anchor-Sentence Drift Protection**: New optional feature to prevent topic bleed.
|
|
7
|
+
- **CLI Drift Flag**: Added `-d` and `--drift` flags to the command-line interface.
|
|
8
|
+
- **Drift Validation**: Added input validation to ensure thresholds stay within semantic bounds (-1.0 to 1.0).
|
|
9
|
+
|
|
10
|
+
### Fixed
|
|
11
|
+
- Improved internal centroid calculation efficiency for long chunks.
|
|
12
|
+
|
|
13
|
+
## [0.6.3] - 2026-01-10
|
|
14
|
+
### Added
|
|
15
|
+
- **YARD Documentation**: Added YARD documentation to all classes and methods in the `lib` directory.
|
|
16
|
+
- **Rake Task**: Added a `yard` rake task to generate documentation.
|
|
17
|
+
|
|
4
18
|
[0.6.2] - 2026-01-07
|
|
5
19
|
----------------------
|
|
6
20
|
|
data/README.md
CHANGED
|
@@ -240,6 +240,20 @@ You can set a hard limit on the character length of a chunk using `max_chunk_siz
|
|
|
240
240
|
chunker = SemanticChunker::Chunker.new(max_chunk_size: 1000)
|
|
241
241
|
```
|
|
242
242
|
|
|
243
|
+
### Anchor-Sentence Drift Protection (v0.6.4)
|
|
244
|
+
While standard semantic chunking looks at the "local" flow of sentences, long chunks can sometimes suffer from **Topic Drift** (the "Telephone Game" effect).
|
|
245
|
+
|
|
246
|
+
By setting a `drift_threshold`, the chunker compares every new sentence not just to its neighbors, but also to the **Anchor** (the very first sentence of the chunk). If the meaning wanders too far from the starting topic, a split is forced.
|
|
247
|
+
|
|
248
|
+
**Usage:**
|
|
249
|
+
```ruby
|
|
250
|
+
# Recommended drift_threshold is 0.75 to 0.80
|
|
251
|
+
chunker = SemanticChunker::Chunker.new(drift_threshold: 0.75)
|
|
252
|
+
```
|
|
253
|
+
```bash
|
|
254
|
+
semantic_chunker -t auto --drift 0.75 document.txt
|
|
255
|
+
```
|
|
256
|
+
|
|
243
257
|
### Adapters
|
|
244
258
|
|
|
245
259
|
The gem is designed to be extensible with different embedding providers. It currently ships with:
|
|
@@ -280,6 +294,15 @@ $ OPENAI_API_KEY="your-key" bundle exec ruby test_integration.rb
|
|
|
280
294
|
$ HUGGING_FACE_API_KEY="your-key" bundle exec ruby test_hugging_face.rb
|
|
281
295
|
```
|
|
282
296
|
|
|
297
|
+
### Documentation
|
|
298
|
+
|
|
299
|
+
This project uses YARD for documentation. To generate the documentation, run:
|
|
300
|
+
```bash
|
|
301
|
+
bundle exec rake yard
|
|
302
|
+
```
|
|
303
|
+
|
|
304
|
+
This will generate the documentation in the `docs` directory.
|
|
305
|
+
|
|
283
306
|
### Security Note: Handling API Keys
|
|
284
307
|
|
|
285
308
|
When using an adapter that requires an API key, **never hardcode your API keys** directly into your source code. To keep your application secure (especially if you are working on public repositories), use one of the following methods:
|
|
@@ -413,12 +436,53 @@ The Hugging Face adapter is built for production-grade reliability:
|
|
|
413
436
|
- **Auto-Wait**: Uses the `X-Wait-For-Model` header to ensure stable results on the Inference API.
|
|
414
437
|
|
|
415
438
|
|
|
416
|
-
##
|
|
417
|
-
|
|
418
|
-
|
|
419
|
-
|
|
420
|
-
|
|
421
|
-
|
|
439
|
+
## ### Roadmap to v1.0.0
|
|
440
|
+
|
|
441
|
+
#### **v0.6.x: Stability & Core Logic**
|
|
442
|
+
|
|
443
|
+
* \[x\] **Adaptive Dynamic Thresholding:** Core semantic splitting logic.
|
|
444
|
+
|
|
445
|
+
* \[x\] **CLI with JSON output:** Global execution and piping support.
|
|
446
|
+
|
|
447
|
+
* \[x\] **Robust Error Handling:** API retry logic and validation.
|
|
448
|
+
|
|
449
|
+
* \[x\] **Anchor-Sentence Drift Protection:** Prevents topic-bleed by comparing current sentences against the chunk's starting "anchor."
|
|
450
|
+
|
|
451
|
+
* \[ \] **Multiple Breakpoint Strategies:** Support for Percentile, StandardDeviation, and Interquartile range splitting (Claude's suggestion).
|
|
452
|
+
|
|
453
|
+
|
|
454
|
+
#### **v0.7.x: Performance & Efficiency**
|
|
455
|
+
|
|
456
|
+
* \[ \] **Local Embedding Cache:** SQLite or file-based cache to store sentence embeddings (saves $$$ and speeds up repeated runs).
|
|
457
|
+
|
|
458
|
+
* \[ \] **Batch Processing:** Support for batching multiple sentences into a single API call to HuggingFace/OpenAI.
|
|
459
|
+
|
|
460
|
+
* \[ \] **Progress Indicators:** CLI progress bars for large document processing.
|
|
461
|
+
|
|
462
|
+
|
|
463
|
+
#### **v0.8.x: RAG & Enterprise Features**
|
|
464
|
+
|
|
465
|
+
* \[ \] **Rich Metadata Support:** Return Chunk objects instead of raw strings, including source, index, and token counts.
|
|
466
|
+
|
|
467
|
+
* \[ \] **Contextual Overlap:** Support for "sliding window" overlap between chunks to preserve context.
|
|
468
|
+
|
|
469
|
+
* \[ \] **PII Sanitization Hook:** Integration point for masking sensitive data before it hits the provider API.
|
|
470
|
+
|
|
471
|
+
|
|
472
|
+
#### **v0.9.x: Ecosystem & Adapters**
|
|
473
|
+
|
|
474
|
+
* \[ \] **Provider Expansion:** Add native adapters for OpenAI, Cohere, and local Transformers (via informers gem).
|
|
475
|
+
|
|
476
|
+
* \[ \] **Gem Documentation:** Full API documentation and a "Best Practices" guide for threshold tuning.
|
|
477
|
+
|
|
478
|
+
* \[ \] **LangChain.rb Integration:** Provide a standard interface for use within the Ruby AI ecosystem.
|
|
479
|
+
|
|
480
|
+
|
|
481
|
+
#### **v1.0.0: Production Ready**
|
|
482
|
+
|
|
483
|
+
* \[ \] **Benchmark Suite:** Comparative performance tests against character-based chunkers.
|
|
484
|
+
|
|
485
|
+
* \[ \] **Stable Public API:** Finalizing the class structure for long-term compatibility.
|
|
422
486
|
|
|
423
487
|
## Contributing
|
|
424
488
|
|
data/bin/semantic_chunker
CHANGED
|
@@ -24,6 +24,9 @@ OptionParser.new do |opts|
|
|
|
24
24
|
puts SemanticChunker::VERSION
|
|
25
25
|
exit
|
|
26
26
|
end
|
|
27
|
+
opts.on("-d", "--drift FLOAT", Float, "Anchor-sentence drift threshold (e.g., 0.75)") do |d|
|
|
28
|
+
options[:drift_threshold] = d
|
|
29
|
+
end
|
|
27
30
|
end.parse!
|
|
28
31
|
|
|
29
32
|
input_file = ARGV[0]
|
|
@@ -48,7 +51,8 @@ chunker = SemanticChunker::Chunker.new(
|
|
|
48
51
|
embedding_provider: provider,
|
|
49
52
|
threshold: options[:threshold],
|
|
50
53
|
max_chunk_size: options[:max_size],
|
|
51
|
-
buffer_size: options[:buffer]
|
|
54
|
+
buffer_size: options[:buffer],
|
|
55
|
+
drift_threshold: options[:drift_threshold]
|
|
52
56
|
)
|
|
53
57
|
|
|
54
58
|
chunks = chunker.chunks_for(text)
|
|
@@ -1,7 +1,14 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
1
3
|
# lib/semantic_chunker/adapters/base.rb
|
|
2
4
|
module SemanticChunker
|
|
3
5
|
module Adapters
|
|
6
|
+
# The Base class for all adapters.
|
|
4
7
|
class Base
|
|
8
|
+
# This method should be implemented by subclasses to generate embeddings for the given sentences.
|
|
9
|
+
#
|
|
10
|
+
# @param sentences [Array<String>] An array of sentences to embed.
|
|
11
|
+
# @raise [NotImplementedError] If the method is not implemented by a subclass.
|
|
5
12
|
def embed(sentences)
|
|
6
13
|
raise NotImplementedError, "#{self.class} must implement #embed"
|
|
7
14
|
end
|
|
@@ -1,3 +1,5 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
1
3
|
# lib/semantic_chunker/adapters/hugging_face_adapter.rb
|
|
2
4
|
require 'net/http'
|
|
3
5
|
require 'json'
|
|
@@ -5,29 +7,40 @@ require 'uri'
|
|
|
5
7
|
|
|
6
8
|
module SemanticChunker
|
|
7
9
|
module Adapters
|
|
10
|
+
# The HuggingFaceAdapter class is responsible for fetching embeddings from the Hugging Face API.
|
|
8
11
|
class HuggingFaceAdapter < Base
|
|
9
|
-
|
|
10
|
-
|
|
11
|
-
|
|
12
|
+
# The base URL for the Hugging Face API.
|
|
13
|
+
BASE_URL = 'https://router.huggingface.co/hf-inference/models/%<model>s'
|
|
14
|
+
|
|
15
|
+
# The maximum number of retries for transient errors.
|
|
12
16
|
MAX_RETRIES = 3
|
|
17
|
+
# The initial backoff time in seconds for retries.
|
|
13
18
|
INITIAL_BACKOFF = 2 # seconds
|
|
19
|
+
# The timeout for opening a connection in seconds.
|
|
14
20
|
OPEN_TIMEOUT = 5 # seconds to open connection
|
|
21
|
+
# The timeout for reading the response in seconds.
|
|
15
22
|
READ_TIMEOUT = 60 # seconds to wait for embeddings
|
|
16
23
|
|
|
24
|
+
# Initializes a new HuggingFaceAdapter.
|
|
25
|
+
#
|
|
26
|
+
# @param api_key [String] The Hugging Face API key.
|
|
27
|
+
# @param model [String] The name of the model to use. I.E. 'sentence-transformers/all-MiniLM-L6-v2' or 'BAAI/bge-small-en-v1.5'
|
|
17
28
|
def initialize(api_key:, model: 'intfloat/multilingual-e5-large')
|
|
18
29
|
@api_key = api_key
|
|
19
30
|
@model = model
|
|
20
|
-
# @model = 'sentence-transformers/all-MiniLM-L6-v2'
|
|
21
|
-
# @model = 'BAAI/bge-small-en-v1.5'
|
|
22
31
|
end
|
|
23
32
|
|
|
33
|
+
# Fetches embeddings for the given sentences from the Hugging Face API.
|
|
34
|
+
#
|
|
35
|
+
# @param sentences [Array<String>] An array of sentences to embed.
|
|
36
|
+
# @return [Array<Array<Float>>] An array of embeddings.
|
|
24
37
|
def embed(sentences)
|
|
25
38
|
retry_count = 0
|
|
26
|
-
|
|
39
|
+
|
|
27
40
|
begin
|
|
28
41
|
response = post_request(sentences)
|
|
29
42
|
handle_response(response)
|
|
30
|
-
rescue => e
|
|
43
|
+
rescue StandardError => e
|
|
31
44
|
if retryable?(e, retry_count)
|
|
32
45
|
wait_time = INITIAL_BACKOFF * (2**retry_count)
|
|
33
46
|
puts "HuggingFace: Transient error (#{e.message}). Retrying in #{wait_time}s..."
|
|
@@ -41,16 +54,20 @@ module SemanticChunker
|
|
|
41
54
|
|
|
42
55
|
private
|
|
43
56
|
|
|
57
|
+
# Sends a POST request to the Hugging Face API.
|
|
58
|
+
#
|
|
59
|
+
# @param sentences [Array<String>] An array of sentences to embed.
|
|
60
|
+
# @return [Net::HTTPResponse] The HTTP response.
|
|
44
61
|
def post_request(sentences)
|
|
45
|
-
uri = URI(BASE_URL
|
|
62
|
+
uri = URI(format(BASE_URL, { model: @model }))
|
|
46
63
|
request = Net::HTTP::Post.new(uri)
|
|
47
|
-
|
|
48
|
-
request[
|
|
49
|
-
request[
|
|
50
|
-
request[
|
|
51
|
-
|
|
64
|
+
|
|
65
|
+
request['Authorization'] = "Bearer #{@api_key}"
|
|
66
|
+
request['Content-Type'] = 'application/json'
|
|
67
|
+
request['X-Wait-For-Model'] = 'true' # Tells HF to wait for model load
|
|
68
|
+
|
|
52
69
|
request.body = { inputs: sentences }.to_json
|
|
53
|
-
|
|
70
|
+
|
|
54
71
|
Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) do |http|
|
|
55
72
|
http.open_timeout = OPEN_TIMEOUT
|
|
56
73
|
http.read_timeout = READ_TIMEOUT
|
|
@@ -58,8 +75,13 @@ module SemanticChunker
|
|
|
58
75
|
end
|
|
59
76
|
end
|
|
60
77
|
|
|
78
|
+
# Handles the HTTP response from the Hugging Face API.
|
|
79
|
+
#
|
|
80
|
+
# @param response [Net::HTTPResponse] The HTTP response.
|
|
81
|
+
# @raise [StandardError] If the response is not successful.
|
|
82
|
+
# @return [Array<Array<Float>>] The parsed embeddings.
|
|
61
83
|
def handle_response(response)
|
|
62
|
-
unless response.content_type ==
|
|
84
|
+
unless response.content_type == 'application/json'
|
|
63
85
|
raise "HuggingFace Error: Expected JSON, got #{response.content_type}."
|
|
64
86
|
end
|
|
65
87
|
|
|
@@ -67,21 +89,26 @@ module SemanticChunker
|
|
|
67
89
|
|
|
68
90
|
if response.is_a?(Net::HTTPSuccess)
|
|
69
91
|
parsed
|
|
70
|
-
elsif parsed.is_a?(Hash) && parsed[
|
|
92
|
+
elsif parsed.is_a?(Hash) && parsed['error']&.include?('loading')
|
|
71
93
|
# This specifically triggers a retry for model warmups
|
|
72
|
-
raise
|
|
94
|
+
raise 'Model is still loading'
|
|
73
95
|
else
|
|
74
96
|
raise "HuggingFace API Error: #{parsed['error'] || response.body}"
|
|
75
97
|
end
|
|
76
98
|
end
|
|
77
99
|
|
|
100
|
+
# Checks if an error is retryable.
|
|
101
|
+
#
|
|
102
|
+
# @param error [StandardError] The error to check.
|
|
103
|
+
# @param count [Integer] The current retry count.
|
|
104
|
+
# @return [Boolean] True if the error is retryable, false otherwise.
|
|
78
105
|
def retryable?(error, count)
|
|
79
106
|
return false if count >= MAX_RETRIES
|
|
80
|
-
|
|
107
|
+
|
|
81
108
|
# Retry on timeouts, loading errors, or 5xx server errors
|
|
82
|
-
error.message.include?(
|
|
83
|
-
|
|
84
|
-
|
|
109
|
+
error.message.include?('loading') ||
|
|
110
|
+
error.is_a?(Net::ReadTimeout) ||
|
|
111
|
+
error.is_a?(Net::OpenTimeout)
|
|
85
112
|
end
|
|
86
113
|
end
|
|
87
114
|
end
|
|
@@ -1,18 +1,30 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
1
3
|
# lib/semantic_chunker/adapters/openai_adapter.rb
|
|
2
|
-
require
|
|
3
|
-
require
|
|
4
|
-
require
|
|
4
|
+
require 'net/http'
|
|
5
|
+
require 'json'
|
|
6
|
+
require 'uri'
|
|
5
7
|
|
|
6
8
|
module SemanticChunker
|
|
7
9
|
module Adapters
|
|
10
|
+
# The OpenAIAdapter class is responsible for fetching embeddings from the OpenAI API.
|
|
8
11
|
class OpenAIAdapter < Base
|
|
9
|
-
|
|
12
|
+
# The endpoint for the OpenAI API.
|
|
13
|
+
ENDPOINT = 'https://api.openai.com/v1/embeddings'
|
|
10
14
|
|
|
11
|
-
|
|
15
|
+
# Initializes a new OpenAIAdapter.
|
|
16
|
+
#
|
|
17
|
+
# @param api_key [String] The OpenAI API key.
|
|
18
|
+
# @param model [String] The name of the model to use.
|
|
19
|
+
def initialize(api_key:, model: 'text-embedding-3-small')
|
|
12
20
|
@api_key = api_key
|
|
13
21
|
@model = model
|
|
14
22
|
end
|
|
15
23
|
|
|
24
|
+
# Fetches embeddings for the given sentences from the OpenAI API.
|
|
25
|
+
#
|
|
26
|
+
# @param sentences [Array<String>] An array of sentences to embed.
|
|
27
|
+
# @return [Array<Array<Float>>] An array of embeddings.
|
|
16
28
|
def embed(sentences)
|
|
17
29
|
response = post_request(sentences)
|
|
18
30
|
parsed = JSON.parse(response.body)
|
|
@@ -20,7 +32,7 @@ module SemanticChunker
|
|
|
20
32
|
if response.is_a?(Net::HTTPSuccess)
|
|
21
33
|
# OpenAI returns data in the same order as input
|
|
22
34
|
# We extract just the embedding arrays
|
|
23
|
-
parsed[
|
|
35
|
+
parsed['data'].map { |entry| entry['embedding'] }
|
|
24
36
|
else
|
|
25
37
|
raise "OpenAI Error: #{parsed.dig('error', 'message') || response.code}"
|
|
26
38
|
end
|
|
@@ -28,11 +40,15 @@ module SemanticChunker
|
|
|
28
40
|
|
|
29
41
|
private
|
|
30
42
|
|
|
43
|
+
# Sends a POST request to the OpenAI API.
|
|
44
|
+
#
|
|
45
|
+
# @param sentences [Array<String>] An array of sentences to embed.
|
|
46
|
+
# @return [Net::HTTPResponse] The HTTP response.
|
|
31
47
|
def post_request(sentences)
|
|
32
48
|
uri = URI(ENDPOINT)
|
|
33
49
|
request = Net::HTTP::Post.new(uri)
|
|
34
|
-
request[
|
|
35
|
-
request[
|
|
50
|
+
request['Authorization'] = "Bearer #{@api_key}"
|
|
51
|
+
request['Content-Type'] = 'application/json'
|
|
36
52
|
request.body = { input: sentences, model: @model }.to_json
|
|
37
53
|
|
|
38
54
|
Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) do |http|
|
|
@@ -1,14 +1,25 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
1
3
|
# lib/semantic_chunker/adapters/test_adapter.rb
|
|
2
4
|
module SemanticChunker
|
|
3
5
|
module Adapters
|
|
6
|
+
# The TestAdapter class is a dummy adapter for testing purposes.
|
|
7
|
+
# It returns predefined or random vectors.
|
|
4
8
|
class TestAdapter < Base
|
|
5
|
-
#
|
|
9
|
+
# Initializes a new TestAdapter.
|
|
10
|
+
#
|
|
11
|
+
# @param predefined_vectors [Array<Array<Float>>, nil] A list of vectors to return.
|
|
12
|
+
# If nil, random vectors will be generated.
|
|
6
13
|
def initialize(predefined_vectors = nil)
|
|
7
14
|
@predefined_vectors = predefined_vectors
|
|
8
15
|
end
|
|
9
16
|
|
|
17
|
+
# Returns predefined or random embeddings for the given sentences.
|
|
18
|
+
#
|
|
19
|
+
# @param sentences [Array<String>] An array of sentences.
|
|
20
|
+
# @return [Array<Array<Float>>] An array of embeddings.
|
|
10
21
|
def embed(sentences)
|
|
11
|
-
# If we have specific vectors, use them;
|
|
22
|
+
# If we have specific vectors, use them;
|
|
12
23
|
# otherwise, return random vectors for each sentence
|
|
13
24
|
@predefined_vectors || sentences.map { [rand, rand, rand] }
|
|
14
25
|
end
|
|
@@ -1,36 +1,56 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
1
3
|
# lib/semantic_chunker/chunker.rb
|
|
2
4
|
require 'matrix'
|
|
3
5
|
require 'pragmatic_segmenter'
|
|
4
6
|
|
|
5
7
|
module SemanticChunker
|
|
8
|
+
# The Chunker class is responsible for splitting text into semantic chunks.
|
|
6
9
|
class Chunker
|
|
10
|
+
# The default threshold for cosine similarity.
|
|
7
11
|
DEFAULT_THRESHOLD = 0.82
|
|
12
|
+
# The default buffer size.
|
|
8
13
|
DEFAULT_BUFFER = 1
|
|
14
|
+
# The default maximum size of a chunk in characters.
|
|
9
15
|
DEFAULT_MAX_SIZE = 1500 # Characters
|
|
10
16
|
|
|
11
|
-
|
|
17
|
+
# Initializes a new Chunker.
|
|
18
|
+
#
|
|
19
|
+
# @param embedding_provider [Object] The provider for generating embeddings.
|
|
20
|
+
# @param threshold [Float, Symbol] The cosine similarity threshold or :auto.
|
|
21
|
+
# @param buffer_size [Integer, Symbol] The buffer size or :auto.
|
|
22
|
+
# @param max_chunk_size [Integer] The maximum size of a chunk in characters.
|
|
23
|
+
# @param drift_threshold [Float] The threshold to detect semantic drift from the beginning of a chunk.
|
|
24
|
+
# @param segmenter_options [Hash] Options for the PragmaticSegmenter.
|
|
25
|
+
def initialize(embedding_provider: nil, threshold: DEFAULT_THRESHOLD, buffer_size: DEFAULT_BUFFER, max_chunk_size: DEFAULT_MAX_SIZE, drift_threshold: nil, segmenter_options: {})
|
|
12
26
|
@provider = embedding_provider || SemanticChunker.configuration&.provider
|
|
13
27
|
@threshold = threshold
|
|
14
28
|
@buffer_size = buffer_size
|
|
15
29
|
@max_chunk_size = max_chunk_size
|
|
30
|
+
@drift_threshold = validate_drift_threshold(drift_threshold || SemanticChunker.configuration&.drift_threshold)
|
|
16
31
|
@segmenter_options = segmenter_options # e.g., { language: 'hy', doc_type: 'pdf' }
|
|
17
32
|
|
|
18
|
-
raise ArgumentError,
|
|
33
|
+
raise ArgumentError, 'A provider must be configured' if @provider.nil?
|
|
19
34
|
end
|
|
20
35
|
|
|
36
|
+
# Splits the given text into semantic chunks.
|
|
37
|
+
#
|
|
38
|
+
# @param text [String] The text to chunk.
|
|
39
|
+
# @return [Array<String>] An array of semantic chunks.
|
|
21
40
|
def chunks_for(text)
|
|
22
41
|
return [] if text.nil? || text.strip.empty?
|
|
42
|
+
|
|
23
43
|
sentences = split_sentences(text)
|
|
24
44
|
|
|
25
45
|
# Step 1: Logic to determine the best buffer window
|
|
26
46
|
effective_buffer = determine_buffer(sentences)
|
|
27
|
-
|
|
47
|
+
|
|
28
48
|
# Step 2: Create overlapping "context groups" for more stable embeddings
|
|
29
49
|
context_groups = build_context_groups(sentences, effective_buffer)
|
|
30
|
-
|
|
50
|
+
|
|
31
51
|
# Step 3: Embed the groups, not the raw sentences
|
|
32
52
|
group_embeddings = @provider.embed(context_groups)
|
|
33
|
-
|
|
53
|
+
|
|
34
54
|
# Resolve the threshold dynamically if requested
|
|
35
55
|
resolved_threshold = resolve_threshold(group_embeddings)
|
|
36
56
|
|
|
@@ -39,12 +59,25 @@ module SemanticChunker
|
|
|
39
59
|
|
|
40
60
|
private
|
|
41
61
|
|
|
42
|
-
|
|
62
|
+
def validate_drift_threshold(val)
|
|
63
|
+
return nil if val.nil? # Keep it off by default
|
|
64
|
+
|
|
65
|
+
unless val.is_a?(Numeric) && val.between?(-1.0, 1.0)
|
|
66
|
+
raise ArgumentError, "drift_threshold must be a Numeric between -1.0 and 1.0 (received #{val.inspect})"
|
|
67
|
+
end
|
|
68
|
+
|
|
69
|
+
val
|
|
70
|
+
end
|
|
71
|
+
|
|
72
|
+
# Determines the buffer size based on the average sentence length.
|
|
73
|
+
#
|
|
74
|
+
# @param sentences [Array<String>] An array of sentences.
|
|
75
|
+
# @return [Integer] The buffer size.
|
|
43
76
|
def determine_buffer(sentences)
|
|
44
77
|
return @buffer_size unless @buffer_size == :auto
|
|
45
78
|
|
|
46
79
|
avg_length = sentences.map(&:length).sum / sentences.size.to_f
|
|
47
|
-
|
|
80
|
+
|
|
48
81
|
# Strategy: If sentences are very short (< 50 chars), we need more context.
|
|
49
82
|
# If they are long (> 150 chars), they are likely self-contained.
|
|
50
83
|
case avg_length
|
|
@@ -54,38 +87,66 @@ module SemanticChunker
|
|
|
54
87
|
end
|
|
55
88
|
end
|
|
56
89
|
|
|
90
|
+
# Builds overlapping context groups from sentences.
|
|
91
|
+
#
|
|
92
|
+
# @param sentences [Array<String>] An array of sentences.
|
|
93
|
+
# @param buffer [Integer] The buffer size.
|
|
94
|
+
# @return [Array<String>] An array of context groups.
|
|
57
95
|
def build_context_groups(sentences, buffer)
|
|
58
96
|
sentences.each_with_index.map do |_, i|
|
|
59
97
|
start_idx = [0, i - buffer].max
|
|
60
98
|
end_idx = [sentences.size - 1, i + buffer].min
|
|
61
|
-
sentences[start_idx..end_idx].join(
|
|
99
|
+
sentences[start_idx..end_idx].join(' ')
|
|
62
100
|
end
|
|
63
101
|
end
|
|
64
102
|
|
|
103
|
+
# Splits the given text into sentences.
|
|
104
|
+
#
|
|
105
|
+
# @param text [String] The text to split.
|
|
106
|
+
# @return [Array<String>] An array of sentences.
|
|
65
107
|
def split_sentences(text)
|
|
66
108
|
options = @segmenter_options.merge(text: text)
|
|
67
109
|
ps = PragmaticSegmenter::Segmenter.new(**options)
|
|
68
110
|
ps.segment
|
|
69
111
|
end
|
|
70
112
|
|
|
113
|
+
# Calculates the semantic groups from sentences and embeddings.
|
|
114
|
+
#
|
|
115
|
+
# @param sentences [Array<String>] An array of sentences.
|
|
116
|
+
# @param embeddings [Array<Array<Float>>] An array of embeddings.
|
|
117
|
+
# @param resolved_threshold [Float] The cosine similarity threshold.
|
|
118
|
+
# @return [Array<String>] An array of semantic chunks.
|
|
71
119
|
def calculate_groups(sentences, embeddings, resolved_threshold)
|
|
72
120
|
chunks = []
|
|
73
121
|
current_chunk_text = [sentences[0]]
|
|
74
|
-
|
|
122
|
+
# The Anchor is the first vector of the new chunk
|
|
123
|
+
anchor_vector = Vector[*embeddings[0]]
|
|
124
|
+
current_chunk_vectors = [anchor_vector]
|
|
75
125
|
|
|
76
126
|
(1...sentences.size).each do |i|
|
|
77
127
|
new_sentence = sentences[i]
|
|
78
128
|
new_vec = Vector[*embeddings[i]]
|
|
79
|
-
|
|
129
|
+
|
|
130
|
+
# 1. Similarity to Centroid (Current behavior)
|
|
80
131
|
centroid = current_chunk_vectors.inject(:+) / current_chunk_vectors.size.to_f
|
|
81
|
-
|
|
132
|
+
centroid_sim = cosine_similarity(centroid, new_vec)
|
|
133
|
+
|
|
134
|
+
# 2. Similarity to Anchor (New behavior)
|
|
135
|
+
# We only check this if @drift_threshold is configured
|
|
136
|
+
drifted = false
|
|
137
|
+
if @drift_threshold
|
|
138
|
+
anchor_sim = cosine_similarity(anchor_vector, new_vec)
|
|
139
|
+
drifted = anchor_sim < @drift_threshold
|
|
140
|
+
end
|
|
82
141
|
|
|
83
|
-
potential_size = current_chunk_text.join(" ").length + new_sentence.length + 1
|
|
84
142
|
|
|
85
|
-
|
|
86
|
-
|
|
87
|
-
|
|
143
|
+
potential_size = current_chunk_text.join(' ').length + new_sentence.length + 1
|
|
144
|
+
|
|
145
|
+
# Logic: Split if Centroid similarity is low OR it drifted from Anchor OR max size reached
|
|
146
|
+
if centroid_sim < resolved_threshold || drifted || potential_size > @max_chunk_size
|
|
147
|
+
chunks << current_chunk_text.join(' ')
|
|
88
148
|
current_chunk_text = [new_sentence]
|
|
149
|
+
anchor_vector = new_vec # Reset Anchor for the new chunk
|
|
89
150
|
current_chunk_vectors = [new_vec]
|
|
90
151
|
else
|
|
91
152
|
current_chunk_text << new_sentence
|
|
@@ -93,41 +154,53 @@ module SemanticChunker
|
|
|
93
154
|
end
|
|
94
155
|
end
|
|
95
156
|
|
|
96
|
-
chunks << current_chunk_text.join(
|
|
157
|
+
chunks << current_chunk_text.join(' ')
|
|
97
158
|
chunks
|
|
98
159
|
end
|
|
160
|
+
|
|
161
|
+
# Calculates the cosine similarity between two vectors.
|
|
162
|
+
#
|
|
163
|
+
# @param v1 [Vector, Array<Float>] The first vector.
|
|
164
|
+
# @param v2 [Vector, Array<Float>] The second vector.
|
|
165
|
+
# @return [Float] The cosine similarity.
|
|
99
166
|
def cosine_similarity(v1, v2)
|
|
100
167
|
# Ensure we are working with Vectors
|
|
101
168
|
v1 = Vector[*v1] unless v1.is_a?(Vector)
|
|
102
169
|
v2 = Vector[*v2] unless v2.is_a?(Vector)
|
|
103
|
-
|
|
170
|
+
|
|
104
171
|
mag1 = v1.magnitude
|
|
105
172
|
mag2 = v2.magnitude
|
|
106
|
-
|
|
173
|
+
|
|
107
174
|
return 0.0 if mag1.zero? || mag2.zero?
|
|
175
|
+
|
|
108
176
|
v1.inner_product(v2) / (mag1 * mag2)
|
|
109
177
|
end
|
|
178
|
+
|
|
179
|
+
# Resolves the threshold dynamically based on the embeddings.
|
|
180
|
+
#
|
|
181
|
+
# @param embeddings [Array<Array<Float>>] An array of embeddings.
|
|
182
|
+
# @return [Float] The resolved threshold.
|
|
110
183
|
def resolve_threshold(embeddings)
|
|
111
184
|
return @threshold if @threshold.is_a?(Numeric)
|
|
112
185
|
return DEFAULT_THRESHOLD if embeddings.size < 2
|
|
113
186
|
|
|
114
187
|
similarities = []
|
|
115
188
|
(0...embeddings.size - 1).each do |i|
|
|
116
|
-
# Note: We wrap them here, but ensure cosine_similarity
|
|
189
|
+
# Note: We wrap them here, but ensure cosine_similarity
|
|
117
190
|
# doesn't re-wrap them if they are already Vectors.
|
|
118
191
|
v1 = Vector[*embeddings[i]]
|
|
119
|
-
v2 = Vector[*embeddings[i+1]]
|
|
192
|
+
v2 = Vector[*embeddings[i + 1]]
|
|
120
193
|
similarities << cosine_similarity(v1, v2)
|
|
121
194
|
end
|
|
122
195
|
|
|
123
196
|
return DEFAULT_THRESHOLD if similarities.empty?
|
|
124
197
|
|
|
125
198
|
percentile_val = @threshold.is_a?(Hash) ? @threshold[:percentile] : 20
|
|
126
|
-
|
|
199
|
+
|
|
127
200
|
# Use (size - 1) for the index to avoid "out of bounds" on small lists
|
|
128
201
|
sorted_sims = similarities.sort
|
|
129
202
|
index = ((sorted_sims.size - 1) * (percentile_val / 100.0)).round
|
|
130
|
-
|
|
203
|
+
|
|
131
204
|
dynamic_val = sorted_sims[index]
|
|
132
205
|
|
|
133
206
|
# Guardrail: Clamp to prevent hyper-splitting or never-splitting
|
data/lib/semantic_chunker.rb
CHANGED
|
@@ -1,3 +1,5 @@
|
|
|
1
|
+
# frozen_string_literal: true
|
|
2
|
+
|
|
1
3
|
# lib/semantic_chunker.rb
|
|
2
4
|
# 1. Require dependencies
|
|
3
5
|
require 'matrix'
|
|
@@ -13,21 +15,36 @@ require_relative 'semantic_chunker/adapters/openai_adapter'
|
|
|
13
15
|
require_relative 'semantic_chunker/adapters/test_adapter'
|
|
14
16
|
require_relative 'semantic_chunker/chunker'
|
|
15
17
|
require_relative 'semantic_chunker/adapters/hugging_face_adapter'
|
|
18
|
+
|
|
19
|
+
# YARD documentation for the SemanticChunker module
|
|
20
|
+
# This module provides an interface to chunk text semantically.
|
|
21
|
+
#
|
|
22
|
+
# @!attribute [rw] configuration
|
|
23
|
+
# @return [Configuration] The configuration object for the gem.
|
|
16
24
|
module SemanticChunker
|
|
17
25
|
class << self
|
|
18
26
|
attr_accessor :configuration
|
|
19
27
|
end
|
|
20
28
|
|
|
29
|
+
# Configures the SemanticChunker gem.
|
|
30
|
+
#
|
|
31
|
+
# @yield [configuration] The configuration object.
|
|
32
|
+
# @yieldparam configuration [Configuration] The configuration object.
|
|
21
33
|
def self.configure
|
|
22
34
|
self.configuration ||= Configuration.new
|
|
23
35
|
yield(configuration)
|
|
24
36
|
end
|
|
25
37
|
|
|
38
|
+
# The Configuration class for the SemanticChunker gem.
|
|
39
|
+
# This class holds the configuration for the gem.
|
|
26
40
|
class Configuration
|
|
27
|
-
|
|
41
|
+
# @!attribute [rw] provider
|
|
42
|
+
# @return [Symbol] The provider to use for semantic chunking.
|
|
43
|
+
attr_accessor :provider, :drift_threshold
|
|
28
44
|
|
|
29
45
|
def initialize
|
|
30
46
|
@provider = nil # User must set this
|
|
47
|
+
@drift_threshold = nil
|
|
31
48
|
end
|
|
32
49
|
end
|
|
33
50
|
end
|
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: semantic_chunker
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.6.
|
|
4
|
+
version: 0.6.4
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Daniele Frisanco
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: bin
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date: 2026-01-
|
|
11
|
+
date: 2026-01-15 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: pragmatic_segmenter
|
|
@@ -94,6 +94,62 @@ dependencies:
|
|
|
94
94
|
- - ">="
|
|
95
95
|
- !ruby/object:Gem::Version
|
|
96
96
|
version: '0'
|
|
97
|
+
- !ruby/object:Gem::Dependency
|
|
98
|
+
name: yard
|
|
99
|
+
requirement: !ruby/object:Gem::Requirement
|
|
100
|
+
requirements:
|
|
101
|
+
- - "~>"
|
|
102
|
+
- !ruby/object:Gem::Version
|
|
103
|
+
version: '0.9'
|
|
104
|
+
type: :development
|
|
105
|
+
prerelease: false
|
|
106
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
107
|
+
requirements:
|
|
108
|
+
- - "~>"
|
|
109
|
+
- !ruby/object:Gem::Version
|
|
110
|
+
version: '0.9'
|
|
111
|
+
- !ruby/object:Gem::Dependency
|
|
112
|
+
name: rubocop
|
|
113
|
+
requirement: !ruby/object:Gem::Requirement
|
|
114
|
+
requirements:
|
|
115
|
+
- - "~>"
|
|
116
|
+
- !ruby/object:Gem::Version
|
|
117
|
+
version: '1.0'
|
|
118
|
+
type: :development
|
|
119
|
+
prerelease: false
|
|
120
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
121
|
+
requirements:
|
|
122
|
+
- - "~>"
|
|
123
|
+
- !ruby/object:Gem::Version
|
|
124
|
+
version: '1.0'
|
|
125
|
+
- !ruby/object:Gem::Dependency
|
|
126
|
+
name: rubocop-performance
|
|
127
|
+
requirement: !ruby/object:Gem::Requirement
|
|
128
|
+
requirements:
|
|
129
|
+
- - ">="
|
|
130
|
+
- !ruby/object:Gem::Version
|
|
131
|
+
version: '0'
|
|
132
|
+
type: :development
|
|
133
|
+
prerelease: false
|
|
134
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
135
|
+
requirements:
|
|
136
|
+
- - ">="
|
|
137
|
+
- !ruby/object:Gem::Version
|
|
138
|
+
version: '0'
|
|
139
|
+
- !ruby/object:Gem::Dependency
|
|
140
|
+
name: dotenv
|
|
141
|
+
requirement: !ruby/object:Gem::Requirement
|
|
142
|
+
requirements:
|
|
143
|
+
- - "~>"
|
|
144
|
+
- !ruby/object:Gem::Version
|
|
145
|
+
version: '3.1'
|
|
146
|
+
type: :development
|
|
147
|
+
prerelease: false
|
|
148
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
149
|
+
requirements:
|
|
150
|
+
- - "~>"
|
|
151
|
+
- !ruby/object:Gem::Version
|
|
152
|
+
version: '3.1'
|
|
97
153
|
description: A powerful tool for RAG (Retrieval-Augmented Generation) that splits
|
|
98
154
|
text into chunks based on semantic meaning rather than just character counts. Supports
|
|
99
155
|
sliding windows, adaptive buffering, and dynamic percentile-based thresholding.
|
|
@@ -122,7 +178,7 @@ metadata:
|
|
|
122
178
|
source_code_uri: https://github.com/danielefrisanco/semantic_chunker
|
|
123
179
|
changelog_uri: https://github.com/danielefrisanco/semantic_chunker/blob/main/CHANGELOG.md
|
|
124
180
|
bug_tracker_uri: https://github.com/danielefrisanco/semantic_chunker/issues
|
|
125
|
-
documentation_uri: https://www.rubydoc.info/gems/semantic_chunker/0.6.
|
|
181
|
+
documentation_uri: https://www.rubydoc.info/gems/semantic_chunker/0.6.4
|
|
126
182
|
allowed_push_host: https://rubygems.org
|
|
127
183
|
post_install_message:
|
|
128
184
|
rdoc_options: []
|