semantic_chunker 0.5.3 → 0.6.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: '0782ee5ae0f80488a985b3c12afad2eb95252ecd45849ed98a8446aef4dbfc66'
4
- data.tar.gz: 988a17459d404db90f460527d105e00579a44a442deabeba3c3e6462a4e440de
3
+ metadata.gz: 9eb71c3c0285ded52be28cc83f18c2590040bcac489bf07963134b9b877c7dd6
4
+ data.tar.gz: 2cbb2ad4565519fd068b3a1eb6805e7989bd7d78886884b311faa6c4109193bc
5
5
  SHA512:
6
- metadata.gz: 29b84d713798dabc248986ad2040da8bffe029c74dcba9e7f35aef783a8ac5b1b944c4ebdba1f1af50c79b78d92a9fb06c66c9f71a901a9c457032edd1508865
7
- data.tar.gz: 26a6d3ac345c0a6d88cbffc7877a2fd387293c75d81d2cb85771b2d8f66778645944c46608f9de405eb787a4c48546f6efe859790963c129d2dfdd1ded8dcd32
6
+ metadata.gz: 2210e23a05cc4ed601528f0c1b877d02c13d9b3da77486a0fa5845a799fd02815f21f9ecd9cd52bf1c4bcb87aff2c733914016e3efba9c1454df992ea0161fa7
7
+ data.tar.gz: 2541b7fcb705b410444e109979a95cc202d660ec9febc0d1bd4e1116173b99a0d1b75a4a88c204a4e590d2a20e84888c3e75bd8174f43e18b2a19201098b756e
data/CHANGELOG.md ADDED
@@ -0,0 +1,63 @@
1
+ # Changelog
2
+
3
+ All notable changes to this project will be documented in this file.
4
+ [0.6.2] - 2026-01-07
5
+ ----------------------
6
+
7
+ ### Added
8
+
9
+ * **Command Line Interface (CLI)**: Introduced bin/semantic\_chunker allowing users to chunk files or piped text directly from the terminal.
10
+
11
+ * **JSON Output**: Added --format json flag to the CLI for easy integration with Python, Node.js, and other data pipelines.
12
+
13
+ * **Net::HTTP Timeouts**: Added open\_timeout and read\_timeout to the Hugging Face adapter to prevent application hangs during network instability.
14
+
15
+ * **Exponential Backoff**: Implemented a retry strategy for the Hugging Face API that waits progressively longer if the model is currently "loading" or "warming up."
16
+
17
+ * **Unit Testing Suite**: Established an RSpec test suite using **WebMock** to simulate API responses and verify retry/timeout logic without making real network calls.
18
+
19
+
20
+ ### Changed
21
+
22
+ * **Hugging Face Resilience**: Improved the adapter to handle transient 503 errors and "Model cold start" scenarios more gracefully using the X-Wait-For-Model header.
23
+
24
+ * **CLI Performance**: Added local load path handling to allow running the CLI during development without requiring the gem to be installed globally.
25
+
26
+
27
+ ### Fixed
28
+
29
+ * **Unstable Network Hangs**: Fixed an issue where a slow response from the embedding provider could block the Ruby process indefinitely.
30
+ ## [0.6.0] - 2026-01-07
31
+
32
+ ### Added
33
+ - **Dynamic Thresholding**: Introduced model-agnostic splitting logic. The chunker now adapts to the specific "density" of a document's vector space.
34
+ - **Auto Mode**: Use `threshold: :auto` to automatically calculate the optimal split point based on the document's 15th percentile of similarity.
35
+ - **Percentile Mode**: Use `threshold: { percentile: 10 }` for fine-grained control over how sensitive the topic-shifting detection should be.
36
+ - **Clamping Logic**: Added guardrails to dynamic thresholds (clamped between `0.3` and `0.95`) to prevent hyper-splitting in repetitive documents.
37
+
38
+ ### Fixed
39
+ - **Ruby 3.0 Compatibility**: Resolved CI/CD issues and Bundler version conflicts to ensure full support for Ruby 3.0.x.
40
+ - **Precision Indexing**: Improved percentile calculation using `round` logic to ensure accuracy in both short and long documents.
41
+
42
+ ### Summary of API Changes
43
+ The `threshold` parameter now accepts three types of input:
44
+
45
+ | Mode | Input | Best For... |
46
+ |------------|----------------------|----------------------------------------------------------------|
47
+ | **Static** | `0.82` (float) | Deterministic behavior with known models (e.g., OpenAI). |
48
+ | **Auto** | `:auto` | General purpose; handles E5/BGE/MiniLM models automatically. |
49
+ | **Percentile**| `{ percentile: 10 }`| Custom sensitivity; lower % = larger chunks, higher % = more splits. |
50
+
51
+ ---
52
+
53
+ ## [0.5.3] - 2025-10-08
54
+ ### Added
55
+ - **Pragmatic Segmenter Integration**: Replaced basic regex splitting with `pragmatic_segmenter` for multilingual and context-aware sentence boundary detection.
56
+ - **Language Support**: Added `segmenter_options` to allow users to specify document language (e.g., `hy`, `jp`, `en`) and type (e.g., `pdf`).
57
+
58
+ ## [0.2.0] - 2026-01-06
59
+ ### Added
60
+ - **Centroid Comparison:** Chunks now split based on the average semantic meaning of the entire current group rather than just the previous sentence.
61
+ - **Sliding Buffer Window:** Added `buffer_size` to enrich sentence embeddings with surrounding context.
62
+ - **Adaptive Buffering:** Introduced `:auto` mode for `buffer_size`.
63
+ - **Hard Size Limits:** Added `max_chunk_size` to force splits when a topic exceeds character limits.
data/README.md ADDED
@@ -0,0 +1,429 @@
1
+ # Semantic Chunker
2
+
3
+ [![Gem Version](https://badge.fury.io/rb/semantic_chunker.svg)](https://badge.fury.io/rb/semantic_chunker)
4
+
5
+ A Ruby gem for splitting long texts into semantically related chunks. This is useful for preparing text for language models where you need to feed a model with contextually relevant information.
6
+
7
+ ## What is Semantic Chunking?
8
+
9
+ Semantic chunking is a technique for splitting text based on meaning. Instead of splitting text by a fixed number of words or sentences, this gem groups sentences that are semantically related.
10
+
11
+ It works by:
12
+ 1. Splitting the text into individual sentences.
13
+ 2. Generating a vector embedding for each sentence using a configurable provider (e.g., OpenAI, Hugging Face).
14
+ 3. Comparing the new sentence's windowed embedding to the **centroid (average) of the current chunk's embeddings**.
15
+ 4. If the similarity between the new sentence and the chunk's centroid is below a certain threshold, a new chunk is started. This prevents topic drift.
16
+ 5. The process is enhanced by a **buffer window**, which considers multiple sentences at a time to make more robust decisions.
17
+
18
+ This results in chunks of text that are topically coherent.
19
+
20
+ ## Compatibility
21
+
22
+ This gem requires Ruby 3.0 or higher.
23
+
24
+ Installation
25
+ ------------
26
+
27
+ This gem relies on two key dependencies for its logic:
28
+
29
+ 1. **matrix**: Used for high-performance vector calculations and centroid math.
30
+
31
+ 2. **pragmatic\_segmenter**: Used for rule-based sentence boundary detection (handling abbreviations, initials, and citations).
32
+
33
+
34
+ Add these lines to your application's Gemfile:
35
+
36
+ ```ruby
37
+ # Required for Ruby 3.1+
38
+ gem 'matrix'
39
+
40
+ # Required for high-quality sentence splitting
41
+ gem 'pragmatic_segmenter'
42
+
43
+ gem 'semantic_chunker'
44
+ ```
45
+
46
+ And then execute:
47
+
48
+ $ bundle install
49
+
50
+ Or install it yourself as:
51
+
52
+ $ gem install semantic_chunker
53
+
54
+ ## Usage
55
+
56
+ Here is a basic example of how to use `semantic_chunker`:
57
+
58
+ ```ruby
59
+ require 'semantic_chunker'
60
+
61
+ # 1. Configure the provider
62
+ # You can configure the provider globally.
63
+ # This is useful in a Rails initializer for example.
64
+ SemanticChunker.configure do |config|
65
+ config.provider = SemanticChunker::Adapters::HuggingFaceAdapter.new(
66
+ api_key: ENV.fetch("HUGGING_FACE_API_KEY"),
67
+ model: "sentence-transformers/all-MiniLM-L6-v2"
68
+ )
69
+ end
70
+
71
+ # 2. Create a chunker and process your text
72
+ chunker = SemanticChunker::Chunker.new(
73
+ threshold: 0.8,
74
+ buffer_size: :auto,
75
+ max_chunk_size: 1000
76
+ )
77
+ text = "Your very long document text goes here. It can contain multiple paragraphs and topics. The chunker will split it into meaningful parts."
78
+ chunks = chunker.chunks_for(text)
79
+
80
+ # chunks will be an array of strings.
81
+ # The strings preserve the original formatting and whitespace.
82
+ chunks.each_with_index do |chunk, i|
83
+ puts "Chunk #{i+1}:"
84
+ puts chunk
85
+ puts "---"
86
+ end
87
+ ```
88
+
89
+ ## Rails Integration
90
+
91
+ For Rails applications, here is a recommended setup:
92
+
93
+ ### 1. Initializer
94
+
95
+ Create an initializer to configure the gem globally. This is where you should set up your embedding provider using Rails credentials.
96
+
97
+ ```ruby
98
+ # config/initializers/semantic_chunker.rb
99
+ SemanticChunker.configure do |config|
100
+ config.provider = SemanticChunker::Adapters::HuggingFaceAdapter.new(
101
+ api_key: Rails.application.credentials.dig(:hugging_face, :api_key),
102
+ model: "sentence-transformers/all-MiniLM-L6-v2"
103
+ )
104
+ end
105
+ ```
106
+
107
+ ### 2. Model Usage
108
+
109
+ You can use the chunker within your models, for example, to chunk a document's content before saving or for indexing in a search engine.
110
+
111
+ ```ruby
112
+ # app/models/document.rb
113
+ class Document < ApplicationRecord
114
+ def semantic_chunks
115
+ chunker = SemanticChunker::Chunker.new
116
+ chunker.chunks_for(self.content)
117
+ end
118
+ end
119
+ ```
120
+
121
+ ### 3. Caching
122
+
123
+ To avoid re-embedding the same content, which can be slow and costly, consider implementing a caching strategy. You can cache the embeddings or the final chunks. Here is a simple example using `Rails.cache`:
124
+
125
+ ```ruby
126
+ # app/models/document.rb
127
+ class Document < ApplicationRecord
128
+ def semantic_chunks
129
+ Rails.cache.fetch("document_#{self.id}_chunks", expires_in: 12.hours) do
130
+ chunker = SemanticChunker::Chunker.new
131
+ chunker.chunks_for(self.content)
132
+ end
133
+ end
134
+ end
135
+ ```
136
+
137
+ ## Configuration
138
+
139
+ ### Sentence Splitting (Pragmatic Segmenter)
140
+
141
+ This gem uses `pragmatic_segmenter` for high-quality sentence splitting. You can pass options directly to it using the `segmenter_options` hash during chunker initialization. This is useful for handling different languages or document types.
142
+
143
+ The following options are available:
144
+ - `language`: Specifies the language of the text (e.g., `'en'` for English, `'hy'` for Armenian).
145
+ - `doc_type`: Optimizes segmentation for specific document formats (e.g., `'pdf'`).
146
+ - `clean`: When `false`, disables the preliminary text cleaning process.
147
+
148
+ **Examples:**
149
+
150
+ ```ruby
151
+ # Example 1: Processing an Armenian PDF
152
+ chunker = SemanticChunker::Chunker.new(
153
+ segmenter_options: { language: 'hy', doc_type: 'pdf' }
154
+ )
155
+
156
+ # Example 2: Disabling text cleaning for strict raw data
157
+ chunker = SemanticChunker::Chunker.new(
158
+ segmenter_options: { clean: false }
159
+ )
160
+ ```
161
+
162
+ ### Global Configuration
163
+
164
+ You can configure the embedding provider globally, which is useful in frameworks like Rails.
165
+
166
+ ```ruby
167
+ # config/initializers/semantic_chunker.rb
168
+ SemanticChunker.configure do |config|
169
+ config.provider = SemanticChunker::Adapters::HuggingFaceAdapter.new(
170
+ api_key: ENV.fetch("HUGGING_FACE_API_KEY"),
171
+ model: "sentence-transformers/all-MiniLM-L6-v2"
172
+ )
173
+ end
174
+ ```
175
+
176
+ ### Per-instance Configuration
177
+
178
+ You can also pass a provider directly to the `Chunker` instance. This will override any global configuration.
179
+
180
+ ```ruby
181
+ provider = SemanticChunker::Adapters::HuggingFaceAdapter.new(api_key: "your-key")
182
+ chunker = SemanticChunker::Chunker.new(embedding_provider: provider)
183
+ ```
184
+
185
+ ### Threshold
186
+
187
+ You can configure the similarity threshold. The default is `0.82`.
188
+
189
+ > **Note:** The default value is optimized for the `sentence-transformers/all-MiniLM-L6-v2` model. You may need to adjust this value significantly for other models, especially those with different embedding dimensions (e.g., OpenAI's `text-embedding-3-large`).
190
+
191
+ 1. Higher threshold (e.g., 0.95): Requires very high similarity to keep sentences together, resulting in more, smaller chunks.
192
+
193
+ 2. Lower threshold (e.g., 0.50): Is more "forgiving," resulting in fewer, larger chunks.
194
+
195
+ ```ruby
196
+ # Lower threshold, fewer chunks
197
+ chunker = SemanticChunker::Chunker.new(threshold: 0.7)
198
+
199
+ # Higher threshold, more chunks
200
+ chunker = SemanticChunker::Chunker.new(threshold: 0.9)
201
+ ```
202
+ ### Dynamic Thresholding (v0.6.0)
203
+
204
+ With the introduction of **Dynamic Thresholding**, SemanticChunker is now model-agnostic. It automatically adapts to the vector density of different embedding models (e.g., OpenAI, E5, BGE, or Hugging Face).
205
+
206
+ ### Threshold Modes
207
+
208
+ | Mode | Syntax | Description |
209
+ | - | - | - |
210
+ | Static | `0.82` | Splits when similarity drops below a fixed number. Use this if you have a specific model tuned to a known threshold. |
211
+ | Auto | `:auto` | (Default) Calculates the 15th percentile of similarities in the document and splits at the "valleys." |
212
+ | Percentile | `{ percentile: 10 }` | Advanced control. A lower percentile creates fewer, larger chunks; a higher percentile creates more, smaller chunks. |
213
+
214
+ ### Which one should I use?
215
+
216
+ * **Use :auto** if you are swapping models frequently or using open-source models from Hugging Face. It prevents the "One Giant Chunk" bug that happens when models have low similarity ranges.
217
+
218
+ * **Use a Static number** if you require strictly deterministic behavior across different documents and know your model's distribution.
219
+ ### Buffer Windows (Buffer Size)
220
+
221
+ The buffer\_size parameter defines a sliding "context window." Instead of embedding a single sentence in isolation, the chunker combines a sentence with its neighbors. This "semantic smoothing" prevents false splits caused by short sentences or pronouns (like "He" or "It") that lack context.
222
+
223
+ * **0**: No buffer. Each sentence is embedded exactly as written. Best for very long, self-contained paragraphs.
224
+ * **1 (Default)**: Looks 1 sentence back and 1 sentence forward. For sentence $i$, the embedding represents $S_{i-1} + S_i + S_{i+1}$.
225
+ * **2**: Looks 2 sentences back and 2 forward. This creates a large 5-sentence context for every comparison.
226
+ * **:auto**: The chunker analyzes the density of your text and automatically selects the best window:
227
+ * **Short sentences** (avg < 60 chars): Uses buffer\_size: 2 (Captures conversation flow).
228
+ * **Medium sentences** (avg 60–150 chars): Uses buffer\_size: 1 (Standard).
229
+ * **Long sentences** (avg > 150 chars): Uses buffer\_size: 0 (High precision).
230
+
231
+ ```ruby
232
+ chunker = SemanticChunker::Chunker.new(buffer_size: :auto)
233
+ ```
234
+
235
+ ### Max Chunk Size
236
+
237
+ You can set a hard limit on the character length of a chunk using `max_chunk_size`. This is useful for ensuring chunks do not exceed the context window of a language model. A split will be forced, even if sentences are semantically related. The default is `1500`.
238
+
239
+ ```ruby
240
+ chunker = SemanticChunker::Chunker.new(max_chunk_size: 1000)
241
+ ```
242
+
243
+ ### Adapters
244
+
245
+ The gem is designed to be extensible with different embedding providers. It currently ships with:
246
+
247
+ - `SemanticChunker::Adapters::OpenAIAdapter`: For OpenAI's embedding models.
248
+ - `SemanticChunker::Adapters::HuggingFaceAdapter`: For Hugging Face's embedding models.
249
+ - `SemanticChunker::Adapters::TestAdapter`: A simple adapter for testing purposes.
250
+
251
+ You can create your own adapter by creating a class that inherits from `SemanticChunker::Adapters::Base` and implements an `embed(sentences)` method.
252
+
253
+ The `embed` method must return an `Array` of `Array`s, where each inner array is an embedding (a list of floats). The `Chunker` will automatically handle the conversion of these arrays into `Vector` objects for similarity calculations.
254
+
255
+ For consistency, it's recommended to place your custom adapter class within the `SemanticChunker::Adapters` namespace, although this is not a strict requirement.
256
+
257
+ ## Development & Testing
258
+
259
+ To run the tests, you'll need to install the development dependencies:
260
+
261
+ $ bundle install
262
+
263
+ ### Unit Tests
264
+
265
+ Run the unit tests with:
266
+
267
+ $ bundle exec rspec
268
+
269
+ ### Integration Tests
270
+
271
+ The integration tests use third-party APIs and require API keys.
272
+
273
+ **OpenAI**
274
+ ```bash
275
+ $ OPENAI_API_KEY="your-key" bundle exec ruby test_integration.rb
276
+ ```
277
+
278
+ **Hugging Face**
279
+ ```bash
280
+ $ HUGGING_FACE_API_KEY="your-key" bundle exec ruby test_hugging_face.rb
281
+ ```
282
+
283
+ ### Security Note: Handling API Keys
284
+
285
+ When using an adapter that requires an API key, **never hardcode your API keys** directly into your source code. To keep your application secure (especially if you are working on public repositories), use one of the following methods:
286
+
287
+ #### Using Rails Credentials (Recommended for Rails)
288
+
289
+ Store your key in your encrypted credentials file:
290
+ ```bash
291
+ bin/rails credentials:edit
292
+ ```
293
+
294
+ Then reference it in your initializer:
295
+
296
+ ```ruby
297
+ SemanticChunker.configure do |config|
298
+ config.provider = SemanticChunker::Adapters::HuggingFaceAdapter.new(
299
+ api_key: Rails.application.credentials.dig(:hugging_face, :api_key)
300
+ )
301
+ end
302
+ ```
303
+
304
+
305
+ #### Using Environment Variables
306
+
307
+ Alternatively, use a gem like dotenv and fetch the key from the environment:
308
+
309
+ ```ruby
310
+ api_key = ENV.fetch("YOUR_API_KEY") { raise "Missing API Key" }
311
+ ```
312
+
313
+
314
+ ## Troubleshooting
315
+ ---------------
316
+
317
+ ### Matrix Dependency (Ruby 3.1+)
318
+
319
+ Since Ruby 3.1, the matrix library was moved from the standard library to a bundled gem.
320
+
321
+ * **If you are on Ruby 3.1, 3.2, or 3.3:** You must include gem 'matrix' in your Gemfile.
322
+
323
+ * **If you are on Ruby 3.0:** The library is built-in. If you see a "duplicate dependency" error, ensure you are not manually adding gem 'matrix' to your Gemfile, as the system version will take precedence.
324
+
325
+
326
+ ### Hugging Face "Model Loading"
327
+
328
+ If you receive a 503 Service Unavailable error when using the Hugging Face adapter, it usually means the model is being loaded onto the server for the first time.
329
+
330
+ * **Solution:** Wait 30 seconds and try again. The HuggingFaceAdapter is designed to be lightweight, but serverless endpoints require a "warm-up" period.
331
+
332
+
333
+ ### Encoding Issues
334
+
335
+ If your text contains complex Unicode or non-UTF-8 characters, pragmatic\_segmenter may behave unexpectedly.
336
+
337
+ * **Solution:** Ensure your input string is UTF-8 encoded: text.encode('UTF-8', invalid: :replace, undef: :replace).
338
+
339
+
340
+ ## Command Line Interface (CLI)
341
+
342
+ SemanticChunker includes a powerful CLI that allows you to chunk files or piped text directly from your terminal. This is ideal for quick testing or integrating with non-Ruby applications.
343
+
344
+ ### Installation
345
+
346
+ The CLI is included when you install the gem:
347
+
348
+ ```bash
349
+ gem install semantic_chunker
350
+ ```
351
+ ### Usage
352
+
353
+ The CLI will automatically look for your HUGGING\_FACE\_API\_KEY or OPENAI\_API\_KEY in your environment or a .env file.
354
+
355
+ ```bash
356
+ # Basic usage with automatic thresholding
357
+ semantic_chunker --threshold auto path/to/document.txt
358
+
359
+ # Specify a static threshold and max chunk size
360
+ semantic_chunker -t 0.85 -m 1000 document.txt
361
+
362
+ # Pipe text from another command
363
+ echo "Long text here..." | semantic_chunker -t auto
364
+ ```
365
+ ### JSON Output
366
+
367
+ For integration with other languages (Python, Node.js) or databases, you can output the result as structured JSON:
368
+
369
+ ```bash
370
+ semantic_chunker --format json document.txt
371
+ ```
372
+
373
+ **Example JSON Output:**
374
+
375
+ ```json
376
+ {
377
+ "metadata": {
378
+ "source": "document.txt",
379
+ "chunk_count": 2,
380
+ "threshold_used": "auto"
381
+ },
382
+ "chunks": [
383
+ {
384
+ "index": 0,
385
+ "content": "First semantic topic...",
386
+ "size": 245
387
+ },
388
+ {
389
+ "index": 1,
390
+ "content": "Second semantic topic...",
391
+ "size": 180
392
+ }
393
+ ]
394
+ }
395
+ ```
396
+
397
+ ### Options
398
+
399
+ | Flag | Long | Flag | Description | Default |
400
+ | - | - | - | - | - |
401
+ | -t | --threshold | Similarity threshold (float or auto) | auto
402
+ | -m | --max-size | Hard limit for character count per chunk | 1500
403
+ | -b | --buffer | Context window size (int or auto) | auto
404
+ | -f | --format | Output format (text or json) | text
405
+ | -v | --version | Show version info | -
406
+
407
+
408
+ ## Reliability & Resilience
409
+
410
+ The Hugging Face adapter is built for production-grade reliability:
411
+ - **Exponential Backoff**: Automatically retries requests if the model is warming up or the API is busy.
412
+ - **Smart Timeouts**: Includes connection and read timeouts to prevent your application from hanging.
413
+ - **Auto-Wait**: Uses the `X-Wait-For-Model` header to ensure stable results on the Inference API.
414
+
415
+
416
+ ## 🚀 Roadmap to v1.0.0
417
+ - [x] Adaptive Dynamic Thresholding
418
+ - [x] CLI with JSON output
419
+ - [x] Robust error handling and retries
420
+ - [ ] **Next:** Local embedding cache (reduce API costs)
421
+ - [ ] **Next:** Drift protection (Anchor-sentence comparison)
422
+
423
+ ## Contributing
424
+
425
+ Bug reports and pull requests are welcome on GitHub at https://github.com/danielefrisanco/semantic_chunker.
426
+
427
+ ## License
428
+
429
+ The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
@@ -0,0 +1,70 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ # Add the local lib directory to the load path
4
+ $LOAD_PATH.unshift(File.expand_path('../lib', __dir__))
5
+
6
+ require 'semantic_chunker'
7
+ require 'optparse'
8
+ require 'dotenv'
9
+ Dotenv.load
10
+
11
+ options = {
12
+ threshold: :auto,
13
+ max_size: 1500,
14
+ buffer: :auto
15
+ }
16
+
17
+ OptionParser.new do |opts|
18
+ opts.banner = "Usage: semantic_chunker [options] <file>"
19
+ opts.on("-t", "--threshold VAL", "Threshold (float, :auto)") { |v| options[:threshold] = v == 'auto' ? :auto : v.to_f }
20
+ opts.on("-m", "--max-size VAL", Integer, "Max character size") { |v| options[:max_size] = v }
21
+ opts.on("-f", "--format FORMAT", [:text, :json], "Output format (text, json)") { |v| options[:format] = v }
22
+ opts.on("-b", "--buffer VAL", "Buffer size (int, :auto)") { |v| options[:buffer] = v == 'auto' ? :auto : v.to_i }
23
+ opts.on("-v", "--version", "Show version") do
24
+ puts SemanticChunker::VERSION
25
+ exit
26
+ end
27
+ end.parse!
28
+
29
+ input_file = ARGV[0]
30
+ text = input_file ? File.read(input_file) : ARGF.read
31
+
32
+ if text.nil? || text.empty?
33
+ puts "Error: No input text provided."
34
+ exit 1
35
+ end
36
+ provider = if ENV['HUGGING_FACE_API_KEY']
37
+ SemanticChunker::Adapters::HuggingFaceAdapter.new(api_key: ENV['HUGGING_FACE_API_KEY'])
38
+ elsif ENV['OPENAI_API_KEY']
39
+ # Assuming you have an OpenAI adapter
40
+ SemanticChunker::Adapters::OpenAIAdapter.new(api_key: ENV['OPENAI_API_KEY'])
41
+ else
42
+ puts "Error: No API key found (HUGGING_FACE_API_KEY or OPENAI_API_KEY)."
43
+ exit 1
44
+ end
45
+
46
+ # Assuming provider is configured via ENV/Dotenv
47
+ chunker = SemanticChunker::Chunker.new(
48
+ embedding_provider: provider,
49
+ threshold: options[:threshold],
50
+ max_chunk_size: options[:max_size],
51
+ buffer_size: options[:buffer]
52
+ )
53
+
54
+ chunks = chunker.chunks_for(text)
55
+ if options[:format] == :json
56
+ puts JSON.pretty_generate({
57
+ metadata: {
58
+ source: input_file || "stdin",
59
+ chunk_count: chunks.size,
60
+ threshold_used: options[:threshold]
61
+ },
62
+ chunks: chunks.map.with_index { |c, i| { index: i, content: c, size: c.length } }
63
+ })
64
+ else
65
+ chunks.each_with_index do |chunk, i|
66
+ puts "--- Chunk #{i + 1} ---"
67
+ puts chunk
68
+ puts "\n"
69
+ end
70
+ end
@@ -1,8 +1,18 @@
1
1
  # lib/semantic_chunker/adapters/hugging_face_adapter.rb
2
+ require 'net/http'
3
+ require 'json'
4
+ require 'uri'
5
+
2
6
  module SemanticChunker
3
7
  module Adapters
4
8
  class HuggingFaceAdapter < Base
5
9
  BASE_URL = "https://router.huggingface.co/hf-inference/models/%{model}"
10
+
11
+ # Configuration for reliability
12
+ MAX_RETRIES = 3
13
+ INITIAL_BACKOFF = 2 # seconds
14
+ OPEN_TIMEOUT = 5 # seconds to open connection
15
+ READ_TIMEOUT = 60 # seconds to wait for embeddings
6
16
 
7
17
  def initialize(api_key:, model: 'intfloat/multilingual-e5-large')
8
18
  @api_key = api_key
@@ -12,23 +22,20 @@ module SemanticChunker
12
22
  end
13
23
 
14
24
  def embed(sentences)
15
- response = post_request(sentences)
16
-
17
- unless response.content_type == "application/json"
18
- raise "HuggingFace Error: Expected JSON, got #{response.content_type}. Body: #{response.body}"
19
- end
25
+ retry_count = 0
20
26
 
21
- parsed = JSON.parse(response.body)
22
-
23
- if response.is_a?(Net::HTTPSuccess)
24
- parsed
25
- else
26
- if parsed.is_a?(Hash) && parsed["error"]&.include?("loading")
27
- puts "Model warming up... retrying in 10s"
28
- sleep 10
29
- return embed(sentences)
27
+ begin
28
+ response = post_request(sentences)
29
+ handle_response(response)
30
+ rescue => e
31
+ if retryable?(e, retry_count)
32
+ wait_time = INITIAL_BACKOFF * (2**retry_count)
33
+ puts "HuggingFace: Transient error (#{e.message}). Retrying in #{wait_time}s..."
34
+ sleep wait_time
35
+ retry_count += 1
36
+ retry
30
37
  end
31
- raise "HuggingFace Error: #{parsed['error'] || parsed}"
38
+ raise e
32
39
  end
33
40
  end
34
41
 
@@ -40,17 +47,42 @@ module SemanticChunker
40
47
 
41
48
  request["Authorization"] = "Bearer #{@api_key}"
42
49
  request["Content-Type"] = "application/json"
43
- request["X-Wait-For-Model"] = "true"
50
+ request["X-Wait-For-Model"] = "true" # Tells HF to wait for model load
44
51
 
45
- request.body = {
46
- inputs: sentences
47
- }.to_json
52
+ request.body = { inputs: sentences }.to_json
48
53
 
49
54
  Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) do |http|
50
- http.read_timeout = 60
55
+ http.open_timeout = OPEN_TIMEOUT
56
+ http.read_timeout = READ_TIMEOUT
51
57
  http.request(request)
52
58
  end
53
59
  end
60
+
61
+ def handle_response(response)
62
+ unless response.content_type == "application/json"
63
+ raise "HuggingFace Error: Expected JSON, got #{response.content_type}."
64
+ end
65
+
66
+ parsed = JSON.parse(response.body)
67
+
68
+ if response.is_a?(Net::HTTPSuccess)
69
+ parsed
70
+ elsif parsed.is_a?(Hash) && parsed["error"]&.include?("loading")
71
+ # This specifically triggers a retry for model warmups
72
+ raise "Model is still loading"
73
+ else
74
+ raise "HuggingFace API Error: #{parsed['error'] || response.body}"
75
+ end
76
+ end
77
+
78
+ def retryable?(error, count)
79
+ return false if count >= MAX_RETRIES
80
+
81
+ # Retry on timeouts, loading errors, or 5xx server errors
82
+ error.message.include?("loading") ||
83
+ error.is_a?(Net::ReadTimeout) ||
84
+ error.is_a?(Net::OpenTimeout)
85
+ end
54
86
  end
55
87
  end
56
- end
88
+ end
@@ -31,7 +31,10 @@ module SemanticChunker
31
31
  # Step 3: Embed the groups, not the raw sentences
32
32
  group_embeddings = @provider.embed(context_groups)
33
33
 
34
- calculate_groups(sentences, group_embeddings)
34
+ # Resolve the threshold dynamically if requested
35
+ resolved_threshold = resolve_threshold(group_embeddings)
36
+
37
+ calculate_groups(sentences, group_embeddings, resolved_threshold)
35
38
  end
36
39
 
37
40
  private
@@ -65,7 +68,7 @@ module SemanticChunker
65
68
  ps.segment
66
69
  end
67
70
 
68
- def calculate_groups(sentences, embeddings)
71
+ def calculate_groups(sentences, embeddings, resolved_threshold)
69
72
  chunks = []
70
73
  current_chunk_text = [sentences[0]]
71
74
  current_chunk_vectors = [Vector[*embeddings[0]]]
@@ -74,22 +77,17 @@ module SemanticChunker
74
77
  new_sentence = sentences[i]
75
78
  new_vec = Vector[*embeddings[i]]
76
79
 
77
- # 1. Calculate Centroid
78
80
  centroid = current_chunk_vectors.inject(:+) / current_chunk_vectors.size.to_f
79
81
  sim = cosine_similarity(centroid, new_vec)
80
82
 
81
- # 2. Check Constraints: Similarity OR Size
82
- # We calculate the potential size of the chunk if we added this sentence
83
83
  potential_size = current_chunk_text.join(" ").length + new_sentence.length + 1
84
84
 
85
- if sim < @threshold || potential_size > @max_chunk_size
86
- # Split if the topic changed OR the chunk is getting too fat
85
+ # Use the resolved_threshold instead of @threshold
86
+ if sim < resolved_threshold || potential_size > @max_chunk_size
87
87
  chunks << current_chunk_text.join(" ")
88
-
89
88
  current_chunk_text = [new_sentence]
90
89
  current_chunk_vectors = [new_vec]
91
90
  else
92
- # Keep grouping
93
91
  current_chunk_text << new_sentence
94
92
  current_chunk_vectors << new_vec
95
93
  end
@@ -98,10 +96,43 @@ module SemanticChunker
98
96
  chunks << current_chunk_text.join(" ")
99
97
  chunks
100
98
  end
101
-
102
99
  def cosine_similarity(v1, v2)
103
- return 0.0 if v1.magnitude.zero? || v2.magnitude.zero?
104
- v1.inner_product(v2) / (v1.magnitude * v2.magnitude)
100
+ # Ensure we are working with Vectors
101
+ v1 = Vector[*v1] unless v1.is_a?(Vector)
102
+ v2 = Vector[*v2] unless v2.is_a?(Vector)
103
+
104
+ mag1 = v1.magnitude
105
+ mag2 = v2.magnitude
106
+
107
+ return 0.0 if mag1.zero? || mag2.zero?
108
+ v1.inner_product(v2) / (mag1 * mag2)
109
+ end
110
+ def resolve_threshold(embeddings)
111
+ return @threshold if @threshold.is_a?(Numeric)
112
+ return DEFAULT_THRESHOLD if embeddings.size < 2
113
+
114
+ similarities = []
115
+ (0...embeddings.size - 1).each do |i|
116
+ # Note: We wrap them here, but ensure cosine_similarity
117
+ # doesn't re-wrap them if they are already Vectors.
118
+ v1 = Vector[*embeddings[i]]
119
+ v2 = Vector[*embeddings[i+1]]
120
+ similarities << cosine_similarity(v1, v2)
121
+ end
122
+
123
+ return DEFAULT_THRESHOLD if similarities.empty?
124
+
125
+ percentile_val = @threshold.is_a?(Hash) ? @threshold[:percentile] : 20
126
+
127
+ # Use (size - 1) for the index to avoid "out of bounds" on small lists
128
+ sorted_sims = similarities.sort
129
+ index = ((sorted_sims.size - 1) * (percentile_val / 100.0)).round
130
+
131
+ dynamic_val = sorted_sims[index]
132
+
133
+ # Guardrail: Clamp to prevent hyper-splitting or never-splitting
134
+ # 0.3 is a safe floor for 'totally different', 0.95 is a safe ceiling.
135
+ dynamic_val.clamp(0.3, 0.95)
105
136
  end
106
137
  end
107
138
  end
@@ -1,3 +1,3 @@
1
1
  module SemanticChunker
2
- VERSION = "0.5.3"
2
+ VERSION = "0.6.3"
3
3
  end
@@ -5,7 +5,7 @@ require 'json'
5
5
  require 'net/http'
6
6
 
7
7
  # 2. Require the version and base modules
8
- require_relative 'semantic_chunker/version' if File.exist?('lib/semantic_chunker/version.rb')
8
+ require_relative 'semantic_chunker/version'
9
9
 
10
10
  # 3. Require the internal logic
11
11
  require_relative 'semantic_chunker/adapters/base'
metadata CHANGED
@@ -1,15 +1,43 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: semantic_chunker
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.5.3
4
+ version: 0.6.3
5
5
  platform: ruby
6
6
  authors:
7
7
  - Daniele Frisanco
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2026-01-07 00:00:00.000000000 Z
11
+ date: 2026-01-08 00:00:00.000000000 Z
12
12
  dependencies:
13
+ - !ruby/object:Gem::Dependency
14
+ name: pragmatic_segmenter
15
+ requirement: !ruby/object:Gem::Requirement
16
+ requirements:
17
+ - - "~>"
18
+ - !ruby/object:Gem::Version
19
+ version: '0.3'
20
+ type: :runtime
21
+ prerelease: false
22
+ version_requirements: !ruby/object:Gem::Requirement
23
+ requirements:
24
+ - - "~>"
25
+ - !ruby/object:Gem::Version
26
+ version: '0.3'
27
+ - !ruby/object:Gem::Dependency
28
+ name: matrix
29
+ requirement: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - "~>"
32
+ - !ruby/object:Gem::Version
33
+ version: '0.4'
34
+ type: :runtime
35
+ prerelease: false
36
+ version_requirements: !ruby/object:Gem::Requirement
37
+ requirements:
38
+ - - "~>"
39
+ - !ruby/object:Gem::Version
40
+ version: '0.4'
13
41
  - !ruby/object:Gem::Dependency
14
42
  name: rake
15
43
  requirement: !ruby/object:Gem::Requirement
@@ -53,40 +81,32 @@ dependencies:
53
81
  - !ruby/object:Gem::Version
54
82
  version: '0'
55
83
  - !ruby/object:Gem::Dependency
56
- name: pragmatic_segmenter
84
+ name: webmock
57
85
  requirement: !ruby/object:Gem::Requirement
58
86
  requirements:
59
- - - "~>"
60
- - !ruby/object:Gem::Version
61
- version: '0.3'
62
- type: :runtime
63
- prerelease: false
64
- version_requirements: !ruby/object:Gem::Requirement
65
- requirements:
66
- - - "~>"
67
- - !ruby/object:Gem::Version
68
- version: '0.3'
69
- - !ruby/object:Gem::Dependency
70
- name: matrix
71
- requirement: !ruby/object:Gem::Requirement
72
- requirements:
73
- - - "~>"
87
+ - - ">="
74
88
  - !ruby/object:Gem::Version
75
- version: '0.4'
76
- type: :runtime
89
+ version: '0'
90
+ type: :development
77
91
  prerelease: false
78
92
  version_requirements: !ruby/object:Gem::Requirement
79
93
  requirements:
80
- - - "~>"
94
+ - - ">="
81
95
  - !ruby/object:Gem::Version
82
- version: '0.4'
83
- description: Split long text into chunks based on semantic meaning.
96
+ version: '0'
97
+ description: A powerful tool for RAG (Retrieval-Augmented Generation) that splits
98
+ text into chunks based on semantic meaning rather than just character counts. Supports
99
+ sliding windows, adaptive buffering, and dynamic percentile-based thresholding.
84
100
  email:
85
101
  - daniele.frisanco@gmail.com
86
- executables: []
102
+ executables:
103
+ - semantic_chunker
87
104
  extensions: []
88
105
  extra_rdoc_files: []
89
106
  files:
107
+ - CHANGELOG.md
108
+ - README.md
109
+ - bin/semantic_chunker
90
110
  - lib/semantic_chunker.rb
91
111
  - lib/semantic_chunker/adapters/base.rb
92
112
  - lib/semantic_chunker/adapters/hugging_face_adapter.rb
@@ -97,7 +117,13 @@ files:
97
117
  homepage: https://github.com/danielefrisanco/semantic_chunker
98
118
  licenses:
99
119
  - MIT
100
- metadata: {}
120
+ metadata:
121
+ homepage_uri: https://github.com/danielefrisanco/semantic_chunker
122
+ source_code_uri: https://github.com/danielefrisanco/semantic_chunker
123
+ changelog_uri: https://github.com/danielefrisanco/semantic_chunker/blob/main/CHANGELOG.md
124
+ bug_tracker_uri: https://github.com/danielefrisanco/semantic_chunker/issues
125
+ documentation_uri: https://www.rubydoc.info/gems/semantic_chunker/0.6.3
126
+ allowed_push_host: https://rubygems.org
101
127
  post_install_message:
102
128
  rdoc_options: []
103
129
  require_paths:
@@ -106,7 +132,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
106
132
  requirements:
107
133
  - - ">="
108
134
  - !ruby/object:Gem::Version
109
- version: '0'
135
+ version: 3.0.0
110
136
  required_rubygems_version: !ruby/object:Gem::Requirement
111
137
  requirements:
112
138
  - - ">="
@@ -116,5 +142,5 @@ requirements: []
116
142
  rubygems_version: 3.3.26
117
143
  signing_key:
118
144
  specification_version: 4
119
- summary: Split long text into chunks based on semantic meaning.
145
+ summary: Semantic text chunking using embeddings and dynamic thresholding.
120
146
  test_files: []