semantic_chunker 0.5.3 → 0.6.3
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +63 -0
- data/README.md +429 -0
- data/bin/semantic_chunker +70 -0
- data/lib/semantic_chunker/adapters/hugging_face_adapter.rb +53 -21
- data/lib/semantic_chunker/chunker.rb +43 -12
- data/lib/semantic_chunker/version.rb +1 -1
- data/lib/semantic_chunker.rb +1 -1
- metadata +53 -27
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 9eb71c3c0285ded52be28cc83f18c2590040bcac489bf07963134b9b877c7dd6
|
|
4
|
+
data.tar.gz: 2cbb2ad4565519fd068b3a1eb6805e7989bd7d78886884b311faa6c4109193bc
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 2210e23a05cc4ed601528f0c1b877d02c13d9b3da77486a0fa5845a799fd02815f21f9ecd9cd52bf1c4bcb87aff2c733914016e3efba9c1454df992ea0161fa7
|
|
7
|
+
data.tar.gz: 2541b7fcb705b410444e109979a95cc202d660ec9febc0d1bd4e1116173b99a0d1b75a4a88c204a4e590d2a20e84888c3e75bd8174f43e18b2a19201098b756e
|
data/CHANGELOG.md
ADDED
|
@@ -0,0 +1,63 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to this project will be documented in this file.
|
|
4
|
+
[0.6.2] - 2026-01-07
|
|
5
|
+
----------------------
|
|
6
|
+
|
|
7
|
+
### Added
|
|
8
|
+
|
|
9
|
+
* **Command Line Interface (CLI)**: Introduced bin/semantic\_chunker allowing users to chunk files or piped text directly from the terminal.
|
|
10
|
+
|
|
11
|
+
* **JSON Output**: Added --format json flag to the CLI for easy integration with Python, Node.js, and other data pipelines.
|
|
12
|
+
|
|
13
|
+
* **Net::HTTP Timeouts**: Added open\_timeout and read\_timeout to the Hugging Face adapter to prevent application hangs during network instability.
|
|
14
|
+
|
|
15
|
+
* **Exponential Backoff**: Implemented a retry strategy for the Hugging Face API that waits progressively longer if the model is currently "loading" or "warming up."
|
|
16
|
+
|
|
17
|
+
* **Unit Testing Suite**: Established an RSpec test suite using **WebMock** to simulate API responses and verify retry/timeout logic without making real network calls.
|
|
18
|
+
|
|
19
|
+
|
|
20
|
+
### Changed
|
|
21
|
+
|
|
22
|
+
* **Hugging Face Resilience**: Improved the adapter to handle transient 503 errors and "Model cold start" scenarios more gracefully using the X-Wait-For-Model header.
|
|
23
|
+
|
|
24
|
+
* **CLI Performance**: Added local load path handling to allow running the CLI during development without requiring the gem to be installed globally.
|
|
25
|
+
|
|
26
|
+
|
|
27
|
+
### Fixed
|
|
28
|
+
|
|
29
|
+
* **Unstable Network Hangs**: Fixed an issue where a slow response from the embedding provider could block the Ruby process indefinitely.
|
|
30
|
+
## [0.6.0] - 2026-01-07
|
|
31
|
+
|
|
32
|
+
### Added
|
|
33
|
+
- **Dynamic Thresholding**: Introduced model-agnostic splitting logic. The chunker now adapts to the specific "density" of a document's vector space.
|
|
34
|
+
- **Auto Mode**: Use `threshold: :auto` to automatically calculate the optimal split point based on the document's 15th percentile of similarity.
|
|
35
|
+
- **Percentile Mode**: Use `threshold: { percentile: 10 }` for fine-grained control over how sensitive the topic-shifting detection should be.
|
|
36
|
+
- **Clamping Logic**: Added guardrails to dynamic thresholds (clamped between `0.3` and `0.95`) to prevent hyper-splitting in repetitive documents.
|
|
37
|
+
|
|
38
|
+
### Fixed
|
|
39
|
+
- **Ruby 3.0 Compatibility**: Resolved CI/CD issues and Bundler version conflicts to ensure full support for Ruby 3.0.x.
|
|
40
|
+
- **Precision Indexing**: Improved percentile calculation using `round` logic to ensure accuracy in both short and long documents.
|
|
41
|
+
|
|
42
|
+
### Summary of API Changes
|
|
43
|
+
The `threshold` parameter now accepts three types of input:
|
|
44
|
+
|
|
45
|
+
| Mode | Input | Best For... |
|
|
46
|
+
|------------|----------------------|----------------------------------------------------------------|
|
|
47
|
+
| **Static** | `0.82` (float) | Deterministic behavior with known models (e.g., OpenAI). |
|
|
48
|
+
| **Auto** | `:auto` | General purpose; handles E5/BGE/MiniLM models automatically. |
|
|
49
|
+
| **Percentile**| `{ percentile: 10 }`| Custom sensitivity; lower % = larger chunks, higher % = more splits. |
|
|
50
|
+
|
|
51
|
+
---
|
|
52
|
+
|
|
53
|
+
## [0.5.3] - 2025-10-08
|
|
54
|
+
### Added
|
|
55
|
+
- **Pragmatic Segmenter Integration**: Replaced basic regex splitting with `pragmatic_segmenter` for multilingual and context-aware sentence boundary detection.
|
|
56
|
+
- **Language Support**: Added `segmenter_options` to allow users to specify document language (e.g., `hy`, `jp`, `en`) and type (e.g., `pdf`).
|
|
57
|
+
|
|
58
|
+
## [0.2.0] - 2026-01-06
|
|
59
|
+
### Added
|
|
60
|
+
- **Centroid Comparison:** Chunks now split based on the average semantic meaning of the entire current group rather than just the previous sentence.
|
|
61
|
+
- **Sliding Buffer Window:** Added `buffer_size` to enrich sentence embeddings with surrounding context.
|
|
62
|
+
- **Adaptive Buffering:** Introduced `:auto` mode for `buffer_size`.
|
|
63
|
+
- **Hard Size Limits:** Added `max_chunk_size` to force splits when a topic exceeds character limits.
|
data/README.md
ADDED
|
@@ -0,0 +1,429 @@
|
|
|
1
|
+
# Semantic Chunker
|
|
2
|
+
|
|
3
|
+
[](https://badge.fury.io/rb/semantic_chunker)
|
|
4
|
+
|
|
5
|
+
A Ruby gem for splitting long texts into semantically related chunks. This is useful for preparing text for language models where you need to feed a model with contextually relevant information.
|
|
6
|
+
|
|
7
|
+
## What is Semantic Chunking?
|
|
8
|
+
|
|
9
|
+
Semantic chunking is a technique for splitting text based on meaning. Instead of splitting text by a fixed number of words or sentences, this gem groups sentences that are semantically related.
|
|
10
|
+
|
|
11
|
+
It works by:
|
|
12
|
+
1. Splitting the text into individual sentences.
|
|
13
|
+
2. Generating a vector embedding for each sentence using a configurable provider (e.g., OpenAI, Hugging Face).
|
|
14
|
+
3. Comparing the new sentence's windowed embedding to the **centroid (average) of the current chunk's embeddings**.
|
|
15
|
+
4. If the similarity between the new sentence and the chunk's centroid is below a certain threshold, a new chunk is started. This prevents topic drift.
|
|
16
|
+
5. The process is enhanced by a **buffer window**, which considers multiple sentences at a time to make more robust decisions.
|
|
17
|
+
|
|
18
|
+
This results in chunks of text that are topically coherent.
|
|
19
|
+
|
|
20
|
+
## Compatibility
|
|
21
|
+
|
|
22
|
+
This gem requires Ruby 3.0 or higher.
|
|
23
|
+
|
|
24
|
+
Installation
|
|
25
|
+
------------
|
|
26
|
+
|
|
27
|
+
This gem relies on two key dependencies for its logic:
|
|
28
|
+
|
|
29
|
+
1. **matrix**: Used for high-performance vector calculations and centroid math.
|
|
30
|
+
|
|
31
|
+
2. **pragmatic\_segmenter**: Used for rule-based sentence boundary detection (handling abbreviations, initials, and citations).
|
|
32
|
+
|
|
33
|
+
|
|
34
|
+
Add these lines to your application's Gemfile:
|
|
35
|
+
|
|
36
|
+
```ruby
|
|
37
|
+
# Required for Ruby 3.1+
|
|
38
|
+
gem 'matrix'
|
|
39
|
+
|
|
40
|
+
# Required for high-quality sentence splitting
|
|
41
|
+
gem 'pragmatic_segmenter'
|
|
42
|
+
|
|
43
|
+
gem 'semantic_chunker'
|
|
44
|
+
```
|
|
45
|
+
|
|
46
|
+
And then execute:
|
|
47
|
+
|
|
48
|
+
$ bundle install
|
|
49
|
+
|
|
50
|
+
Or install it yourself as:
|
|
51
|
+
|
|
52
|
+
$ gem install semantic_chunker
|
|
53
|
+
|
|
54
|
+
## Usage
|
|
55
|
+
|
|
56
|
+
Here is a basic example of how to use `semantic_chunker`:
|
|
57
|
+
|
|
58
|
+
```ruby
|
|
59
|
+
require 'semantic_chunker'
|
|
60
|
+
|
|
61
|
+
# 1. Configure the provider
|
|
62
|
+
# You can configure the provider globally.
|
|
63
|
+
# This is useful in a Rails initializer for example.
|
|
64
|
+
SemanticChunker.configure do |config|
|
|
65
|
+
config.provider = SemanticChunker::Adapters::HuggingFaceAdapter.new(
|
|
66
|
+
api_key: ENV.fetch("HUGGING_FACE_API_KEY"),
|
|
67
|
+
model: "sentence-transformers/all-MiniLM-L6-v2"
|
|
68
|
+
)
|
|
69
|
+
end
|
|
70
|
+
|
|
71
|
+
# 2. Create a chunker and process your text
|
|
72
|
+
chunker = SemanticChunker::Chunker.new(
|
|
73
|
+
threshold: 0.8,
|
|
74
|
+
buffer_size: :auto,
|
|
75
|
+
max_chunk_size: 1000
|
|
76
|
+
)
|
|
77
|
+
text = "Your very long document text goes here. It can contain multiple paragraphs and topics. The chunker will split it into meaningful parts."
|
|
78
|
+
chunks = chunker.chunks_for(text)
|
|
79
|
+
|
|
80
|
+
# chunks will be an array of strings.
|
|
81
|
+
# The strings preserve the original formatting and whitespace.
|
|
82
|
+
chunks.each_with_index do |chunk, i|
|
|
83
|
+
puts "Chunk #{i+1}:"
|
|
84
|
+
puts chunk
|
|
85
|
+
puts "---"
|
|
86
|
+
end
|
|
87
|
+
```
|
|
88
|
+
|
|
89
|
+
## Rails Integration
|
|
90
|
+
|
|
91
|
+
For Rails applications, here is a recommended setup:
|
|
92
|
+
|
|
93
|
+
### 1. Initializer
|
|
94
|
+
|
|
95
|
+
Create an initializer to configure the gem globally. This is where you should set up your embedding provider using Rails credentials.
|
|
96
|
+
|
|
97
|
+
```ruby
|
|
98
|
+
# config/initializers/semantic_chunker.rb
|
|
99
|
+
SemanticChunker.configure do |config|
|
|
100
|
+
config.provider = SemanticChunker::Adapters::HuggingFaceAdapter.new(
|
|
101
|
+
api_key: Rails.application.credentials.dig(:hugging_face, :api_key),
|
|
102
|
+
model: "sentence-transformers/all-MiniLM-L6-v2"
|
|
103
|
+
)
|
|
104
|
+
end
|
|
105
|
+
```
|
|
106
|
+
|
|
107
|
+
### 2. Model Usage
|
|
108
|
+
|
|
109
|
+
You can use the chunker within your models, for example, to chunk a document's content before saving or for indexing in a search engine.
|
|
110
|
+
|
|
111
|
+
```ruby
|
|
112
|
+
# app/models/document.rb
|
|
113
|
+
class Document < ApplicationRecord
|
|
114
|
+
def semantic_chunks
|
|
115
|
+
chunker = SemanticChunker::Chunker.new
|
|
116
|
+
chunker.chunks_for(self.content)
|
|
117
|
+
end
|
|
118
|
+
end
|
|
119
|
+
```
|
|
120
|
+
|
|
121
|
+
### 3. Caching
|
|
122
|
+
|
|
123
|
+
To avoid re-embedding the same content, which can be slow and costly, consider implementing a caching strategy. You can cache the embeddings or the final chunks. Here is a simple example using `Rails.cache`:
|
|
124
|
+
|
|
125
|
+
```ruby
|
|
126
|
+
# app/models/document.rb
|
|
127
|
+
class Document < ApplicationRecord
|
|
128
|
+
def semantic_chunks
|
|
129
|
+
Rails.cache.fetch("document_#{self.id}_chunks", expires_in: 12.hours) do
|
|
130
|
+
chunker = SemanticChunker::Chunker.new
|
|
131
|
+
chunker.chunks_for(self.content)
|
|
132
|
+
end
|
|
133
|
+
end
|
|
134
|
+
end
|
|
135
|
+
```
|
|
136
|
+
|
|
137
|
+
## Configuration
|
|
138
|
+
|
|
139
|
+
### Sentence Splitting (Pragmatic Segmenter)
|
|
140
|
+
|
|
141
|
+
This gem uses `pragmatic_segmenter` for high-quality sentence splitting. You can pass options directly to it using the `segmenter_options` hash during chunker initialization. This is useful for handling different languages or document types.
|
|
142
|
+
|
|
143
|
+
The following options are available:
|
|
144
|
+
- `language`: Specifies the language of the text (e.g., `'en'` for English, `'hy'` for Armenian).
|
|
145
|
+
- `doc_type`: Optimizes segmentation for specific document formats (e.g., `'pdf'`).
|
|
146
|
+
- `clean`: When `false`, disables the preliminary text cleaning process.
|
|
147
|
+
|
|
148
|
+
**Examples:**
|
|
149
|
+
|
|
150
|
+
```ruby
|
|
151
|
+
# Example 1: Processing an Armenian PDF
|
|
152
|
+
chunker = SemanticChunker::Chunker.new(
|
|
153
|
+
segmenter_options: { language: 'hy', doc_type: 'pdf' }
|
|
154
|
+
)
|
|
155
|
+
|
|
156
|
+
# Example 2: Disabling text cleaning for strict raw data
|
|
157
|
+
chunker = SemanticChunker::Chunker.new(
|
|
158
|
+
segmenter_options: { clean: false }
|
|
159
|
+
)
|
|
160
|
+
```
|
|
161
|
+
|
|
162
|
+
### Global Configuration
|
|
163
|
+
|
|
164
|
+
You can configure the embedding provider globally, which is useful in frameworks like Rails.
|
|
165
|
+
|
|
166
|
+
```ruby
|
|
167
|
+
# config/initializers/semantic_chunker.rb
|
|
168
|
+
SemanticChunker.configure do |config|
|
|
169
|
+
config.provider = SemanticChunker::Adapters::HuggingFaceAdapter.new(
|
|
170
|
+
api_key: ENV.fetch("HUGGING_FACE_API_KEY"),
|
|
171
|
+
model: "sentence-transformers/all-MiniLM-L6-v2"
|
|
172
|
+
)
|
|
173
|
+
end
|
|
174
|
+
```
|
|
175
|
+
|
|
176
|
+
### Per-instance Configuration
|
|
177
|
+
|
|
178
|
+
You can also pass a provider directly to the `Chunker` instance. This will override any global configuration.
|
|
179
|
+
|
|
180
|
+
```ruby
|
|
181
|
+
provider = SemanticChunker::Adapters::HuggingFaceAdapter.new(api_key: "your-key")
|
|
182
|
+
chunker = SemanticChunker::Chunker.new(embedding_provider: provider)
|
|
183
|
+
```
|
|
184
|
+
|
|
185
|
+
### Threshold
|
|
186
|
+
|
|
187
|
+
You can configure the similarity threshold. The default is `0.82`.
|
|
188
|
+
|
|
189
|
+
> **Note:** The default value is optimized for the `sentence-transformers/all-MiniLM-L6-v2` model. You may need to adjust this value significantly for other models, especially those with different embedding dimensions (e.g., OpenAI's `text-embedding-3-large`).
|
|
190
|
+
|
|
191
|
+
1. Higher threshold (e.g., 0.95): Requires very high similarity to keep sentences together, resulting in more, smaller chunks.
|
|
192
|
+
|
|
193
|
+
2. Lower threshold (e.g., 0.50): Is more "forgiving," resulting in fewer, larger chunks.
|
|
194
|
+
|
|
195
|
+
```ruby
|
|
196
|
+
# Lower threshold, fewer chunks
|
|
197
|
+
chunker = SemanticChunker::Chunker.new(threshold: 0.7)
|
|
198
|
+
|
|
199
|
+
# Higher threshold, more chunks
|
|
200
|
+
chunker = SemanticChunker::Chunker.new(threshold: 0.9)
|
|
201
|
+
```
|
|
202
|
+
### Dynamic Thresholding (v0.6.0)
|
|
203
|
+
|
|
204
|
+
With the introduction of **Dynamic Thresholding**, SemanticChunker is now model-agnostic. It automatically adapts to the vector density of different embedding models (e.g., OpenAI, E5, BGE, or Hugging Face).
|
|
205
|
+
|
|
206
|
+
### Threshold Modes
|
|
207
|
+
|
|
208
|
+
| Mode | Syntax | Description |
|
|
209
|
+
| - | - | - |
|
|
210
|
+
| Static | `0.82` | Splits when similarity drops below a fixed number. Use this if you have a specific model tuned to a known threshold. |
|
|
211
|
+
| Auto | `:auto` | (Default) Calculates the 15th percentile of similarities in the document and splits at the "valleys." |
|
|
212
|
+
| Percentile | `{ percentile: 10 }` | Advanced control. A lower percentile creates fewer, larger chunks; a higher percentile creates more, smaller chunks. |
|
|
213
|
+
|
|
214
|
+
### Which one should I use?
|
|
215
|
+
|
|
216
|
+
* **Use :auto** if you are swapping models frequently or using open-source models from Hugging Face. It prevents the "One Giant Chunk" bug that happens when models have low similarity ranges.
|
|
217
|
+
|
|
218
|
+
* **Use a Static number** if you require strictly deterministic behavior across different documents and know your model's distribution.
|
|
219
|
+
### Buffer Windows (Buffer Size)
|
|
220
|
+
|
|
221
|
+
The buffer\_size parameter defines a sliding "context window." Instead of embedding a single sentence in isolation, the chunker combines a sentence with its neighbors. This "semantic smoothing" prevents false splits caused by short sentences or pronouns (like "He" or "It") that lack context.
|
|
222
|
+
|
|
223
|
+
* **0**: No buffer. Each sentence is embedded exactly as written. Best for very long, self-contained paragraphs.
|
|
224
|
+
* **1 (Default)**: Looks 1 sentence back and 1 sentence forward. For sentence $i$, the embedding represents $S_{i-1} + S_i + S_{i+1}$.
|
|
225
|
+
* **2**: Looks 2 sentences back and 2 forward. This creates a large 5-sentence context for every comparison.
|
|
226
|
+
* **:auto**: The chunker analyzes the density of your text and automatically selects the best window:
|
|
227
|
+
* **Short sentences** (avg < 60 chars): Uses buffer\_size: 2 (Captures conversation flow).
|
|
228
|
+
* **Medium sentences** (avg 60–150 chars): Uses buffer\_size: 1 (Standard).
|
|
229
|
+
* **Long sentences** (avg > 150 chars): Uses buffer\_size: 0 (High precision).
|
|
230
|
+
|
|
231
|
+
```ruby
|
|
232
|
+
chunker = SemanticChunker::Chunker.new(buffer_size: :auto)
|
|
233
|
+
```
|
|
234
|
+
|
|
235
|
+
### Max Chunk Size
|
|
236
|
+
|
|
237
|
+
You can set a hard limit on the character length of a chunk using `max_chunk_size`. This is useful for ensuring chunks do not exceed the context window of a language model. A split will be forced, even if sentences are semantically related. The default is `1500`.
|
|
238
|
+
|
|
239
|
+
```ruby
|
|
240
|
+
chunker = SemanticChunker::Chunker.new(max_chunk_size: 1000)
|
|
241
|
+
```
|
|
242
|
+
|
|
243
|
+
### Adapters
|
|
244
|
+
|
|
245
|
+
The gem is designed to be extensible with different embedding providers. It currently ships with:
|
|
246
|
+
|
|
247
|
+
- `SemanticChunker::Adapters::OpenAIAdapter`: For OpenAI's embedding models.
|
|
248
|
+
- `SemanticChunker::Adapters::HuggingFaceAdapter`: For Hugging Face's embedding models.
|
|
249
|
+
- `SemanticChunker::Adapters::TestAdapter`: A simple adapter for testing purposes.
|
|
250
|
+
|
|
251
|
+
You can create your own adapter by creating a class that inherits from `SemanticChunker::Adapters::Base` and implements an `embed(sentences)` method.
|
|
252
|
+
|
|
253
|
+
The `embed` method must return an `Array` of `Array`s, where each inner array is an embedding (a list of floats). The `Chunker` will automatically handle the conversion of these arrays into `Vector` objects for similarity calculations.
|
|
254
|
+
|
|
255
|
+
For consistency, it's recommended to place your custom adapter class within the `SemanticChunker::Adapters` namespace, although this is not a strict requirement.
|
|
256
|
+
|
|
257
|
+
## Development & Testing
|
|
258
|
+
|
|
259
|
+
To run the tests, you'll need to install the development dependencies:
|
|
260
|
+
|
|
261
|
+
$ bundle install
|
|
262
|
+
|
|
263
|
+
### Unit Tests
|
|
264
|
+
|
|
265
|
+
Run the unit tests with:
|
|
266
|
+
|
|
267
|
+
$ bundle exec rspec
|
|
268
|
+
|
|
269
|
+
### Integration Tests
|
|
270
|
+
|
|
271
|
+
The integration tests use third-party APIs and require API keys.
|
|
272
|
+
|
|
273
|
+
**OpenAI**
|
|
274
|
+
```bash
|
|
275
|
+
$ OPENAI_API_KEY="your-key" bundle exec ruby test_integration.rb
|
|
276
|
+
```
|
|
277
|
+
|
|
278
|
+
**Hugging Face**
|
|
279
|
+
```bash
|
|
280
|
+
$ HUGGING_FACE_API_KEY="your-key" bundle exec ruby test_hugging_face.rb
|
|
281
|
+
```
|
|
282
|
+
|
|
283
|
+
### Security Note: Handling API Keys
|
|
284
|
+
|
|
285
|
+
When using an adapter that requires an API key, **never hardcode your API keys** directly into your source code. To keep your application secure (especially if you are working on public repositories), use one of the following methods:
|
|
286
|
+
|
|
287
|
+
#### Using Rails Credentials (Recommended for Rails)
|
|
288
|
+
|
|
289
|
+
Store your key in your encrypted credentials file:
|
|
290
|
+
```bash
|
|
291
|
+
bin/rails credentials:edit
|
|
292
|
+
```
|
|
293
|
+
|
|
294
|
+
Then reference it in your initializer:
|
|
295
|
+
|
|
296
|
+
```ruby
|
|
297
|
+
SemanticChunker.configure do |config|
|
|
298
|
+
config.provider = SemanticChunker::Adapters::HuggingFaceAdapter.new(
|
|
299
|
+
api_key: Rails.application.credentials.dig(:hugging_face, :api_key)
|
|
300
|
+
)
|
|
301
|
+
end
|
|
302
|
+
```
|
|
303
|
+
|
|
304
|
+
|
|
305
|
+
#### Using Environment Variables
|
|
306
|
+
|
|
307
|
+
Alternatively, use a gem like dotenv and fetch the key from the environment:
|
|
308
|
+
|
|
309
|
+
```ruby
|
|
310
|
+
api_key = ENV.fetch("YOUR_API_KEY") { raise "Missing API Key" }
|
|
311
|
+
```
|
|
312
|
+
|
|
313
|
+
|
|
314
|
+
## Troubleshooting
|
|
315
|
+
---------------
|
|
316
|
+
|
|
317
|
+
### Matrix Dependency (Ruby 3.1+)
|
|
318
|
+
|
|
319
|
+
Since Ruby 3.1, the matrix library was moved from the standard library to a bundled gem.
|
|
320
|
+
|
|
321
|
+
* **If you are on Ruby 3.1, 3.2, or 3.3:** You must include gem 'matrix' in your Gemfile.
|
|
322
|
+
|
|
323
|
+
* **If you are on Ruby 3.0:** The library is built-in. If you see a "duplicate dependency" error, ensure you are not manually adding gem 'matrix' to your Gemfile, as the system version will take precedence.
|
|
324
|
+
|
|
325
|
+
|
|
326
|
+
### Hugging Face "Model Loading"
|
|
327
|
+
|
|
328
|
+
If you receive a 503 Service Unavailable error when using the Hugging Face adapter, it usually means the model is being loaded onto the server for the first time.
|
|
329
|
+
|
|
330
|
+
* **Solution:** Wait 30 seconds and try again. The HuggingFaceAdapter is designed to be lightweight, but serverless endpoints require a "warm-up" period.
|
|
331
|
+
|
|
332
|
+
|
|
333
|
+
### Encoding Issues
|
|
334
|
+
|
|
335
|
+
If your text contains complex Unicode or non-UTF-8 characters, pragmatic\_segmenter may behave unexpectedly.
|
|
336
|
+
|
|
337
|
+
* **Solution:** Ensure your input string is UTF-8 encoded: text.encode('UTF-8', invalid: :replace, undef: :replace).
|
|
338
|
+
|
|
339
|
+
|
|
340
|
+
## Command Line Interface (CLI)
|
|
341
|
+
|
|
342
|
+
SemanticChunker includes a powerful CLI that allows you to chunk files or piped text directly from your terminal. This is ideal for quick testing or integrating with non-Ruby applications.
|
|
343
|
+
|
|
344
|
+
### Installation
|
|
345
|
+
|
|
346
|
+
The CLI is included when you install the gem:
|
|
347
|
+
|
|
348
|
+
```bash
|
|
349
|
+
gem install semantic_chunker
|
|
350
|
+
```
|
|
351
|
+
### Usage
|
|
352
|
+
|
|
353
|
+
The CLI will automatically look for your HUGGING\_FACE\_API\_KEY or OPENAI\_API\_KEY in your environment or a .env file.
|
|
354
|
+
|
|
355
|
+
```bash
|
|
356
|
+
# Basic usage with automatic thresholding
|
|
357
|
+
semantic_chunker --threshold auto path/to/document.txt
|
|
358
|
+
|
|
359
|
+
# Specify a static threshold and max chunk size
|
|
360
|
+
semantic_chunker -t 0.85 -m 1000 document.txt
|
|
361
|
+
|
|
362
|
+
# Pipe text from another command
|
|
363
|
+
echo "Long text here..." | semantic_chunker -t auto
|
|
364
|
+
```
|
|
365
|
+
### JSON Output
|
|
366
|
+
|
|
367
|
+
For integration with other languages (Python, Node.js) or databases, you can output the result as structured JSON:
|
|
368
|
+
|
|
369
|
+
```bash
|
|
370
|
+
semantic_chunker --format json document.txt
|
|
371
|
+
```
|
|
372
|
+
|
|
373
|
+
**Example JSON Output:**
|
|
374
|
+
|
|
375
|
+
```json
|
|
376
|
+
{
|
|
377
|
+
"metadata": {
|
|
378
|
+
"source": "document.txt",
|
|
379
|
+
"chunk_count": 2,
|
|
380
|
+
"threshold_used": "auto"
|
|
381
|
+
},
|
|
382
|
+
"chunks": [
|
|
383
|
+
{
|
|
384
|
+
"index": 0,
|
|
385
|
+
"content": "First semantic topic...",
|
|
386
|
+
"size": 245
|
|
387
|
+
},
|
|
388
|
+
{
|
|
389
|
+
"index": 1,
|
|
390
|
+
"content": "Second semantic topic...",
|
|
391
|
+
"size": 180
|
|
392
|
+
}
|
|
393
|
+
]
|
|
394
|
+
}
|
|
395
|
+
```
|
|
396
|
+
|
|
397
|
+
### Options
|
|
398
|
+
|
|
399
|
+
| Flag | Long | Flag | Description | Default |
|
|
400
|
+
| - | - | - | - | - |
|
|
401
|
+
| -t | --threshold | Similarity threshold (float or auto) | auto
|
|
402
|
+
| -m | --max-size | Hard limit for character count per chunk | 1500
|
|
403
|
+
| -b | --buffer | Context window size (int or auto) | auto
|
|
404
|
+
| -f | --format | Output format (text or json) | text
|
|
405
|
+
| -v | --version | Show version info | -
|
|
406
|
+
|
|
407
|
+
|
|
408
|
+
## Reliability & Resilience
|
|
409
|
+
|
|
410
|
+
The Hugging Face adapter is built for production-grade reliability:
|
|
411
|
+
- **Exponential Backoff**: Automatically retries requests if the model is warming up or the API is busy.
|
|
412
|
+
- **Smart Timeouts**: Includes connection and read timeouts to prevent your application from hanging.
|
|
413
|
+
- **Auto-Wait**: Uses the `X-Wait-For-Model` header to ensure stable results on the Inference API.
|
|
414
|
+
|
|
415
|
+
|
|
416
|
+
## 🚀 Roadmap to v1.0.0
|
|
417
|
+
- [x] Adaptive Dynamic Thresholding
|
|
418
|
+
- [x] CLI with JSON output
|
|
419
|
+
- [x] Robust error handling and retries
|
|
420
|
+
- [ ] **Next:** Local embedding cache (reduce API costs)
|
|
421
|
+
- [ ] **Next:** Drift protection (Anchor-sentence comparison)
|
|
422
|
+
|
|
423
|
+
## Contributing
|
|
424
|
+
|
|
425
|
+
Bug reports and pull requests are welcome on GitHub at https://github.com/danielefrisanco/semantic_chunker.
|
|
426
|
+
|
|
427
|
+
## License
|
|
428
|
+
|
|
429
|
+
The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
|
|
@@ -0,0 +1,70 @@
|
|
|
1
|
+
#!/usr/bin/env ruby
|
|
2
|
+
|
|
3
|
+
# Add the local lib directory to the load path
|
|
4
|
+
$LOAD_PATH.unshift(File.expand_path('../lib', __dir__))
|
|
5
|
+
|
|
6
|
+
require 'semantic_chunker'
|
|
7
|
+
require 'optparse'
|
|
8
|
+
require 'dotenv'
|
|
9
|
+
Dotenv.load
|
|
10
|
+
|
|
11
|
+
options = {
|
|
12
|
+
threshold: :auto,
|
|
13
|
+
max_size: 1500,
|
|
14
|
+
buffer: :auto
|
|
15
|
+
}
|
|
16
|
+
|
|
17
|
+
OptionParser.new do |opts|
|
|
18
|
+
opts.banner = "Usage: semantic_chunker [options] <file>"
|
|
19
|
+
opts.on("-t", "--threshold VAL", "Threshold (float, :auto)") { |v| options[:threshold] = v == 'auto' ? :auto : v.to_f }
|
|
20
|
+
opts.on("-m", "--max-size VAL", Integer, "Max character size") { |v| options[:max_size] = v }
|
|
21
|
+
opts.on("-f", "--format FORMAT", [:text, :json], "Output format (text, json)") { |v| options[:format] = v }
|
|
22
|
+
opts.on("-b", "--buffer VAL", "Buffer size (int, :auto)") { |v| options[:buffer] = v == 'auto' ? :auto : v.to_i }
|
|
23
|
+
opts.on("-v", "--version", "Show version") do
|
|
24
|
+
puts SemanticChunker::VERSION
|
|
25
|
+
exit
|
|
26
|
+
end
|
|
27
|
+
end.parse!
|
|
28
|
+
|
|
29
|
+
input_file = ARGV[0]
|
|
30
|
+
text = input_file ? File.read(input_file) : ARGF.read
|
|
31
|
+
|
|
32
|
+
if text.nil? || text.empty?
|
|
33
|
+
puts "Error: No input text provided."
|
|
34
|
+
exit 1
|
|
35
|
+
end
|
|
36
|
+
provider = if ENV['HUGGING_FACE_API_KEY']
|
|
37
|
+
SemanticChunker::Adapters::HuggingFaceAdapter.new(api_key: ENV['HUGGING_FACE_API_KEY'])
|
|
38
|
+
elsif ENV['OPENAI_API_KEY']
|
|
39
|
+
# Assuming you have an OpenAI adapter
|
|
40
|
+
SemanticChunker::Adapters::OpenAIAdapter.new(api_key: ENV['OPENAI_API_KEY'])
|
|
41
|
+
else
|
|
42
|
+
puts "Error: No API key found (HUGGING_FACE_API_KEY or OPENAI_API_KEY)."
|
|
43
|
+
exit 1
|
|
44
|
+
end
|
|
45
|
+
|
|
46
|
+
# Assuming provider is configured via ENV/Dotenv
|
|
47
|
+
chunker = SemanticChunker::Chunker.new(
|
|
48
|
+
embedding_provider: provider,
|
|
49
|
+
threshold: options[:threshold],
|
|
50
|
+
max_chunk_size: options[:max_size],
|
|
51
|
+
buffer_size: options[:buffer]
|
|
52
|
+
)
|
|
53
|
+
|
|
54
|
+
chunks = chunker.chunks_for(text)
|
|
55
|
+
if options[:format] == :json
|
|
56
|
+
puts JSON.pretty_generate({
|
|
57
|
+
metadata: {
|
|
58
|
+
source: input_file || "stdin",
|
|
59
|
+
chunk_count: chunks.size,
|
|
60
|
+
threshold_used: options[:threshold]
|
|
61
|
+
},
|
|
62
|
+
chunks: chunks.map.with_index { |c, i| { index: i, content: c, size: c.length } }
|
|
63
|
+
})
|
|
64
|
+
else
|
|
65
|
+
chunks.each_with_index do |chunk, i|
|
|
66
|
+
puts "--- Chunk #{i + 1} ---"
|
|
67
|
+
puts chunk
|
|
68
|
+
puts "\n"
|
|
69
|
+
end
|
|
70
|
+
end
|
|
@@ -1,8 +1,18 @@
|
|
|
1
1
|
# lib/semantic_chunker/adapters/hugging_face_adapter.rb
|
|
2
|
+
require 'net/http'
|
|
3
|
+
require 'json'
|
|
4
|
+
require 'uri'
|
|
5
|
+
|
|
2
6
|
module SemanticChunker
|
|
3
7
|
module Adapters
|
|
4
8
|
class HuggingFaceAdapter < Base
|
|
5
9
|
BASE_URL = "https://router.huggingface.co/hf-inference/models/%{model}"
|
|
10
|
+
|
|
11
|
+
# Configuration for reliability
|
|
12
|
+
MAX_RETRIES = 3
|
|
13
|
+
INITIAL_BACKOFF = 2 # seconds
|
|
14
|
+
OPEN_TIMEOUT = 5 # seconds to open connection
|
|
15
|
+
READ_TIMEOUT = 60 # seconds to wait for embeddings
|
|
6
16
|
|
|
7
17
|
def initialize(api_key:, model: 'intfloat/multilingual-e5-large')
|
|
8
18
|
@api_key = api_key
|
|
@@ -12,23 +22,20 @@ module SemanticChunker
|
|
|
12
22
|
end
|
|
13
23
|
|
|
14
24
|
def embed(sentences)
|
|
15
|
-
|
|
16
|
-
|
|
17
|
-
unless response.content_type == "application/json"
|
|
18
|
-
raise "HuggingFace Error: Expected JSON, got #{response.content_type}. Body: #{response.body}"
|
|
19
|
-
end
|
|
25
|
+
retry_count = 0
|
|
20
26
|
|
|
21
|
-
|
|
22
|
-
|
|
23
|
-
|
|
24
|
-
|
|
25
|
-
|
|
26
|
-
|
|
27
|
-
puts "
|
|
28
|
-
sleep
|
|
29
|
-
|
|
27
|
+
begin
|
|
28
|
+
response = post_request(sentences)
|
|
29
|
+
handle_response(response)
|
|
30
|
+
rescue => e
|
|
31
|
+
if retryable?(e, retry_count)
|
|
32
|
+
wait_time = INITIAL_BACKOFF * (2**retry_count)
|
|
33
|
+
puts "HuggingFace: Transient error (#{e.message}). Retrying in #{wait_time}s..."
|
|
34
|
+
sleep wait_time
|
|
35
|
+
retry_count += 1
|
|
36
|
+
retry
|
|
30
37
|
end
|
|
31
|
-
raise
|
|
38
|
+
raise e
|
|
32
39
|
end
|
|
33
40
|
end
|
|
34
41
|
|
|
@@ -40,17 +47,42 @@ module SemanticChunker
|
|
|
40
47
|
|
|
41
48
|
request["Authorization"] = "Bearer #{@api_key}"
|
|
42
49
|
request["Content-Type"] = "application/json"
|
|
43
|
-
request["X-Wait-For-Model"] = "true"
|
|
50
|
+
request["X-Wait-For-Model"] = "true" # Tells HF to wait for model load
|
|
44
51
|
|
|
45
|
-
request.body = {
|
|
46
|
-
inputs: sentences
|
|
47
|
-
}.to_json
|
|
52
|
+
request.body = { inputs: sentences }.to_json
|
|
48
53
|
|
|
49
54
|
Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) do |http|
|
|
50
|
-
http.
|
|
55
|
+
http.open_timeout = OPEN_TIMEOUT
|
|
56
|
+
http.read_timeout = READ_TIMEOUT
|
|
51
57
|
http.request(request)
|
|
52
58
|
end
|
|
53
59
|
end
|
|
60
|
+
|
|
61
|
+
def handle_response(response)
|
|
62
|
+
unless response.content_type == "application/json"
|
|
63
|
+
raise "HuggingFace Error: Expected JSON, got #{response.content_type}."
|
|
64
|
+
end
|
|
65
|
+
|
|
66
|
+
parsed = JSON.parse(response.body)
|
|
67
|
+
|
|
68
|
+
if response.is_a?(Net::HTTPSuccess)
|
|
69
|
+
parsed
|
|
70
|
+
elsif parsed.is_a?(Hash) && parsed["error"]&.include?("loading")
|
|
71
|
+
# This specifically triggers a retry for model warmups
|
|
72
|
+
raise "Model is still loading"
|
|
73
|
+
else
|
|
74
|
+
raise "HuggingFace API Error: #{parsed['error'] || response.body}"
|
|
75
|
+
end
|
|
76
|
+
end
|
|
77
|
+
|
|
78
|
+
def retryable?(error, count)
|
|
79
|
+
return false if count >= MAX_RETRIES
|
|
80
|
+
|
|
81
|
+
# Retry on timeouts, loading errors, or 5xx server errors
|
|
82
|
+
error.message.include?("loading") ||
|
|
83
|
+
error.is_a?(Net::ReadTimeout) ||
|
|
84
|
+
error.is_a?(Net::OpenTimeout)
|
|
85
|
+
end
|
|
54
86
|
end
|
|
55
87
|
end
|
|
56
|
-
end
|
|
88
|
+
end
|
|
@@ -31,7 +31,10 @@ module SemanticChunker
|
|
|
31
31
|
# Step 3: Embed the groups, not the raw sentences
|
|
32
32
|
group_embeddings = @provider.embed(context_groups)
|
|
33
33
|
|
|
34
|
-
|
|
34
|
+
# Resolve the threshold dynamically if requested
|
|
35
|
+
resolved_threshold = resolve_threshold(group_embeddings)
|
|
36
|
+
|
|
37
|
+
calculate_groups(sentences, group_embeddings, resolved_threshold)
|
|
35
38
|
end
|
|
36
39
|
|
|
37
40
|
private
|
|
@@ -65,7 +68,7 @@ module SemanticChunker
|
|
|
65
68
|
ps.segment
|
|
66
69
|
end
|
|
67
70
|
|
|
68
|
-
def calculate_groups(sentences, embeddings)
|
|
71
|
+
def calculate_groups(sentences, embeddings, resolved_threshold)
|
|
69
72
|
chunks = []
|
|
70
73
|
current_chunk_text = [sentences[0]]
|
|
71
74
|
current_chunk_vectors = [Vector[*embeddings[0]]]
|
|
@@ -74,22 +77,17 @@ module SemanticChunker
|
|
|
74
77
|
new_sentence = sentences[i]
|
|
75
78
|
new_vec = Vector[*embeddings[i]]
|
|
76
79
|
|
|
77
|
-
# 1. Calculate Centroid
|
|
78
80
|
centroid = current_chunk_vectors.inject(:+) / current_chunk_vectors.size.to_f
|
|
79
81
|
sim = cosine_similarity(centroid, new_vec)
|
|
80
82
|
|
|
81
|
-
# 2. Check Constraints: Similarity OR Size
|
|
82
|
-
# We calculate the potential size of the chunk if we added this sentence
|
|
83
83
|
potential_size = current_chunk_text.join(" ").length + new_sentence.length + 1
|
|
84
84
|
|
|
85
|
-
|
|
86
|
-
|
|
85
|
+
# Use the resolved_threshold instead of @threshold
|
|
86
|
+
if sim < resolved_threshold || potential_size > @max_chunk_size
|
|
87
87
|
chunks << current_chunk_text.join(" ")
|
|
88
|
-
|
|
89
88
|
current_chunk_text = [new_sentence]
|
|
90
89
|
current_chunk_vectors = [new_vec]
|
|
91
90
|
else
|
|
92
|
-
# Keep grouping
|
|
93
91
|
current_chunk_text << new_sentence
|
|
94
92
|
current_chunk_vectors << new_vec
|
|
95
93
|
end
|
|
@@ -98,10 +96,43 @@ module SemanticChunker
|
|
|
98
96
|
chunks << current_chunk_text.join(" ")
|
|
99
97
|
chunks
|
|
100
98
|
end
|
|
101
|
-
|
|
102
99
|
def cosine_similarity(v1, v2)
|
|
103
|
-
|
|
104
|
-
v1
|
|
100
|
+
# Ensure we are working with Vectors
|
|
101
|
+
v1 = Vector[*v1] unless v1.is_a?(Vector)
|
|
102
|
+
v2 = Vector[*v2] unless v2.is_a?(Vector)
|
|
103
|
+
|
|
104
|
+
mag1 = v1.magnitude
|
|
105
|
+
mag2 = v2.magnitude
|
|
106
|
+
|
|
107
|
+
return 0.0 if mag1.zero? || mag2.zero?
|
|
108
|
+
v1.inner_product(v2) / (mag1 * mag2)
|
|
109
|
+
end
|
|
110
|
+
def resolve_threshold(embeddings)
|
|
111
|
+
return @threshold if @threshold.is_a?(Numeric)
|
|
112
|
+
return DEFAULT_THRESHOLD if embeddings.size < 2
|
|
113
|
+
|
|
114
|
+
similarities = []
|
|
115
|
+
(0...embeddings.size - 1).each do |i|
|
|
116
|
+
# Note: We wrap them here, but ensure cosine_similarity
|
|
117
|
+
# doesn't re-wrap them if they are already Vectors.
|
|
118
|
+
v1 = Vector[*embeddings[i]]
|
|
119
|
+
v2 = Vector[*embeddings[i+1]]
|
|
120
|
+
similarities << cosine_similarity(v1, v2)
|
|
121
|
+
end
|
|
122
|
+
|
|
123
|
+
return DEFAULT_THRESHOLD if similarities.empty?
|
|
124
|
+
|
|
125
|
+
percentile_val = @threshold.is_a?(Hash) ? @threshold[:percentile] : 20
|
|
126
|
+
|
|
127
|
+
# Use (size - 1) for the index to avoid "out of bounds" on small lists
|
|
128
|
+
sorted_sims = similarities.sort
|
|
129
|
+
index = ((sorted_sims.size - 1) * (percentile_val / 100.0)).round
|
|
130
|
+
|
|
131
|
+
dynamic_val = sorted_sims[index]
|
|
132
|
+
|
|
133
|
+
# Guardrail: Clamp to prevent hyper-splitting or never-splitting
|
|
134
|
+
# 0.3 is a safe floor for 'totally different', 0.95 is a safe ceiling.
|
|
135
|
+
dynamic_val.clamp(0.3, 0.95)
|
|
105
136
|
end
|
|
106
137
|
end
|
|
107
138
|
end
|
data/lib/semantic_chunker.rb
CHANGED
|
@@ -5,7 +5,7 @@ require 'json'
|
|
|
5
5
|
require 'net/http'
|
|
6
6
|
|
|
7
7
|
# 2. Require the version and base modules
|
|
8
|
-
require_relative 'semantic_chunker/version'
|
|
8
|
+
require_relative 'semantic_chunker/version'
|
|
9
9
|
|
|
10
10
|
# 3. Require the internal logic
|
|
11
11
|
require_relative 'semantic_chunker/adapters/base'
|
metadata
CHANGED
|
@@ -1,15 +1,43 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: semantic_chunker
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.
|
|
4
|
+
version: 0.6.3
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Daniele Frisanco
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: bin
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date: 2026-01-
|
|
11
|
+
date: 2026-01-08 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
|
+
- !ruby/object:Gem::Dependency
|
|
14
|
+
name: pragmatic_segmenter
|
|
15
|
+
requirement: !ruby/object:Gem::Requirement
|
|
16
|
+
requirements:
|
|
17
|
+
- - "~>"
|
|
18
|
+
- !ruby/object:Gem::Version
|
|
19
|
+
version: '0.3'
|
|
20
|
+
type: :runtime
|
|
21
|
+
prerelease: false
|
|
22
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
23
|
+
requirements:
|
|
24
|
+
- - "~>"
|
|
25
|
+
- !ruby/object:Gem::Version
|
|
26
|
+
version: '0.3'
|
|
27
|
+
- !ruby/object:Gem::Dependency
|
|
28
|
+
name: matrix
|
|
29
|
+
requirement: !ruby/object:Gem::Requirement
|
|
30
|
+
requirements:
|
|
31
|
+
- - "~>"
|
|
32
|
+
- !ruby/object:Gem::Version
|
|
33
|
+
version: '0.4'
|
|
34
|
+
type: :runtime
|
|
35
|
+
prerelease: false
|
|
36
|
+
version_requirements: !ruby/object:Gem::Requirement
|
|
37
|
+
requirements:
|
|
38
|
+
- - "~>"
|
|
39
|
+
- !ruby/object:Gem::Version
|
|
40
|
+
version: '0.4'
|
|
13
41
|
- !ruby/object:Gem::Dependency
|
|
14
42
|
name: rake
|
|
15
43
|
requirement: !ruby/object:Gem::Requirement
|
|
@@ -53,40 +81,32 @@ dependencies:
|
|
|
53
81
|
- !ruby/object:Gem::Version
|
|
54
82
|
version: '0'
|
|
55
83
|
- !ruby/object:Gem::Dependency
|
|
56
|
-
name:
|
|
84
|
+
name: webmock
|
|
57
85
|
requirement: !ruby/object:Gem::Requirement
|
|
58
86
|
requirements:
|
|
59
|
-
- - "
|
|
60
|
-
- !ruby/object:Gem::Version
|
|
61
|
-
version: '0.3'
|
|
62
|
-
type: :runtime
|
|
63
|
-
prerelease: false
|
|
64
|
-
version_requirements: !ruby/object:Gem::Requirement
|
|
65
|
-
requirements:
|
|
66
|
-
- - "~>"
|
|
67
|
-
- !ruby/object:Gem::Version
|
|
68
|
-
version: '0.3'
|
|
69
|
-
- !ruby/object:Gem::Dependency
|
|
70
|
-
name: matrix
|
|
71
|
-
requirement: !ruby/object:Gem::Requirement
|
|
72
|
-
requirements:
|
|
73
|
-
- - "~>"
|
|
87
|
+
- - ">="
|
|
74
88
|
- !ruby/object:Gem::Version
|
|
75
|
-
version: '0
|
|
76
|
-
type: :
|
|
89
|
+
version: '0'
|
|
90
|
+
type: :development
|
|
77
91
|
prerelease: false
|
|
78
92
|
version_requirements: !ruby/object:Gem::Requirement
|
|
79
93
|
requirements:
|
|
80
|
-
- - "
|
|
94
|
+
- - ">="
|
|
81
95
|
- !ruby/object:Gem::Version
|
|
82
|
-
version: '0
|
|
83
|
-
description:
|
|
96
|
+
version: '0'
|
|
97
|
+
description: A powerful tool for RAG (Retrieval-Augmented Generation) that splits
|
|
98
|
+
text into chunks based on semantic meaning rather than just character counts. Supports
|
|
99
|
+
sliding windows, adaptive buffering, and dynamic percentile-based thresholding.
|
|
84
100
|
email:
|
|
85
101
|
- daniele.frisanco@gmail.com
|
|
86
|
-
executables:
|
|
102
|
+
executables:
|
|
103
|
+
- semantic_chunker
|
|
87
104
|
extensions: []
|
|
88
105
|
extra_rdoc_files: []
|
|
89
106
|
files:
|
|
107
|
+
- CHANGELOG.md
|
|
108
|
+
- README.md
|
|
109
|
+
- bin/semantic_chunker
|
|
90
110
|
- lib/semantic_chunker.rb
|
|
91
111
|
- lib/semantic_chunker/adapters/base.rb
|
|
92
112
|
- lib/semantic_chunker/adapters/hugging_face_adapter.rb
|
|
@@ -97,7 +117,13 @@ files:
|
|
|
97
117
|
homepage: https://github.com/danielefrisanco/semantic_chunker
|
|
98
118
|
licenses:
|
|
99
119
|
- MIT
|
|
100
|
-
metadata:
|
|
120
|
+
metadata:
|
|
121
|
+
homepage_uri: https://github.com/danielefrisanco/semantic_chunker
|
|
122
|
+
source_code_uri: https://github.com/danielefrisanco/semantic_chunker
|
|
123
|
+
changelog_uri: https://github.com/danielefrisanco/semantic_chunker/blob/main/CHANGELOG.md
|
|
124
|
+
bug_tracker_uri: https://github.com/danielefrisanco/semantic_chunker/issues
|
|
125
|
+
documentation_uri: https://www.rubydoc.info/gems/semantic_chunker/0.6.3
|
|
126
|
+
allowed_push_host: https://rubygems.org
|
|
101
127
|
post_install_message:
|
|
102
128
|
rdoc_options: []
|
|
103
129
|
require_paths:
|
|
@@ -106,7 +132,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
|
106
132
|
requirements:
|
|
107
133
|
- - ">="
|
|
108
134
|
- !ruby/object:Gem::Version
|
|
109
|
-
version:
|
|
135
|
+
version: 3.0.0
|
|
110
136
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
|
111
137
|
requirements:
|
|
112
138
|
- - ">="
|
|
@@ -116,5 +142,5 @@ requirements: []
|
|
|
116
142
|
rubygems_version: 3.3.26
|
|
117
143
|
signing_key:
|
|
118
144
|
specification_version: 4
|
|
119
|
-
summary:
|
|
145
|
+
summary: Semantic text chunking using embeddings and dynamic thresholding.
|
|
120
146
|
test_files: []
|