RubyGems - semantic_chunker - Versions diffs - 0.5.3 → 0.6.3 - Mend

semantic_chunker 0.5.3 → 0.6.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +63 -0
data/README.md +429 -0
data/bin/semantic_chunker +70 -0
data/lib/semantic_chunker/adapters/hugging_face_adapter.rb +53 -21
data/lib/semantic_chunker/chunker.rb +43 -12
data/lib/semantic_chunker/version.rb +1 -1
data/lib/semantic_chunker.rb +1 -1
metadata +53 -27

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: '0782ee5ae0f80488a985b3c12afad2eb95252ecd45849ed98a8446aef4dbfc66'
-  data.tar.gz: 988a17459d404db90f460527d105e00579a44a442deabeba3c3e6462a4e440de
+  metadata.gz: 9eb71c3c0285ded52be28cc83f18c2590040bcac489bf07963134b9b877c7dd6
+  data.tar.gz: 2cbb2ad4565519fd068b3a1eb6805e7989bd7d78886884b311faa6c4109193bc
 SHA512:
-  metadata.gz: 29b84d713798dabc248986ad2040da8bffe029c74dcba9e7f35aef783a8ac5b1b944c4ebdba1f1af50c79b78d92a9fb06c66c9f71a901a9c457032edd1508865
-  data.tar.gz: 26a6d3ac345c0a6d88cbffc7877a2fd387293c75d81d2cb85771b2d8f66778645944c46608f9de405eb787a4c48546f6efe859790963c129d2dfdd1ded8dcd32
+  metadata.gz: 2210e23a05cc4ed601528f0c1b877d02c13d9b3da77486a0fa5845a799fd02815f21f9ecd9cd52bf1c4bcb87aff2c733914016e3efba9c1454df992ea0161fa7
+  data.tar.gz: 2541b7fcb705b410444e109979a95cc202d660ec9febc0d1bd4e1116173b99a0d1b75a4a88c204a4e590d2a20e84888c3e75bd8174f43e18b2a19201098b756e

data/CHANGELOG.md ADDED Viewed

@@ -0,0 +1,63 @@
+# Changelog
+All notable changes to this project will be documented in this file.
+[0.6.2] - 2026-01-07
+----------------------
+### Added
+*   **Command Line Interface (CLI)**: Introduced bin/semantic\_chunker allowing users to chunk files or piped text directly from the terminal.
+*   **JSON Output**: Added --format json flag to the CLI for easy integration with Python, Node.js, and other data pipelines.
+*   **Net::HTTP Timeouts**: Added open\_timeout and read\_timeout to the Hugging Face adapter to prevent application hangs during network instability.
+*   **Exponential Backoff**: Implemented a retry strategy for the Hugging Face API that waits progressively longer if the model is currently "loading" or "warming up."
+*   **Unit Testing Suite**: Established an RSpec test suite using **WebMock** to simulate API responses and verify retry/timeout logic without making real network calls.
+### Changed
+*   **Hugging Face Resilience**: Improved the adapter to handle transient 503 errors and "Model cold start" scenarios more gracefully using the X-Wait-For-Model header.
+*   **CLI Performance**: Added local load path handling to allow running the CLI during development without requiring the gem to be installed globally.
+### Fixed
+*   **Unstable Network Hangs**: Fixed an issue where a slow response from the embedding provider could block the Ruby process indefinitely.
+## [0.6.0] - 2026-01-07
+### Added
+- **Dynamic Thresholding**: Introduced model-agnostic splitting logic. The chunker now adapts to the specific "density" of a document's vector space.
+  - **Auto Mode**: Use `threshold: :auto` to automatically calculate the optimal split point based on the document's 15th percentile of similarity.
+  - **Percentile Mode**: Use `threshold: { percentile: 10 }` for fine-grained control over how sensitive the topic-shifting detection should be.
+- **Clamping Logic**: Added guardrails to dynamic thresholds (clamped between `0.3` and `0.95`) to prevent hyper-splitting in repetitive documents.
+### Fixed
+- **Ruby 3.0 Compatibility**: Resolved CI/CD issues and Bundler version conflicts to ensure full support for Ruby 3.0.x.
+- **Precision Indexing**: Improved percentile calculation using `round` logic to ensure accuracy in both short and long documents.
+### Summary of API Changes
+The `threshold` parameter now accepts three types of input:
+| Mode       | Input                | Best For...                                                    |
+|------------|----------------------|----------------------------------------------------------------|
+| **Static** | `0.82` (float)       | Deterministic behavior with known models (e.g., OpenAI).       |
+| **Auto** | `:auto`              | General purpose; handles E5/BGE/MiniLM models automatically.   |
+| **Percentile**| `{ percentile: 10 }`| Custom sensitivity; lower % = larger chunks, higher % = more splits. |
+---
+## [0.5.3] - 2025-10-08
+### Added
+- **Pragmatic Segmenter Integration**: Replaced basic regex splitting with `pragmatic_segmenter` for multilingual and context-aware sentence boundary detection.
+- **Language Support**: Added `segmenter_options` to allow users to specify document language (e.g., `hy`, `jp`, `en`) and type (e.g., `pdf`).
+## [0.2.0] - 2026-01-06
+### Added
+- **Centroid Comparison:** Chunks now split based on the average semantic meaning of the entire current group rather than just the previous sentence.
+- **Sliding Buffer Window:** Added `buffer_size` to enrich sentence embeddings with surrounding context.
+- **Adaptive Buffering:** Introduced `:auto` mode for `buffer_size`.
+- **Hard Size Limits:** Added `max_chunk_size` to force splits when a topic exceeds character limits.

data/README.md ADDED Viewed

@@ -0,0 +1,429 @@
+# Semantic Chunker
+[![Gem Version](https://badge.fury.io/rb/semantic_chunker.svg)](https://badge.fury.io/rb/semantic_chunker)
+A Ruby gem for splitting long texts into semantically related chunks. This is useful for preparing text for language models where you need to feed a model with contextually relevant information.
+## What is Semantic Chunking?
+Semantic chunking is a technique for splitting text based on meaning. Instead of splitting text by a fixed number of words or sentences, this gem groups sentences that are semantically related.
+It works by:
+1. Splitting the text into individual sentences.
+2. Generating a vector embedding for each sentence using a configurable provider (e.g., OpenAI, Hugging Face).
+3. Comparing the new sentence's windowed embedding to the **centroid (average) of the current chunk's embeddings**.
+4. If the similarity between the new sentence and the chunk's centroid is below a certain threshold, a new chunk is started. This prevents topic drift.
+5. The process is enhanced by a **buffer window**, which considers multiple sentences at a time to make more robust decisions.
+This results in chunks of text that are topically coherent.
+## Compatibility
+This gem requires Ruby 3.0 or higher.
+Installation
+------------
+This gem relies on two key dependencies for its logic:
+1.  **matrix**: Used for high-performance vector calculations and centroid math.
+2.  **pragmatic\_segmenter**: Used for rule-based sentence boundary detection (handling abbreviations, initials, and citations).
+Add these lines to your application's Gemfile:
+```ruby
+# Required for Ruby 3.1+
+gem 'matrix'
+# Required for high-quality sentence splitting
+gem 'pragmatic_segmenter'
+gem 'semantic_chunker'
+```
+And then execute:
+    $ bundle install
+Or install it yourself as:
+    $ gem install semantic_chunker
+## Usage
+Here is a basic example of how to use `semantic_chunker`:
+```ruby
+require 'semantic_chunker'
+# 1. Configure the provider
+# You can configure the provider globally.
+# This is useful in a Rails initializer for example.
+SemanticChunker.configure do |config|
+  config.provider = SemanticChunker::Adapters::HuggingFaceAdapter.new(
+    api_key: ENV.fetch("HUGGING_FACE_API_KEY"),
+    model: "sentence-transformers/all-MiniLM-L6-v2"
+  )
+end
+# 2. Create a chunker and process your text
+chunker = SemanticChunker::Chunker.new(
+  threshold: 0.8,
+  buffer_size: :auto,
+  max_chunk_size: 1000
+)
+text = "Your very long document text goes here. It can contain multiple paragraphs and topics. The chunker will split it into meaningful parts."
+chunks = chunker.chunks_for(text)
+# chunks will be an array of strings.
+# The strings preserve the original formatting and whitespace.
+chunks.each_with_index do |chunk, i|
+  puts "Chunk #{i+1}:"
+  puts chunk
+  puts "---"
+end
+```
+## Rails Integration
+For Rails applications, here is a recommended setup:
+### 1. Initializer
+Create an initializer to configure the gem globally. This is where you should set up your embedding provider using Rails credentials.
+```ruby
+# config/initializers/semantic_chunker.rb
+SemanticChunker.configure do |config|
+  config.provider = SemanticChunker::Adapters::HuggingFaceAdapter.new(
+    api_key: Rails.application.credentials.dig(:hugging_face, :api_key),
+    model: "sentence-transformers/all-MiniLM-L6-v2"
+  )
+end
+```
+### 2. Model Usage
+You can use the chunker within your models, for example, to chunk a document's content before saving or for indexing in a search engine.
+```ruby
+# app/models/document.rb
+class Document < ApplicationRecord
+  def semantic_chunks
+    chunker = SemanticChunker::Chunker.new
+    chunker.chunks_for(self.content)
+  end
+end
+```
+### 3. Caching
+To avoid re-embedding the same content, which can be slow and costly, consider implementing a caching strategy. You can cache the embeddings or the final chunks. Here is a simple example using `Rails.cache`:
+```ruby
+# app/models/document.rb
+class Document < ApplicationRecord
+  def semantic_chunks
+    Rails.cache.fetch("document_#{self.id}_chunks", expires_in: 12.hours) do
+      chunker = SemanticChunker::Chunker.new
+      chunker.chunks_for(self.content)
+    end
+  end
+end
+```
+## Configuration
+### Sentence Splitting (Pragmatic Segmenter)
+This gem uses `pragmatic_segmenter` for high-quality sentence splitting. You can pass options directly to it using the `segmenter_options` hash during chunker initialization. This is useful for handling different languages or document types.
+The following options are available:
+- `language`: Specifies the language of the text (e.g., `'en'` for English, `'hy'` for Armenian).
+- `doc_type`: Optimizes segmentation for specific document formats (e.g., `'pdf'`).
+- `clean`: When `false`, disables the preliminary text cleaning process.
+**Examples:**
+```ruby
+# Example 1: Processing an Armenian PDF
+chunker = SemanticChunker::Chunker.new(
+  segmenter_options: { language: 'hy', doc_type: 'pdf' }
+)
+# Example 2: Disabling text cleaning for strict raw data
+chunker = SemanticChunker::Chunker.new(
+  segmenter_options: { clean: false }
+)
+```
+### Global Configuration
+You can configure the embedding provider globally, which is useful in frameworks like Rails.
+```ruby
+# config/initializers/semantic_chunker.rb
+SemanticChunker.configure do |config|
+  config.provider = SemanticChunker::Adapters::HuggingFaceAdapter.new(
+    api_key: ENV.fetch("HUGGING_FACE_API_KEY"),
+    model: "sentence-transformers/all-MiniLM-L6-v2"
+  )
+end
+```
+### Per-instance Configuration
+You can also pass a provider directly to the `Chunker` instance. This will override any global configuration.
+```ruby
+provider = SemanticChunker::Adapters::HuggingFaceAdapter.new(api_key: "your-key")
+chunker = SemanticChunker::Chunker.new(embedding_provider: provider)
+```
+### Threshold
+You can configure the similarity threshold. The default is `0.82`.
+> **Note:** The default value is optimized for the `sentence-transformers/all-MiniLM-L6-v2` model. You may need to adjust this value significantly for other models, especially those with different embedding dimensions (e.g., OpenAI's `text-embedding-3-large`).
+1. Higher threshold (e.g., 0.95): Requires very high similarity to keep sentences together, resulting in more, smaller chunks.
+2. Lower threshold (e.g., 0.50): Is more "forgiving," resulting in fewer, larger chunks.
+```ruby
+# Lower threshold, fewer chunks
+chunker = SemanticChunker::Chunker.new(threshold: 0.7)
+# Higher threshold, more chunks
+chunker = SemanticChunker::Chunker.new(threshold: 0.9)
+```
+### Dynamic Thresholding (v0.6.0)
+With the introduction of **Dynamic Thresholding**, SemanticChunker is now model-agnostic. It automatically adapts to the vector density of different embedding models (e.g., OpenAI, E5, BGE, or Hugging Face).
+### Threshold Modes
+| Mode | Syntax | Description |
+ | - | - | - |
+| Static | `0.82` | Splits when similarity drops below a fixed number. Use this if you have a specific model tuned to a known threshold. |
+| Auto | `:auto` | (Default) Calculates the 15th percentile of similarities in the document and splits at the "valleys." |
+| Percentile | `{ percentile: 10 }` | Advanced control. A lower percentile creates fewer, larger chunks; a higher percentile creates more, smaller chunks. |
+### Which one should I use?
+*   **Use :auto** if you are swapping models frequently or using open-source models from Hugging Face. It prevents the "One Giant Chunk" bug that happens when models have low similarity ranges.
+*   **Use a Static number** if you require strictly deterministic behavior across different documents and know your model's distribution.
+### Buffer Windows (Buffer Size)
+The buffer\_size parameter defines a sliding "context window." Instead of embedding a single sentence in isolation, the chunker combines a sentence with its neighbors. This "semantic smoothing" prevents false splits caused by short sentences or pronouns (like "He" or "It") that lack context.
+*   **0**: No buffer. Each sentence is embedded exactly as written. Best for very long, self-contained paragraphs.
+*   **1 (Default)**: Looks 1 sentence back and 1 sentence forward. For sentence $i$, the embedding represents $S_{i-1} + S_i + S_{i+1}$.
+*   **2**: Looks 2 sentences back and 2 forward. This creates a large 5-sentence context for every comparison.
+*   **:auto**: The chunker analyzes the density of your text and automatically selects the best window:
+    *   **Short sentences** (avg < 60 chars): Uses buffer\_size: 2 (Captures conversation flow).
+    *   **Medium sentences** (avg 60–150 chars): Uses buffer\_size: 1 (Standard).
+    *   **Long sentences** (avg > 150 chars): Uses buffer\_size: 0 (High precision).
+```ruby
+chunker = SemanticChunker::Chunker.new(buffer_size: :auto)
+```
+### Max Chunk Size
+You can set a hard limit on the character length of a chunk using `max_chunk_size`. This is useful for ensuring chunks do not exceed the context window of a language model. A split will be forced, even if sentences are semantically related. The default is `1500`.
+```ruby
+chunker = SemanticChunker::Chunker.new(max_chunk_size: 1000)
+```
+### Adapters
+The gem is designed to be extensible with different embedding providers. It currently ships with:
+- `SemanticChunker::Adapters::OpenAIAdapter`: For OpenAI's embedding models.
+- `SemanticChunker::Adapters::HuggingFaceAdapter`: For Hugging Face's embedding models.
+- `SemanticChunker::Adapters::TestAdapter`: A simple adapter for testing purposes.
+You can create your own adapter by creating a class that inherits from `SemanticChunker::Adapters::Base` and implements an `embed(sentences)` method.
+The `embed` method must return an `Array` of `Array`s, where each inner array is an embedding (a list of floats). The `Chunker` will automatically handle the conversion of these arrays into `Vector` objects for similarity calculations.
+For consistency, it's recommended to place your custom adapter class within the `SemanticChunker::Adapters` namespace, although this is not a strict requirement.
+## Development & Testing
+To run the tests, you'll need to install the development dependencies:
+    $ bundle install
+### Unit Tests
+Run the unit tests with:
+    $ bundle exec rspec
+### Integration Tests
+The integration tests use third-party APIs and require API keys.
+**OpenAI**
+```bash
+$ OPENAI_API_KEY="your-key" bundle exec ruby test_integration.rb
+```
+**Hugging Face**
+```bash
+$ HUGGING_FACE_API_KEY="your-key" bundle exec ruby test_hugging_face.rb
+```
+### Security Note: Handling API Keys
+When using an adapter that requires an API key, **never hardcode your API keys** directly into your source code. To keep your application secure (especially if you are working on public repositories), use one of the following methods:
+#### Using Rails Credentials (Recommended for Rails)
+Store your key in your encrypted credentials file:
+```bash
+  bin/rails credentials:edit
+```
+Then reference it in your initializer:
+```ruby
+SemanticChunker.configure do |config|
+  config.provider = SemanticChunker::Adapters::HuggingFaceAdapter.new(
+    api_key: Rails.application.credentials.dig(:hugging_face, :api_key)
+  )
+end
+```
+#### Using Environment Variables
+Alternatively, use a gem like dotenv and fetch the key from the environment:
+```ruby
+api_key = ENV.fetch("YOUR_API_KEY") { raise "Missing API Key" }
+```
+## Troubleshooting
+---------------
+### Matrix Dependency (Ruby 3.1+)
+Since Ruby 3.1, the matrix library was moved from the standard library to a bundled gem.
+*   **If you are on Ruby 3.1, 3.2, or 3.3:** You must include gem 'matrix' in your Gemfile.
+*   **If you are on Ruby 3.0:** The library is built-in. If you see a "duplicate dependency" error, ensure you are not manually adding gem 'matrix' to your Gemfile, as the system version will take precedence.
+### Hugging Face "Model Loading"
+If you receive a 503 Service Unavailable error when using the Hugging Face adapter, it usually means the model is being loaded onto the server for the first time.
+*   **Solution:** Wait 30 seconds and try again. The HuggingFaceAdapter is designed to be lightweight, but serverless endpoints require a "warm-up" period.
+### Encoding Issues
+If your text contains complex Unicode or non-UTF-8 characters, pragmatic\_segmenter may behave unexpectedly.
+*   **Solution:** Ensure your input string is UTF-8 encoded: text.encode('UTF-8', invalid: :replace, undef: :replace).
+## Command Line Interface (CLI)
+SemanticChunker includes a powerful CLI that allows you to chunk files or piped text directly from your terminal. This is ideal for quick testing or integrating with non-Ruby applications.
+### Installation
+The CLI is included when you install the gem:
+```bash
+  gem install semantic_chunker
+```
+### Usage
+The CLI will automatically look for your HUGGING\_FACE\_API\_KEY or OPENAI\_API\_KEY in your environment or a .env file.
+```bash
+# Basic usage with automatic thresholding
+semantic_chunker --threshold auto path/to/document.txt
+# Specify a static threshold and max chunk size
+semantic_chunker -t 0.85 -m 1000 document.txt
+# Pipe text from another command
+echo "Long text here..." | semantic_chunker -t auto
+```
+### JSON Output
+For integration with other languages (Python, Node.js) or databases, you can output the result as structured JSON:
+```bash
+semantic_chunker --format json document.txt
+```
+**Example JSON Output:**
+```json
+{
+  "metadata": {
+    "source": "document.txt",
+    "chunk_count": 2,
+    "threshold_used": "auto"
+  },
+  "chunks": [
+    {
+      "index": 0,
+      "content": "First semantic topic...",
+      "size": 245
+    },
+    {
+      "index": 1,
+      "content": "Second semantic topic...",
+      "size": 180
+    }
+  ]
+}
+```
+### Options
+| Flag	| Long	|  Flag		| Description	| 	Default	|
+| -	| -	| -	| -	| -	|
+| -t		| --threshold		| Similarity threshold (float or auto)	| 	auto
+| -m		| --max-size		| Hard limit for character count per chunk		| 1500
+| -b		| --buffer		| Context window size (int or auto)	| 	auto
+| -f		| --format		| Output format (text or json)	| 	text
+| -v		| --version		| Show version info		| -
+## Reliability & Resilience
+The Hugging Face adapter is built for production-grade reliability:
+- **Exponential Backoff**: Automatically retries requests if the model is warming up or the API is busy.
+- **Smart Timeouts**: Includes connection and read timeouts to prevent your application from hanging.
+- **Auto-Wait**: Uses the `X-Wait-For-Model` header to ensure stable results on the Inference API.
+## 🚀 Roadmap to v1.0.0
+- [x] Adaptive Dynamic Thresholding
+- [x] CLI with JSON output
+- [x] Robust error handling and retries
+- [ ] **Next:** Local embedding cache (reduce API costs)
+- [ ] **Next:** Drift protection (Anchor-sentence comparison)
+## Contributing
+Bug reports and pull requests are welcome on GitHub at https://github.com/danielefrisanco/semantic_chunker.
+## License
+The gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).

data/bin/semantic_chunker ADDED Viewed

@@ -0,0 +1,70 @@
+#!/usr/bin/env ruby
+# Add the local lib directory to the load path
+$LOAD_PATH.unshift(File.expand_path('../lib', __dir__))
+require 'semantic_chunker'
+require 'optparse'
+require 'dotenv'
+Dotenv.load
+options = {
+  threshold: :auto,
+  max_size: 1500,
+  buffer: :auto
+}
+OptionParser.new do |opts|
+  opts.banner = "Usage: semantic_chunker [options] <file>"
+  opts.on("-t", "--threshold VAL", "Threshold (float, :auto)") { |v| options[:threshold] = v == 'auto' ? :auto : v.to_f }
+  opts.on("-m", "--max-size VAL", Integer, "Max character size") { |v| options[:max_size] = v }
+  opts.on("-f", "--format FORMAT", [:text, :json], "Output format (text, json)") { |v| options[:format] = v }
+  opts.on("-b", "--buffer VAL", "Buffer size (int, :auto)") { |v| options[:buffer] = v == 'auto' ? :auto : v.to_i }
+  opts.on("-v", "--version", "Show version") do
+    puts SemanticChunker::VERSION
+    exit
+  end
+end.parse!
+input_file = ARGV[0]
+text = input_file ? File.read(input_file) : ARGF.read
+if text.nil? || text.empty?
+  puts "Error: No input text provided."
+  exit 1
+end
+provider = if ENV['HUGGING_FACE_API_KEY']
+             SemanticChunker::Adapters::HuggingFaceAdapter.new(api_key: ENV['HUGGING_FACE_API_KEY'])
+           elsif ENV['OPENAI_API_KEY']
+             # Assuming you have an OpenAI adapter
+             SemanticChunker::Adapters::OpenAIAdapter.new(api_key: ENV['OPENAI_API_KEY'])
+           else
+             puts "Error: No API key found (HUGGING_FACE_API_KEY or OPENAI_API_KEY)."
+             exit 1
+           end
+# Assuming provider is configured via ENV/Dotenv
+chunker = SemanticChunker::Chunker.new(
+  embedding_provider: provider,
+  threshold: options[:threshold],
+  max_chunk_size: options[:max_size],
+  buffer_size: options[:buffer]
+)
+chunks = chunker.chunks_for(text)
+if options[:format] == :json
+  puts JSON.pretty_generate({
+    metadata: {
+      source: input_file || "stdin",
+      chunk_count: chunks.size,
+      threshold_used: options[:threshold]
+    },
+    chunks: chunks.map.with_index { |c, i| { index: i, content: c, size: c.length } }
+  })
+else
+  chunks.each_with_index do |chunk, i|
+    puts "--- Chunk #{i + 1} ---"
+    puts chunk
+    puts "\n"
+  end
+end

data/lib/semantic_chunker/adapters/hugging_face_adapter.rb CHANGED Viewed

@@ -1,8 +1,18 @@
 # lib/semantic_chunker/adapters/hugging_face_adapter.rb
+require 'net/http'
+require 'json'
+require 'uri'
 module SemanticChunker
   module Adapters
     class HuggingFaceAdapter < Base
       BASE_URL = "https://router.huggingface.co/hf-inference/models/%{model}"
+      # Configuration for reliability
+      MAX_RETRIES = 3
+      INITIAL_BACKOFF = 2 # seconds
+      OPEN_TIMEOUT = 5    # seconds to open connection
+      READ_TIMEOUT = 60   # seconds to wait for embeddings
       def initialize(api_key:, model: 'intfloat/multilingual-e5-large')
         @api_key = api_key
@@ -12,23 +22,20 @@ module SemanticChunker
       end
       def embed(sentences)
-        response = post_request(sentences)
-        unless response.content_type == "application/json"
-          raise "HuggingFace Error: Expected JSON, got #{response.content_type}. Body: #{response.body}"
-        end
+        retry_count = 0
-        parsed = JSON.parse(response.body)
-        if response.is_a?(Net::HTTPSuccess)
-          parsed
-        else
-          if parsed.is_a?(Hash) && parsed["error"]&.include?("loading")
-            puts "Model warming up... retrying in 10s"
-            sleep 10
-            return embed(sentences)
+        begin
+          response = post_request(sentences)
+          handle_response(response)
+        rescue => e
+          if retryable?(e, retry_count)
+            wait_time = INITIAL_BACKOFF * (2**retry_count)
+            puts "HuggingFace: Transient error (#{e.message}). Retrying in #{wait_time}s..."
+            sleep wait_time
+            retry_count += 1
+            retry
           end
-          raise "HuggingFace Error: #{parsed['error'] || parsed}"
+          raise e
         end
       end
@@ -40,17 +47,42 @@ module SemanticChunker
         request["Authorization"] = "Bearer #{@api_key}"
         request["Content-Type"] = "application/json"
-        request["X-Wait-For-Model"] = "true"
+        request["X-Wait-For-Model"] = "true" # Tells HF to wait for model load
-        request.body = {
-          inputs: sentences
-        }.to_json
+        request.body = { inputs: sentences }.to_json
         Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) do |http|
-          http.read_timeout = 60
+          http.open_timeout = OPEN_TIMEOUT
+          http.read_timeout = READ_TIMEOUT
           http.request(request)
         end
       end
+      def handle_response(response)
+        unless response.content_type == "application/json"
+          raise "HuggingFace Error: Expected JSON, got #{response.content_type}."
+        end
+        parsed = JSON.parse(response.body)
+        if response.is_a?(Net::HTTPSuccess)
+          parsed
+        elsif parsed.is_a?(Hash) && parsed["error"]&.include?("loading")
+          # This specifically triggers a retry for model warmups
+          raise "Model is still loading"
+        else
+          raise "HuggingFace API Error: #{parsed['error'] || response.body}"
+        end
+      end
+      def retryable?(error, count)
+        return false if count >= MAX_RETRIES
+        # Retry on timeouts, loading errors, or 5xx server errors
+        error.message.include?("loading") ||
+        error.is_a?(Net::ReadTimeout) ||
+        error.is_a?(Net::OpenTimeout)
+      end
     end
   end
-end
+end

data/lib/semantic_chunker/chunker.rb CHANGED Viewed

@@ -31,7 +31,10 @@ module SemanticChunker
       # Step 3: Embed the groups, not the raw sentences
       group_embeddings = @provider.embed(context_groups)
-      calculate_groups(sentences, group_embeddings)
+      # Resolve the threshold dynamically if requested
+      resolved_threshold = resolve_threshold(group_embeddings)
+      calculate_groups(sentences, group_embeddings, resolved_threshold)
     end
     private
@@ -65,7 +68,7 @@ module SemanticChunker
       ps.segment
     end
-    def calculate_groups(sentences, embeddings)
+    def calculate_groups(sentences, embeddings, resolved_threshold)
       chunks = []
       current_chunk_text = [sentences[0]]
       current_chunk_vectors = [Vector[*embeddings[0]]]
@@ -74,22 +77,17 @@ module SemanticChunker
         new_sentence = sentences[i]
         new_vec = Vector[*embeddings[i]]
-        # 1. Calculate Centroid
         centroid = current_chunk_vectors.inject(:+) / current_chunk_vectors.size.to_f
         sim = cosine_similarity(centroid, new_vec)
-        # 2. Check Constraints: Similarity OR Size
-        # We calculate the potential size of the chunk if we added this sentence
         potential_size = current_chunk_text.join(" ").length + new_sentence.length + 1
-        if sim < @threshold || potential_size > @max_chunk_size
-          # Split if the topic changed OR the chunk is getting too fat
+        # Use the resolved_threshold instead of @threshold
+        if sim < resolved_threshold || potential_size > @max_chunk_size
           chunks << current_chunk_text.join(" ")
           current_chunk_text = [new_sentence]
           current_chunk_vectors = [new_vec]
         else
-          # Keep grouping
           current_chunk_text << new_sentence
           current_chunk_vectors << new_vec
         end
@@ -98,10 +96,43 @@ module SemanticChunker
       chunks << current_chunk_text.join(" ")
       chunks
     end
     def cosine_similarity(v1, v2)
-      return 0.0 if v1.magnitude.zero? || v2.magnitude.zero?
-      v1.inner_product(v2) / (v1.magnitude * v2.magnitude)
+      # Ensure we are working with Vectors
+      v1 = Vector[*v1] unless v1.is_a?(Vector)
+      v2 = Vector[*v2] unless v2.is_a?(Vector)
+      mag1 = v1.magnitude
+      mag2 = v2.magnitude
+      return 0.0 if mag1.zero? || mag2.zero?
+      v1.inner_product(v2) / (mag1 * mag2)
+    end
+    def resolve_threshold(embeddings)
+      return @threshold if @threshold.is_a?(Numeric)
+      return DEFAULT_THRESHOLD if embeddings.size < 2
+      similarities = []
+      (0...embeddings.size - 1).each do |i|
+        # Note: We wrap them here, but ensure cosine_similarity
+        # doesn't re-wrap them if they are already Vectors.
+        v1 = Vector[*embeddings[i]]
+        v2 = Vector[*embeddings[i+1]]
+        similarities << cosine_similarity(v1, v2)
+      end
+      return DEFAULT_THRESHOLD if similarities.empty?
+      percentile_val = @threshold.is_a?(Hash) ? @threshold[:percentile] : 20
+      # Use (size - 1) for the index to avoid "out of bounds" on small lists
+      sorted_sims = similarities.sort
+      index = ((sorted_sims.size - 1) * (percentile_val / 100.0)).round
+      dynamic_val = sorted_sims[index]
+      # Guardrail: Clamp to prevent hyper-splitting or never-splitting
+      # 0.3 is a safe floor for 'totally different', 0.95 is a safe ceiling.
+      dynamic_val.clamp(0.3, 0.95)
     end
   end
 end

data/lib/semantic_chunker/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module SemanticChunker
-  VERSION = "0.5.3"
+  VERSION = "0.6.3"
 end

data/lib/semantic_chunker.rb CHANGED Viewed

@@ -5,7 +5,7 @@ require 'json'
 require 'net/http'
 # 2. Require the version and base modules
-require_relative 'semantic_chunker/version' if File.exist?('lib/semantic_chunker/version.rb')
+require_relative 'semantic_chunker/version'
 # 3. Require the internal logic
 require_relative 'semantic_chunker/adapters/base'

metadata CHANGED Viewed

@@ -1,15 +1,43 @@
 --- !ruby/object:Gem::Specification
 name: semantic_chunker
 version: !ruby/object:Gem::Version
-  version: 0.5.3
+  version: 0.6.3
 platform: ruby
 authors:
 - Daniele Frisanco
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2026-01-07 00:00:00.000000000 Z
+date: 2026-01-08 00:00:00.000000000 Z
 dependencies:
+- !ruby/object:Gem::Dependency
+  name: pragmatic_segmenter
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.3'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.3'
+- !ruby/object:Gem::Dependency
+  name: matrix
+  requirement: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.4'
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - "~>"
+      - !ruby/object:Gem::Version
+        version: '0.4'
 - !ruby/object:Gem::Dependency
   name: rake
   requirement: !ruby/object:Gem::Requirement
@@ -53,40 +81,32 @@ dependencies:
       - !ruby/object:Gem::Version
         version: '0'
 - !ruby/object:Gem::Dependency
-  name: pragmatic_segmenter
+  name: webmock
   requirement: !ruby/object:Gem::Requirement
     requirements:
-    - - "~>"
-      - !ruby/object:Gem::Version
-        version: '0.3'
-  type: :runtime
-  prerelease: false
-  version_requirements: !ruby/object:Gem::Requirement
-    requirements:
-    - - "~>"
-      - !ruby/object:Gem::Version
-        version: '0.3'
-- !ruby/object:Gem::Dependency
-  name: matrix
-  requirement: !ruby/object:Gem::Requirement
-    requirements:
-    - - "~>"
+    - - ">="
       - !ruby/object:Gem::Version
-        version: '0.4'
-  type: :runtime
+        version: '0'
+  type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
-    - - "~>"
+    - - ">="
       - !ruby/object:Gem::Version
-        version: '0.4'
-description: Split long text into chunks based on semantic meaning.
+        version: '0'
+description: A powerful tool for RAG (Retrieval-Augmented Generation) that splits
+  text into chunks based on semantic meaning rather than just character counts. Supports
+  sliding windows, adaptive buffering, and dynamic percentile-based thresholding.
 email:
 - daniele.frisanco@gmail.com
-executables: []
+executables:
+- semantic_chunker
 extensions: []
 extra_rdoc_files: []
 files:
+- CHANGELOG.md
+- README.md
+- bin/semantic_chunker
 - lib/semantic_chunker.rb
 - lib/semantic_chunker/adapters/base.rb
 - lib/semantic_chunker/adapters/hugging_face_adapter.rb
@@ -97,7 +117,13 @@ files:
 homepage: https://github.com/danielefrisanco/semantic_chunker
 licenses:
 - MIT
-metadata: {}
+metadata:
+  homepage_uri: https://github.com/danielefrisanco/semantic_chunker
+  source_code_uri: https://github.com/danielefrisanco/semantic_chunker
+  changelog_uri: https://github.com/danielefrisanco/semantic_chunker/blob/main/CHANGELOG.md
+  bug_tracker_uri: https://github.com/danielefrisanco/semantic_chunker/issues
+  documentation_uri: https://www.rubydoc.info/gems/semantic_chunker/0.6.3
+  allowed_push_host: https://rubygems.org
 post_install_message:
 rdoc_options: []
 require_paths:
@@ -106,7 +132,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
     - !ruby/object:Gem::Version
-      version: '0'
+      version: 3.0.0
 required_rubygems_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
@@ -116,5 +142,5 @@ requirements: []
 rubygems_version: 3.3.26
 signing_key:
 specification_version: 4
-summary: Split long text into chunks based on semantic meaning.
+summary: Semantic text chunking using embeddings and dynamic thresholding.
 test_files: []