RubyGems - semantic_text_chunker - Versions diffs - 0.1.0 → 0.2.0 - Mend

semantic_text_chunker 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

checksums.yaml +4 -4
data/README.md +45 -5
data/lib/semantic_text_chunker/boundary_detector.rb +16 -6
data/lib/semantic_text_chunker/chunk_builder.rb +7 -1
data/lib/semantic_text_chunker/chunker.rb +18 -5
data/lib/semantic_text_chunker/splitters/sentence_splitter.rb +20 -7
data/lib/semantic_text_chunker/splitters/structure_splitter.rb +69 -0
data/lib/semantic_text_chunker/version.rb +1 -1
metadata +3 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: bd5a74c09a20b7a87ae6d2a3c8a4cfda92315de9d4f18ffe54de76ee35b0db4f
-  data.tar.gz: cfe5098ea9fcd78e0fd8297755f2ac1c568d34656decf4b98b31a7c7f2e69c66
+  metadata.gz: ab921bdd7c9c704dbfbb565d42b5a2d6b4bf120d026e4fede72b97df7192508f
+  data.tar.gz: 3dce3cdcd0b84ec89eb296e2c12c7ba27eac8363675266bf7420927e8035a66f
 SHA512:
-  metadata.gz: 8cc50f43d68240887b1af0498ac50a320a3c77fa97656cc60ecb8b2baa4c829c381bb2e66763d53b65dbd32def3f54952b33fd9e0278bfa681b752e046aa1d27
-  data.tar.gz: 391aba7c0d6dc181fc29a9b383dd3f07f1196dd4f4e5487e724b0893f80a6ca9a207f9f9e6cb8c6f003a91d800bdd074feb000bed985ba656cab4289f2dbbc53
+  metadata.gz: 05dd40f830c2c7e43643b99ef7e739b87acd933cbf24bdf8f115df5121e044ab206abe84b73cfd1121d4a095dd45beb96eb5ff71415b6c2f054f7e79253faeca
+  data.tar.gz: f6b20bdf6f581e29885513f9aaa97982ce4031593065969766a93b5fec8c52b2525c16866d2df1f715c36bc8897c7e9b5e4543a3062db404922dcc1b041aa114

data/README.md CHANGED Viewed

@@ -82,13 +82,17 @@ embedder = SemanticTextChunker::Embedders::Null.new
 | `threshold`         | `0.75`  | Cosine similarity threshold for detecting boundaries |
 | `max_tokens`        | `512`   | Maximum tokens per chunk (estimated at ~4 chars/token) |
 | `overlap_sentences` | `2`     | Number of sentences to overlap between chunks        |
+| `respect_structure` | `true`  | Treat paragraph breaks and markdown headings as hard chunk boundaries |
+| `extra_abbreviations` | `[]`  | Additional abbreviations the sentence splitter should not split on |
 ```ruby
 chunks = SemanticTextChunker.chunk(text,
   embedder: embedder,
   threshold: 0.8,
   max_tokens: 1024,
-  overlap_sentences: 3
+  overlap_sentences: 3,
+  respect_structure: true,
+  extra_abbreviations: ["Inc", "Ltd"]
 )
 ```
@@ -135,12 +139,48 @@ end
 The base class provides a `cosine_similarity` method used for boundary detection.
+## Sentence Splitting
+Sentences are detected with punctuation-aware rules that:
+- Keep common abbreviations intact (`Mr.`, `Dr.`, `U.S.A.`, `e.g.`, etc.)
+- Keep decimal numbers intact (`3.14`, `v1.2.3`)
+- Split dialogue ending in a closing quote (`"Stop!" He ran.`)
+- Start new sentences on digits or opening quotes, not just capital letters
+To recognize domain-specific abbreviations, pass `extra_abbreviations` (also accepted
+directly by `SemanticTextChunker.chunk`):
+```ruby
+splitter = SemanticTextChunker::Splitters::SentenceSplitter.new(
+  extra_abbreviations: ["Inc", "Ltd", "cf", "al"]
+)
+splitter.split("Acme Inc. shipped it. Done.")
+# => ["Acme Inc. shipped it.", "Done."]
+```
+## Structure-Aware Chunking
+By default (`respect_structure: true`), the chunker respects document structure so that
+chunks never blur across obvious boundaries:
+- **Paragraph breaks** (blank lines) end a chunk — two paragraphs are never merged into one.
+- **Markdown headings** (`# ...` through `###### ...`) start a new section. A standalone
+  heading is attached to the content that follows it, so each section's chunk carries its
+  heading for context.
+- **Overlap never crosses a structural boundary**, so a section's chunk won't leak the tail
+  of the previous section.
+Semantic similarity and the token limit still apply *within* each structural block. Set
+`respect_structure: false` to disable this and chunk purely by similarity and token count.
 ## How It Works
-1. **Sentence splitting** - Text is split into sentences using punctuation-aware rules that handle abbreviations (Mr., Dr., U.S., etc.)
-2. **Embedding** - Each sentence is embedded using the configured embedder
-3. **Boundary detection** - Consecutive sentences are grouped. A new chunk boundary is created when the cosine similarity between the accumulated chunk embedding and the next sentence drops below the threshold, or when the token limit is exceeded
-4. **Chunk building** - Sentences are assembled into chunks with configurable overlap for context continuity
+1. **Structure splitting** - Text is broken into blocks on paragraph breaks and markdown headings, which become hard boundaries that chunks are never merged across (when `respect_structure` is enabled)
+2. **Sentence splitting** - Each block is split into sentences using punctuation-aware rules that handle abbreviations (Mr., Dr., U.S., etc.), decimals, and dialogue
+3. **Embedding** - Each sentence is embedded using the configured embedder
+4. **Boundary detection** - Consecutive sentences are grouped. A new chunk boundary is created at a structural boundary, when the cosine similarity between the accumulated chunk embedding and the next sentence drops below the threshold, or when the token limit is exceeded
+5. **Chunk building** - Sentences are assembled into chunks with configurable overlap for context continuity (overlap never crosses a structural boundary)
 ## License

data/lib/semantic_text_chunker/boundary_detector.rb CHANGED Viewed

@@ -1,31 +1,41 @@
+require "set"
 module SemanticTextChunker
   class BoundaryDetector
-    def initialize(sentences:, embeddings:, threshold:, max_tokens:, embedder:)
+    def initialize(sentences:, embeddings:, threshold:, max_tokens:, embedder:, forced: [])
       @sentences  = sentences
       @embeddings = embeddings
       @threshold  = threshold
       @max_tokens = max_tokens
       @embedder   = embedder
+      @forced     = forced.to_set
     end
     # Returns array of sentence indices where chunks end
     def boundaries
       return [] if @sentences.size <= 1
-      boundaries    = []
-      chunk_start   = 0
-      current_text  = ""
+      boundaries  = []
+      chunk_start = 0
       @sentences.each_with_index do |sentence, i|
         next if i == 0
+        # Hard structural boundary: the previous sentence ended a block, so the
+        # chunk must end there regardless of similarity or token count.
+        if @forced.include?(i - 1)
+          boundaries << (i - 1) unless boundaries.last == (i - 1)
+          chunk_start = i
+          next
+        end
         current_text = @sentences[chunk_start..i - 1].join(" ")
         next_text    = current_text + " " + sentence
         # Force boundary if adding this sentence exceeds token limit
         if tokens(next_text) > @max_tokens
           boundaries << i - 1
-          chunk_start  = i
-          current_text = ""
+          chunk_start = i
           next
         end

data/lib/semantic_text_chunker/chunk_builder.rb CHANGED Viewed

@@ -1,9 +1,12 @@
+require "set"
 module SemanticTextChunker
   class ChunkBuilder
-    def initialize(sentences:, boundaries:, overlap_sentences:)
+    def initialize(sentences:, boundaries:, overlap_sentences:, hard_boundaries: [])
       @sentences         = sentences
       @boundaries        = boundaries
       @overlap_sentences = overlap_sentences
+      @hard_boundaries   = hard_boundaries.to_set
     end
     def build
@@ -17,6 +20,9 @@ module SemanticTextChunker
       split_points.each_with_index do |boundary, idx|
         start = if idx == 0
           0
+        elsif @hard_boundaries.include?(prev_end)
+          # Don't carry overlap across a structural boundary.
+          prev_end + 1
         else
           # Overlap: go back N sentences from previous boundary
           [prev_end - @overlap_sentences + 1, 0].max

data/lib/semantic_text_chunker/chunker.rb CHANGED Viewed

@@ -1,4 +1,5 @@
 require_relative "splitters/sentence_splitter"
+require_relative "splitters/structure_splitter"
 require_relative "embedders/base"
 require_relative "embedders/null"
 require_relative "boundary_detector"
@@ -11,19 +12,29 @@ module SemanticTextChunker
       embedder: Embedders::Null.new,
       threshold: 0.75,
       max_tokens: 512,
-      overlap_sentences: 2
+      overlap_sentences: 2,
+      respect_structure: true,
+      extra_abbreviations: []
     )
       @embedder          = embedder
       @threshold         = threshold
       @max_tokens        = max_tokens
       @overlap_sentences = overlap_sentences
-      @splitter          = Splitters::SentenceSplitter.new
+      @respect_structure = respect_structure
+      @splitter          = Splitters::SentenceSplitter.new(extra_abbreviations: extra_abbreviations)
+      @structure_splitter = Splitters::StructureSplitter.new(sentence_splitter: @splitter)
     end
     def chunk(text)
       return [] if text.nil? || text.strip.empty?
-      sentences  = @splitter.split(text)
+      if @respect_structure
+        sentences, hard = @structure_splitter.split(text)
+      else
+        sentences = @splitter.split(text)
+        hard      = []
+      end
       embeddings = @embedder.embed(sentences)
       boundaries = BoundaryDetector.new(
@@ -31,13 +42,15 @@ module SemanticTextChunker
         embeddings: embeddings,
         threshold:  @threshold,
         max_tokens: @max_tokens,
-        embedder:   @embedder
+        embedder:   @embedder,
+        forced:     hard
       ).boundaries
       ChunkBuilder.new(
         sentences:         sentences,
         boundaries:        boundaries,
-        overlap_sentences: @overlap_sentences
+        overlap_sentences: @overlap_sentences,
+        hard_boundaries:   hard
       ).build
     end

data/lib/semantic_text_chunker/splitters/sentence_splitter.rb CHANGED Viewed

@@ -2,18 +2,31 @@ module SemanticTextChunker
   module Splitters
     class SentenceSplitter
       ABBREVS = %w[Mr Mrs Dr Prof Sr Jr vs etc e.g i.e U.S U.K U.S.A Fig Vol No].freeze
-      ABBREV_PATTERN = /\b(#{ABBREVS.map { |a| Regexp.escape(a) }.join("|")})\.\s/
+      # Split after a terminator (optionally followed by a closing quote/bracket)
+      # and whitespace, when the next sentence starts with an opening quote,
+      # an uppercase letter, or a digit.
+      SPLIT_PATTERN = /(?<=[.?!]|[.?!]["')\]])\s+(?=["'(\[A-Z0-9])/
+      ABBREV_PLACEHOLDER = "__STC_ABBREV__".freeze
+      DECIMAL_PLACEHOLDER = "__STC_DEC__".freeze
+      def initialize(extra_abbreviations: [])
+        @abbrevs = (ABBREVS + extra_abbreviations).freeze
+        @abbrev_pattern = /\b(#{@abbrevs.map { |a| Regexp.escape(a) }.join("|")})\.\s/
+      end
       def split(text)
+        # Protect periods inside decimal numbers (e.g. 3.14, v1.2.3)
+        protected = text.gsub(/(\d)\.(\d)/) { "#{$1}#{DECIMAL_PLACEHOLDER}#{$2}" }
         # Temporarily replace abbreviation periods
-        protected = text.gsub(ABBREV_PATTERN) { "#{$1}__ABBREV__ " }
+        protected = protected.gsub(@abbrev_pattern) { "#{$1}#{ABBREV_PLACEHOLDER} " }
-        sentences = protected
-          .split(/(?<=[.?!])\s+(?=[A-Z])/)
-          .map { |s| s.gsub("__ABBREV__", ".").strip }
+        protected
+          .split(SPLIT_PATTERN)
+          .map { |s| s.gsub(ABBREV_PLACEHOLDER, ".").gsub(DECIMAL_PLACEHOLDER, ".").strip }
           .reject(&:empty?)
-        sentences
       end
     end
   end

data/lib/semantic_text_chunker/splitters/structure_splitter.rb ADDED Viewed

@@ -0,0 +1,69 @@
+require_relative "sentence_splitter"
+module SemanticTextChunker
+  module Splitters
+    # Splits text while respecting document structure. Blank-line paragraph
+    # breaks and markdown headings produce "hard" boundaries that chunks are
+    # never merged across. A standalone heading is attached to the block of
+    # content that follows it, so the heading travels with its section.
+    class StructureSplitter
+      # ATX markdown heading line, e.g. "## Section title"
+      HEADING_LINE = /\A\#{1,6}\s+\S/
+      def initialize(sentence_splitter: SentenceSplitter.new)
+        @sentence_splitter = sentence_splitter
+      end
+      # Returns [sentences, hard_boundaries] where:
+      #   sentences       - flat array of sentence strings across all blocks
+      #   hard_boundaries - sentence indices that must end a chunk
+      def split(text)
+        sentences = []
+        hard      = []
+        segment(text).each do |block|
+          block_sentences = @sentence_splitter.split(block)
+          next if block_sentences.empty?
+          sentences.concat(block_sentences)
+          hard << sentences.size - 1
+        end
+        # The last block's trailing boundary is the document end, not a split.
+        hard.pop
+        [sentences, hard]
+      end
+      private
+      # Break text into blocks on blank lines, merging a standalone heading
+      # into the following block.
+      def segment(text)
+        raw = text.split(/\n[ \t]*\n+/).map(&:strip).reject(&:empty?)
+        merged          = []
+        pending_heading = nil
+        raw.each do |block|
+          if heading_only?(block)
+            pending_heading = pending_heading ? "#{pending_heading}\n#{block}" : block
+          elsif pending_heading
+            merged << "#{pending_heading}\n#{block}"
+            pending_heading = nil
+          else
+            merged << block
+          end
+        end
+        merged << pending_heading if pending_heading
+        merged
+      end
+      def heading_only?(block)
+        lines = block.lines.map(&:strip).reject(&:empty?)
+        lines.size == 1 && lines.first.match?(HEADING_LINE)
+      end
+    end
+  end
+end

data/lib/semantic_text_chunker/version.rb CHANGED Viewed

@@ -1,3 +1,3 @@
 module SemanticTextChunker
-  VERSION = "0.1.0"
+  VERSION = "0.2.0"
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: semantic_text_chunker
 version: !ruby/object:Gem::Version
-  version: 0.1.0
+  version: 0.2.0
 platform: ruby
 authors:
 - Vlad Tigănilă
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2026-05-19 00:00:00.000000000 Z
+date: 2026-06-02 00:00:00.000000000 Z
 dependencies: []
 description: Detects topic boundaries using embedding similarity to produce semantically
   coherent chunks from books, articles, and documents. Supports Cohere, OpenAI, and
@@ -32,6 +32,7 @@ files:
 - lib/semantic_text_chunker/embedders/openai.rb
 - lib/semantic_text_chunker/metadata.rb
 - lib/semantic_text_chunker/splitters/sentence_splitter.rb
+- lib/semantic_text_chunker/splitters/structure_splitter.rb
 - lib/semantic_text_chunker/version.rb
 homepage: https://github.com/VladTZY/semantic_text_chunker
 licenses: