RubyGems - semchunk - Versions diffs - 0.1.1 → 0.1.2 - Mend

semchunk 0.1.1 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: ba78c1e8ca6f2d54a7345a1f2c33a4492224c121022db4fdc4dd46742e4891c8
-  data.tar.gz: f7c5e2d0a24964d97f59aa6b1b16649712108e5ce9a3253a178e36d2a3196ecd
+  metadata.gz: 1ca2e04c2509ef3fa509307aaf5231dcb23df17a596a2dfdfcfa9760c0f6498b
+  data.tar.gz: e2850b2ca76ca627d7bf67ba070f906092e193d19a632bbbee19dfeeb153583a
 SHA512:
-  metadata.gz: e41759111d572eff3fe47d518f33524e2fac447a09992e96e76a488c27cb760cbe82352b7533680a8761f8360ddfc2613ac5c7b148e21c9800357dd9dcbbed94
-  data.tar.gz: e4b4acaf2229dcf85d16d1c3869a4978f17f69fe60b3c2b2a86c67354a4e90f9982691f88f8c6595189863109bb0df82be2232b5144ea036ff2db4e0ce0425df
+  metadata.gz: 53ca16f2cddb69d9eb7461754dee4261f592acb0aa88329a5803f9c92246529356ed4bb9932f7cb1895d3480b8cb6eac04112bb5066c8e80f1edbe1430ba3578
+  data.tar.gz: d302134a84ea39be36b7a7637f88c9107bfd452b495133fbed7ed678f4f8c6a986a265550917e5d198f73ed2922584e48bca89a14948e6c855df4ee852af395c

data/README.md CHANGED Viewed

@@ -1,12 +1,18 @@
-# Semchunk
+<div align="center">
+# semchunk 🧩
 [![Gem Version](https://img.shields.io/gem/v/semchunk)](https://rubygems.org/gems/semchunk)
 [![Gem Downloads](https://img.shields.io/gem/dt/semchunk)](https://www.ruby-toolbox.com/projects/semchunk)
 [![GitHub Workflow Status](https://img.shields.io/github/actions/workflow/status/philip-zhan/semchunk.rb/ci.yml)](https://github.com/philip-zhan/semchunk.rb/actions/workflows/ci.yml)
-Split text into semantically meaningful chunks of a specified size as determined by a provided token counter.
+</div>
+**`semchunk`** is a fast and lightweight Ruby gem for splitting text into semantically meaningful chunks.
-This is a Ruby port of the Python [semchunk](https://github.com/umarbutler/semchunk) package.
+This is a Ruby port of the Python [semchunk](https://github.com/isaacus-dev/semchunk) library by [Isaacus](https://isaacus.com/), maintaining the same efficient chunking algorithm and API design.
+`semchunk` produces chunks that are more semantically meaningful than regular token and recursive character chunkers, while being fast and easy to use.
 ## Features
@@ -293,16 +299,31 @@ text = "Your long text here..."
 chunks = chunker.call(text)
 ```
-## How It Works
+## How It Works 🔍
+`semchunk` works by recursively splitting texts until all resulting chunks are equal to or less than a specified chunk size. In particular, it:
-Semchunk uses a hierarchical splitting strategy:
+1. Splits text using the most semantically meaningful splitter possible;
+2. Recursively splits the resulting chunks until a set of chunks equal to or less than the specified chunk size is produced;
+3. Merges any chunks that are under the chunk size back together until the chunk size is reached;
+4. Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks; and
+5. Excludes chunks consisting entirely of whitespace characters.
-1. **Primary split**: Tries to split on paragraph breaks (newlines)
-2. **Secondary split**: Falls back to sentences (periods, question marks, etc.)
-3. **Tertiary split**: Uses clauses (commas, semicolons) if needed
-4. **Final split**: Character-level splitting as last resort
+To ensure that chunks are as semantically meaningful as possible, `semchunk` uses the following splitters, in order of precedence:
-This ensures that chunks are semantically meaningful while respecting your token limits.
+1. The largest sequence of newlines (`\n`) and/or carriage returns (`\r`);
+2. The largest sequence of tabs;
+3. The largest sequence of whitespace characters (as defined by regex's `\s` character class) or, if the largest sequence of whitespace characters is only a single character and there exist whitespace characters preceded by any of the semantically meaningful non-whitespace characters listed below (in the same order of precedence), then only those specific whitespace characters;
+4. Sentence terminators (`.`, `?`, `!` and `*`);
+5. Clause separators (`;`, `,`, `(`, `)`, `[`, `]`, `"`, `"`, `'`, `'`, `'`, `"` and `` ` ``);
+6. Sentence interrupters (`:`, `—` and `…`);
+7. Word joiners (`/`, `\`, `–`, `&` and `-`); and
+8. All other characters.
+If overlapping chunks have been requested, `semchunk` also:
+1. Internally reduces the chunk size to `min(overlap, chunk_size - overlap)` (`overlap` being computed as `floor(chunk_size * overlap)` for relative overlaps and `min(overlap, chunk_size - 1)` for absolute overlaps); and
+2. Merges every `floor(original_chunk_size / reduced_chunk_size)` chunks starting from the first chunk and then jumping by `floor((original_chunk_size - overlap) / reduced_chunk_size)` chunks until the last chunk is reached.
 The algorithm uses binary search to efficiently find the optimal split points, making it fast even for large documents.
@@ -318,6 +339,22 @@ ruby examples/basic_usage.rb
 ruby examples/advanced_usage.rb
 ```
+## Benchmarks 📊
+You can run the included benchmark to test performance:
+```bash
+ruby test/bench.rb
+```
+The Ruby implementation maintains similar performance characteristics to the Python version:
+- Efficient binary search for optimal split points
+- O(n log n) complexity for chunking
+- Fast token count lookups with memoization
+- Low memory overhead
+The benchmark tests chunking multiple texts with various chunk sizes and provides detailed performance metrics.
 ## Differences from Python Version
 This Ruby port maintains feature parity with the Python version, with a few notes:

data/lib/semchunk/chunker.rb CHANGED Viewed

@@ -16,27 +16,23 @@ module Semchunk
     #
     # @param text_or_texts [String, Array<String>] The text or texts to be chunked
     # @param processes [Integer] The number of processes to use when chunking multiple texts (not yet implemented)
-    # @param progress [Boolean] Whether to display a progress bar when chunking multiple texts (not yet implemented)
     # @param offsets [Boolean] Whether to return the start and end offsets of each chunk
     # @param overlap [Float, Integer, nil] The proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap
     #
     # @return [Array<String>, Array<Array>, Hash] Depending on the input and options, returns chunks and optionally offsets
-    def call(text_or_texts, processes: 1, progress: false, offsets: false, overlap: nil)
+    def call(text_or_texts, processes: 1, offsets: false, overlap: nil)
       chunk_function = make_chunk_function(offsets: offsets, overlap: overlap)
       # Handle single text
-      if text_or_texts.is_a?(String)
-        return chunk_function.call(text_or_texts)
-      end
+      return chunk_function.call(text_or_texts) if text_or_texts.is_a?(String)
       # Handle multiple texts
-      if processes == 1
-        # TODO: Add progress bar support
-        chunks_and_offsets = text_or_texts.map { |text| chunk_function.call(text) }
-      else
-        # TODO: Add parallel processing support
-        raise NotImplementedError, "Parallel processing not yet implemented. Please use processes: 1"
-      end
+      raise NotImplementedError, "Parallel processing not yet implemented. Please use processes: 1" unless processes == 1
+      # TODO: Add progress bar support
+      chunks_and_offsets = text_or_texts.map { |text| chunk_function.call(text) }
+      # TODO: Add parallel processing support
       # Return results
       if offsets

data/lib/semchunk/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module Semchunk
-  VERSION = "0.1.1"
+  VERSION = "0.1.2"
 end

data/lib/semchunk.rb CHANGED Viewed

@@ -23,158 +23,263 @@ module Semchunk
     # @param start [Integer] Internal parameter for tracking character offset.
     #
     # @return [Array<String>, Array<Array>] A list of chunks up to chunk_size-tokens-long, with any whitespace used to split the text removed, and, if offsets is true, a list of tuples [start, end].
-    def chunk(text, chunk_size:, token_counter:, memoize: true, offsets: false, overlap: nil, cache_maxsize: nil, recursion_depth: 0, start: 0)
-      # Rename variables for clarity
+    def chunk(text, chunk_size:, token_counter:, memoize: true, offsets: false, overlap: nil, cache_maxsize: nil,
+              recursion_depth: 0, start: 0)
       return_offsets = offsets
+      is_first_call = recursion_depth.zero?
+      # Initialize token counter and compute effective chunk size
+      token_counter, local_chunk_size, overlap, unoverlapped_chunk_size = initialize_chunking_params(
+        token_counter, chunk_size, overlap, is_first_call, memoize, cache_maxsize
+      )
+      # Split text and prepare metadata
+      splitter, splitter_is_whitespace, splits = split_text(text)
+      split_starts, cum_lens, num_splits_plus_one = prepare_split_metadata(splits, splitter, start)
+      # Process splits into chunks
+      chunks, offsets_arr = process_splits(
+        splits,
+        split_starts,
+        cum_lens,
+        splitter,
+        splitter_is_whitespace,
+        token_counter,
+        local_chunk_size,
+        num_splits_plus_one,
+        recursion_depth
+      )
+      # Finalize first call: cleanup and overlap
+      finalize_chunks(
+        chunks,
+        offsets_arr,
+        is_first_call,
+        return_offsets,
+        overlap,
+        local_chunk_size,
+        chunk_size,
+        unoverlapped_chunk_size,
+        text
+      )
+    end
+    # A tuple of semantically meaningful non-whitespace splitters
+    NON_WHITESPACE_SEMANTIC_SPLITTERS = [
+      # Sentence terminators
+      ".",
+      "?",
+      "!",
+      "*",
+      # Clause separators
+      ";",
+      ",",
+      "(",
+      ")",
+      "[",
+      "]",
+      ", ",
+      "'",
+      "'",
+      "'",
+      '"',
+      "`",
+      # Sentence interrupters
+      ":",
+      "—",
+      "…",
+      # Word joiners
+      "/",
+      "\\",
+      "–",
+      "&",
+      "-"
+    ].freeze
+    private
+    def initialize_chunking_params(token_counter, chunk_size, overlap, is_first_call, memoize, cache_maxsize)
       local_chunk_size = chunk_size
+      unoverlapped_chunk_size = nil
-      # If this is the first call, memoize the token counter if memoization is enabled and reduce the effective chunk size if overlapping chunks
-      is_first_call = recursion_depth.zero?
+      return [token_counter, local_chunk_size, overlap, unoverlapped_chunk_size] unless is_first_call
-      if is_first_call
-        if memoize
-          token_counter = memoize_token_counter(token_counter, cache_maxsize)
-        end
+      token_counter = memoize_token_counter(token_counter, cache_maxsize) if memoize
-        if overlap
-          # Make relative overlaps absolute and floor both relative and absolute overlaps
-          overlap = if overlap < 1
-                      (chunk_size * overlap).floor
-                    else
-                      [overlap, chunk_size - 1].min
-                    end
-          # If the overlap has not been zeroed, compute the effective chunk size
-          if overlap.positive?
-            unoverlapped_chunk_size = chunk_size - overlap
-            local_chunk_size = [overlap, unoverlapped_chunk_size].min
-          end
+      if overlap
+        overlap = compute_overlap(overlap, chunk_size)
+        if overlap.positive?
+          unoverlapped_chunk_size = chunk_size - overlap
+          local_chunk_size = [overlap, unoverlapped_chunk_size].min
         end
       end
-      # Split the text using the most semantically meaningful splitter possible
-      splitter, splitter_is_whitespace, splits = split_text(text)
+      [token_counter, local_chunk_size, overlap, unoverlapped_chunk_size]
+    end
-      offsets_arr = []
+    def compute_overlap(overlap, chunk_size)
+      if overlap < 1
+        (chunk_size * overlap).floor
+      else
+        [overlap, chunk_size - 1].min
+      end
+    end
+    def prepare_split_metadata(splits, splitter, start)
       splitter_len = splitter.length
       split_lens = splits.map(&:length)
       cum_lens = [0]
-      split_lens.each { |len| cum_lens << cum_lens.last + len }
+      split_lens.each { |len| cum_lens << (cum_lens.last + len) }
       split_starts = [0]
       split_lens.each_with_index do |split_len, i|
-        split_starts << split_starts[i] + split_len + splitter_len
+        split_starts << (split_starts[i] + split_len + splitter_len)
       end
       split_starts = split_starts.map { |s| s + start }
       num_splits_plus_one = splits.length + 1
+      [split_starts, cum_lens, num_splits_plus_one]
+    end
+    def process_splits(splits, split_starts, cum_lens, splitter, splitter_is_whitespace,
+                       token_counter, local_chunk_size, num_splits_plus_one, recursion_depth)
       chunks = []
+      offsets_arr = []
       skips = Set.new
+      splitter_len = splitter.length
-      # Iterate through the splits
       splits.each_with_index do |split, i|
-        # Skip the split if it has already been added to a chunk
         next if skips.include?(i)
         split_start = split_starts[i]
-        # If the split is over the chunk size, recursively chunk it
         if token_counter.call(split) > local_chunk_size
-          new_chunks, new_offsets = chunk(
-            split,
-            chunk_size: local_chunk_size,
-            token_counter: token_counter,
-            offsets: true,
-            recursion_depth: recursion_depth + 1,
-            start: split_start
-          )
+          new_chunks, new_offsets = chunk_recursively(split, local_chunk_size, token_counter, recursion_depth, split_start)
           chunks.concat(new_chunks)
           offsets_arr.concat(new_offsets)
         else
-          # Merge the split with subsequent splits until the chunk size is reached
-          final_split_in_chunk_i, new_chunk = merge_splits(
-            splits: splits,
-            cum_lens: cum_lens,
-            chunk_size: local_chunk_size,
-            splitter: splitter,
+          final_split_i, new_chunk = merge_splits(
+            splits:        splits,
+            cum_lens:      cum_lens,
+            chunk_size:    local_chunk_size,
+            splitter:      splitter,
             token_counter: token_counter,
-            start: i,
-            high: num_splits_plus_one
+            start:         i,
+            high:          num_splits_plus_one
           )
-          # Mark any splits included in the new chunk for exclusion from future chunks
-          ((i + 1)...final_split_in_chunk_i).each { |j| skips.add(j) }
-          # Add the chunk
+          ((i + 1)...final_split_i).each { |j| skips.add(j) }
           chunks << new_chunk
-          # Add the chunk's offsets
-          split_end = split_starts[final_split_in_chunk_i] - splitter_len
+          split_end = split_starts[final_split_i] - splitter_len
           offsets_arr << [split_start, split_end]
         end
-        # If the splitter is not whitespace and the split is not the last split, add the splitter to the end of the latest chunk
-        unless splitter_is_whitespace || (i == splits.length - 1 || ((i + 1)...splits.length).all? { |j| skips.include?(j) })
-          last_chunk_with_splitter = chunks[-1] + splitter
-          if token_counter.call(last_chunk_with_splitter) <= local_chunk_size
-            chunks[-1] = last_chunk_with_splitter
-            offset_start, offset_end = offsets_arr[-1]
-            offsets_arr[-1] = [offset_start, offset_end + splitter_len]
-          else
-            offset_start = offsets_arr.empty? ? split_start : offsets_arr[-1][1]
-            chunks << splitter
-            offsets_arr << [offset_start, offset_start + splitter_len]
-          end
-        end
+        append_splitter_if_needed(
+          chunks,
+          offsets_arr,
+          splitter,
+          splitter_is_whitespace,
+          i,
+          splits,
+          skips,
+          token_counter,
+          local_chunk_size,
+          split_start
+        )
       end
-      # If this is the first call, remove empty chunks and overlap if desired
-      if is_first_call
-        # Remove empty chunks and chunks comprised entirely of whitespace
-        chunks_and_offsets = chunks.zip(offsets_arr).reject { |chunk, _| chunk.empty? || chunk.strip.empty? }
+      [chunks, offsets_arr]
+    end
-        if chunks_and_offsets.any?
-          chunks, offsets_arr = chunks_and_offsets.transpose
-        else
-          chunks = []
-          offsets_arr = []
-        end
+    def chunk_recursively(split, local_chunk_size, token_counter, recursion_depth, split_start)
+      chunk(
+        split,
+        chunk_size: local_chunk_size,
+        token_counter: token_counter,
+        offsets: true,
+        recursion_depth: recursion_depth + 1,
+        start: split_start
+      )
+    end
-        # Overlap chunks if desired and there are chunks to overlap
-        if overlap && overlap.positive? && chunks.any?
-          # Rename variables for clarity
-          subchunk_size = local_chunk_size
-          subchunks = chunks
-          suboffsets = offsets_arr
-          num_subchunks = subchunks.length
+    def append_splitter_if_needed(chunks, offsets_arr, splitter, splitter_is_whitespace,
+                                  split_index, splits, skips, token_counter, local_chunk_size,
+                                  split_start)
+      return if splitter_is_whitespace
+      return if split_index == splits.length - 1
+      return if ((split_index + 1)...splits.length).all? { |j| skips.include?(j) }
-          # Merge the subchunks into overlapping chunks
-          subchunks_per_chunk = (chunk_size.to_f / subchunk_size).floor
-          subchunk_stride = (unoverlapped_chunk_size.to_f / subchunk_size).floor
+      splitter_len = splitter.length
+      last_chunk_with_splitter = chunks[-1] + splitter
+      if token_counter.call(last_chunk_with_splitter) <= local_chunk_size
+        chunks[-1] = last_chunk_with_splitter
+        offset_start, offset_end = offsets_arr[-1]
+        offsets_arr[-1] = [offset_start, offset_end + splitter_len]
+      else
+        offset_start = offsets_arr.empty? ? split_start : offsets_arr[-1][1]
+        chunks << splitter
+        offsets_arr << [offset_start, offset_start + splitter_len]
+      end
+    end
-          num_overlapping_chunks = [1, ((num_subchunks - subchunks_per_chunk).to_f / subchunk_stride).ceil + 1].max
+    def finalize_chunks(chunks, offsets_arr, is_first_call, return_offsets, overlap,
+                        local_chunk_size, chunk_size, unoverlapped_chunk_size, text)
+      return [chunks, offsets_arr] unless is_first_call
+      chunks, offsets_arr = remove_empty_chunks(chunks, offsets_arr)
+      if overlap&.positive? && chunks.any?
+        chunks, offsets_arr = apply_overlap(
+          chunks,
+          offsets_arr,
+          local_chunk_size,
+          chunk_size,
+          unoverlapped_chunk_size,
+          text
+        )
+      end
-          offsets_arr = (0...num_overlapping_chunks).map do |i|
-            start_idx = i * subchunk_stride
-            end_idx = [start_idx + subchunks_per_chunk, num_subchunks].min - 1
-            [suboffsets[start_idx][0], suboffsets[end_idx][1]]
-          end
+      return [chunks, offsets_arr] if return_offsets
-          chunks = offsets_arr.map { |s, e| text[s...e] }
-        end
+      chunks
+    end
-        # Return offsets if desired
-        return [chunks, offsets_arr] if return_offsets
+    def remove_empty_chunks(chunks, offsets_arr)
+      chunks_and_offsets = chunks.zip(offsets_arr).reject { |chunk, _| chunk.empty? || chunk.strip.empty? }
-        return chunks
+      if chunks_and_offsets.any?
+        chunks_and_offsets.transpose
+      else
+        [[], []]
       end
+    end
+    def apply_overlap(chunks, offsets_arr, subchunk_size, chunk_size, unoverlapped_chunk_size, text)
+      subchunks = chunks
+      suboffsets = offsets_arr
+      num_subchunks = subchunks.length
+      subchunks_per_chunk = (chunk_size.to_f / subchunk_size).floor
+      subchunk_stride = (unoverlapped_chunk_size.to_f / subchunk_size).floor
+      num_overlapping_chunks = [1, ((num_subchunks - subchunks_per_chunk).to_f / subchunk_stride).ceil + 1].max
+      offsets_arr = (0...num_overlapping_chunks).map do |i|
+        start_idx = i * subchunk_stride
+        end_idx = [start_idx + subchunks_per_chunk, num_subchunks].min - 1
+        [suboffsets[start_idx][0], suboffsets[end_idx][1]]
+      end
+      chunks = offsets_arr.map { |s, e| text[s...e] }
-      # Always return chunks and offsets if this is a recursive call
       [chunks, offsets_arr]
     end
+    public
     # Construct a chunker that splits one or more texts into semantically meaningful chunks
     #
     # @param tokenizer_or_token_counter [String, #encode, Proc, Method, #call] Either: the name of a tokenizer; a tokenizer that possesses an encode method; or a token counter.
@@ -185,132 +290,135 @@ module Semchunk
     #
     # @return [Chunker] A chunker instance
     def chunkerify(tokenizer_or_token_counter, chunk_size: nil, max_token_chars: nil, memoize: true, cache_maxsize: nil)
-      # Handle string tokenizer names (would require tiktoken/transformers Ruby equivalents)
-      if tokenizer_or_token_counter.is_a?(String)
-        raise NotImplementedError, "String tokenizer names not yet supported in Ruby. Please pass a tokenizer object or token counter proc."
+      validate_tokenizer_type(tokenizer_or_token_counter)
+      max_token_chars = determine_max_token_chars(tokenizer_or_token_counter, max_token_chars)
+      chunk_size = determine_chunk_size(tokenizer_or_token_counter, chunk_size)
+      token_counter = create_token_counter(tokenizer_or_token_counter)
+      token_counter = wrap_with_fast_counter(token_counter, max_token_chars, chunk_size) if max_token_chars
+      token_counter = memoize_token_counter(token_counter, cache_maxsize) if memoize
+      Chunker.new(chunk_size: chunk_size, token_counter: token_counter)
+    end
+    private
+    def validate_tokenizer_type(tokenizer_or_token_counter)
+      return unless tokenizer_or_token_counter.is_a?(String)
+      raise NotImplementedError,
+            "String tokenizer names not yet supported in Ruby. Please pass a tokenizer object or token counter proc."
+    end
+    def determine_max_token_chars(tokenizer, max_token_chars)
+      return max_token_chars unless max_token_chars.nil?
+      if tokenizer.respond_to?(:token_byte_values)
+        vocab = tokenizer.token_byte_values
+        return vocab.map(&:length).max if vocab.respond_to?(:map)
+      elsif tokenizer.respond_to?(:get_vocab)
+        vocab = tokenizer.get_vocab
+        return vocab.keys.map(&:length).max if vocab.respond_to?(:keys)
       end
-      # Determine max_token_chars if not provided
-      if max_token_chars.nil?
-        if tokenizer_or_token_counter.respond_to?(:token_byte_values)
-          vocab = tokenizer_or_token_counter.token_byte_values
-          max_token_chars = vocab.map(&:length).max if vocab.respond_to?(:map)
-        elsif tokenizer_or_token_counter.respond_to?(:get_vocab)
-          vocab = tokenizer_or_token_counter.get_vocab
-          max_token_chars = vocab.keys.map(&:length).max if vocab.respond_to?(:keys)
+      nil
+    end
+    def determine_chunk_size(tokenizer, chunk_size)
+      return chunk_size unless chunk_size.nil?
+      raise ArgumentError, "chunk_size not provided and tokenizer lacks model_max_length attribute" unless tokenizer.respond_to?(:model_max_length) && tokenizer.model_max_length.is_a?(Integer)
+      chunk_size = tokenizer.model_max_length
+      if tokenizer.respond_to?(:encode)
+        begin
+          chunk_size -= tokenizer.encode("").length
+        rescue StandardError
+          # Ignore errors
         end
       end
-      # Determine chunk_size if not provided
-      if chunk_size.nil?
-        if tokenizer_or_token_counter.respond_to?(:model_max_length) && tokenizer_or_token_counter.model_max_length.is_a?(Integer)
-          chunk_size = tokenizer_or_token_counter.model_max_length
-          # Attempt to reduce the chunk size by the number of special characters
-          if tokenizer_or_token_counter.respond_to?(:encode)
-            begin
-              chunk_size -= tokenizer_or_token_counter.encode("").length
-            rescue StandardError
-              # Ignore errors
-            end
-          end
-        else
-          raise ArgumentError, "chunk_size not provided and tokenizer lacks model_max_length attribute"
-        end
+      chunk_size
+    end
+    def create_token_counter(tokenizer_or_token_counter)
+      return tokenizer_or_token_counter unless tokenizer_or_token_counter.respond_to?(:encode)
+      tokenizer = tokenizer_or_token_counter
+      encode_params = begin
+        tokenizer.method(:encode).parameters
+      rescue StandardError
+        []
       end
-      # Construct token counter from tokenizer if needed
-      if tokenizer_or_token_counter.respond_to?(:encode)
-        tokenizer = tokenizer_or_token_counter
-        # Check if encode accepts add_special_tokens parameter
-        encode_params = tokenizer.method(:encode).parameters rescue []
-        has_special_tokens = encode_params.any? { |type, name| name == :add_special_tokens }
-        token_counter = if has_special_tokens
-                          ->(text) { tokenizer.encode(text, add_special_tokens: false).length }
-                        else
-                          ->(text) { tokenizer.encode(text).length }
-                        end
+      has_special_tokens = encode_params.any? { |_type, name| name == :add_special_tokens }
+      if has_special_tokens
+        ->(text) { tokenizer.encode(text, add_special_tokens: false).length }
       else
-        token_counter = tokenizer_or_token_counter
+        ->(text) { tokenizer.encode(text).length }
       end
+    end
-      # Add fast token counter optimization if max_token_chars is known
-      if max_token_chars
-        max_token_chars -= 1
-        original_token_counter = token_counter
-        token_counter = lambda do |text|
-          heuristic = chunk_size * 6
-          if text.length > heuristic && original_token_counter.call(text[0...(heuristic + max_token_chars)]) > chunk_size
-            chunk_size + 1
-          else
-            original_token_counter.call(text)
-          end
+    def wrap_with_fast_counter(token_counter, max_token_chars, chunk_size)
+      max_token_chars -= 1
+      original_token_counter = token_counter
+      lambda do |text|
+        heuristic = chunk_size * 6
+        if text.length > heuristic && original_token_counter.call(text[0...(heuristic + max_token_chars)]) > chunk_size
+          chunk_size + 1
+        else
+          original_token_counter.call(text)
         end
       end
+    end
+    def split_text(text)
+      splitter_is_whitespace = true
+      splitter = find_whitespace_splitter(text)
+      if splitter
+        result = try_whitespace_with_semantic_preceder(text, splitter, splitter_is_whitespace)
+        return result if result
-      # Memoize the token counter if necessary
-      if memoize
-        token_counter = memoize_token_counter(token_counter, cache_maxsize)
+        return [splitter, splitter_is_whitespace, text.split(splitter)]
       end
-      # Construct and return the chunker
-      Chunker.new(chunk_size: chunk_size, token_counter: token_counter)
+      # No whitespace found, use non-whitespace semantic splitters
+      find_non_whitespace_splitter(text)
     end
-    private
+    def find_whitespace_splitter(text)
+      return text.scan(/[\r\n]+/).max_by(&:length) if text.include?("\n") || text.include?("\r")
+      return text.scan(/\t+/).max_by(&:length) if text.include?("\t")
+      return text.scan(/\s+/).max_by(&:length) if text.match?(/\s/)
-    # A tuple of semantically meaningful non-whitespace splitters
-    NON_WHITESPACE_SEMANTIC_SPLITTERS = [
-      # Sentence terminators
-      ".", "?", "!", "*",
-      # Clause separators
-      ";", ",", "(", ")", "[", "]", """, """, "'", "'", "'", '"', "`",
-      # Sentence interrupters
-      ":", "—", "…",
-      # Word joiners
-      "/", "\\", "–", "&", "-"
-    ].freeze
+      nil
+    end
-    def split_text(text)
-      splitter_is_whitespace = true
+    def try_whitespace_with_semantic_preceder(text, splitter, splitter_is_whitespace)
+      return nil unless splitter.length == 1
-      # Try splitting at various levels
-      if text.include?("\n") || text.include?("\r")
-        newline_matches = text.scan(/[\r\n]+/)
-        splitter = newline_matches.max_by(&:length)
-      elsif text.include?("\t")
-        tab_matches = text.scan(/\t+/)
-        splitter = tab_matches.max_by(&:length)
-      elsif text.match?(/\s/)
-        whitespace_matches = text.scan(/\s+/)
-        splitter = whitespace_matches.max_by(&:length)
-        # If the splitter is only a single character, see if we can target whitespace preceded by semantic splitters
-        if splitter.length == 1
-          NON_WHITESPACE_SEMANTIC_SPLITTERS.each do |preceder|
-            escaped_preceder = Regexp.escape(preceder)
-            if (match = text.match(/#{escaped_preceder}(\s)/))
-              splitter = match[1]
-              escaped_splitter = Regexp.escape(splitter)
-              return [splitter, splitter_is_whitespace, text.split(/(?<=#{escaped_preceder})#{escaped_splitter}/)]
-            end
-          end
-        end
-      else
-        # Find the most desirable semantically meaningful non-whitespace splitter
-        splitter = NON_WHITESPACE_SEMANTIC_SPLITTERS.find { |s| text.include?(s) }
+      NON_WHITESPACE_SEMANTIC_SPLITTERS.each do |preceder|
+        escaped_preceder = Regexp.escape(preceder)
+        match = text.match(/#{escaped_preceder}(\s)/)
+        next unless match
-        if splitter
-          splitter_is_whitespace = false
-        else
-          # No semantic splitter found, return characters
-          return ["", splitter_is_whitespace, text.chars]
-        end
+        matched_splitter = match[1]
+        escaped_splitter = Regexp.escape(matched_splitter)
+        return [matched_splitter, splitter_is_whitespace, text.split(/(?<=#{escaped_preceder})#{escaped_splitter}/)]
       end
-      [splitter, splitter_is_whitespace, text.split(splitter)]
+      nil
+    end
+    def find_non_whitespace_splitter(text)
+      splitter = NON_WHITESPACE_SEMANTIC_SPLITTERS.find { |s| text.include?(s) }
+      return ["", true, text.chars] unless splitter
+      [splitter, false, text.split(splitter)]
     end
     def bisect_left(sorted, target, low, high)
@@ -356,7 +464,7 @@ module Semchunk
       [last_split_index, splits[start...last_split_index].join(splitter)]
     end
-    def memoize_token_counter(token_counter, maxsize = nil)
+    def memoize_token_counter(token_counter, maxsize=nil)
       return @memoized_token_counters[token_counter] if @memoized_token_counters.key?(token_counter)
       cache = {}

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: semchunk
 version: !ruby/object:Gem::Version
-  version: 0.1.1
+  version: 0.1.2
 platform: ruby
 authors:
 - Philip Zhan