RubyGems - semchunk - Versions diffs - 0.1.0 → 0.1.2 - Mend

semchunk 0.1.0 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (6) hide show

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: f00f8beee6b49e975d6b3fc6f31dbfea804325af49bdafd3e70958e74357e287
-  data.tar.gz: c7b813779cb517ea5a4df048d244848c9475b1fc9f0a6276c96394e929fa89fa
+  metadata.gz: 1ca2e04c2509ef3fa509307aaf5231dcb23df17a596a2dfdfcfa9760c0f6498b
+  data.tar.gz: e2850b2ca76ca627d7bf67ba070f906092e193d19a632bbbee19dfeeb153583a
 SHA512:
-  metadata.gz: f66735229364b68db50409b4e2d297217a31fb59ceff792cf0396a5ead3277fa5974fcc5c2066892e67dff224a2345b0b2d4fe8ff525a7c20990e233915ed4dd
-  data.tar.gz: 42ce444f5a50ba23c7203cf49d1fa0b7d1ccf3fd4c6621b0cc061f3fb41d873be2f797451a3c1560c483ebbdbe388a9d16d77cf9cdf1699d1cd866de83ab4a80
+  metadata.gz: 53ca16f2cddb69d9eb7461754dee4261f592acb0aa88329a5803f9c92246529356ed4bb9932f7cb1895d3480b8cb6eac04112bb5066c8e80f1edbe1430ba3578
+  data.tar.gz: d302134a84ea39be36b7a7637f88c9107bfd452b495133fbed7ed678f4f8c6a986a265550917e5d198f73ed2922584e48bca89a14948e6c855df4ee852af395c

data/README.md CHANGED Viewed

@@ -1,29 +1,371 @@
-# semchunk
+<div align="center">
+# semchunk 🧩
 [![Gem Version](https://img.shields.io/gem/v/semchunk)](https://rubygems.org/gems/semchunk)
 [![Gem Downloads](https://img.shields.io/gem/dt/semchunk)](https://www.ruby-toolbox.com/projects/semchunk)
 [![GitHub Workflow Status](https://img.shields.io/github/actions/workflow/status/philip-zhan/semchunk.rb/ci.yml)](https://github.com/philip-zhan/semchunk.rb/actions/workflows/ci.yml)
-TODO: Description of this gem goes here.
+</div>
+**`semchunk`** is a fast and lightweight Ruby gem for splitting text into semantically meaningful chunks.
+This is a Ruby port of the Python [semchunk](https://github.com/isaacus-dev/semchunk) library by [Isaacus](https://isaacus.com/), maintaining the same efficient chunking algorithm and API design.
+`semchunk` produces chunks that are more semantically meaningful than regular token and recursive character chunkers, while being fast and easy to use.
+## Features
+- **Semantic chunking**: Splits text at natural boundaries (sentences, paragraphs, etc.) rather than at arbitrary character positions
+- **Token-aware**: Respects token limits from any tokenizer you provide
+- **Overlap support**: Create overlapping chunks for better context preservation
+- **Offset tracking**: Get the original positions of each chunk in the source text
+- **Flexible**: Works with any token counter (word count, character count, or tokenizers)
+- **Memoization**: Optional caching of token counts for improved performance
 ---
+- [Installation](#installation)
 - [Quick start](#quick-start)
+- [API Reference](#api-reference)
+- [Examples](#examples)
 - [Support](#support)
 - [License](#license)
 - [Code of conduct](#code-of-conduct)
 - [Contribution guide](#contribution-guide)
-## Quick start
+## Installation
+Add this line to your application's Gemfile:
+```ruby
+gem 'semchunk'
 ```
+Or install it directly:
+```bash
 gem install semchunk
 ```
+## Quick start
+```ruby
+require "semchunk"
+# Define a simple token counter (or use a real tokenizer)
+token_counter = ->(text) { text.split.length }
+# Chunk some text
+text = "This is the first sentence. This is the second sentence. And this is the third sentence."
+chunks = Semchunk.chunk(text, chunk_size: 5, token_counter: token_counter)
+puts chunks.inspect
+# => ["This is the first sentence.", "This is the second sentence.", "And this is the third sentence."]
+```
+## API Reference
+### `Semchunk.chunk`
+Split a text into semantically meaningful chunks.
+```ruby
+Semchunk.chunk(
+  text,
+  chunk_size:,
+  token_counter:,
+  memoize: true,
+  offsets: false,
+  overlap: nil,
+  cache_maxsize: nil
+)
+```
+**Parameters:**
+- `text` (String): The text to be chunked
+- `chunk_size` (Integer): The maximum number of tokens a chunk may contain
+- `token_counter` (Proc, Lambda, Method): A callable that takes a string and returns the number of tokens in it
+- `memoize` (Boolean, optional): Whether to memoize the token counter. Defaults to `true`
+- `offsets` (Boolean, optional): Whether to return the start and end offsets of each chunk. Defaults to `false`
+- `overlap` (Float, Integer, nil, optional): The proportion of the chunk size (if < 1), or the number of tokens (if >= 1), by which chunks should overlap. Defaults to `nil`
+- `cache_maxsize` (Integer, nil, optional): The maximum number of text-token count pairs to cache. Defaults to `nil` (unbounded)
+**Returns:**
+- `Array<String>` if `offsets: false`: List of text chunks
+- `[Array<String>, Array<Array<Integer>>]` if `offsets: true`: List of chunks and their `[start, end]` offsets
+### `Semchunk.chunkerify`
+Create a reusable chunker object.
+```ruby
+Semchunk.chunkerify(
+  tokenizer_or_token_counter,
+  chunk_size: nil,
+  max_token_chars: nil,
+  memoize: true,
+  cache_maxsize: nil
+)
+```
+**Parameters:**
+- `tokenizer_or_token_counter`: A tokenizer object with an `encode` method, or a callable token counter
+- `chunk_size` (Integer, nil): Maximum tokens per chunk. If `nil`, will attempt to use tokenizer's `model_max_length`
+- `max_token_chars` (Integer, nil): Maximum characters per token (optimization parameter)
+- `memoize` (Boolean): Whether to cache token counts. Defaults to `true`
+- `cache_maxsize` (Integer, nil): Cache size limit. Defaults to `nil` (unbounded)
+**Returns:**
+- `Semchunk::Chunker`: A chunker instance
+### `Chunker#call`
+Process text(s) with the chunker.
+```ruby
+chunker.call(
+  text_or_texts,
+  processes: 1,
+  progress: false,
+  offsets: false,
+  overlap: nil
+)
+```
+**Parameters:**
+- `text_or_texts` (String, Array<String>): Single text or array of texts to chunk
+- `processes` (Integer): Number of processes for parallel chunking (not yet implemented)
+- `progress` (Boolean): Show progress bar for multiple texts (not yet implemented)
+- `offsets` (Boolean): Return offset information
+- `overlap` (Float, Integer, nil): Overlap configuration
+**Returns:**
+- For single text: `Array<String>` or `[Array<String>, Array<Array<Integer>>]`
+- For multiple texts: `Array<Array<String>>` or `[Array<Array<String>>, Array<Array<Array<Integer>>>]`
+## Examples
+### Basic Chunking
 ```ruby
 require "semchunk"
+text = "Natural language processing is fascinating. It allows computers to understand human language. This enables many applications."
+# Use word count as token counter
+token_counter = ->(text) { text.split.length }
+chunks = Semchunk.chunk(text, chunk_size: 8, token_counter: token_counter)
+chunks.each_with_index do |chunk, i|
+  puts "Chunk #{i + 1}: #{chunk}"
+end
+# => Chunk 1: Natural language processing is fascinating. It allows computers
+# => Chunk 2: to understand human language. This enables many applications.
+```
+### With Offsets
+Track where each chunk came from in the original text:
+```ruby
+text = "First paragraph here. Second paragraph here. Third paragraph here."
+token_counter = ->(text) { text.split.length }
+chunks, offsets = Semchunk.chunk(
+  text,
+  chunk_size: 5,
+  token_counter: token_counter,
+  offsets: true
+)
+chunks.zip(offsets).each do |chunk, (start_pos, end_pos)|
+  puts "Chunk: '#{chunk}'"
+  puts "Position: #{start_pos}...#{end_pos}"
+  puts "Verification: '#{text[start_pos...end_pos]}'"
+  puts
+end
+```
+### With Overlap
+Create overlapping chunks to maintain context:
+```ruby
+text = "One two three four five six seven eight nine ten."
+token_counter = ->(text) { text.split.length }
+# 50% overlap
+chunks = Semchunk.chunk(
+  text,
+  chunk_size: 4,
+  token_counter: token_counter,
+  overlap: 0.5
+)
+puts "Overlapping chunks:"
+chunks.each { |chunk| puts "- #{chunk}" }
+# Fixed overlap of 2 tokens
+chunks = Semchunk.chunk(
+  text,
+  chunk_size: 6,
+  token_counter: token_counter,
+  overlap: 2
+)
+puts "\nWith 2-token overlap:"
+chunks.each { |chunk| puts "- #{chunk}" }
+```
+### Using Chunkerify for Reusable Chunkers
+```ruby
+# Create a chunker once
+token_counter = ->(text) { text.split.length }
+chunker = Semchunk.chunkerify(token_counter, chunk_size: 10)
+# Use it multiple times
+texts = [
+  "First document to process.",
+  "Second document to process.",
+  "Third document to process."
+]
+all_chunks = chunker.call(texts)
+all_chunks.each_with_index do |chunks, i|
+  puts "Document #{i + 1} chunks: #{chunks.inspect}"
+end
+```
+### Character-Level Chunking
+```ruby
+text = "abcdefghijklmnopqrstuvwxyz"
+# Character count as token counter
+token_counter = ->(text) { text.length }
+chunks = Semchunk.chunk(text, chunk_size: 5, token_counter: token_counter)
+puts chunks.inspect
+# => ["abcde", "fghij", "klmno", "pqrst", "uvwxy", "z"]
+```
+### Custom Token Counter
+```ruby
+# Token counter that counts punctuation as separate tokens
+def custom_token_counter(text)
+  text.scan(/\w+|[^\w\s]/).length
+end
+text = "Hello, world! How are you?"
+chunks = Semchunk.chunk(
+  text,
+  chunk_size: 5,
+  token_counter: method(:custom_token_counter)
+)
+puts chunks.inspect
+```
+### Working with Real Tokenizers
+If you have a tokenizer that implements an `encode` method:
+```ruby
+# Example with a hypothetical tokenizer
+class MyTokenizer
+  def encode(text, add_special_tokens: true)
+    # Your tokenization logic here
+    text.split.map { |word| word.hash }
+  end
+  def model_max_length
+    512
+  end
+end
+tokenizer = MyTokenizer.new
+# chunkerify will automatically extract the token counter
+chunker = Semchunk.chunkerify(tokenizer, chunk_size: 100)
+text = "Your long text here..."
+chunks = chunker.call(text)
 ```
+## How It Works 🔍
+`semchunk` works by recursively splitting texts until all resulting chunks are equal to or less than a specified chunk size. In particular, it:
+1. Splits text using the most semantically meaningful splitter possible;
+2. Recursively splits the resulting chunks until a set of chunks equal to or less than the specified chunk size is produced;
+3. Merges any chunks that are under the chunk size back together until the chunk size is reached;
+4. Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks; and
+5. Excludes chunks consisting entirely of whitespace characters.
+To ensure that chunks are as semantically meaningful as possible, `semchunk` uses the following splitters, in order of precedence:
+1. The largest sequence of newlines (`\n`) and/or carriage returns (`\r`);
+2. The largest sequence of tabs;
+3. The largest sequence of whitespace characters (as defined by regex's `\s` character class) or, if the largest sequence of whitespace characters is only a single character and there exist whitespace characters preceded by any of the semantically meaningful non-whitespace characters listed below (in the same order of precedence), then only those specific whitespace characters;
+4. Sentence terminators (`.`, `?`, `!` and `*`);
+5. Clause separators (`;`, `,`, `(`, `)`, `[`, `]`, `"`, `"`, `'`, `'`, `'`, `"` and `` ` ``);
+6. Sentence interrupters (`:`, `—` and `…`);
+7. Word joiners (`/`, `\`, `–`, `&` and `-`); and
+8. All other characters.
+If overlapping chunks have been requested, `semchunk` also:
+1. Internally reduces the chunk size to `min(overlap, chunk_size - overlap)` (`overlap` being computed as `floor(chunk_size * overlap)` for relative overlaps and `min(overlap, chunk_size - 1)` for absolute overlaps); and
+2. Merges every `floor(original_chunk_size / reduced_chunk_size)` chunks starting from the first chunk and then jumping by `floor((original_chunk_size - overlap) / reduced_chunk_size)` chunks until the last chunk is reached.
+The algorithm uses binary search to efficiently find the optimal split points, making it fast even for large documents.
+## Running the Examples
+This gem includes example scripts that demonstrate various features:
+```bash
+# Basic usage examples
+ruby examples/basic_usage.rb
+# Advanced usage with longer documents
+ruby examples/advanced_usage.rb
+```
+## Benchmarks 📊
+You can run the included benchmark to test performance:
+```bash
+ruby test/bench.rb
+```
+The Ruby implementation maintains similar performance characteristics to the Python version:
+- Efficient binary search for optimal split points
+- O(n log n) complexity for chunking
+- Fast token count lookups with memoization
+- Low memory overhead
+The benchmark tests chunking multiple texts with various chunk sizes and provides detailed performance metrics.
+## Differences from Python Version
+This Ruby port maintains feature parity with the Python version, with a few notes:
+- Multiprocessing support is not yet implemented (`processes` parameter)
+- Progress bar support is not yet implemented (`progress` parameter)
+- String tokenizer names (like `"gpt-4"`) are not yet supported
+- Otherwise, the API and behavior match the Python version
+See [MIGRATION.md](MIGRATION.md) for a detailed guide on migrating from the Python version.
 ## Support
 If you want to report a bug, or have ideas, feedback or questions about the gem, [let me know via GitHub issues](https://github.com/philip-zhan/semchunk.rb/issues/new) and I will do my best to provide a helpful answer. Happy hacking!

data/lib/semchunk/chunker.rb ADDED Viewed

@@ -0,0 +1,61 @@
+# frozen_string_literal: true
+require "set"
+module Semchunk
+  # A class for chunking one or more texts into semantically meaningful chunks
+  class Chunker
+    attr_reader :chunk_size, :token_counter
+    def initialize(chunk_size:, token_counter:)
+      @chunk_size = chunk_size
+      @token_counter = token_counter
+    end
+    # Split text or texts into semantically meaningful chunks
+    #
+    # @param text_or_texts [String, Array<String>] The text or texts to be chunked
+    # @param processes [Integer] The number of processes to use when chunking multiple texts (not yet implemented)
+    # @param offsets [Boolean] Whether to return the start and end offsets of each chunk
+    # @param overlap [Float, Integer, nil] The proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap
+    #
+    # @return [Array<String>, Array<Array>, Hash] Depending on the input and options, returns chunks and optionally offsets
+    def call(text_or_texts, processes: 1, offsets: false, overlap: nil)
+      chunk_function = make_chunk_function(offsets: offsets, overlap: overlap)
+      # Handle single text
+      return chunk_function.call(text_or_texts) if text_or_texts.is_a?(String)
+      # Handle multiple texts
+      raise NotImplementedError, "Parallel processing not yet implemented. Please use processes: 1" unless processes == 1
+      # TODO: Add progress bar support
+      chunks_and_offsets = text_or_texts.map { |text| chunk_function.call(text) }
+      # TODO: Add parallel processing support
+      # Return results
+      if offsets
+        chunks, offsets_arr = chunks_and_offsets.transpose
+        return [chunks.to_a, offsets_arr.to_a]
+      end
+      chunks_and_offsets
+    end
+    private
+    def make_chunk_function(offsets:, overlap:)
+      lambda do |text|
+        Semchunk.chunk(
+          text,
+          chunk_size: chunk_size,
+          token_counter: token_counter,
+          memoize: false,
+          offsets: offsets,
+          overlap: overlap
+        )
+      end
+    end
+  end
+end

data/lib/semchunk/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module Semchunk
-  VERSION = "0.1.0"
+  VERSION = "0.1.2"
 end

data/lib/semchunk.rb CHANGED Viewed

@@ -1,5 +1,495 @@
 # frozen_string_literal: true
+require_relative "semchunk/version"
+require_relative "semchunk/chunker"
 module Semchunk
-  autoload :VERSION, "semchunk/version"
+  # A map of token counters to their memoized versions
+  @memoized_token_counters = {}
+  class << self
+    attr_reader :memoized_token_counters
+    # Split a text into semantically meaningful chunks of a specified size as determined by the provided token counter.
+    #
+    # @param text [String] The text to be chunked.
+    # @param chunk_size [Integer] The maximum number of tokens a chunk may contain.
+    # @param token_counter [Proc, Method, #call] A callable that takes a string and returns the number of tokens in it.
+    # @param memoize [Boolean] Whether to memoize the token counter. Defaults to true.
+    # @param offsets [Boolean] Whether to return the start and end offsets of each chunk. Defaults to false.
+    # @param overlap [Float, Integer, nil] The proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap. Defaults to nil.
+    # @param cache_maxsize [Integer, nil] The maximum number of text-token count pairs that can be stored in the token counter's cache. Defaults to nil (unbounded).
+    # @param recursion_depth [Integer] Internal parameter for tracking recursion depth.
+    # @param start [Integer] Internal parameter for tracking character offset.
+    #
+    # @return [Array<String>, Array<Array>] A list of chunks up to chunk_size-tokens-long, with any whitespace used to split the text removed, and, if offsets is true, a list of tuples [start, end].
+    def chunk(text, chunk_size:, token_counter:, memoize: true, offsets: false, overlap: nil, cache_maxsize: nil,
+              recursion_depth: 0, start: 0)
+      return_offsets = offsets
+      is_first_call = recursion_depth.zero?
+      # Initialize token counter and compute effective chunk size
+      token_counter, local_chunk_size, overlap, unoverlapped_chunk_size = initialize_chunking_params(
+        token_counter, chunk_size, overlap, is_first_call, memoize, cache_maxsize
+      )
+      # Split text and prepare metadata
+      splitter, splitter_is_whitespace, splits = split_text(text)
+      split_starts, cum_lens, num_splits_plus_one = prepare_split_metadata(splits, splitter, start)
+      # Process splits into chunks
+      chunks, offsets_arr = process_splits(
+        splits,
+        split_starts,
+        cum_lens,
+        splitter,
+        splitter_is_whitespace,
+        token_counter,
+        local_chunk_size,
+        num_splits_plus_one,
+        recursion_depth
+      )
+      # Finalize first call: cleanup and overlap
+      finalize_chunks(
+        chunks,
+        offsets_arr,
+        is_first_call,
+        return_offsets,
+        overlap,
+        local_chunk_size,
+        chunk_size,
+        unoverlapped_chunk_size,
+        text
+      )
+    end
+    # A tuple of semantically meaningful non-whitespace splitters
+    NON_WHITESPACE_SEMANTIC_SPLITTERS = [
+      # Sentence terminators
+      ".",
+      "?",
+      "!",
+      "*",
+      # Clause separators
+      ";",
+      ",",
+      "(",
+      ")",
+      "[",
+      "]",
+      ", ",
+      "'",
+      "'",
+      "'",
+      '"',
+      "`",
+      # Sentence interrupters
+      ":",
+      "—",
+      "…",
+      # Word joiners
+      "/",
+      "\\",
+      "–",
+      "&",
+      "-"
+    ].freeze
+    private
+    def initialize_chunking_params(token_counter, chunk_size, overlap, is_first_call, memoize, cache_maxsize)
+      local_chunk_size = chunk_size
+      unoverlapped_chunk_size = nil
+      return [token_counter, local_chunk_size, overlap, unoverlapped_chunk_size] unless is_first_call
+      token_counter = memoize_token_counter(token_counter, cache_maxsize) if memoize
+      if overlap
+        overlap = compute_overlap(overlap, chunk_size)
+        if overlap.positive?
+          unoverlapped_chunk_size = chunk_size - overlap
+          local_chunk_size = [overlap, unoverlapped_chunk_size].min
+        end
+      end
+      [token_counter, local_chunk_size, overlap, unoverlapped_chunk_size]
+    end
+    def compute_overlap(overlap, chunk_size)
+      if overlap < 1
+        (chunk_size * overlap).floor
+      else
+        [overlap, chunk_size - 1].min
+      end
+    end
+    def prepare_split_metadata(splits, splitter, start)
+      splitter_len = splitter.length
+      split_lens = splits.map(&:length)
+      cum_lens = [0]
+      split_lens.each { |len| cum_lens << (cum_lens.last + len) }
+      split_starts = [0]
+      split_lens.each_with_index do |split_len, i|
+        split_starts << (split_starts[i] + split_len + splitter_len)
+      end
+      split_starts = split_starts.map { |s| s + start }
+      num_splits_plus_one = splits.length + 1
+      [split_starts, cum_lens, num_splits_plus_one]
+    end
+    def process_splits(splits, split_starts, cum_lens, splitter, splitter_is_whitespace,
+                       token_counter, local_chunk_size, num_splits_plus_one, recursion_depth)
+      chunks = []
+      offsets_arr = []
+      skips = Set.new
+      splitter_len = splitter.length
+      splits.each_with_index do |split, i|
+        next if skips.include?(i)
+        split_start = split_starts[i]
+        if token_counter.call(split) > local_chunk_size
+          new_chunks, new_offsets = chunk_recursively(split, local_chunk_size, token_counter, recursion_depth, split_start)
+          chunks.concat(new_chunks)
+          offsets_arr.concat(new_offsets)
+        else
+          final_split_i, new_chunk = merge_splits(
+            splits:        splits,
+            cum_lens:      cum_lens,
+            chunk_size:    local_chunk_size,
+            splitter:      splitter,
+            token_counter: token_counter,
+            start:         i,
+            high:          num_splits_plus_one
+          )
+          ((i + 1)...final_split_i).each { |j| skips.add(j) }
+          chunks << new_chunk
+          split_end = split_starts[final_split_i] - splitter_len
+          offsets_arr << [split_start, split_end]
+        end
+        append_splitter_if_needed(
+          chunks,
+          offsets_arr,
+          splitter,
+          splitter_is_whitespace,
+          i,
+          splits,
+          skips,
+          token_counter,
+          local_chunk_size,
+          split_start
+        )
+      end
+      [chunks, offsets_arr]
+    end
+    def chunk_recursively(split, local_chunk_size, token_counter, recursion_depth, split_start)
+      chunk(
+        split,
+        chunk_size: local_chunk_size,
+        token_counter: token_counter,
+        offsets: true,
+        recursion_depth: recursion_depth + 1,
+        start: split_start
+      )
+    end
+    def append_splitter_if_needed(chunks, offsets_arr, splitter, splitter_is_whitespace,
+                                  split_index, splits, skips, token_counter, local_chunk_size,
+                                  split_start)
+      return if splitter_is_whitespace
+      return if split_index == splits.length - 1
+      return if ((split_index + 1)...splits.length).all? { |j| skips.include?(j) }
+      splitter_len = splitter.length
+      last_chunk_with_splitter = chunks[-1] + splitter
+      if token_counter.call(last_chunk_with_splitter) <= local_chunk_size
+        chunks[-1] = last_chunk_with_splitter
+        offset_start, offset_end = offsets_arr[-1]
+        offsets_arr[-1] = [offset_start, offset_end + splitter_len]
+      else
+        offset_start = offsets_arr.empty? ? split_start : offsets_arr[-1][1]
+        chunks << splitter
+        offsets_arr << [offset_start, offset_start + splitter_len]
+      end
+    end
+    def finalize_chunks(chunks, offsets_arr, is_first_call, return_offsets, overlap,
+                        local_chunk_size, chunk_size, unoverlapped_chunk_size, text)
+      return [chunks, offsets_arr] unless is_first_call
+      chunks, offsets_arr = remove_empty_chunks(chunks, offsets_arr)
+      if overlap&.positive? && chunks.any?
+        chunks, offsets_arr = apply_overlap(
+          chunks,
+          offsets_arr,
+          local_chunk_size,
+          chunk_size,
+          unoverlapped_chunk_size,
+          text
+        )
+      end
+      return [chunks, offsets_arr] if return_offsets
+      chunks
+    end
+    def remove_empty_chunks(chunks, offsets_arr)
+      chunks_and_offsets = chunks.zip(offsets_arr).reject { |chunk, _| chunk.empty? || chunk.strip.empty? }
+      if chunks_and_offsets.any?
+        chunks_and_offsets.transpose
+      else
+        [[], []]
+      end
+    end
+    def apply_overlap(chunks, offsets_arr, subchunk_size, chunk_size, unoverlapped_chunk_size, text)
+      subchunks = chunks
+      suboffsets = offsets_arr
+      num_subchunks = subchunks.length
+      subchunks_per_chunk = (chunk_size.to_f / subchunk_size).floor
+      subchunk_stride = (unoverlapped_chunk_size.to_f / subchunk_size).floor
+      num_overlapping_chunks = [1, ((num_subchunks - subchunks_per_chunk).to_f / subchunk_stride).ceil + 1].max
+      offsets_arr = (0...num_overlapping_chunks).map do |i|
+        start_idx = i * subchunk_stride
+        end_idx = [start_idx + subchunks_per_chunk, num_subchunks].min - 1
+        [suboffsets[start_idx][0], suboffsets[end_idx][1]]
+      end
+      chunks = offsets_arr.map { |s, e| text[s...e] }
+      [chunks, offsets_arr]
+    end
+    public
+    # Construct a chunker that splits one or more texts into semantically meaningful chunks
+    #
+    # @param tokenizer_or_token_counter [String, #encode, Proc, Method, #call] Either: the name of a tokenizer; a tokenizer that possesses an encode method; or a token counter.
+    # @param chunk_size [Integer, nil] The maximum number of tokens a chunk may contain. Defaults to nil.
+    # @param max_token_chars [Integer, nil] The maximum number of characters a token may contain. Defaults to nil.
+    # @param memoize [Boolean] Whether to memoize the token counter. Defaults to true.
+    # @param cache_maxsize [Integer, nil] The maximum number of text-token count pairs that can be stored in the token counter's cache.
+    #
+    # @return [Chunker] A chunker instance
+    def chunkerify(tokenizer_or_token_counter, chunk_size: nil, max_token_chars: nil, memoize: true, cache_maxsize: nil)
+      validate_tokenizer_type(tokenizer_or_token_counter)
+      max_token_chars = determine_max_token_chars(tokenizer_or_token_counter, max_token_chars)
+      chunk_size = determine_chunk_size(tokenizer_or_token_counter, chunk_size)
+      token_counter = create_token_counter(tokenizer_or_token_counter)
+      token_counter = wrap_with_fast_counter(token_counter, max_token_chars, chunk_size) if max_token_chars
+      token_counter = memoize_token_counter(token_counter, cache_maxsize) if memoize
+      Chunker.new(chunk_size: chunk_size, token_counter: token_counter)
+    end
+    private
+    def validate_tokenizer_type(tokenizer_or_token_counter)
+      return unless tokenizer_or_token_counter.is_a?(String)
+      raise NotImplementedError,
+            "String tokenizer names not yet supported in Ruby. Please pass a tokenizer object or token counter proc."
+    end
+    def determine_max_token_chars(tokenizer, max_token_chars)
+      return max_token_chars unless max_token_chars.nil?
+      if tokenizer.respond_to?(:token_byte_values)
+        vocab = tokenizer.token_byte_values
+        return vocab.map(&:length).max if vocab.respond_to?(:map)
+      elsif tokenizer.respond_to?(:get_vocab)
+        vocab = tokenizer.get_vocab
+        return vocab.keys.map(&:length).max if vocab.respond_to?(:keys)
+      end
+      nil
+    end
+    def determine_chunk_size(tokenizer, chunk_size)
+      return chunk_size unless chunk_size.nil?
+      raise ArgumentError, "chunk_size not provided and tokenizer lacks model_max_length attribute" unless tokenizer.respond_to?(:model_max_length) && tokenizer.model_max_length.is_a?(Integer)
+      chunk_size = tokenizer.model_max_length
+      if tokenizer.respond_to?(:encode)
+        begin
+          chunk_size -= tokenizer.encode("").length
+        rescue StandardError
+          # Ignore errors
+        end
+      end
+      chunk_size
+    end
+    def create_token_counter(tokenizer_or_token_counter)
+      return tokenizer_or_token_counter unless tokenizer_or_token_counter.respond_to?(:encode)
+      tokenizer = tokenizer_or_token_counter
+      encode_params = begin
+        tokenizer.method(:encode).parameters
+      rescue StandardError
+        []
+      end
+      has_special_tokens = encode_params.any? { |_type, name| name == :add_special_tokens }
+      if has_special_tokens
+        ->(text) { tokenizer.encode(text, add_special_tokens: false).length }
+      else
+        ->(text) { tokenizer.encode(text).length }
+      end
+    end
+    def wrap_with_fast_counter(token_counter, max_token_chars, chunk_size)
+      max_token_chars -= 1
+      original_token_counter = token_counter
+      lambda do |text|
+        heuristic = chunk_size * 6
+        if text.length > heuristic && original_token_counter.call(text[0...(heuristic + max_token_chars)]) > chunk_size
+          chunk_size + 1
+        else
+          original_token_counter.call(text)
+        end
+      end
+    end
+    def split_text(text)
+      splitter_is_whitespace = true
+      splitter = find_whitespace_splitter(text)
+      if splitter
+        result = try_whitespace_with_semantic_preceder(text, splitter, splitter_is_whitespace)
+        return result if result
+        return [splitter, splitter_is_whitespace, text.split(splitter)]
+      end
+      # No whitespace found, use non-whitespace semantic splitters
+      find_non_whitespace_splitter(text)
+    end
+    def find_whitespace_splitter(text)
+      return text.scan(/[\r\n]+/).max_by(&:length) if text.include?("\n") || text.include?("\r")
+      return text.scan(/\t+/).max_by(&:length) if text.include?("\t")
+      return text.scan(/\s+/).max_by(&:length) if text.match?(/\s/)
+      nil
+    end
+    def try_whitespace_with_semantic_preceder(text, splitter, splitter_is_whitespace)
+      return nil unless splitter.length == 1
+      NON_WHITESPACE_SEMANTIC_SPLITTERS.each do |preceder|
+        escaped_preceder = Regexp.escape(preceder)
+        match = text.match(/#{escaped_preceder}(\s)/)
+        next unless match
+        matched_splitter = match[1]
+        escaped_splitter = Regexp.escape(matched_splitter)
+        return [matched_splitter, splitter_is_whitespace, text.split(/(?<=#{escaped_preceder})#{escaped_splitter}/)]
+      end
+      nil
+    end
+    def find_non_whitespace_splitter(text)
+      splitter = NON_WHITESPACE_SEMANTIC_SPLITTERS.find { |s| text.include?(s) }
+      return ["", true, text.chars] unless splitter
+      [splitter, false, text.split(splitter)]
+    end
+    def bisect_left(sorted, target, low, high)
+      while low < high
+        mid = (low + high) / 2
+        if sorted[mid] < target
+          low = mid + 1
+        else
+          high = mid
+        end
+      end
+      low
+    end
+    def merge_splits(splits:, cum_lens:, chunk_size:, splitter:, token_counter:, start:, high:)
+      average = 0.2
+      low = start
+      offset = cum_lens[start]
+      target = offset + (chunk_size * average)
+      while low < high
+        i = bisect_left(cum_lens, target, low, high)
+        midpoint = [i, high - 1].min
+        tokens = token_counter.call(splits[start...midpoint].join(splitter))
+        local_cum = cum_lens[midpoint] - offset
+        if local_cum.positive? && tokens.positive?
+          average = local_cum.to_f / tokens
+          target = offset + (chunk_size * average)
+        end
+        if tokens > chunk_size
+          high = midpoint
+        else
+          low = midpoint + 1
+        end
+      end
+      last_split_index = low - 1
+      [last_split_index, splits[start...last_split_index].join(splitter)]
+    end
+    def memoize_token_counter(token_counter, maxsize=nil)
+      return @memoized_token_counters[token_counter] if @memoized_token_counters.key?(token_counter)
+      cache = {}
+      queue = []
+      memoized = lambda do |text|
+        if cache.key?(text)
+          cache[text]
+        else
+          result = token_counter.call(text)
+          cache[text] = result
+          if maxsize
+            queue << text
+            if queue.length > maxsize
+              oldest = queue.shift
+              cache.delete(oldest)
+            end
+          end
+          result
+        end
+      end
+      @memoized_token_counters[token_counter] = memoized
+    end
+  end
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: semchunk
 version: !ruby/object:Gem::Version
-  version: 0.1.0
+  version: 0.1.2
 platform: ruby
 authors:
 - Philip Zhan
@@ -18,6 +18,7 @@ files:
 - LICENSE.txt
 - README.md
 - lib/semchunk.rb
+- lib/semchunk/chunker.rb
 - lib/semchunk/version.rb
 homepage: https://github.com/philip-zhan/semchunk.rb
 licenses: