semchunk 0.1.0 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: f00f8beee6b49e975d6b3fc6f31dbfea804325af49bdafd3e70958e74357e287
4
- data.tar.gz: c7b813779cb517ea5a4df048d244848c9475b1fc9f0a6276c96394e929fa89fa
3
+ metadata.gz: 1ca2e04c2509ef3fa509307aaf5231dcb23df17a596a2dfdfcfa9760c0f6498b
4
+ data.tar.gz: e2850b2ca76ca627d7bf67ba070f906092e193d19a632bbbee19dfeeb153583a
5
5
  SHA512:
6
- metadata.gz: f66735229364b68db50409b4e2d297217a31fb59ceff792cf0396a5ead3277fa5974fcc5c2066892e67dff224a2345b0b2d4fe8ff525a7c20990e233915ed4dd
7
- data.tar.gz: 42ce444f5a50ba23c7203cf49d1fa0b7d1ccf3fd4c6621b0cc061f3fb41d873be2f797451a3c1560c483ebbdbe388a9d16d77cf9cdf1699d1cd866de83ab4a80
6
+ metadata.gz: 53ca16f2cddb69d9eb7461754dee4261f592acb0aa88329a5803f9c92246529356ed4bb9932f7cb1895d3480b8cb6eac04112bb5066c8e80f1edbe1430ba3578
7
+ data.tar.gz: d302134a84ea39be36b7a7637f88c9107bfd452b495133fbed7ed678f4f8c6a986a265550917e5d198f73ed2922584e48bca89a14948e6c855df4ee852af395c
data/README.md CHANGED
@@ -1,29 +1,371 @@
1
- # semchunk
1
+ <div align="center">
2
+
3
+ # semchunk 🧩
2
4
 
3
5
  [![Gem Version](https://img.shields.io/gem/v/semchunk)](https://rubygems.org/gems/semchunk)
4
6
  [![Gem Downloads](https://img.shields.io/gem/dt/semchunk)](https://www.ruby-toolbox.com/projects/semchunk)
5
7
  [![GitHub Workflow Status](https://img.shields.io/github/actions/workflow/status/philip-zhan/semchunk.rb/ci.yml)](https://github.com/philip-zhan/semchunk.rb/actions/workflows/ci.yml)
6
8
 
7
- TODO: Description of this gem goes here.
9
+ </div>
10
+
11
+ **`semchunk`** is a fast and lightweight Ruby gem for splitting text into semantically meaningful chunks.
12
+
13
+ This is a Ruby port of the Python [semchunk](https://github.com/isaacus-dev/semchunk) library by [Isaacus](https://isaacus.com/), maintaining the same efficient chunking algorithm and API design.
14
+
15
+ `semchunk` produces chunks that are more semantically meaningful than regular token and recursive character chunkers, while being fast and easy to use.
16
+
17
+ ## Features
18
+
19
+ - **Semantic chunking**: Splits text at natural boundaries (sentences, paragraphs, etc.) rather than at arbitrary character positions
20
+ - **Token-aware**: Respects token limits from any tokenizer you provide
21
+ - **Overlap support**: Create overlapping chunks for better context preservation
22
+ - **Offset tracking**: Get the original positions of each chunk in the source text
23
+ - **Flexible**: Works with any token counter (word count, character count, or tokenizers)
24
+ - **Memoization**: Optional caching of token counts for improved performance
8
25
 
9
26
  ---
10
27
 
28
+ - [Installation](#installation)
11
29
  - [Quick start](#quick-start)
30
+ - [API Reference](#api-reference)
31
+ - [Examples](#examples)
12
32
  - [Support](#support)
13
33
  - [License](#license)
14
34
  - [Code of conduct](#code-of-conduct)
15
35
  - [Contribution guide](#contribution-guide)
16
36
 
17
- ## Quick start
37
+ ## Installation
38
+
39
+ Add this line to your application's Gemfile:
18
40
 
41
+ ```ruby
42
+ gem 'semchunk'
19
43
  ```
44
+
45
+ Or install it directly:
46
+
47
+ ```bash
20
48
  gem install semchunk
21
49
  ```
22
50
 
51
+ ## Quick start
52
+
53
+ ```ruby
54
+ require "semchunk"
55
+
56
+ # Define a simple token counter (or use a real tokenizer)
57
+ token_counter = ->(text) { text.split.length }
58
+
59
+ # Chunk some text
60
+ text = "This is the first sentence. This is the second sentence. And this is the third sentence."
61
+ chunks = Semchunk.chunk(text, chunk_size: 5, token_counter: token_counter)
62
+
63
+ puts chunks.inspect
64
+ # => ["This is the first sentence.", "This is the second sentence.", "And this is the third sentence."]
65
+ ```
66
+
67
+ ## API Reference
68
+
69
+ ### `Semchunk.chunk`
70
+
71
+ Split a text into semantically meaningful chunks.
72
+
73
+ ```ruby
74
+ Semchunk.chunk(
75
+ text,
76
+ chunk_size:,
77
+ token_counter:,
78
+ memoize: true,
79
+ offsets: false,
80
+ overlap: nil,
81
+ cache_maxsize: nil
82
+ )
83
+ ```
84
+
85
+ **Parameters:**
86
+ - `text` (String): The text to be chunked
87
+ - `chunk_size` (Integer): The maximum number of tokens a chunk may contain
88
+ - `token_counter` (Proc, Lambda, Method): A callable that takes a string and returns the number of tokens in it
89
+ - `memoize` (Boolean, optional): Whether to memoize the token counter. Defaults to `true`
90
+ - `offsets` (Boolean, optional): Whether to return the start and end offsets of each chunk. Defaults to `false`
91
+ - `overlap` (Float, Integer, nil, optional): The proportion of the chunk size (if < 1), or the number of tokens (if >= 1), by which chunks should overlap. Defaults to `nil`
92
+ - `cache_maxsize` (Integer, nil, optional): The maximum number of text-token count pairs to cache. Defaults to `nil` (unbounded)
93
+
94
+ **Returns:**
95
+ - `Array<String>` if `offsets: false`: List of text chunks
96
+ - `[Array<String>, Array<Array<Integer>>]` if `offsets: true`: List of chunks and their `[start, end]` offsets
97
+
98
+ ### `Semchunk.chunkerify`
99
+
100
+ Create a reusable chunker object.
101
+
102
+ ```ruby
103
+ Semchunk.chunkerify(
104
+ tokenizer_or_token_counter,
105
+ chunk_size: nil,
106
+ max_token_chars: nil,
107
+ memoize: true,
108
+ cache_maxsize: nil
109
+ )
110
+ ```
111
+
112
+ **Parameters:**
113
+ - `tokenizer_or_token_counter`: A tokenizer object with an `encode` method, or a callable token counter
114
+ - `chunk_size` (Integer, nil): Maximum tokens per chunk. If `nil`, will attempt to use tokenizer's `model_max_length`
115
+ - `max_token_chars` (Integer, nil): Maximum characters per token (optimization parameter)
116
+ - `memoize` (Boolean): Whether to cache token counts. Defaults to `true`
117
+ - `cache_maxsize` (Integer, nil): Cache size limit. Defaults to `nil` (unbounded)
118
+
119
+ **Returns:**
120
+ - `Semchunk::Chunker`: A chunker instance
121
+
122
+ ### `Chunker#call`
123
+
124
+ Process text(s) with the chunker.
125
+
126
+ ```ruby
127
+ chunker.call(
128
+ text_or_texts,
129
+ processes: 1,
130
+ progress: false,
131
+ offsets: false,
132
+ overlap: nil
133
+ )
134
+ ```
135
+
136
+ **Parameters:**
137
+ - `text_or_texts` (String, Array<String>): Single text or array of texts to chunk
138
+ - `processes` (Integer): Number of processes for parallel chunking (not yet implemented)
139
+ - `progress` (Boolean): Show progress bar for multiple texts (not yet implemented)
140
+ - `offsets` (Boolean): Return offset information
141
+ - `overlap` (Float, Integer, nil): Overlap configuration
142
+
143
+ **Returns:**
144
+ - For single text: `Array<String>` or `[Array<String>, Array<Array<Integer>>]`
145
+ - For multiple texts: `Array<Array<String>>` or `[Array<Array<String>>, Array<Array<Array<Integer>>>]`
146
+
147
+ ## Examples
148
+
149
+ ### Basic Chunking
150
+
23
151
  ```ruby
24
152
  require "semchunk"
153
+
154
+ text = "Natural language processing is fascinating. It allows computers to understand human language. This enables many applications."
155
+
156
+ # Use word count as token counter
157
+ token_counter = ->(text) { text.split.length }
158
+
159
+ chunks = Semchunk.chunk(text, chunk_size: 8, token_counter: token_counter)
160
+
161
+ chunks.each_with_index do |chunk, i|
162
+ puts "Chunk #{i + 1}: #{chunk}"
163
+ end
164
+ # => Chunk 1: Natural language processing is fascinating. It allows computers
165
+ # => Chunk 2: to understand human language. This enables many applications.
166
+ ```
167
+
168
+ ### With Offsets
169
+
170
+ Track where each chunk came from in the original text:
171
+
172
+ ```ruby
173
+ text = "First paragraph here. Second paragraph here. Third paragraph here."
174
+ token_counter = ->(text) { text.split.length }
175
+
176
+ chunks, offsets = Semchunk.chunk(
177
+ text,
178
+ chunk_size: 5,
179
+ token_counter: token_counter,
180
+ offsets: true
181
+ )
182
+
183
+ chunks.zip(offsets).each do |chunk, (start_pos, end_pos)|
184
+ puts "Chunk: '#{chunk}'"
185
+ puts "Position: #{start_pos}...#{end_pos}"
186
+ puts "Verification: '#{text[start_pos...end_pos]}'"
187
+ puts
188
+ end
189
+ ```
190
+
191
+ ### With Overlap
192
+
193
+ Create overlapping chunks to maintain context:
194
+
195
+ ```ruby
196
+ text = "One two three four five six seven eight nine ten."
197
+ token_counter = ->(text) { text.split.length }
198
+
199
+ # 50% overlap
200
+ chunks = Semchunk.chunk(
201
+ text,
202
+ chunk_size: 4,
203
+ token_counter: token_counter,
204
+ overlap: 0.5
205
+ )
206
+
207
+ puts "Overlapping chunks:"
208
+ chunks.each { |chunk| puts "- #{chunk}" }
209
+
210
+ # Fixed overlap of 2 tokens
211
+ chunks = Semchunk.chunk(
212
+ text,
213
+ chunk_size: 6,
214
+ token_counter: token_counter,
215
+ overlap: 2
216
+ )
217
+
218
+ puts "\nWith 2-token overlap:"
219
+ chunks.each { |chunk| puts "- #{chunk}" }
220
+ ```
221
+
222
+ ### Using Chunkerify for Reusable Chunkers
223
+
224
+ ```ruby
225
+ # Create a chunker once
226
+ token_counter = ->(text) { text.split.length }
227
+ chunker = Semchunk.chunkerify(token_counter, chunk_size: 10)
228
+
229
+ # Use it multiple times
230
+ texts = [
231
+ "First document to process.",
232
+ "Second document to process.",
233
+ "Third document to process."
234
+ ]
235
+
236
+ all_chunks = chunker.call(texts)
237
+
238
+ all_chunks.each_with_index do |chunks, i|
239
+ puts "Document #{i + 1} chunks: #{chunks.inspect}"
240
+ end
241
+ ```
242
+
243
+ ### Character-Level Chunking
244
+
245
+ ```ruby
246
+ text = "abcdefghijklmnopqrstuvwxyz"
247
+
248
+ # Character count as token counter
249
+ token_counter = ->(text) { text.length }
250
+
251
+ chunks = Semchunk.chunk(text, chunk_size: 5, token_counter: token_counter)
252
+
253
+ puts chunks.inspect
254
+ # => ["abcde", "fghij", "klmno", "pqrst", "uvwxy", "z"]
255
+ ```
256
+
257
+ ### Custom Token Counter
258
+
259
+ ```ruby
260
+ # Token counter that counts punctuation as separate tokens
261
+ def custom_token_counter(text)
262
+ text.scan(/\w+|[^\w\s]/).length
263
+ end
264
+
265
+ text = "Hello, world! How are you?"
266
+
267
+ chunks = Semchunk.chunk(
268
+ text,
269
+ chunk_size: 5,
270
+ token_counter: method(:custom_token_counter)
271
+ )
272
+
273
+ puts chunks.inspect
274
+ ```
275
+
276
+ ### Working with Real Tokenizers
277
+
278
+ If you have a tokenizer that implements an `encode` method:
279
+
280
+ ```ruby
281
+ # Example with a hypothetical tokenizer
282
+ class MyTokenizer
283
+ def encode(text, add_special_tokens: true)
284
+ # Your tokenization logic here
285
+ text.split.map { |word| word.hash }
286
+ end
287
+
288
+ def model_max_length
289
+ 512
290
+ end
291
+ end
292
+
293
+ tokenizer = MyTokenizer.new
294
+
295
+ # chunkerify will automatically extract the token counter
296
+ chunker = Semchunk.chunkerify(tokenizer, chunk_size: 100)
297
+
298
+ text = "Your long text here..."
299
+ chunks = chunker.call(text)
25
300
  ```
26
301
 
302
+ ## How It Works 🔍
303
+
304
+ `semchunk` works by recursively splitting texts until all resulting chunks are equal to or less than a specified chunk size. In particular, it:
305
+
306
+ 1. Splits text using the most semantically meaningful splitter possible;
307
+ 2. Recursively splits the resulting chunks until a set of chunks equal to or less than the specified chunk size is produced;
308
+ 3. Merges any chunks that are under the chunk size back together until the chunk size is reached;
309
+ 4. Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks; and
310
+ 5. Excludes chunks consisting entirely of whitespace characters.
311
+
312
+ To ensure that chunks are as semantically meaningful as possible, `semchunk` uses the following splitters, in order of precedence:
313
+
314
+ 1. The largest sequence of newlines (`\n`) and/or carriage returns (`\r`);
315
+ 2. The largest sequence of tabs;
316
+ 3. The largest sequence of whitespace characters (as defined by regex's `\s` character class) or, if the largest sequence of whitespace characters is only a single character and there exist whitespace characters preceded by any of the semantically meaningful non-whitespace characters listed below (in the same order of precedence), then only those specific whitespace characters;
317
+ 4. Sentence terminators (`.`, `?`, `!` and `*`);
318
+ 5. Clause separators (`;`, `,`, `(`, `)`, `[`, `]`, `"`, `"`, `'`, `'`, `'`, `"` and `` ` ``);
319
+ 6. Sentence interrupters (`:`, `—` and `…`);
320
+ 7. Word joiners (`/`, `\`, `–`, `&` and `-`); and
321
+ 8. All other characters.
322
+
323
+ If overlapping chunks have been requested, `semchunk` also:
324
+
325
+ 1. Internally reduces the chunk size to `min(overlap, chunk_size - overlap)` (`overlap` being computed as `floor(chunk_size * overlap)` for relative overlaps and `min(overlap, chunk_size - 1)` for absolute overlaps); and
326
+ 2. Merges every `floor(original_chunk_size / reduced_chunk_size)` chunks starting from the first chunk and then jumping by `floor((original_chunk_size - overlap) / reduced_chunk_size)` chunks until the last chunk is reached.
327
+
328
+ The algorithm uses binary search to efficiently find the optimal split points, making it fast even for large documents.
329
+
330
+ ## Running the Examples
331
+
332
+ This gem includes example scripts that demonstrate various features:
333
+
334
+ ```bash
335
+ # Basic usage examples
336
+ ruby examples/basic_usage.rb
337
+
338
+ # Advanced usage with longer documents
339
+ ruby examples/advanced_usage.rb
340
+ ```
341
+
342
+ ## Benchmarks 📊
343
+
344
+ You can run the included benchmark to test performance:
345
+
346
+ ```bash
347
+ ruby test/bench.rb
348
+ ```
349
+
350
+ The Ruby implementation maintains similar performance characteristics to the Python version:
351
+ - Efficient binary search for optimal split points
352
+ - O(n log n) complexity for chunking
353
+ - Fast token count lookups with memoization
354
+ - Low memory overhead
355
+
356
+ The benchmark tests chunking multiple texts with various chunk sizes and provides detailed performance metrics.
357
+
358
+ ## Differences from Python Version
359
+
360
+ This Ruby port maintains feature parity with the Python version, with a few notes:
361
+
362
+ - Multiprocessing support is not yet implemented (`processes` parameter)
363
+ - Progress bar support is not yet implemented (`progress` parameter)
364
+ - String tokenizer names (like `"gpt-4"`) are not yet supported
365
+ - Otherwise, the API and behavior match the Python version
366
+
367
+ See [MIGRATION.md](MIGRATION.md) for a detailed guide on migrating from the Python version.
368
+
27
369
  ## Support
28
370
 
29
371
  If you want to report a bug, or have ideas, feedback or questions about the gem, [let me know via GitHub issues](https://github.com/philip-zhan/semchunk.rb/issues/new) and I will do my best to provide a helpful answer. Happy hacking!
@@ -0,0 +1,61 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "set"
4
+
5
+ module Semchunk
6
+ # A class for chunking one or more texts into semantically meaningful chunks
7
+ class Chunker
8
+ attr_reader :chunk_size, :token_counter
9
+
10
+ def initialize(chunk_size:, token_counter:)
11
+ @chunk_size = chunk_size
12
+ @token_counter = token_counter
13
+ end
14
+
15
+ # Split text or texts into semantically meaningful chunks
16
+ #
17
+ # @param text_or_texts [String, Array<String>] The text or texts to be chunked
18
+ # @param processes [Integer] The number of processes to use when chunking multiple texts (not yet implemented)
19
+ # @param offsets [Boolean] Whether to return the start and end offsets of each chunk
20
+ # @param overlap [Float, Integer, nil] The proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap
21
+ #
22
+ # @return [Array<String>, Array<Array>, Hash] Depending on the input and options, returns chunks and optionally offsets
23
+ def call(text_or_texts, processes: 1, offsets: false, overlap: nil)
24
+ chunk_function = make_chunk_function(offsets: offsets, overlap: overlap)
25
+
26
+ # Handle single text
27
+ return chunk_function.call(text_or_texts) if text_or_texts.is_a?(String)
28
+
29
+ # Handle multiple texts
30
+ raise NotImplementedError, "Parallel processing not yet implemented. Please use processes: 1" unless processes == 1
31
+
32
+ # TODO: Add progress bar support
33
+ chunks_and_offsets = text_or_texts.map { |text| chunk_function.call(text) }
34
+
35
+ # TODO: Add parallel processing support
36
+
37
+ # Return results
38
+ if offsets
39
+ chunks, offsets_arr = chunks_and_offsets.transpose
40
+ return [chunks.to_a, offsets_arr.to_a]
41
+ end
42
+
43
+ chunks_and_offsets
44
+ end
45
+
46
+ private
47
+
48
+ def make_chunk_function(offsets:, overlap:)
49
+ lambda do |text|
50
+ Semchunk.chunk(
51
+ text,
52
+ chunk_size: chunk_size,
53
+ token_counter: token_counter,
54
+ memoize: false,
55
+ offsets: offsets,
56
+ overlap: overlap
57
+ )
58
+ end
59
+ end
60
+ end
61
+ end
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Semchunk
4
- VERSION = "0.1.0"
4
+ VERSION = "0.1.2"
5
5
  end
data/lib/semchunk.rb CHANGED
@@ -1,5 +1,495 @@
1
1
  # frozen_string_literal: true
2
2
 
3
+ require_relative "semchunk/version"
4
+ require_relative "semchunk/chunker"
5
+
3
6
  module Semchunk
4
- autoload :VERSION, "semchunk/version"
7
+ # A map of token counters to their memoized versions
8
+ @memoized_token_counters = {}
9
+
10
+ class << self
11
+ attr_reader :memoized_token_counters
12
+
13
+ # Split a text into semantically meaningful chunks of a specified size as determined by the provided token counter.
14
+ #
15
+ # @param text [String] The text to be chunked.
16
+ # @param chunk_size [Integer] The maximum number of tokens a chunk may contain.
17
+ # @param token_counter [Proc, Method, #call] A callable that takes a string and returns the number of tokens in it.
18
+ # @param memoize [Boolean] Whether to memoize the token counter. Defaults to true.
19
+ # @param offsets [Boolean] Whether to return the start and end offsets of each chunk. Defaults to false.
20
+ # @param overlap [Float, Integer, nil] The proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap. Defaults to nil.
21
+ # @param cache_maxsize [Integer, nil] The maximum number of text-token count pairs that can be stored in the token counter's cache. Defaults to nil (unbounded).
22
+ # @param recursion_depth [Integer] Internal parameter for tracking recursion depth.
23
+ # @param start [Integer] Internal parameter for tracking character offset.
24
+ #
25
+ # @return [Array<String>, Array<Array>] A list of chunks up to chunk_size-tokens-long, with any whitespace used to split the text removed, and, if offsets is true, a list of tuples [start, end].
26
+ def chunk(text, chunk_size:, token_counter:, memoize: true, offsets: false, overlap: nil, cache_maxsize: nil,
27
+ recursion_depth: 0, start: 0)
28
+ return_offsets = offsets
29
+ is_first_call = recursion_depth.zero?
30
+
31
+ # Initialize token counter and compute effective chunk size
32
+ token_counter, local_chunk_size, overlap, unoverlapped_chunk_size = initialize_chunking_params(
33
+ token_counter, chunk_size, overlap, is_first_call, memoize, cache_maxsize
34
+ )
35
+
36
+ # Split text and prepare metadata
37
+ splitter, splitter_is_whitespace, splits = split_text(text)
38
+ split_starts, cum_lens, num_splits_plus_one = prepare_split_metadata(splits, splitter, start)
39
+
40
+ # Process splits into chunks
41
+ chunks, offsets_arr = process_splits(
42
+ splits,
43
+ split_starts,
44
+ cum_lens,
45
+ splitter,
46
+ splitter_is_whitespace,
47
+ token_counter,
48
+ local_chunk_size,
49
+ num_splits_plus_one,
50
+ recursion_depth
51
+ )
52
+
53
+ # Finalize first call: cleanup and overlap
54
+ finalize_chunks(
55
+ chunks,
56
+ offsets_arr,
57
+ is_first_call,
58
+ return_offsets,
59
+ overlap,
60
+ local_chunk_size,
61
+ chunk_size,
62
+ unoverlapped_chunk_size,
63
+ text
64
+ )
65
+ end
66
+
67
+ # A tuple of semantically meaningful non-whitespace splitters
68
+ NON_WHITESPACE_SEMANTIC_SPLITTERS = [
69
+ # Sentence terminators
70
+ ".",
71
+ "?",
72
+ "!",
73
+ "*",
74
+ # Clause separators
75
+ ";",
76
+ ",",
77
+ "(",
78
+ ")",
79
+ "[",
80
+ "]",
81
+ ", ",
82
+ "'",
83
+ "'",
84
+ "'",
85
+ '"',
86
+ "`",
87
+ # Sentence interrupters
88
+ ":",
89
+ "—",
90
+ "…",
91
+ # Word joiners
92
+ "/",
93
+ "\\",
94
+ "–",
95
+ "&",
96
+ "-"
97
+ ].freeze
98
+
99
+ private
100
+
101
+ def initialize_chunking_params(token_counter, chunk_size, overlap, is_first_call, memoize, cache_maxsize)
102
+ local_chunk_size = chunk_size
103
+ unoverlapped_chunk_size = nil
104
+
105
+ return [token_counter, local_chunk_size, overlap, unoverlapped_chunk_size] unless is_first_call
106
+
107
+ token_counter = memoize_token_counter(token_counter, cache_maxsize) if memoize
108
+
109
+ if overlap
110
+ overlap = compute_overlap(overlap, chunk_size)
111
+
112
+ if overlap.positive?
113
+ unoverlapped_chunk_size = chunk_size - overlap
114
+ local_chunk_size = [overlap, unoverlapped_chunk_size].min
115
+ end
116
+ end
117
+
118
+ [token_counter, local_chunk_size, overlap, unoverlapped_chunk_size]
119
+ end
120
+
121
+ def compute_overlap(overlap, chunk_size)
122
+ if overlap < 1
123
+ (chunk_size * overlap).floor
124
+ else
125
+ [overlap, chunk_size - 1].min
126
+ end
127
+ end
128
+
129
+ def prepare_split_metadata(splits, splitter, start)
130
+ splitter_len = splitter.length
131
+ split_lens = splits.map(&:length)
132
+
133
+ cum_lens = [0]
134
+ split_lens.each { |len| cum_lens << (cum_lens.last + len) }
135
+
136
+ split_starts = [0]
137
+ split_lens.each_with_index do |split_len, i|
138
+ split_starts << (split_starts[i] + split_len + splitter_len)
139
+ end
140
+ split_starts = split_starts.map { |s| s + start }
141
+
142
+ num_splits_plus_one = splits.length + 1
143
+
144
+ [split_starts, cum_lens, num_splits_plus_one]
145
+ end
146
+
147
+ def process_splits(splits, split_starts, cum_lens, splitter, splitter_is_whitespace,
148
+ token_counter, local_chunk_size, num_splits_plus_one, recursion_depth)
149
+ chunks = []
150
+ offsets_arr = []
151
+ skips = Set.new
152
+ splitter_len = splitter.length
153
+
154
+ splits.each_with_index do |split, i|
155
+ next if skips.include?(i)
156
+
157
+ split_start = split_starts[i]
158
+
159
+ if token_counter.call(split) > local_chunk_size
160
+ new_chunks, new_offsets = chunk_recursively(split, local_chunk_size, token_counter, recursion_depth, split_start)
161
+ chunks.concat(new_chunks)
162
+ offsets_arr.concat(new_offsets)
163
+ else
164
+ final_split_i, new_chunk = merge_splits(
165
+ splits: splits,
166
+ cum_lens: cum_lens,
167
+ chunk_size: local_chunk_size,
168
+ splitter: splitter,
169
+ token_counter: token_counter,
170
+ start: i,
171
+ high: num_splits_plus_one
172
+ )
173
+
174
+ ((i + 1)...final_split_i).each { |j| skips.add(j) }
175
+ chunks << new_chunk
176
+ split_end = split_starts[final_split_i] - splitter_len
177
+ offsets_arr << [split_start, split_end]
178
+ end
179
+
180
+ append_splitter_if_needed(
181
+ chunks,
182
+ offsets_arr,
183
+ splitter,
184
+ splitter_is_whitespace,
185
+ i,
186
+ splits,
187
+ skips,
188
+ token_counter,
189
+ local_chunk_size,
190
+ split_start
191
+ )
192
+ end
193
+
194
+ [chunks, offsets_arr]
195
+ end
196
+
197
+ def chunk_recursively(split, local_chunk_size, token_counter, recursion_depth, split_start)
198
+ chunk(
199
+ split,
200
+ chunk_size: local_chunk_size,
201
+ token_counter: token_counter,
202
+ offsets: true,
203
+ recursion_depth: recursion_depth + 1,
204
+ start: split_start
205
+ )
206
+ end
207
+
208
+ def append_splitter_if_needed(chunks, offsets_arr, splitter, splitter_is_whitespace,
209
+ split_index, splits, skips, token_counter, local_chunk_size,
210
+ split_start)
211
+ return if splitter_is_whitespace
212
+ return if split_index == splits.length - 1
213
+ return if ((split_index + 1)...splits.length).all? { |j| skips.include?(j) }
214
+
215
+ splitter_len = splitter.length
216
+ last_chunk_with_splitter = chunks[-1] + splitter
217
+ if token_counter.call(last_chunk_with_splitter) <= local_chunk_size
218
+ chunks[-1] = last_chunk_with_splitter
219
+ offset_start, offset_end = offsets_arr[-1]
220
+ offsets_arr[-1] = [offset_start, offset_end + splitter_len]
221
+ else
222
+ offset_start = offsets_arr.empty? ? split_start : offsets_arr[-1][1]
223
+ chunks << splitter
224
+ offsets_arr << [offset_start, offset_start + splitter_len]
225
+ end
226
+ end
227
+
228
+ def finalize_chunks(chunks, offsets_arr, is_first_call, return_offsets, overlap,
229
+ local_chunk_size, chunk_size, unoverlapped_chunk_size, text)
230
+ return [chunks, offsets_arr] unless is_first_call
231
+
232
+ chunks, offsets_arr = remove_empty_chunks(chunks, offsets_arr)
233
+
234
+ if overlap&.positive? && chunks.any?
235
+ chunks, offsets_arr = apply_overlap(
236
+ chunks,
237
+ offsets_arr,
238
+ local_chunk_size,
239
+ chunk_size,
240
+ unoverlapped_chunk_size,
241
+ text
242
+ )
243
+ end
244
+
245
+ return [chunks, offsets_arr] if return_offsets
246
+
247
+ chunks
248
+ end
249
+
250
+ def remove_empty_chunks(chunks, offsets_arr)
251
+ chunks_and_offsets = chunks.zip(offsets_arr).reject { |chunk, _| chunk.empty? || chunk.strip.empty? }
252
+
253
+ if chunks_and_offsets.any?
254
+ chunks_and_offsets.transpose
255
+ else
256
+ [[], []]
257
+ end
258
+ end
259
+
260
+ def apply_overlap(chunks, offsets_arr, subchunk_size, chunk_size, unoverlapped_chunk_size, text)
261
+ subchunks = chunks
262
+ suboffsets = offsets_arr
263
+ num_subchunks = subchunks.length
264
+
265
+ subchunks_per_chunk = (chunk_size.to_f / subchunk_size).floor
266
+ subchunk_stride = (unoverlapped_chunk_size.to_f / subchunk_size).floor
267
+
268
+ num_overlapping_chunks = [1, ((num_subchunks - subchunks_per_chunk).to_f / subchunk_stride).ceil + 1].max
269
+
270
+ offsets_arr = (0...num_overlapping_chunks).map do |i|
271
+ start_idx = i * subchunk_stride
272
+ end_idx = [start_idx + subchunks_per_chunk, num_subchunks].min - 1
273
+ [suboffsets[start_idx][0], suboffsets[end_idx][1]]
274
+ end
275
+
276
+ chunks = offsets_arr.map { |s, e| text[s...e] }
277
+
278
+ [chunks, offsets_arr]
279
+ end
280
+
281
+ public
282
+
283
+ # Construct a chunker that splits one or more texts into semantically meaningful chunks
284
+ #
285
+ # @param tokenizer_or_token_counter [String, #encode, Proc, Method, #call] Either: the name of a tokenizer; a tokenizer that possesses an encode method; or a token counter.
286
+ # @param chunk_size [Integer, nil] The maximum number of tokens a chunk may contain. Defaults to nil.
287
+ # @param max_token_chars [Integer, nil] The maximum number of characters a token may contain. Defaults to nil.
288
+ # @param memoize [Boolean] Whether to memoize the token counter. Defaults to true.
289
+ # @param cache_maxsize [Integer, nil] The maximum number of text-token count pairs that can be stored in the token counter's cache.
290
+ #
291
+ # @return [Chunker] A chunker instance
292
+ def chunkerify(tokenizer_or_token_counter, chunk_size: nil, max_token_chars: nil, memoize: true, cache_maxsize: nil)
293
+ validate_tokenizer_type(tokenizer_or_token_counter)
294
+
295
+ max_token_chars = determine_max_token_chars(tokenizer_or_token_counter, max_token_chars)
296
+ chunk_size = determine_chunk_size(tokenizer_or_token_counter, chunk_size)
297
+ token_counter = create_token_counter(tokenizer_or_token_counter)
298
+ token_counter = wrap_with_fast_counter(token_counter, max_token_chars, chunk_size) if max_token_chars
299
+ token_counter = memoize_token_counter(token_counter, cache_maxsize) if memoize
300
+
301
+ Chunker.new(chunk_size: chunk_size, token_counter: token_counter)
302
+ end
303
+
304
+ private
305
+
306
+ def validate_tokenizer_type(tokenizer_or_token_counter)
307
+ return unless tokenizer_or_token_counter.is_a?(String)
308
+
309
+ raise NotImplementedError,
310
+ "String tokenizer names not yet supported in Ruby. Please pass a tokenizer object or token counter proc."
311
+ end
312
+
313
+ def determine_max_token_chars(tokenizer, max_token_chars)
314
+ return max_token_chars unless max_token_chars.nil?
315
+
316
+ if tokenizer.respond_to?(:token_byte_values)
317
+ vocab = tokenizer.token_byte_values
318
+ return vocab.map(&:length).max if vocab.respond_to?(:map)
319
+ elsif tokenizer.respond_to?(:get_vocab)
320
+ vocab = tokenizer.get_vocab
321
+ return vocab.keys.map(&:length).max if vocab.respond_to?(:keys)
322
+ end
323
+
324
+ nil
325
+ end
326
+
327
+ def determine_chunk_size(tokenizer, chunk_size)
328
+ return chunk_size unless chunk_size.nil?
329
+
330
+ raise ArgumentError, "chunk_size not provided and tokenizer lacks model_max_length attribute" unless tokenizer.respond_to?(:model_max_length) && tokenizer.model_max_length.is_a?(Integer)
331
+
332
+ chunk_size = tokenizer.model_max_length
333
+
334
+ if tokenizer.respond_to?(:encode)
335
+ begin
336
+ chunk_size -= tokenizer.encode("").length
337
+ rescue StandardError
338
+ # Ignore errors
339
+ end
340
+ end
341
+
342
+ chunk_size
343
+ end
344
+
345
+ def create_token_counter(tokenizer_or_token_counter)
346
+ return tokenizer_or_token_counter unless tokenizer_or_token_counter.respond_to?(:encode)
347
+
348
+ tokenizer = tokenizer_or_token_counter
349
+ encode_params = begin
350
+ tokenizer.method(:encode).parameters
351
+ rescue StandardError
352
+ []
353
+ end
354
+
355
+ has_special_tokens = encode_params.any? { |_type, name| name == :add_special_tokens }
356
+
357
+ if has_special_tokens
358
+ ->(text) { tokenizer.encode(text, add_special_tokens: false).length }
359
+ else
360
+ ->(text) { tokenizer.encode(text).length }
361
+ end
362
+ end
363
+
364
+ def wrap_with_fast_counter(token_counter, max_token_chars, chunk_size)
365
+ max_token_chars -= 1
366
+ original_token_counter = token_counter
367
+
368
+ lambda do |text|
369
+ heuristic = chunk_size * 6
370
+ if text.length > heuristic && original_token_counter.call(text[0...(heuristic + max_token_chars)]) > chunk_size
371
+ chunk_size + 1
372
+ else
373
+ original_token_counter.call(text)
374
+ end
375
+ end
376
+ end
377
+
378
+ def split_text(text)
379
+ splitter_is_whitespace = true
380
+
381
+ splitter = find_whitespace_splitter(text)
382
+ if splitter
383
+ result = try_whitespace_with_semantic_preceder(text, splitter, splitter_is_whitespace)
384
+ return result if result
385
+
386
+ return [splitter, splitter_is_whitespace, text.split(splitter)]
387
+ end
388
+
389
+ # No whitespace found, use non-whitespace semantic splitters
390
+ find_non_whitespace_splitter(text)
391
+ end
392
+
393
+ def find_whitespace_splitter(text)
394
+ return text.scan(/[\r\n]+/).max_by(&:length) if text.include?("\n") || text.include?("\r")
395
+ return text.scan(/\t+/).max_by(&:length) if text.include?("\t")
396
+ return text.scan(/\s+/).max_by(&:length) if text.match?(/\s/)
397
+
398
+ nil
399
+ end
400
+
401
+ def try_whitespace_with_semantic_preceder(text, splitter, splitter_is_whitespace)
402
+ return nil unless splitter.length == 1
403
+
404
+ NON_WHITESPACE_SEMANTIC_SPLITTERS.each do |preceder|
405
+ escaped_preceder = Regexp.escape(preceder)
406
+ match = text.match(/#{escaped_preceder}(\s)/)
407
+ next unless match
408
+
409
+ matched_splitter = match[1]
410
+ escaped_splitter = Regexp.escape(matched_splitter)
411
+ return [matched_splitter, splitter_is_whitespace, text.split(/(?<=#{escaped_preceder})#{escaped_splitter}/)]
412
+ end
413
+
414
+ nil
415
+ end
416
+
417
+ def find_non_whitespace_splitter(text)
418
+ splitter = NON_WHITESPACE_SEMANTIC_SPLITTERS.find { |s| text.include?(s) }
419
+ return ["", true, text.chars] unless splitter
420
+
421
+ [splitter, false, text.split(splitter)]
422
+ end
423
+
424
+ def bisect_left(sorted, target, low, high)
425
+ while low < high
426
+ mid = (low + high) / 2
427
+ if sorted[mid] < target
428
+ low = mid + 1
429
+ else
430
+ high = mid
431
+ end
432
+ end
433
+ low
434
+ end
435
+
436
+ def merge_splits(splits:, cum_lens:, chunk_size:, splitter:, token_counter:, start:, high:)
437
+ average = 0.2
438
+ low = start
439
+
440
+ offset = cum_lens[start]
441
+ target = offset + (chunk_size * average)
442
+
443
+ while low < high
444
+ i = bisect_left(cum_lens, target, low, high)
445
+ midpoint = [i, high - 1].min
446
+
447
+ tokens = token_counter.call(splits[start...midpoint].join(splitter))
448
+
449
+ local_cum = cum_lens[midpoint] - offset
450
+
451
+ if local_cum.positive? && tokens.positive?
452
+ average = local_cum.to_f / tokens
453
+ target = offset + (chunk_size * average)
454
+ end
455
+
456
+ if tokens > chunk_size
457
+ high = midpoint
458
+ else
459
+ low = midpoint + 1
460
+ end
461
+ end
462
+
463
+ last_split_index = low - 1
464
+ [last_split_index, splits[start...last_split_index].join(splitter)]
465
+ end
466
+
467
+ def memoize_token_counter(token_counter, maxsize=nil)
468
+ return @memoized_token_counters[token_counter] if @memoized_token_counters.key?(token_counter)
469
+
470
+ cache = {}
471
+ queue = []
472
+
473
+ memoized = lambda do |text|
474
+ if cache.key?(text)
475
+ cache[text]
476
+ else
477
+ result = token_counter.call(text)
478
+ cache[text] = result
479
+
480
+ if maxsize
481
+ queue << text
482
+ if queue.length > maxsize
483
+ oldest = queue.shift
484
+ cache.delete(oldest)
485
+ end
486
+ end
487
+
488
+ result
489
+ end
490
+ end
491
+
492
+ @memoized_token_counters[token_counter] = memoized
493
+ end
494
+ end
5
495
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: semchunk
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.1.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Philip Zhan
@@ -18,6 +18,7 @@ files:
18
18
  - LICENSE.txt
19
19
  - README.md
20
20
  - lib/semchunk.rb
21
+ - lib/semchunk/chunker.rb
21
22
  - lib/semchunk/version.rb
22
23
  homepage: https://github.com/philip-zhan/semchunk.rb
23
24
  licenses: