semchunk 0.1.1 → 0.1.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: ba78c1e8ca6f2d54a7345a1f2c33a4492224c121022db4fdc4dd46742e4891c8
4
- data.tar.gz: f7c5e2d0a24964d97f59aa6b1b16649712108e5ce9a3253a178e36d2a3196ecd
3
+ metadata.gz: 1ca2e04c2509ef3fa509307aaf5231dcb23df17a596a2dfdfcfa9760c0f6498b
4
+ data.tar.gz: e2850b2ca76ca627d7bf67ba070f906092e193d19a632bbbee19dfeeb153583a
5
5
  SHA512:
6
- metadata.gz: e41759111d572eff3fe47d518f33524e2fac447a09992e96e76a488c27cb760cbe82352b7533680a8761f8360ddfc2613ac5c7b148e21c9800357dd9dcbbed94
7
- data.tar.gz: e4b4acaf2229dcf85d16d1c3869a4978f17f69fe60b3c2b2a86c67354a4e90f9982691f88f8c6595189863109bb0df82be2232b5144ea036ff2db4e0ce0425df
6
+ metadata.gz: 53ca16f2cddb69d9eb7461754dee4261f592acb0aa88329a5803f9c92246529356ed4bb9932f7cb1895d3480b8cb6eac04112bb5066c8e80f1edbe1430ba3578
7
+ data.tar.gz: d302134a84ea39be36b7a7637f88c9107bfd452b495133fbed7ed678f4f8c6a986a265550917e5d198f73ed2922584e48bca89a14948e6c855df4ee852af395c
data/README.md CHANGED
@@ -1,12 +1,18 @@
1
- # Semchunk
1
+ <div align="center">
2
+
3
+ # semchunk 🧩
2
4
 
3
5
  [![Gem Version](https://img.shields.io/gem/v/semchunk)](https://rubygems.org/gems/semchunk)
4
6
  [![Gem Downloads](https://img.shields.io/gem/dt/semchunk)](https://www.ruby-toolbox.com/projects/semchunk)
5
7
  [![GitHub Workflow Status](https://img.shields.io/github/actions/workflow/status/philip-zhan/semchunk.rb/ci.yml)](https://github.com/philip-zhan/semchunk.rb/actions/workflows/ci.yml)
6
8
 
7
- Split text into semantically meaningful chunks of a specified size as determined by a provided token counter.
9
+ </div>
10
+
11
+ **`semchunk`** is a fast and lightweight Ruby gem for splitting text into semantically meaningful chunks.
8
12
 
9
- This is a Ruby port of the Python [semchunk](https://github.com/umarbutler/semchunk) package.
13
+ This is a Ruby port of the Python [semchunk](https://github.com/isaacus-dev/semchunk) library by [Isaacus](https://isaacus.com/), maintaining the same efficient chunking algorithm and API design.
14
+
15
+ `semchunk` produces chunks that are more semantically meaningful than regular token and recursive character chunkers, while being fast and easy to use.
10
16
 
11
17
  ## Features
12
18
 
@@ -293,16 +299,31 @@ text = "Your long text here..."
293
299
  chunks = chunker.call(text)
294
300
  ```
295
301
 
296
- ## How It Works
302
+ ## How It Works 🔍
303
+
304
+ `semchunk` works by recursively splitting texts until all resulting chunks are equal to or less than a specified chunk size. In particular, it:
297
305
 
298
- Semchunk uses a hierarchical splitting strategy:
306
+ 1. Splits text using the most semantically meaningful splitter possible;
307
+ 2. Recursively splits the resulting chunks until a set of chunks equal to or less than the specified chunk size is produced;
308
+ 3. Merges any chunks that are under the chunk size back together until the chunk size is reached;
309
+ 4. Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks; and
310
+ 5. Excludes chunks consisting entirely of whitespace characters.
299
311
 
300
- 1. **Primary split**: Tries to split on paragraph breaks (newlines)
301
- 2. **Secondary split**: Falls back to sentences (periods, question marks, etc.)
302
- 3. **Tertiary split**: Uses clauses (commas, semicolons) if needed
303
- 4. **Final split**: Character-level splitting as last resort
312
+ To ensure that chunks are as semantically meaningful as possible, `semchunk` uses the following splitters, in order of precedence:
304
313
 
305
- This ensures that chunks are semantically meaningful while respecting your token limits.
314
+ 1. The largest sequence of newlines (`\n`) and/or carriage returns (`\r`);
315
+ 2. The largest sequence of tabs;
316
+ 3. The largest sequence of whitespace characters (as defined by regex's `\s` character class) or, if the largest sequence of whitespace characters is only a single character and there exist whitespace characters preceded by any of the semantically meaningful non-whitespace characters listed below (in the same order of precedence), then only those specific whitespace characters;
317
+ 4. Sentence terminators (`.`, `?`, `!` and `*`);
318
+ 5. Clause separators (`;`, `,`, `(`, `)`, `[`, `]`, `"`, `"`, `'`, `'`, `'`, `"` and `` ` ``);
319
+ 6. Sentence interrupters (`:`, `—` and `…`);
320
+ 7. Word joiners (`/`, `\`, `–`, `&` and `-`); and
321
+ 8. All other characters.
322
+
323
+ If overlapping chunks have been requested, `semchunk` also:
324
+
325
+ 1. Internally reduces the chunk size to `min(overlap, chunk_size - overlap)` (`overlap` being computed as `floor(chunk_size * overlap)` for relative overlaps and `min(overlap, chunk_size - 1)` for absolute overlaps); and
326
+ 2. Merges every `floor(original_chunk_size / reduced_chunk_size)` chunks starting from the first chunk and then jumping by `floor((original_chunk_size - overlap) / reduced_chunk_size)` chunks until the last chunk is reached.
306
327
 
307
328
  The algorithm uses binary search to efficiently find the optimal split points, making it fast even for large documents.
308
329
 
@@ -318,6 +339,22 @@ ruby examples/basic_usage.rb
318
339
  ruby examples/advanced_usage.rb
319
340
  ```
320
341
 
342
+ ## Benchmarks 📊
343
+
344
+ You can run the included benchmark to test performance:
345
+
346
+ ```bash
347
+ ruby test/bench.rb
348
+ ```
349
+
350
+ The Ruby implementation maintains similar performance characteristics to the Python version:
351
+ - Efficient binary search for optimal split points
352
+ - O(n log n) complexity for chunking
353
+ - Fast token count lookups with memoization
354
+ - Low memory overhead
355
+
356
+ The benchmark tests chunking multiple texts with various chunk sizes and provides detailed performance metrics.
357
+
321
358
  ## Differences from Python Version
322
359
 
323
360
  This Ruby port maintains feature parity with the Python version, with a few notes:
@@ -16,27 +16,23 @@ module Semchunk
16
16
  #
17
17
  # @param text_or_texts [String, Array<String>] The text or texts to be chunked
18
18
  # @param processes [Integer] The number of processes to use when chunking multiple texts (not yet implemented)
19
- # @param progress [Boolean] Whether to display a progress bar when chunking multiple texts (not yet implemented)
20
19
  # @param offsets [Boolean] Whether to return the start and end offsets of each chunk
21
20
  # @param overlap [Float, Integer, nil] The proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap
22
21
  #
23
22
  # @return [Array<String>, Array<Array>, Hash] Depending on the input and options, returns chunks and optionally offsets
24
- def call(text_or_texts, processes: 1, progress: false, offsets: false, overlap: nil)
23
+ def call(text_or_texts, processes: 1, offsets: false, overlap: nil)
25
24
  chunk_function = make_chunk_function(offsets: offsets, overlap: overlap)
26
25
 
27
26
  # Handle single text
28
- if text_or_texts.is_a?(String)
29
- return chunk_function.call(text_or_texts)
30
- end
27
+ return chunk_function.call(text_or_texts) if text_or_texts.is_a?(String)
31
28
 
32
29
  # Handle multiple texts
33
- if processes == 1
34
- # TODO: Add progress bar support
35
- chunks_and_offsets = text_or_texts.map { |text| chunk_function.call(text) }
36
- else
37
- # TODO: Add parallel processing support
38
- raise NotImplementedError, "Parallel processing not yet implemented. Please use processes: 1"
39
- end
30
+ raise NotImplementedError, "Parallel processing not yet implemented. Please use processes: 1" unless processes == 1
31
+
32
+ # TODO: Add progress bar support
33
+ chunks_and_offsets = text_or_texts.map { |text| chunk_function.call(text) }
34
+
35
+ # TODO: Add parallel processing support
40
36
 
41
37
  # Return results
42
38
  if offsets
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Semchunk
4
- VERSION = "0.1.1"
4
+ VERSION = "0.1.2"
5
5
  end
data/lib/semchunk.rb CHANGED
@@ -23,158 +23,263 @@ module Semchunk
23
23
  # @param start [Integer] Internal parameter for tracking character offset.
24
24
  #
25
25
  # @return [Array<String>, Array<Array>] A list of chunks up to chunk_size-tokens-long, with any whitespace used to split the text removed, and, if offsets is true, a list of tuples [start, end].
26
- def chunk(text, chunk_size:, token_counter:, memoize: true, offsets: false, overlap: nil, cache_maxsize: nil, recursion_depth: 0, start: 0)
27
- # Rename variables for clarity
26
+ def chunk(text, chunk_size:, token_counter:, memoize: true, offsets: false, overlap: nil, cache_maxsize: nil,
27
+ recursion_depth: 0, start: 0)
28
28
  return_offsets = offsets
29
+ is_first_call = recursion_depth.zero?
30
+
31
+ # Initialize token counter and compute effective chunk size
32
+ token_counter, local_chunk_size, overlap, unoverlapped_chunk_size = initialize_chunking_params(
33
+ token_counter, chunk_size, overlap, is_first_call, memoize, cache_maxsize
34
+ )
35
+
36
+ # Split text and prepare metadata
37
+ splitter, splitter_is_whitespace, splits = split_text(text)
38
+ split_starts, cum_lens, num_splits_plus_one = prepare_split_metadata(splits, splitter, start)
39
+
40
+ # Process splits into chunks
41
+ chunks, offsets_arr = process_splits(
42
+ splits,
43
+ split_starts,
44
+ cum_lens,
45
+ splitter,
46
+ splitter_is_whitespace,
47
+ token_counter,
48
+ local_chunk_size,
49
+ num_splits_plus_one,
50
+ recursion_depth
51
+ )
52
+
53
+ # Finalize first call: cleanup and overlap
54
+ finalize_chunks(
55
+ chunks,
56
+ offsets_arr,
57
+ is_first_call,
58
+ return_offsets,
59
+ overlap,
60
+ local_chunk_size,
61
+ chunk_size,
62
+ unoverlapped_chunk_size,
63
+ text
64
+ )
65
+ end
66
+
67
+ # A tuple of semantically meaningful non-whitespace splitters
68
+ NON_WHITESPACE_SEMANTIC_SPLITTERS = [
69
+ # Sentence terminators
70
+ ".",
71
+ "?",
72
+ "!",
73
+ "*",
74
+ # Clause separators
75
+ ";",
76
+ ",",
77
+ "(",
78
+ ")",
79
+ "[",
80
+ "]",
81
+ ", ",
82
+ "'",
83
+ "'",
84
+ "'",
85
+ '"',
86
+ "`",
87
+ # Sentence interrupters
88
+ ":",
89
+ "—",
90
+ "…",
91
+ # Word joiners
92
+ "/",
93
+ "\\",
94
+ "–",
95
+ "&",
96
+ "-"
97
+ ].freeze
98
+
99
+ private
100
+
101
+ def initialize_chunking_params(token_counter, chunk_size, overlap, is_first_call, memoize, cache_maxsize)
29
102
  local_chunk_size = chunk_size
103
+ unoverlapped_chunk_size = nil
30
104
 
31
- # If this is the first call, memoize the token counter if memoization is enabled and reduce the effective chunk size if overlapping chunks
32
- is_first_call = recursion_depth.zero?
105
+ return [token_counter, local_chunk_size, overlap, unoverlapped_chunk_size] unless is_first_call
33
106
 
34
- if is_first_call
35
- if memoize
36
- token_counter = memoize_token_counter(token_counter, cache_maxsize)
37
- end
107
+ token_counter = memoize_token_counter(token_counter, cache_maxsize) if memoize
38
108
 
39
- if overlap
40
- # Make relative overlaps absolute and floor both relative and absolute overlaps
41
- overlap = if overlap < 1
42
- (chunk_size * overlap).floor
43
- else
44
- [overlap, chunk_size - 1].min
45
- end
46
-
47
- # If the overlap has not been zeroed, compute the effective chunk size
48
- if overlap.positive?
49
- unoverlapped_chunk_size = chunk_size - overlap
50
- local_chunk_size = [overlap, unoverlapped_chunk_size].min
51
- end
109
+ if overlap
110
+ overlap = compute_overlap(overlap, chunk_size)
111
+
112
+ if overlap.positive?
113
+ unoverlapped_chunk_size = chunk_size - overlap
114
+ local_chunk_size = [overlap, unoverlapped_chunk_size].min
52
115
  end
53
116
  end
54
117
 
55
- # Split the text using the most semantically meaningful splitter possible
56
- splitter, splitter_is_whitespace, splits = split_text(text)
118
+ [token_counter, local_chunk_size, overlap, unoverlapped_chunk_size]
119
+ end
57
120
 
58
- offsets_arr = []
121
+ def compute_overlap(overlap, chunk_size)
122
+ if overlap < 1
123
+ (chunk_size * overlap).floor
124
+ else
125
+ [overlap, chunk_size - 1].min
126
+ end
127
+ end
128
+
129
+ def prepare_split_metadata(splits, splitter, start)
59
130
  splitter_len = splitter.length
60
131
  split_lens = splits.map(&:length)
132
+
61
133
  cum_lens = [0]
62
- split_lens.each { |len| cum_lens << cum_lens.last + len }
134
+ split_lens.each { |len| cum_lens << (cum_lens.last + len) }
63
135
 
64
136
  split_starts = [0]
65
137
  split_lens.each_with_index do |split_len, i|
66
- split_starts << split_starts[i] + split_len + splitter_len
138
+ split_starts << (split_starts[i] + split_len + splitter_len)
67
139
  end
68
140
  split_starts = split_starts.map { |s| s + start }
69
141
 
70
142
  num_splits_plus_one = splits.length + 1
71
143
 
144
+ [split_starts, cum_lens, num_splits_plus_one]
145
+ end
146
+
147
+ def process_splits(splits, split_starts, cum_lens, splitter, splitter_is_whitespace,
148
+ token_counter, local_chunk_size, num_splits_plus_one, recursion_depth)
72
149
  chunks = []
150
+ offsets_arr = []
73
151
  skips = Set.new
152
+ splitter_len = splitter.length
74
153
 
75
- # Iterate through the splits
76
154
  splits.each_with_index do |split, i|
77
- # Skip the split if it has already been added to a chunk
78
155
  next if skips.include?(i)
79
156
 
80
157
  split_start = split_starts[i]
81
158
 
82
- # If the split is over the chunk size, recursively chunk it
83
159
  if token_counter.call(split) > local_chunk_size
84
- new_chunks, new_offsets = chunk(
85
- split,
86
- chunk_size: local_chunk_size,
87
- token_counter: token_counter,
88
- offsets: true,
89
- recursion_depth: recursion_depth + 1,
90
- start: split_start
91
- )
92
-
160
+ new_chunks, new_offsets = chunk_recursively(split, local_chunk_size, token_counter, recursion_depth, split_start)
93
161
  chunks.concat(new_chunks)
94
162
  offsets_arr.concat(new_offsets)
95
163
  else
96
- # Merge the split with subsequent splits until the chunk size is reached
97
- final_split_in_chunk_i, new_chunk = merge_splits(
98
- splits: splits,
99
- cum_lens: cum_lens,
100
- chunk_size: local_chunk_size,
101
- splitter: splitter,
164
+ final_split_i, new_chunk = merge_splits(
165
+ splits: splits,
166
+ cum_lens: cum_lens,
167
+ chunk_size: local_chunk_size,
168
+ splitter: splitter,
102
169
  token_counter: token_counter,
103
- start: i,
104
- high: num_splits_plus_one
170
+ start: i,
171
+ high: num_splits_plus_one
105
172
  )
106
173
 
107
- # Mark any splits included in the new chunk for exclusion from future chunks
108
- ((i + 1)...final_split_in_chunk_i).each { |j| skips.add(j) }
109
-
110
- # Add the chunk
174
+ ((i + 1)...final_split_i).each { |j| skips.add(j) }
111
175
  chunks << new_chunk
112
-
113
- # Add the chunk's offsets
114
- split_end = split_starts[final_split_in_chunk_i] - splitter_len
176
+ split_end = split_starts[final_split_i] - splitter_len
115
177
  offsets_arr << [split_start, split_end]
116
178
  end
117
179
 
118
- # If the splitter is not whitespace and the split is not the last split, add the splitter to the end of the latest chunk
119
- unless splitter_is_whitespace || (i == splits.length - 1 || ((i + 1)...splits.length).all? { |j| skips.include?(j) })
120
- last_chunk_with_splitter = chunks[-1] + splitter
121
- if token_counter.call(last_chunk_with_splitter) <= local_chunk_size
122
- chunks[-1] = last_chunk_with_splitter
123
- offset_start, offset_end = offsets_arr[-1]
124
- offsets_arr[-1] = [offset_start, offset_end + splitter_len]
125
- else
126
- offset_start = offsets_arr.empty? ? split_start : offsets_arr[-1][1]
127
- chunks << splitter
128
- offsets_arr << [offset_start, offset_start + splitter_len]
129
- end
130
- end
180
+ append_splitter_if_needed(
181
+ chunks,
182
+ offsets_arr,
183
+ splitter,
184
+ splitter_is_whitespace,
185
+ i,
186
+ splits,
187
+ skips,
188
+ token_counter,
189
+ local_chunk_size,
190
+ split_start
191
+ )
131
192
  end
132
193
 
133
- # If this is the first call, remove empty chunks and overlap if desired
134
- if is_first_call
135
- # Remove empty chunks and chunks comprised entirely of whitespace
136
- chunks_and_offsets = chunks.zip(offsets_arr).reject { |chunk, _| chunk.empty? || chunk.strip.empty? }
194
+ [chunks, offsets_arr]
195
+ end
137
196
 
138
- if chunks_and_offsets.any?
139
- chunks, offsets_arr = chunks_and_offsets.transpose
140
- else
141
- chunks = []
142
- offsets_arr = []
143
- end
197
+ def chunk_recursively(split, local_chunk_size, token_counter, recursion_depth, split_start)
198
+ chunk(
199
+ split,
200
+ chunk_size: local_chunk_size,
201
+ token_counter: token_counter,
202
+ offsets: true,
203
+ recursion_depth: recursion_depth + 1,
204
+ start: split_start
205
+ )
206
+ end
144
207
 
145
- # Overlap chunks if desired and there are chunks to overlap
146
- if overlap && overlap.positive? && chunks.any?
147
- # Rename variables for clarity
148
- subchunk_size = local_chunk_size
149
- subchunks = chunks
150
- suboffsets = offsets_arr
151
- num_subchunks = subchunks.length
208
+ def append_splitter_if_needed(chunks, offsets_arr, splitter, splitter_is_whitespace,
209
+ split_index, splits, skips, token_counter, local_chunk_size,
210
+ split_start)
211
+ return if splitter_is_whitespace
212
+ return if split_index == splits.length - 1
213
+ return if ((split_index + 1)...splits.length).all? { |j| skips.include?(j) }
152
214
 
153
- # Merge the subchunks into overlapping chunks
154
- subchunks_per_chunk = (chunk_size.to_f / subchunk_size).floor
155
- subchunk_stride = (unoverlapped_chunk_size.to_f / subchunk_size).floor
215
+ splitter_len = splitter.length
216
+ last_chunk_with_splitter = chunks[-1] + splitter
217
+ if token_counter.call(last_chunk_with_splitter) <= local_chunk_size
218
+ chunks[-1] = last_chunk_with_splitter
219
+ offset_start, offset_end = offsets_arr[-1]
220
+ offsets_arr[-1] = [offset_start, offset_end + splitter_len]
221
+ else
222
+ offset_start = offsets_arr.empty? ? split_start : offsets_arr[-1][1]
223
+ chunks << splitter
224
+ offsets_arr << [offset_start, offset_start + splitter_len]
225
+ end
226
+ end
156
227
 
157
- num_overlapping_chunks = [1, ((num_subchunks - subchunks_per_chunk).to_f / subchunk_stride).ceil + 1].max
228
+ def finalize_chunks(chunks, offsets_arr, is_first_call, return_offsets, overlap,
229
+ local_chunk_size, chunk_size, unoverlapped_chunk_size, text)
230
+ return [chunks, offsets_arr] unless is_first_call
231
+
232
+ chunks, offsets_arr = remove_empty_chunks(chunks, offsets_arr)
233
+
234
+ if overlap&.positive? && chunks.any?
235
+ chunks, offsets_arr = apply_overlap(
236
+ chunks,
237
+ offsets_arr,
238
+ local_chunk_size,
239
+ chunk_size,
240
+ unoverlapped_chunk_size,
241
+ text
242
+ )
243
+ end
158
244
 
159
- offsets_arr = (0...num_overlapping_chunks).map do |i|
160
- start_idx = i * subchunk_stride
161
- end_idx = [start_idx + subchunks_per_chunk, num_subchunks].min - 1
162
- [suboffsets[start_idx][0], suboffsets[end_idx][1]]
163
- end
245
+ return [chunks, offsets_arr] if return_offsets
164
246
 
165
- chunks = offsets_arr.map { |s, e| text[s...e] }
166
- end
247
+ chunks
248
+ end
167
249
 
168
- # Return offsets if desired
169
- return [chunks, offsets_arr] if return_offsets
250
+ def remove_empty_chunks(chunks, offsets_arr)
251
+ chunks_and_offsets = chunks.zip(offsets_arr).reject { |chunk, _| chunk.empty? || chunk.strip.empty? }
170
252
 
171
- return chunks
253
+ if chunks_and_offsets.any?
254
+ chunks_and_offsets.transpose
255
+ else
256
+ [[], []]
172
257
  end
258
+ end
259
+
260
+ def apply_overlap(chunks, offsets_arr, subchunk_size, chunk_size, unoverlapped_chunk_size, text)
261
+ subchunks = chunks
262
+ suboffsets = offsets_arr
263
+ num_subchunks = subchunks.length
264
+
265
+ subchunks_per_chunk = (chunk_size.to_f / subchunk_size).floor
266
+ subchunk_stride = (unoverlapped_chunk_size.to_f / subchunk_size).floor
267
+
268
+ num_overlapping_chunks = [1, ((num_subchunks - subchunks_per_chunk).to_f / subchunk_stride).ceil + 1].max
269
+
270
+ offsets_arr = (0...num_overlapping_chunks).map do |i|
271
+ start_idx = i * subchunk_stride
272
+ end_idx = [start_idx + subchunks_per_chunk, num_subchunks].min - 1
273
+ [suboffsets[start_idx][0], suboffsets[end_idx][1]]
274
+ end
275
+
276
+ chunks = offsets_arr.map { |s, e| text[s...e] }
173
277
 
174
- # Always return chunks and offsets if this is a recursive call
175
278
  [chunks, offsets_arr]
176
279
  end
177
280
 
281
+ public
282
+
178
283
  # Construct a chunker that splits one or more texts into semantically meaningful chunks
179
284
  #
180
285
  # @param tokenizer_or_token_counter [String, #encode, Proc, Method, #call] Either: the name of a tokenizer; a tokenizer that possesses an encode method; or a token counter.
@@ -185,132 +290,135 @@ module Semchunk
185
290
  #
186
291
  # @return [Chunker] A chunker instance
187
292
  def chunkerify(tokenizer_or_token_counter, chunk_size: nil, max_token_chars: nil, memoize: true, cache_maxsize: nil)
188
- # Handle string tokenizer names (would require tiktoken/transformers Ruby equivalents)
189
- if tokenizer_or_token_counter.is_a?(String)
190
- raise NotImplementedError, "String tokenizer names not yet supported in Ruby. Please pass a tokenizer object or token counter proc."
293
+ validate_tokenizer_type(tokenizer_or_token_counter)
294
+
295
+ max_token_chars = determine_max_token_chars(tokenizer_or_token_counter, max_token_chars)
296
+ chunk_size = determine_chunk_size(tokenizer_or_token_counter, chunk_size)
297
+ token_counter = create_token_counter(tokenizer_or_token_counter)
298
+ token_counter = wrap_with_fast_counter(token_counter, max_token_chars, chunk_size) if max_token_chars
299
+ token_counter = memoize_token_counter(token_counter, cache_maxsize) if memoize
300
+
301
+ Chunker.new(chunk_size: chunk_size, token_counter: token_counter)
302
+ end
303
+
304
+ private
305
+
306
+ def validate_tokenizer_type(tokenizer_or_token_counter)
307
+ return unless tokenizer_or_token_counter.is_a?(String)
308
+
309
+ raise NotImplementedError,
310
+ "String tokenizer names not yet supported in Ruby. Please pass a tokenizer object or token counter proc."
311
+ end
312
+
313
+ def determine_max_token_chars(tokenizer, max_token_chars)
314
+ return max_token_chars unless max_token_chars.nil?
315
+
316
+ if tokenizer.respond_to?(:token_byte_values)
317
+ vocab = tokenizer.token_byte_values
318
+ return vocab.map(&:length).max if vocab.respond_to?(:map)
319
+ elsif tokenizer.respond_to?(:get_vocab)
320
+ vocab = tokenizer.get_vocab
321
+ return vocab.keys.map(&:length).max if vocab.respond_to?(:keys)
191
322
  end
192
323
 
193
- # Determine max_token_chars if not provided
194
- if max_token_chars.nil?
195
- if tokenizer_or_token_counter.respond_to?(:token_byte_values)
196
- vocab = tokenizer_or_token_counter.token_byte_values
197
- max_token_chars = vocab.map(&:length).max if vocab.respond_to?(:map)
198
- elsif tokenizer_or_token_counter.respond_to?(:get_vocab)
199
- vocab = tokenizer_or_token_counter.get_vocab
200
- max_token_chars = vocab.keys.map(&:length).max if vocab.respond_to?(:keys)
324
+ nil
325
+ end
326
+
327
+ def determine_chunk_size(tokenizer, chunk_size)
328
+ return chunk_size unless chunk_size.nil?
329
+
330
+ raise ArgumentError, "chunk_size not provided and tokenizer lacks model_max_length attribute" unless tokenizer.respond_to?(:model_max_length) && tokenizer.model_max_length.is_a?(Integer)
331
+
332
+ chunk_size = tokenizer.model_max_length
333
+
334
+ if tokenizer.respond_to?(:encode)
335
+ begin
336
+ chunk_size -= tokenizer.encode("").length
337
+ rescue StandardError
338
+ # Ignore errors
201
339
  end
202
340
  end
203
341
 
204
- # Determine chunk_size if not provided
205
- if chunk_size.nil?
206
- if tokenizer_or_token_counter.respond_to?(:model_max_length) && tokenizer_or_token_counter.model_max_length.is_a?(Integer)
207
- chunk_size = tokenizer_or_token_counter.model_max_length
208
-
209
- # Attempt to reduce the chunk size by the number of special characters
210
- if tokenizer_or_token_counter.respond_to?(:encode)
211
- begin
212
- chunk_size -= tokenizer_or_token_counter.encode("").length
213
- rescue StandardError
214
- # Ignore errors
215
- end
216
- end
217
- else
218
- raise ArgumentError, "chunk_size not provided and tokenizer lacks model_max_length attribute"
219
- end
342
+ chunk_size
343
+ end
344
+
345
+ def create_token_counter(tokenizer_or_token_counter)
346
+ return tokenizer_or_token_counter unless tokenizer_or_token_counter.respond_to?(:encode)
347
+
348
+ tokenizer = tokenizer_or_token_counter
349
+ encode_params = begin
350
+ tokenizer.method(:encode).parameters
351
+ rescue StandardError
352
+ []
220
353
  end
221
354
 
222
- # Construct token counter from tokenizer if needed
223
- if tokenizer_or_token_counter.respond_to?(:encode)
224
- tokenizer = tokenizer_or_token_counter
225
- # Check if encode accepts add_special_tokens parameter
226
- encode_params = tokenizer.method(:encode).parameters rescue []
227
- has_special_tokens = encode_params.any? { |type, name| name == :add_special_tokens }
228
-
229
- token_counter = if has_special_tokens
230
- ->(text) { tokenizer.encode(text, add_special_tokens: false).length }
231
- else
232
- ->(text) { tokenizer.encode(text).length }
233
- end
355
+ has_special_tokens = encode_params.any? { |_type, name| name == :add_special_tokens }
356
+
357
+ if has_special_tokens
358
+ ->(text) { tokenizer.encode(text, add_special_tokens: false).length }
234
359
  else
235
- token_counter = tokenizer_or_token_counter
360
+ ->(text) { tokenizer.encode(text).length }
236
361
  end
362
+ end
237
363
 
238
- # Add fast token counter optimization if max_token_chars is known
239
- if max_token_chars
240
- max_token_chars -= 1
241
- original_token_counter = token_counter
242
-
243
- token_counter = lambda do |text|
244
- heuristic = chunk_size * 6
245
- if text.length > heuristic && original_token_counter.call(text[0...(heuristic + max_token_chars)]) > chunk_size
246
- chunk_size + 1
247
- else
248
- original_token_counter.call(text)
249
- end
364
+ def wrap_with_fast_counter(token_counter, max_token_chars, chunk_size)
365
+ max_token_chars -= 1
366
+ original_token_counter = token_counter
367
+
368
+ lambda do |text|
369
+ heuristic = chunk_size * 6
370
+ if text.length > heuristic && original_token_counter.call(text[0...(heuristic + max_token_chars)]) > chunk_size
371
+ chunk_size + 1
372
+ else
373
+ original_token_counter.call(text)
250
374
  end
251
375
  end
376
+ end
377
+
378
+ def split_text(text)
379
+ splitter_is_whitespace = true
380
+
381
+ splitter = find_whitespace_splitter(text)
382
+ if splitter
383
+ result = try_whitespace_with_semantic_preceder(text, splitter, splitter_is_whitespace)
384
+ return result if result
252
385
 
253
- # Memoize the token counter if necessary
254
- if memoize
255
- token_counter = memoize_token_counter(token_counter, cache_maxsize)
386
+ return [splitter, splitter_is_whitespace, text.split(splitter)]
256
387
  end
257
388
 
258
- # Construct and return the chunker
259
- Chunker.new(chunk_size: chunk_size, token_counter: token_counter)
389
+ # No whitespace found, use non-whitespace semantic splitters
390
+ find_non_whitespace_splitter(text)
260
391
  end
261
392
 
262
- private
393
+ def find_whitespace_splitter(text)
394
+ return text.scan(/[\r\n]+/).max_by(&:length) if text.include?("\n") || text.include?("\r")
395
+ return text.scan(/\t+/).max_by(&:length) if text.include?("\t")
396
+ return text.scan(/\s+/).max_by(&:length) if text.match?(/\s/)
263
397
 
264
- # A tuple of semantically meaningful non-whitespace splitters
265
- NON_WHITESPACE_SEMANTIC_SPLITTERS = [
266
- # Sentence terminators
267
- ".", "?", "!", "*",
268
- # Clause separators
269
- ";", ",", "(", ")", "[", "]", """, """, "'", "'", "'", '"', "`",
270
- # Sentence interrupters
271
- ":", "—", "…",
272
- # Word joiners
273
- "/", "\\", "–", "&", "-"
274
- ].freeze
398
+ nil
399
+ end
275
400
 
276
- def split_text(text)
277
- splitter_is_whitespace = true
401
+ def try_whitespace_with_semantic_preceder(text, splitter, splitter_is_whitespace)
402
+ return nil unless splitter.length == 1
278
403
 
279
- # Try splitting at various levels
280
- if text.include?("\n") || text.include?("\r")
281
- newline_matches = text.scan(/[\r\n]+/)
282
- splitter = newline_matches.max_by(&:length)
283
- elsif text.include?("\t")
284
- tab_matches = text.scan(/\t+/)
285
- splitter = tab_matches.max_by(&:length)
286
- elsif text.match?(/\s/)
287
- whitespace_matches = text.scan(/\s+/)
288
- splitter = whitespace_matches.max_by(&:length)
289
-
290
- # If the splitter is only a single character, see if we can target whitespace preceded by semantic splitters
291
- if splitter.length == 1
292
- NON_WHITESPACE_SEMANTIC_SPLITTERS.each do |preceder|
293
- escaped_preceder = Regexp.escape(preceder)
294
- if (match = text.match(/#{escaped_preceder}(\s)/))
295
- splitter = match[1]
296
- escaped_splitter = Regexp.escape(splitter)
297
- return [splitter, splitter_is_whitespace, text.split(/(?<=#{escaped_preceder})#{escaped_splitter}/)]
298
- end
299
- end
300
- end
301
- else
302
- # Find the most desirable semantically meaningful non-whitespace splitter
303
- splitter = NON_WHITESPACE_SEMANTIC_SPLITTERS.find { |s| text.include?(s) }
404
+ NON_WHITESPACE_SEMANTIC_SPLITTERS.each do |preceder|
405
+ escaped_preceder = Regexp.escape(preceder)
406
+ match = text.match(/#{escaped_preceder}(\s)/)
407
+ next unless match
304
408
 
305
- if splitter
306
- splitter_is_whitespace = false
307
- else
308
- # No semantic splitter found, return characters
309
- return ["", splitter_is_whitespace, text.chars]
310
- end
409
+ matched_splitter = match[1]
410
+ escaped_splitter = Regexp.escape(matched_splitter)
411
+ return [matched_splitter, splitter_is_whitespace, text.split(/(?<=#{escaped_preceder})#{escaped_splitter}/)]
311
412
  end
312
413
 
313
- [splitter, splitter_is_whitespace, text.split(splitter)]
414
+ nil
415
+ end
416
+
417
+ def find_non_whitespace_splitter(text)
418
+ splitter = NON_WHITESPACE_SEMANTIC_SPLITTERS.find { |s| text.include?(s) }
419
+ return ["", true, text.chars] unless splitter
420
+
421
+ [splitter, false, text.split(splitter)]
314
422
  end
315
423
 
316
424
  def bisect_left(sorted, target, low, high)
@@ -356,7 +464,7 @@ module Semchunk
356
464
  [last_split_index, splits[start...last_split_index].join(splitter)]
357
465
  end
358
466
 
359
- def memoize_token_counter(token_counter, maxsize = nil)
467
+ def memoize_token_counter(token_counter, maxsize=nil)
360
468
  return @memoized_token_counters[token_counter] if @memoized_token_counters.key?(token_counter)
361
469
 
362
470
  cache = {}
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: semchunk
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.1
4
+ version: 0.1.2
5
5
  platform: ruby
6
6
  authors:
7
7
  - Philip Zhan