semchunk 0.1.1 → 0.1.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +47 -10
- data/lib/semchunk/chunker.rb +8 -12
- data/lib/semchunk/version.rb +1 -1
- data/lib/semchunk.rb +310 -202
- metadata +1 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 1ca2e04c2509ef3fa509307aaf5231dcb23df17a596a2dfdfcfa9760c0f6498b
|
|
4
|
+
data.tar.gz: e2850b2ca76ca627d7bf67ba070f906092e193d19a632bbbee19dfeeb153583a
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 53ca16f2cddb69d9eb7461754dee4261f592acb0aa88329a5803f9c92246529356ed4bb9932f7cb1895d3480b8cb6eac04112bb5066c8e80f1edbe1430ba3578
|
|
7
|
+
data.tar.gz: d302134a84ea39be36b7a7637f88c9107bfd452b495133fbed7ed678f4f8c6a986a265550917e5d198f73ed2922584e48bca89a14948e6c855df4ee852af395c
|
data/README.md
CHANGED
|
@@ -1,12 +1,18 @@
|
|
|
1
|
-
|
|
1
|
+
<div align="center">
|
|
2
|
+
|
|
3
|
+
# semchunk 🧩
|
|
2
4
|
|
|
3
5
|
[](https://rubygems.org/gems/semchunk)
|
|
4
6
|
[](https://www.ruby-toolbox.com/projects/semchunk)
|
|
5
7
|
[](https://github.com/philip-zhan/semchunk.rb/actions/workflows/ci.yml)
|
|
6
8
|
|
|
7
|
-
|
|
9
|
+
</div>
|
|
10
|
+
|
|
11
|
+
**`semchunk`** is a fast and lightweight Ruby gem for splitting text into semantically meaningful chunks.
|
|
8
12
|
|
|
9
|
-
This is a Ruby port of the Python [semchunk](https://github.com/
|
|
13
|
+
This is a Ruby port of the Python [semchunk](https://github.com/isaacus-dev/semchunk) library by [Isaacus](https://isaacus.com/), maintaining the same efficient chunking algorithm and API design.
|
|
14
|
+
|
|
15
|
+
`semchunk` produces chunks that are more semantically meaningful than regular token and recursive character chunkers, while being fast and easy to use.
|
|
10
16
|
|
|
11
17
|
## Features
|
|
12
18
|
|
|
@@ -293,16 +299,31 @@ text = "Your long text here..."
|
|
|
293
299
|
chunks = chunker.call(text)
|
|
294
300
|
```
|
|
295
301
|
|
|
296
|
-
## How It Works
|
|
302
|
+
## How It Works 🔍
|
|
303
|
+
|
|
304
|
+
`semchunk` works by recursively splitting texts until all resulting chunks are equal to or less than a specified chunk size. In particular, it:
|
|
297
305
|
|
|
298
|
-
|
|
306
|
+
1. Splits text using the most semantically meaningful splitter possible;
|
|
307
|
+
2. Recursively splits the resulting chunks until a set of chunks equal to or less than the specified chunk size is produced;
|
|
308
|
+
3. Merges any chunks that are under the chunk size back together until the chunk size is reached;
|
|
309
|
+
4. Reattaches any non-whitespace splitters back to the ends of chunks barring the final chunk if doing so does not bring chunks over the chunk size, otherwise adds non-whitespace splitters as their own chunks; and
|
|
310
|
+
5. Excludes chunks consisting entirely of whitespace characters.
|
|
299
311
|
|
|
300
|
-
|
|
301
|
-
2. **Secondary split**: Falls back to sentences (periods, question marks, etc.)
|
|
302
|
-
3. **Tertiary split**: Uses clauses (commas, semicolons) if needed
|
|
303
|
-
4. **Final split**: Character-level splitting as last resort
|
|
312
|
+
To ensure that chunks are as semantically meaningful as possible, `semchunk` uses the following splitters, in order of precedence:
|
|
304
313
|
|
|
305
|
-
|
|
314
|
+
1. The largest sequence of newlines (`\n`) and/or carriage returns (`\r`);
|
|
315
|
+
2. The largest sequence of tabs;
|
|
316
|
+
3. The largest sequence of whitespace characters (as defined by regex's `\s` character class) or, if the largest sequence of whitespace characters is only a single character and there exist whitespace characters preceded by any of the semantically meaningful non-whitespace characters listed below (in the same order of precedence), then only those specific whitespace characters;
|
|
317
|
+
4. Sentence terminators (`.`, `?`, `!` and `*`);
|
|
318
|
+
5. Clause separators (`;`, `,`, `(`, `)`, `[`, `]`, `"`, `"`, `'`, `'`, `'`, `"` and `` ` ``);
|
|
319
|
+
6. Sentence interrupters (`:`, `—` and `…`);
|
|
320
|
+
7. Word joiners (`/`, `\`, `–`, `&` and `-`); and
|
|
321
|
+
8. All other characters.
|
|
322
|
+
|
|
323
|
+
If overlapping chunks have been requested, `semchunk` also:
|
|
324
|
+
|
|
325
|
+
1. Internally reduces the chunk size to `min(overlap, chunk_size - overlap)` (`overlap` being computed as `floor(chunk_size * overlap)` for relative overlaps and `min(overlap, chunk_size - 1)` for absolute overlaps); and
|
|
326
|
+
2. Merges every `floor(original_chunk_size / reduced_chunk_size)` chunks starting from the first chunk and then jumping by `floor((original_chunk_size - overlap) / reduced_chunk_size)` chunks until the last chunk is reached.
|
|
306
327
|
|
|
307
328
|
The algorithm uses binary search to efficiently find the optimal split points, making it fast even for large documents.
|
|
308
329
|
|
|
@@ -318,6 +339,22 @@ ruby examples/basic_usage.rb
|
|
|
318
339
|
ruby examples/advanced_usage.rb
|
|
319
340
|
```
|
|
320
341
|
|
|
342
|
+
## Benchmarks 📊
|
|
343
|
+
|
|
344
|
+
You can run the included benchmark to test performance:
|
|
345
|
+
|
|
346
|
+
```bash
|
|
347
|
+
ruby test/bench.rb
|
|
348
|
+
```
|
|
349
|
+
|
|
350
|
+
The Ruby implementation maintains similar performance characteristics to the Python version:
|
|
351
|
+
- Efficient binary search for optimal split points
|
|
352
|
+
- O(n log n) complexity for chunking
|
|
353
|
+
- Fast token count lookups with memoization
|
|
354
|
+
- Low memory overhead
|
|
355
|
+
|
|
356
|
+
The benchmark tests chunking multiple texts with various chunk sizes and provides detailed performance metrics.
|
|
357
|
+
|
|
321
358
|
## Differences from Python Version
|
|
322
359
|
|
|
323
360
|
This Ruby port maintains feature parity with the Python version, with a few notes:
|
data/lib/semchunk/chunker.rb
CHANGED
|
@@ -16,27 +16,23 @@ module Semchunk
|
|
|
16
16
|
#
|
|
17
17
|
# @param text_or_texts [String, Array<String>] The text or texts to be chunked
|
|
18
18
|
# @param processes [Integer] The number of processes to use when chunking multiple texts (not yet implemented)
|
|
19
|
-
# @param progress [Boolean] Whether to display a progress bar when chunking multiple texts (not yet implemented)
|
|
20
19
|
# @param offsets [Boolean] Whether to return the start and end offsets of each chunk
|
|
21
20
|
# @param overlap [Float, Integer, nil] The proportion of the chunk size, or, if >=1, the number of tokens, by which chunks should overlap
|
|
22
21
|
#
|
|
23
22
|
# @return [Array<String>, Array<Array>, Hash] Depending on the input and options, returns chunks and optionally offsets
|
|
24
|
-
def call(text_or_texts, processes: 1,
|
|
23
|
+
def call(text_or_texts, processes: 1, offsets: false, overlap: nil)
|
|
25
24
|
chunk_function = make_chunk_function(offsets: offsets, overlap: overlap)
|
|
26
25
|
|
|
27
26
|
# Handle single text
|
|
28
|
-
if text_or_texts.is_a?(String)
|
|
29
|
-
return chunk_function.call(text_or_texts)
|
|
30
|
-
end
|
|
27
|
+
return chunk_function.call(text_or_texts) if text_or_texts.is_a?(String)
|
|
31
28
|
|
|
32
29
|
# Handle multiple texts
|
|
33
|
-
|
|
34
|
-
|
|
35
|
-
|
|
36
|
-
|
|
37
|
-
|
|
38
|
-
|
|
39
|
-
end
|
|
30
|
+
raise NotImplementedError, "Parallel processing not yet implemented. Please use processes: 1" unless processes == 1
|
|
31
|
+
|
|
32
|
+
# TODO: Add progress bar support
|
|
33
|
+
chunks_and_offsets = text_or_texts.map { |text| chunk_function.call(text) }
|
|
34
|
+
|
|
35
|
+
# TODO: Add parallel processing support
|
|
40
36
|
|
|
41
37
|
# Return results
|
|
42
38
|
if offsets
|
data/lib/semchunk/version.rb
CHANGED
data/lib/semchunk.rb
CHANGED
|
@@ -23,158 +23,263 @@ module Semchunk
|
|
|
23
23
|
# @param start [Integer] Internal parameter for tracking character offset.
|
|
24
24
|
#
|
|
25
25
|
# @return [Array<String>, Array<Array>] A list of chunks up to chunk_size-tokens-long, with any whitespace used to split the text removed, and, if offsets is true, a list of tuples [start, end].
|
|
26
|
-
def chunk(text, chunk_size:, token_counter:, memoize: true, offsets: false, overlap: nil, cache_maxsize: nil,
|
|
27
|
-
|
|
26
|
+
def chunk(text, chunk_size:, token_counter:, memoize: true, offsets: false, overlap: nil, cache_maxsize: nil,
|
|
27
|
+
recursion_depth: 0, start: 0)
|
|
28
28
|
return_offsets = offsets
|
|
29
|
+
is_first_call = recursion_depth.zero?
|
|
30
|
+
|
|
31
|
+
# Initialize token counter and compute effective chunk size
|
|
32
|
+
token_counter, local_chunk_size, overlap, unoverlapped_chunk_size = initialize_chunking_params(
|
|
33
|
+
token_counter, chunk_size, overlap, is_first_call, memoize, cache_maxsize
|
|
34
|
+
)
|
|
35
|
+
|
|
36
|
+
# Split text and prepare metadata
|
|
37
|
+
splitter, splitter_is_whitespace, splits = split_text(text)
|
|
38
|
+
split_starts, cum_lens, num_splits_plus_one = prepare_split_metadata(splits, splitter, start)
|
|
39
|
+
|
|
40
|
+
# Process splits into chunks
|
|
41
|
+
chunks, offsets_arr = process_splits(
|
|
42
|
+
splits,
|
|
43
|
+
split_starts,
|
|
44
|
+
cum_lens,
|
|
45
|
+
splitter,
|
|
46
|
+
splitter_is_whitespace,
|
|
47
|
+
token_counter,
|
|
48
|
+
local_chunk_size,
|
|
49
|
+
num_splits_plus_one,
|
|
50
|
+
recursion_depth
|
|
51
|
+
)
|
|
52
|
+
|
|
53
|
+
# Finalize first call: cleanup and overlap
|
|
54
|
+
finalize_chunks(
|
|
55
|
+
chunks,
|
|
56
|
+
offsets_arr,
|
|
57
|
+
is_first_call,
|
|
58
|
+
return_offsets,
|
|
59
|
+
overlap,
|
|
60
|
+
local_chunk_size,
|
|
61
|
+
chunk_size,
|
|
62
|
+
unoverlapped_chunk_size,
|
|
63
|
+
text
|
|
64
|
+
)
|
|
65
|
+
end
|
|
66
|
+
|
|
67
|
+
# A tuple of semantically meaningful non-whitespace splitters
|
|
68
|
+
NON_WHITESPACE_SEMANTIC_SPLITTERS = [
|
|
69
|
+
# Sentence terminators
|
|
70
|
+
".",
|
|
71
|
+
"?",
|
|
72
|
+
"!",
|
|
73
|
+
"*",
|
|
74
|
+
# Clause separators
|
|
75
|
+
";",
|
|
76
|
+
",",
|
|
77
|
+
"(",
|
|
78
|
+
")",
|
|
79
|
+
"[",
|
|
80
|
+
"]",
|
|
81
|
+
", ",
|
|
82
|
+
"'",
|
|
83
|
+
"'",
|
|
84
|
+
"'",
|
|
85
|
+
'"',
|
|
86
|
+
"`",
|
|
87
|
+
# Sentence interrupters
|
|
88
|
+
":",
|
|
89
|
+
"—",
|
|
90
|
+
"…",
|
|
91
|
+
# Word joiners
|
|
92
|
+
"/",
|
|
93
|
+
"\\",
|
|
94
|
+
"–",
|
|
95
|
+
"&",
|
|
96
|
+
"-"
|
|
97
|
+
].freeze
|
|
98
|
+
|
|
99
|
+
private
|
|
100
|
+
|
|
101
|
+
def initialize_chunking_params(token_counter, chunk_size, overlap, is_first_call, memoize, cache_maxsize)
|
|
29
102
|
local_chunk_size = chunk_size
|
|
103
|
+
unoverlapped_chunk_size = nil
|
|
30
104
|
|
|
31
|
-
|
|
32
|
-
is_first_call = recursion_depth.zero?
|
|
105
|
+
return [token_counter, local_chunk_size, overlap, unoverlapped_chunk_size] unless is_first_call
|
|
33
106
|
|
|
34
|
-
if
|
|
35
|
-
if memoize
|
|
36
|
-
token_counter = memoize_token_counter(token_counter, cache_maxsize)
|
|
37
|
-
end
|
|
107
|
+
token_counter = memoize_token_counter(token_counter, cache_maxsize) if memoize
|
|
38
108
|
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
|
|
42
|
-
|
|
43
|
-
|
|
44
|
-
|
|
45
|
-
end
|
|
46
|
-
|
|
47
|
-
# If the overlap has not been zeroed, compute the effective chunk size
|
|
48
|
-
if overlap.positive?
|
|
49
|
-
unoverlapped_chunk_size = chunk_size - overlap
|
|
50
|
-
local_chunk_size = [overlap, unoverlapped_chunk_size].min
|
|
51
|
-
end
|
|
109
|
+
if overlap
|
|
110
|
+
overlap = compute_overlap(overlap, chunk_size)
|
|
111
|
+
|
|
112
|
+
if overlap.positive?
|
|
113
|
+
unoverlapped_chunk_size = chunk_size - overlap
|
|
114
|
+
local_chunk_size = [overlap, unoverlapped_chunk_size].min
|
|
52
115
|
end
|
|
53
116
|
end
|
|
54
117
|
|
|
55
|
-
|
|
56
|
-
|
|
118
|
+
[token_counter, local_chunk_size, overlap, unoverlapped_chunk_size]
|
|
119
|
+
end
|
|
57
120
|
|
|
58
|
-
|
|
121
|
+
def compute_overlap(overlap, chunk_size)
|
|
122
|
+
if overlap < 1
|
|
123
|
+
(chunk_size * overlap).floor
|
|
124
|
+
else
|
|
125
|
+
[overlap, chunk_size - 1].min
|
|
126
|
+
end
|
|
127
|
+
end
|
|
128
|
+
|
|
129
|
+
def prepare_split_metadata(splits, splitter, start)
|
|
59
130
|
splitter_len = splitter.length
|
|
60
131
|
split_lens = splits.map(&:length)
|
|
132
|
+
|
|
61
133
|
cum_lens = [0]
|
|
62
|
-
split_lens.each { |len| cum_lens << cum_lens.last + len }
|
|
134
|
+
split_lens.each { |len| cum_lens << (cum_lens.last + len) }
|
|
63
135
|
|
|
64
136
|
split_starts = [0]
|
|
65
137
|
split_lens.each_with_index do |split_len, i|
|
|
66
|
-
split_starts << split_starts[i] + split_len + splitter_len
|
|
138
|
+
split_starts << (split_starts[i] + split_len + splitter_len)
|
|
67
139
|
end
|
|
68
140
|
split_starts = split_starts.map { |s| s + start }
|
|
69
141
|
|
|
70
142
|
num_splits_plus_one = splits.length + 1
|
|
71
143
|
|
|
144
|
+
[split_starts, cum_lens, num_splits_plus_one]
|
|
145
|
+
end
|
|
146
|
+
|
|
147
|
+
def process_splits(splits, split_starts, cum_lens, splitter, splitter_is_whitespace,
|
|
148
|
+
token_counter, local_chunk_size, num_splits_plus_one, recursion_depth)
|
|
72
149
|
chunks = []
|
|
150
|
+
offsets_arr = []
|
|
73
151
|
skips = Set.new
|
|
152
|
+
splitter_len = splitter.length
|
|
74
153
|
|
|
75
|
-
# Iterate through the splits
|
|
76
154
|
splits.each_with_index do |split, i|
|
|
77
|
-
# Skip the split if it has already been added to a chunk
|
|
78
155
|
next if skips.include?(i)
|
|
79
156
|
|
|
80
157
|
split_start = split_starts[i]
|
|
81
158
|
|
|
82
|
-
# If the split is over the chunk size, recursively chunk it
|
|
83
159
|
if token_counter.call(split) > local_chunk_size
|
|
84
|
-
new_chunks, new_offsets =
|
|
85
|
-
split,
|
|
86
|
-
chunk_size: local_chunk_size,
|
|
87
|
-
token_counter: token_counter,
|
|
88
|
-
offsets: true,
|
|
89
|
-
recursion_depth: recursion_depth + 1,
|
|
90
|
-
start: split_start
|
|
91
|
-
)
|
|
92
|
-
|
|
160
|
+
new_chunks, new_offsets = chunk_recursively(split, local_chunk_size, token_counter, recursion_depth, split_start)
|
|
93
161
|
chunks.concat(new_chunks)
|
|
94
162
|
offsets_arr.concat(new_offsets)
|
|
95
163
|
else
|
|
96
|
-
|
|
97
|
-
|
|
98
|
-
|
|
99
|
-
|
|
100
|
-
|
|
101
|
-
splitter: splitter,
|
|
164
|
+
final_split_i, new_chunk = merge_splits(
|
|
165
|
+
splits: splits,
|
|
166
|
+
cum_lens: cum_lens,
|
|
167
|
+
chunk_size: local_chunk_size,
|
|
168
|
+
splitter: splitter,
|
|
102
169
|
token_counter: token_counter,
|
|
103
|
-
start:
|
|
104
|
-
high:
|
|
170
|
+
start: i,
|
|
171
|
+
high: num_splits_plus_one
|
|
105
172
|
)
|
|
106
173
|
|
|
107
|
-
|
|
108
|
-
((i + 1)...final_split_in_chunk_i).each { |j| skips.add(j) }
|
|
109
|
-
|
|
110
|
-
# Add the chunk
|
|
174
|
+
((i + 1)...final_split_i).each { |j| skips.add(j) }
|
|
111
175
|
chunks << new_chunk
|
|
112
|
-
|
|
113
|
-
# Add the chunk's offsets
|
|
114
|
-
split_end = split_starts[final_split_in_chunk_i] - splitter_len
|
|
176
|
+
split_end = split_starts[final_split_i] - splitter_len
|
|
115
177
|
offsets_arr << [split_start, split_end]
|
|
116
178
|
end
|
|
117
179
|
|
|
118
|
-
|
|
119
|
-
|
|
120
|
-
|
|
121
|
-
|
|
122
|
-
|
|
123
|
-
|
|
124
|
-
|
|
125
|
-
|
|
126
|
-
|
|
127
|
-
|
|
128
|
-
|
|
129
|
-
|
|
130
|
-
end
|
|
180
|
+
append_splitter_if_needed(
|
|
181
|
+
chunks,
|
|
182
|
+
offsets_arr,
|
|
183
|
+
splitter,
|
|
184
|
+
splitter_is_whitespace,
|
|
185
|
+
i,
|
|
186
|
+
splits,
|
|
187
|
+
skips,
|
|
188
|
+
token_counter,
|
|
189
|
+
local_chunk_size,
|
|
190
|
+
split_start
|
|
191
|
+
)
|
|
131
192
|
end
|
|
132
193
|
|
|
133
|
-
|
|
134
|
-
|
|
135
|
-
# Remove empty chunks and chunks comprised entirely of whitespace
|
|
136
|
-
chunks_and_offsets = chunks.zip(offsets_arr).reject { |chunk, _| chunk.empty? || chunk.strip.empty? }
|
|
194
|
+
[chunks, offsets_arr]
|
|
195
|
+
end
|
|
137
196
|
|
|
138
|
-
|
|
139
|
-
|
|
140
|
-
|
|
141
|
-
|
|
142
|
-
|
|
143
|
-
|
|
197
|
+
def chunk_recursively(split, local_chunk_size, token_counter, recursion_depth, split_start)
|
|
198
|
+
chunk(
|
|
199
|
+
split,
|
|
200
|
+
chunk_size: local_chunk_size,
|
|
201
|
+
token_counter: token_counter,
|
|
202
|
+
offsets: true,
|
|
203
|
+
recursion_depth: recursion_depth + 1,
|
|
204
|
+
start: split_start
|
|
205
|
+
)
|
|
206
|
+
end
|
|
144
207
|
|
|
145
|
-
|
|
146
|
-
|
|
147
|
-
|
|
148
|
-
|
|
149
|
-
|
|
150
|
-
|
|
151
|
-
num_subchunks = subchunks.length
|
|
208
|
+
def append_splitter_if_needed(chunks, offsets_arr, splitter, splitter_is_whitespace,
|
|
209
|
+
split_index, splits, skips, token_counter, local_chunk_size,
|
|
210
|
+
split_start)
|
|
211
|
+
return if splitter_is_whitespace
|
|
212
|
+
return if split_index == splits.length - 1
|
|
213
|
+
return if ((split_index + 1)...splits.length).all? { |j| skips.include?(j) }
|
|
152
214
|
|
|
153
|
-
|
|
154
|
-
|
|
155
|
-
|
|
215
|
+
splitter_len = splitter.length
|
|
216
|
+
last_chunk_with_splitter = chunks[-1] + splitter
|
|
217
|
+
if token_counter.call(last_chunk_with_splitter) <= local_chunk_size
|
|
218
|
+
chunks[-1] = last_chunk_with_splitter
|
|
219
|
+
offset_start, offset_end = offsets_arr[-1]
|
|
220
|
+
offsets_arr[-1] = [offset_start, offset_end + splitter_len]
|
|
221
|
+
else
|
|
222
|
+
offset_start = offsets_arr.empty? ? split_start : offsets_arr[-1][1]
|
|
223
|
+
chunks << splitter
|
|
224
|
+
offsets_arr << [offset_start, offset_start + splitter_len]
|
|
225
|
+
end
|
|
226
|
+
end
|
|
156
227
|
|
|
157
|
-
|
|
228
|
+
def finalize_chunks(chunks, offsets_arr, is_first_call, return_offsets, overlap,
|
|
229
|
+
local_chunk_size, chunk_size, unoverlapped_chunk_size, text)
|
|
230
|
+
return [chunks, offsets_arr] unless is_first_call
|
|
231
|
+
|
|
232
|
+
chunks, offsets_arr = remove_empty_chunks(chunks, offsets_arr)
|
|
233
|
+
|
|
234
|
+
if overlap&.positive? && chunks.any?
|
|
235
|
+
chunks, offsets_arr = apply_overlap(
|
|
236
|
+
chunks,
|
|
237
|
+
offsets_arr,
|
|
238
|
+
local_chunk_size,
|
|
239
|
+
chunk_size,
|
|
240
|
+
unoverlapped_chunk_size,
|
|
241
|
+
text
|
|
242
|
+
)
|
|
243
|
+
end
|
|
158
244
|
|
|
159
|
-
|
|
160
|
-
start_idx = i * subchunk_stride
|
|
161
|
-
end_idx = [start_idx + subchunks_per_chunk, num_subchunks].min - 1
|
|
162
|
-
[suboffsets[start_idx][0], suboffsets[end_idx][1]]
|
|
163
|
-
end
|
|
245
|
+
return [chunks, offsets_arr] if return_offsets
|
|
164
246
|
|
|
165
|
-
|
|
166
|
-
|
|
247
|
+
chunks
|
|
248
|
+
end
|
|
167
249
|
|
|
168
|
-
|
|
169
|
-
|
|
250
|
+
def remove_empty_chunks(chunks, offsets_arr)
|
|
251
|
+
chunks_and_offsets = chunks.zip(offsets_arr).reject { |chunk, _| chunk.empty? || chunk.strip.empty? }
|
|
170
252
|
|
|
171
|
-
|
|
253
|
+
if chunks_and_offsets.any?
|
|
254
|
+
chunks_and_offsets.transpose
|
|
255
|
+
else
|
|
256
|
+
[[], []]
|
|
172
257
|
end
|
|
258
|
+
end
|
|
259
|
+
|
|
260
|
+
def apply_overlap(chunks, offsets_arr, subchunk_size, chunk_size, unoverlapped_chunk_size, text)
|
|
261
|
+
subchunks = chunks
|
|
262
|
+
suboffsets = offsets_arr
|
|
263
|
+
num_subchunks = subchunks.length
|
|
264
|
+
|
|
265
|
+
subchunks_per_chunk = (chunk_size.to_f / subchunk_size).floor
|
|
266
|
+
subchunk_stride = (unoverlapped_chunk_size.to_f / subchunk_size).floor
|
|
267
|
+
|
|
268
|
+
num_overlapping_chunks = [1, ((num_subchunks - subchunks_per_chunk).to_f / subchunk_stride).ceil + 1].max
|
|
269
|
+
|
|
270
|
+
offsets_arr = (0...num_overlapping_chunks).map do |i|
|
|
271
|
+
start_idx = i * subchunk_stride
|
|
272
|
+
end_idx = [start_idx + subchunks_per_chunk, num_subchunks].min - 1
|
|
273
|
+
[suboffsets[start_idx][0], suboffsets[end_idx][1]]
|
|
274
|
+
end
|
|
275
|
+
|
|
276
|
+
chunks = offsets_arr.map { |s, e| text[s...e] }
|
|
173
277
|
|
|
174
|
-
# Always return chunks and offsets if this is a recursive call
|
|
175
278
|
[chunks, offsets_arr]
|
|
176
279
|
end
|
|
177
280
|
|
|
281
|
+
public
|
|
282
|
+
|
|
178
283
|
# Construct a chunker that splits one or more texts into semantically meaningful chunks
|
|
179
284
|
#
|
|
180
285
|
# @param tokenizer_or_token_counter [String, #encode, Proc, Method, #call] Either: the name of a tokenizer; a tokenizer that possesses an encode method; or a token counter.
|
|
@@ -185,132 +290,135 @@ module Semchunk
|
|
|
185
290
|
#
|
|
186
291
|
# @return [Chunker] A chunker instance
|
|
187
292
|
def chunkerify(tokenizer_or_token_counter, chunk_size: nil, max_token_chars: nil, memoize: true, cache_maxsize: nil)
|
|
188
|
-
|
|
189
|
-
|
|
190
|
-
|
|
293
|
+
validate_tokenizer_type(tokenizer_or_token_counter)
|
|
294
|
+
|
|
295
|
+
max_token_chars = determine_max_token_chars(tokenizer_or_token_counter, max_token_chars)
|
|
296
|
+
chunk_size = determine_chunk_size(tokenizer_or_token_counter, chunk_size)
|
|
297
|
+
token_counter = create_token_counter(tokenizer_or_token_counter)
|
|
298
|
+
token_counter = wrap_with_fast_counter(token_counter, max_token_chars, chunk_size) if max_token_chars
|
|
299
|
+
token_counter = memoize_token_counter(token_counter, cache_maxsize) if memoize
|
|
300
|
+
|
|
301
|
+
Chunker.new(chunk_size: chunk_size, token_counter: token_counter)
|
|
302
|
+
end
|
|
303
|
+
|
|
304
|
+
private
|
|
305
|
+
|
|
306
|
+
def validate_tokenizer_type(tokenizer_or_token_counter)
|
|
307
|
+
return unless tokenizer_or_token_counter.is_a?(String)
|
|
308
|
+
|
|
309
|
+
raise NotImplementedError,
|
|
310
|
+
"String tokenizer names not yet supported in Ruby. Please pass a tokenizer object or token counter proc."
|
|
311
|
+
end
|
|
312
|
+
|
|
313
|
+
def determine_max_token_chars(tokenizer, max_token_chars)
|
|
314
|
+
return max_token_chars unless max_token_chars.nil?
|
|
315
|
+
|
|
316
|
+
if tokenizer.respond_to?(:token_byte_values)
|
|
317
|
+
vocab = tokenizer.token_byte_values
|
|
318
|
+
return vocab.map(&:length).max if vocab.respond_to?(:map)
|
|
319
|
+
elsif tokenizer.respond_to?(:get_vocab)
|
|
320
|
+
vocab = tokenizer.get_vocab
|
|
321
|
+
return vocab.keys.map(&:length).max if vocab.respond_to?(:keys)
|
|
191
322
|
end
|
|
192
323
|
|
|
193
|
-
|
|
194
|
-
|
|
195
|
-
|
|
196
|
-
|
|
197
|
-
|
|
198
|
-
|
|
199
|
-
|
|
200
|
-
|
|
324
|
+
nil
|
|
325
|
+
end
|
|
326
|
+
|
|
327
|
+
def determine_chunk_size(tokenizer, chunk_size)
|
|
328
|
+
return chunk_size unless chunk_size.nil?
|
|
329
|
+
|
|
330
|
+
raise ArgumentError, "chunk_size not provided and tokenizer lacks model_max_length attribute" unless tokenizer.respond_to?(:model_max_length) && tokenizer.model_max_length.is_a?(Integer)
|
|
331
|
+
|
|
332
|
+
chunk_size = tokenizer.model_max_length
|
|
333
|
+
|
|
334
|
+
if tokenizer.respond_to?(:encode)
|
|
335
|
+
begin
|
|
336
|
+
chunk_size -= tokenizer.encode("").length
|
|
337
|
+
rescue StandardError
|
|
338
|
+
# Ignore errors
|
|
201
339
|
end
|
|
202
340
|
end
|
|
203
341
|
|
|
204
|
-
|
|
205
|
-
|
|
206
|
-
|
|
207
|
-
|
|
208
|
-
|
|
209
|
-
|
|
210
|
-
|
|
211
|
-
|
|
212
|
-
|
|
213
|
-
|
|
214
|
-
|
|
215
|
-
end
|
|
216
|
-
end
|
|
217
|
-
else
|
|
218
|
-
raise ArgumentError, "chunk_size not provided and tokenizer lacks model_max_length attribute"
|
|
219
|
-
end
|
|
342
|
+
chunk_size
|
|
343
|
+
end
|
|
344
|
+
|
|
345
|
+
def create_token_counter(tokenizer_or_token_counter)
|
|
346
|
+
return tokenizer_or_token_counter unless tokenizer_or_token_counter.respond_to?(:encode)
|
|
347
|
+
|
|
348
|
+
tokenizer = tokenizer_or_token_counter
|
|
349
|
+
encode_params = begin
|
|
350
|
+
tokenizer.method(:encode).parameters
|
|
351
|
+
rescue StandardError
|
|
352
|
+
[]
|
|
220
353
|
end
|
|
221
354
|
|
|
222
|
-
|
|
223
|
-
|
|
224
|
-
|
|
225
|
-
|
|
226
|
-
encode_params = tokenizer.method(:encode).parameters rescue []
|
|
227
|
-
has_special_tokens = encode_params.any? { |type, name| name == :add_special_tokens }
|
|
228
|
-
|
|
229
|
-
token_counter = if has_special_tokens
|
|
230
|
-
->(text) { tokenizer.encode(text, add_special_tokens: false).length }
|
|
231
|
-
else
|
|
232
|
-
->(text) { tokenizer.encode(text).length }
|
|
233
|
-
end
|
|
355
|
+
has_special_tokens = encode_params.any? { |_type, name| name == :add_special_tokens }
|
|
356
|
+
|
|
357
|
+
if has_special_tokens
|
|
358
|
+
->(text) { tokenizer.encode(text, add_special_tokens: false).length }
|
|
234
359
|
else
|
|
235
|
-
|
|
360
|
+
->(text) { tokenizer.encode(text).length }
|
|
236
361
|
end
|
|
362
|
+
end
|
|
237
363
|
|
|
238
|
-
|
|
239
|
-
|
|
240
|
-
|
|
241
|
-
|
|
242
|
-
|
|
243
|
-
|
|
244
|
-
|
|
245
|
-
|
|
246
|
-
|
|
247
|
-
|
|
248
|
-
original_token_counter.call(text)
|
|
249
|
-
end
|
|
364
|
+
def wrap_with_fast_counter(token_counter, max_token_chars, chunk_size)
|
|
365
|
+
max_token_chars -= 1
|
|
366
|
+
original_token_counter = token_counter
|
|
367
|
+
|
|
368
|
+
lambda do |text|
|
|
369
|
+
heuristic = chunk_size * 6
|
|
370
|
+
if text.length > heuristic && original_token_counter.call(text[0...(heuristic + max_token_chars)]) > chunk_size
|
|
371
|
+
chunk_size + 1
|
|
372
|
+
else
|
|
373
|
+
original_token_counter.call(text)
|
|
250
374
|
end
|
|
251
375
|
end
|
|
376
|
+
end
|
|
377
|
+
|
|
378
|
+
def split_text(text)
|
|
379
|
+
splitter_is_whitespace = true
|
|
380
|
+
|
|
381
|
+
splitter = find_whitespace_splitter(text)
|
|
382
|
+
if splitter
|
|
383
|
+
result = try_whitespace_with_semantic_preceder(text, splitter, splitter_is_whitespace)
|
|
384
|
+
return result if result
|
|
252
385
|
|
|
253
|
-
|
|
254
|
-
if memoize
|
|
255
|
-
token_counter = memoize_token_counter(token_counter, cache_maxsize)
|
|
386
|
+
return [splitter, splitter_is_whitespace, text.split(splitter)]
|
|
256
387
|
end
|
|
257
388
|
|
|
258
|
-
#
|
|
259
|
-
|
|
389
|
+
# No whitespace found, use non-whitespace semantic splitters
|
|
390
|
+
find_non_whitespace_splitter(text)
|
|
260
391
|
end
|
|
261
392
|
|
|
262
|
-
|
|
393
|
+
def find_whitespace_splitter(text)
|
|
394
|
+
return text.scan(/[\r\n]+/).max_by(&:length) if text.include?("\n") || text.include?("\r")
|
|
395
|
+
return text.scan(/\t+/).max_by(&:length) if text.include?("\t")
|
|
396
|
+
return text.scan(/\s+/).max_by(&:length) if text.match?(/\s/)
|
|
263
397
|
|
|
264
|
-
|
|
265
|
-
|
|
266
|
-
# Sentence terminators
|
|
267
|
-
".", "?", "!", "*",
|
|
268
|
-
# Clause separators
|
|
269
|
-
";", ",", "(", ")", "[", "]", """, """, "'", "'", "'", '"', "`",
|
|
270
|
-
# Sentence interrupters
|
|
271
|
-
":", "—", "…",
|
|
272
|
-
# Word joiners
|
|
273
|
-
"/", "\\", "–", "&", "-"
|
|
274
|
-
].freeze
|
|
398
|
+
nil
|
|
399
|
+
end
|
|
275
400
|
|
|
276
|
-
def
|
|
277
|
-
|
|
401
|
+
def try_whitespace_with_semantic_preceder(text, splitter, splitter_is_whitespace)
|
|
402
|
+
return nil unless splitter.length == 1
|
|
278
403
|
|
|
279
|
-
|
|
280
|
-
|
|
281
|
-
|
|
282
|
-
|
|
283
|
-
elsif text.include?("\t")
|
|
284
|
-
tab_matches = text.scan(/\t+/)
|
|
285
|
-
splitter = tab_matches.max_by(&:length)
|
|
286
|
-
elsif text.match?(/\s/)
|
|
287
|
-
whitespace_matches = text.scan(/\s+/)
|
|
288
|
-
splitter = whitespace_matches.max_by(&:length)
|
|
289
|
-
|
|
290
|
-
# If the splitter is only a single character, see if we can target whitespace preceded by semantic splitters
|
|
291
|
-
if splitter.length == 1
|
|
292
|
-
NON_WHITESPACE_SEMANTIC_SPLITTERS.each do |preceder|
|
|
293
|
-
escaped_preceder = Regexp.escape(preceder)
|
|
294
|
-
if (match = text.match(/#{escaped_preceder}(\s)/))
|
|
295
|
-
splitter = match[1]
|
|
296
|
-
escaped_splitter = Regexp.escape(splitter)
|
|
297
|
-
return [splitter, splitter_is_whitespace, text.split(/(?<=#{escaped_preceder})#{escaped_splitter}/)]
|
|
298
|
-
end
|
|
299
|
-
end
|
|
300
|
-
end
|
|
301
|
-
else
|
|
302
|
-
# Find the most desirable semantically meaningful non-whitespace splitter
|
|
303
|
-
splitter = NON_WHITESPACE_SEMANTIC_SPLITTERS.find { |s| text.include?(s) }
|
|
404
|
+
NON_WHITESPACE_SEMANTIC_SPLITTERS.each do |preceder|
|
|
405
|
+
escaped_preceder = Regexp.escape(preceder)
|
|
406
|
+
match = text.match(/#{escaped_preceder}(\s)/)
|
|
407
|
+
next unless match
|
|
304
408
|
|
|
305
|
-
|
|
306
|
-
|
|
307
|
-
|
|
308
|
-
# No semantic splitter found, return characters
|
|
309
|
-
return ["", splitter_is_whitespace, text.chars]
|
|
310
|
-
end
|
|
409
|
+
matched_splitter = match[1]
|
|
410
|
+
escaped_splitter = Regexp.escape(matched_splitter)
|
|
411
|
+
return [matched_splitter, splitter_is_whitespace, text.split(/(?<=#{escaped_preceder})#{escaped_splitter}/)]
|
|
311
412
|
end
|
|
312
413
|
|
|
313
|
-
|
|
414
|
+
nil
|
|
415
|
+
end
|
|
416
|
+
|
|
417
|
+
def find_non_whitespace_splitter(text)
|
|
418
|
+
splitter = NON_WHITESPACE_SEMANTIC_SPLITTERS.find { |s| text.include?(s) }
|
|
419
|
+
return ["", true, text.chars] unless splitter
|
|
420
|
+
|
|
421
|
+
[splitter, false, text.split(splitter)]
|
|
314
422
|
end
|
|
315
423
|
|
|
316
424
|
def bisect_left(sorted, target, low, high)
|
|
@@ -356,7 +464,7 @@ module Semchunk
|
|
|
356
464
|
[last_split_index, splits[start...last_split_index].join(splitter)]
|
|
357
465
|
end
|
|
358
466
|
|
|
359
|
-
def memoize_token_counter(token_counter, maxsize
|
|
467
|
+
def memoize_token_counter(token_counter, maxsize=nil)
|
|
360
468
|
return @memoized_token_counters[token_counter] if @memoized_token_counters.key?(token_counter)
|
|
361
469
|
|
|
362
470
|
cache = {}
|