tokenkit 0.1.0.pre.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (43) hide show
  1. checksums.yaml +7 -0
  2. data/.rspec +3 -0
  3. data/.standard.yml +3 -0
  4. data/.yardopts +12 -0
  5. data/CODE_OF_CONDUCT.md +132 -0
  6. data/LICENSE.txt +21 -0
  7. data/README.md +644 -0
  8. data/Rakefile +18 -0
  9. data/benchmarks/cache_test.rb +63 -0
  10. data/benchmarks/final_comparison.rb +83 -0
  11. data/benchmarks/tokenizer_benchmark.rb +250 -0
  12. data/docs/ARCHITECTURE.md +469 -0
  13. data/docs/PERFORMANCE.md +382 -0
  14. data/docs/README.md +118 -0
  15. data/ext/tokenkit/Cargo.toml +21 -0
  16. data/ext/tokenkit/extconf.rb +4 -0
  17. data/ext/tokenkit/src/config.rs +37 -0
  18. data/ext/tokenkit/src/error.rs +67 -0
  19. data/ext/tokenkit/src/lib.rs +346 -0
  20. data/ext/tokenkit/src/tokenizer/base.rs +41 -0
  21. data/ext/tokenkit/src/tokenizer/char_group.rs +62 -0
  22. data/ext/tokenkit/src/tokenizer/edge_ngram.rs +73 -0
  23. data/ext/tokenkit/src/tokenizer/grapheme.rs +26 -0
  24. data/ext/tokenkit/src/tokenizer/keyword.rs +25 -0
  25. data/ext/tokenkit/src/tokenizer/letter.rs +41 -0
  26. data/ext/tokenkit/src/tokenizer/lowercase.rs +51 -0
  27. data/ext/tokenkit/src/tokenizer/mod.rs +254 -0
  28. data/ext/tokenkit/src/tokenizer/ngram.rs +80 -0
  29. data/ext/tokenkit/src/tokenizer/path_hierarchy.rs +187 -0
  30. data/ext/tokenkit/src/tokenizer/pattern.rs +38 -0
  31. data/ext/tokenkit/src/tokenizer/sentence.rs +89 -0
  32. data/ext/tokenkit/src/tokenizer/unicode.rs +36 -0
  33. data/ext/tokenkit/src/tokenizer/url_email.rs +108 -0
  34. data/ext/tokenkit/src/tokenizer/whitespace.rs +31 -0
  35. data/lib/tokenkit/config.rb +74 -0
  36. data/lib/tokenkit/config_builder.rb +209 -0
  37. data/lib/tokenkit/config_compat.rb +52 -0
  38. data/lib/tokenkit/configuration.rb +194 -0
  39. data/lib/tokenkit/regex_converter.rb +58 -0
  40. data/lib/tokenkit/version.rb +5 -0
  41. data/lib/tokenkit.rb +336 -0
  42. data/sig/tokenkit.rbs +4 -0
  43. metadata +172 -0
data/README.md ADDED
@@ -0,0 +1,644 @@
1
+ # TokenKit
2
+
3
+ Fast, Rust-backed word-level tokenization for Ruby with pattern preservation.
4
+
5
+ TokenKit is a Ruby wrapper around Rust's [unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) crate, providing lightweight, Unicode-aware tokenization designed for NLP pipelines, search applications, and text processing where you need consistent, high-quality word segmentation.
6
+
7
+ ## Quickstart
8
+
9
+ ```ruby
10
+ # Install the gem
11
+ gem install tokenkit
12
+
13
+ # Or add to your Gemfile
14
+ gem 'tokenkit'
15
+ ```
16
+
17
+ ```ruby
18
+ require 'tokenkit'
19
+
20
+ # Basic tokenization - handles Unicode, contractions, accents
21
+ TokenKit.tokenize("Hello, world! café can't")
22
+ # => ["hello", "world", "café", "can't"]
23
+
24
+ # Preserve domain-specific terms even when lowercasing
25
+ TokenKit.configure do |config|
26
+ config.lowercase = true
27
+ config.preserve_patterns = [
28
+ /\d+ug/i, # Measurements: 100ug
29
+ /[A-Z][A-Z0-9]+/ # Gene names: BRCA1, TP53
30
+ ]
31
+ end
32
+
33
+ TokenKit.tokenize("Patient received 100ug for BRCA1 study")
34
+ # => ["patient", "received", "100ug", "for", "BRCA1", "study"]
35
+ ```
36
+
37
+ ## Features
38
+
39
+ - **Thirteen tokenization strategies**: whitespace, unicode (recommended), custom regex patterns, sentence, grapheme, keyword, edge n-gram, n-gram, path hierarchy, URL/email-aware, character group, letter, and lowercase
40
+ - **Pattern preservation**: Keep domain-specific terms (gene names, measurements, antibodies) intact even with case normalization
41
+ - **Fast**: Rust-backed implementation (~100K docs/sec)
42
+ - **Thread-safe**: Safe for concurrent use
43
+ - **Simple API**: Configure once, use everywhere
44
+ - **Zero dependencies**: Pure Ruby API with Rust extension
45
+
46
+ ## Tokenization Strategies
47
+
48
+ ### Unicode (Recommended)
49
+
50
+ Uses Unicode word segmentation for proper handling of contractions, accents, and multi-language text.
51
+
52
+ **✅ Supports `preserve_patterns`**
53
+
54
+ ```ruby
55
+ TokenKit.configure do |config|
56
+ config.strategy = :unicode
57
+ config.lowercase = true
58
+ end
59
+
60
+ TokenKit.tokenize("Don't worry about café!")
61
+ # => ["don't", "worry", "about", "café"]
62
+ ```
63
+
64
+ ### Whitespace
65
+
66
+ Simple whitespace splitting.
67
+
68
+ **✅ Supports `preserve_patterns`**
69
+
70
+ ```ruby
71
+ TokenKit.configure do |config|
72
+ config.strategy = :whitespace
73
+ config.lowercase = true
74
+ end
75
+
76
+ TokenKit.tokenize("hello world")
77
+ # => ["hello", "world"]
78
+ ```
79
+
80
+ ### Pattern (Custom Regex)
81
+
82
+ Custom tokenization using regex patterns.
83
+
84
+ **✅ Supports `preserve_patterns`**
85
+
86
+ ```ruby
87
+ TokenKit.configure do |config|
88
+ config.strategy = :pattern
89
+ config.regex = /[\w-]+/ # Keep words and hyphens
90
+ config.lowercase = true
91
+ end
92
+
93
+ TokenKit.tokenize("anti-CD3 antibody")
94
+ # => ["anti-cd3", "antibody"]
95
+ ```
96
+
97
+ ### Sentence
98
+
99
+ Splits text into sentences using Unicode sentence boundaries.
100
+
101
+ **✅ Supports `preserve_patterns`** (preserves patterns within each sentence)
102
+
103
+ ```ruby
104
+ TokenKit.configure do |config|
105
+ config.strategy = :sentence
106
+ config.lowercase = false
107
+ end
108
+
109
+ TokenKit.tokenize("Hello world! How are you? I am fine.")
110
+ # => ["Hello world! ", "How are you? ", "I am fine."]
111
+ ```
112
+
113
+ Useful for document-level processing, sentence embeddings, or paragraph analysis.
114
+
115
+ ### Grapheme
116
+
117
+ Splits text into grapheme clusters (user-perceived characters).
118
+
119
+ **⚠️ Does NOT support `preserve_patterns`** (patterns will be lowercased if `lowercase: true`)
120
+
121
+ ```ruby
122
+ TokenKit.configure do |config|
123
+ config.strategy = :grapheme
124
+ config.grapheme_extended = true # Use extended grapheme clusters (default)
125
+ config.lowercase = false
126
+ end
127
+
128
+ TokenKit.tokenize("👨‍👩‍👧‍👦café")
129
+ # => ["👨‍👩‍👧‍👦", "c", "a", "f", "é"]
130
+ ```
131
+
132
+ Perfect for handling emoji, combining characters, and complex scripts. Set `grapheme_extended = false` for legacy grapheme boundaries.
133
+
134
+ ### Keyword
135
+
136
+ Treats entire input as a single token (no splitting).
137
+
138
+ **⚠️ Does NOT support `preserve_patterns`** (patterns will be lowercased if `lowercase: true`)
139
+
140
+ ```ruby
141
+ TokenKit.configure do |config|
142
+ config.strategy = :keyword
143
+ config.lowercase = false
144
+ end
145
+
146
+ TokenKit.tokenize("PROD-2024-ABC-001")
147
+ # => ["PROD-2024-ABC-001"]
148
+ ```
149
+
150
+ Ideal for exact matching of SKUs, IDs, product codes, or category names where splitting would lose meaning.
151
+
152
+ ### Edge N-gram (Search-as-you-type)
153
+
154
+ Generates prefixes from the beginning of words for autocomplete functionality.
155
+
156
+ **⚠️ Does NOT support `preserve_patterns`** (patterns will be lowercased if `lowercase: true`)
157
+
158
+ ```ruby
159
+ TokenKit.configure do |config|
160
+ config.strategy = :edge_ngram
161
+ config.min_gram = 2 # Minimum prefix length
162
+ config.max_gram = 10 # Maximum prefix length
163
+ config.lowercase = true
164
+ end
165
+
166
+ TokenKit.tokenize("laptop")
167
+ # => ["la", "lap", "lapt", "lapto", "laptop"]
168
+ ```
169
+
170
+ Essential for autocomplete, type-ahead search, and prefix matching. At index time, generate edge n-grams of your product names or search terms.
171
+
172
+ ### N-gram (Fuzzy Matching)
173
+
174
+ Generates all substring n-grams (sliding window) for fuzzy matching and misspelling tolerance.
175
+
176
+ **⚠️ Does NOT support `preserve_patterns`** (patterns will be lowercased if `lowercase: true`)
177
+
178
+ ```ruby
179
+ TokenKit.configure do |config|
180
+ config.strategy = :ngram
181
+ config.min_gram = 2 # Minimum n-gram length
182
+ config.max_gram = 3 # Maximum n-gram length
183
+ config.lowercase = true
184
+ end
185
+
186
+ TokenKit.tokenize("quick")
187
+ # => ["qu", "ui", "ic", "ck", "qui", "uic", "ick"]
188
+ ```
189
+
190
+ Perfect for fuzzy search, typo tolerance, and partial matching. Unlike edge n-grams which only generate prefixes, n-grams generate all possible substrings.
191
+
192
+ ### Path Hierarchy (Hierarchical Navigation)
193
+
194
+ Creates tokens for each level of a path hierarchy.
195
+
196
+ **⚠️ Partially supports `preserve_patterns`** (has limitations with hierarchical structure)
197
+
198
+ ```ruby
199
+ TokenKit.configure do |config|
200
+ config.strategy = :path_hierarchy
201
+ config.delimiter = "/" # Use "\\" for Windows paths
202
+ config.lowercase = false
203
+ end
204
+
205
+ TokenKit.tokenize("/usr/local/bin/ruby")
206
+ # => ["/usr", "/usr/local", "/usr/local/bin", "/usr/local/bin/ruby"]
207
+
208
+ # Works for category hierarchies too
209
+ TokenKit.tokenize("electronics/computers/laptops")
210
+ # => ["electronics", "electronics/computers", "electronics/computers/laptops"]
211
+ ```
212
+
213
+ Perfect for filesystem paths, URL structures, category hierarchies, and breadcrumb navigation.
214
+
215
+ ### URL/Email-Aware (Web Content)
216
+
217
+ Preserves URLs and email addresses as single tokens while tokenizing surrounding text.
218
+
219
+ **✅ Supports `preserve_patterns`** (preserves patterns alongside URLs/emails)
220
+
221
+ ```ruby
222
+ TokenKit.configure do |config|
223
+ config.strategy = :url_email
224
+ config.lowercase = true
225
+ end
226
+
227
+ TokenKit.tokenize("Contact support@example.com or visit https://example.com")
228
+ # => ["contact", "support@example.com", "or", "visit", "https://example.com"]
229
+ ```
230
+
231
+ Essential for user-generated content, customer support messages, product descriptions with links, and social media text.
232
+
233
+ ### Character Group (Fast Custom Splitting)
234
+
235
+ Splits text based on a custom set of characters (faster than regex for simple delimiters).
236
+
237
+ **⚠️ Partially supports `preserve_patterns`** (works best with whitespace delimiters; non-whitespace delimiters may have issues)
238
+
239
+ ```ruby
240
+ TokenKit.configure do |config|
241
+ config.strategy = :char_group
242
+ config.split_on_chars = ",;" # Split on commas and semicolons
243
+ config.lowercase = false
244
+ end
245
+
246
+ TokenKit.tokenize("apple,banana;cherry")
247
+ # => ["apple", "banana", "cherry"]
248
+
249
+ # CSV parsing
250
+ TokenKit.tokenize("John Doe,30,Software Engineer")
251
+ # => ["John Doe", "30", "Software Engineer"]
252
+ ```
253
+
254
+ Ideal for structured data (CSV, TSV), log parsing, and custom delimiter-based formats. Default split characters are ` \t\n\r` (whitespace).
255
+
256
+ ### Letter (Language-Agnostic)
257
+
258
+ Splits on any non-letter character (simpler than Unicode tokenizer, no special handling for contractions).
259
+
260
+ **✅ Supports `preserve_patterns`**
261
+
262
+ ```ruby
263
+ TokenKit.configure do |config|
264
+ config.strategy = :letter
265
+ config.lowercase = true
266
+ end
267
+
268
+ TokenKit.tokenize("hello-world123test")
269
+ # => ["hello", "world", "test"]
270
+
271
+ # Handles multiple scripts
272
+ TokenKit.tokenize("Hello-世界-test")
273
+ # => ["hello", "世界", "test"]
274
+ ```
275
+
276
+ Great for noisy text, mixed scripts, and cases where you want aggressive splitting on any non-letter character.
277
+
278
+ ### Lowercase (Efficient Case Normalization)
279
+
280
+ Like the Letter tokenizer but always lowercases in a single pass (more efficient than letter + lowercase filter).
281
+
282
+ **✅ Supports `preserve_patterns`** (preserved patterns maintain original case despite always lowercasing)
283
+
284
+ ```ruby
285
+ TokenKit.configure do |config|
286
+ config.strategy = :lowercase
287
+ # Note: config.lowercase setting is ignored - this tokenizer ALWAYS lowercases
288
+ end
289
+
290
+ TokenKit.tokenize("HELLO-WORLD")
291
+ # => ["hello", "world"]
292
+
293
+ # Case-insensitive search indexing
294
+ TokenKit.tokenize("User-Agent: Mozilla/5.0")
295
+ # => ["user", "agent", "mozilla"]
296
+ ```
297
+
298
+ **⚠️ Important**: The `:lowercase` strategy **always** lowercases text, regardless of the `config.lowercase` setting. If you need control over lowercasing, use the `:letter` strategy instead with `config.lowercase = true/false`.
299
+
300
+ Perfect for case-insensitive search indexing, normalizing product codes, and cleaning social media text. Handles Unicode correctly, including characters that lowercase to multiple characters (e.g., Turkish İ).
301
+
302
+ ## Pattern Preservation
303
+
304
+ Preserve domain-specific terms even when lowercasing.
305
+
306
+ **Fully Supported by:** Unicode, Pattern, Whitespace, Letter, Lowercase, Sentence, and URL/Email tokenizers.
307
+
308
+ **Partially Supported by:** Character Group (works best with whitespace delimiters) and Path Hierarchy (limitations with hierarchical structure) tokenizers.
309
+
310
+ **Not Supported by:** Grapheme, Keyword, Edge N-gram, and N-gram tokenizers.
311
+
312
+ ```ruby
313
+ TokenKit.configure do |config|
314
+ config.strategy = :unicode
315
+ config.lowercase = true
316
+ config.preserve_patterns = [
317
+ /\d+(ug|mg|ml|units)/i, # Measurements: 100ug, 50mg
318
+ /anti-cd\d+/i, # Antibodies: Anti-CD3, anti-CD28
319
+ /[A-Z][A-Z0-9]+/ # Gene names: BRCA1, TP53, EGFR
320
+ ]
321
+ end
322
+
323
+ text = "Patient received 100ug Anti-CD3 with BRCA1 mutation"
324
+ tokens = TokenKit.tokenize(text)
325
+ # => ["patient", "received", "100ug", "Anti-CD3", "with", "BRCA1", "mutation"]
326
+ ```
327
+
328
+ Pattern matches maintain their original case despite `lowercase=true`.
329
+
330
+ ### Regex Flags
331
+
332
+ TokenKit supports Ruby regex flags for both `preserve_patterns` and the `:pattern` strategy:
333
+
334
+ ```ruby
335
+ # Case-insensitive matching (i flag)
336
+ TokenKit.configure do |config|
337
+ config.preserve_patterns = [/gene-\d+/i]
338
+ end
339
+
340
+ TokenKit.tokenize("Found GENE-123 and gene-456")
341
+ # => ["found", "GENE-123", "and", "gene-456"]
342
+
343
+ # Multiline mode (m flag) - dot matches newlines
344
+ TokenKit.configure do |config|
345
+ config.strategy = :pattern
346
+ config.regex = /test./m
347
+ end
348
+
349
+ # Extended mode (x flag) - allows comments and whitespace
350
+ pattern = /
351
+ \w+ # word characters
352
+ @ # at sign
353
+ \w+\.\w+ # domain.tld
354
+ /x
355
+
356
+ TokenKit.configure do |config|
357
+ config.preserve_patterns = [pattern]
358
+ end
359
+
360
+ # Combine flags
361
+ TokenKit.configure do |config|
362
+ config.preserve_patterns = [/code-\d+/im] # case-insensitive + multiline
363
+ end
364
+ ```
365
+
366
+ Supported flags:
367
+ - `i` - Case-insensitive matching
368
+ - `m` - Multiline mode (`.` matches newlines)
369
+ - `x` - Extended mode (ignore whitespace, allow comments)
370
+
371
+ Flags work with both Regexp objects and string patterns passed to `:pattern` strategy.
372
+
373
+ ## Configuration
374
+
375
+ ### Global Configuration
376
+
377
+ ```ruby
378
+ TokenKit.configure do |config|
379
+ config.strategy = :unicode # :whitespace, :unicode, :pattern, :sentence, :grapheme, :keyword, :edge_ngram, :ngram, :path_hierarchy, :url_email, :char_group, :letter, :lowercase
380
+ config.lowercase = true # Normalize to lowercase
381
+ config.remove_punctuation = false # Remove punctuation from tokens
382
+ config.preserve_patterns = [] # Regex patterns to preserve
383
+
384
+ # Strategy-specific options
385
+ config.regex = /\w+/ # Only for :pattern strategy
386
+ config.grapheme_extended = true # Only for :grapheme strategy (default: true)
387
+ config.min_gram = 2 # For :edge_ngram and :ngram strategies (default: 2)
388
+ config.max_gram = 10 # For :edge_ngram and :ngram strategies (default: 10)
389
+ config.delimiter = "/" # Only for :path_hierarchy strategy (default: "/")
390
+ config.split_on_chars = " \t\n\r" # Only for :char_group strategy (default: whitespace)
391
+ end
392
+ ```
393
+
394
+ ### Per-Call Options
395
+
396
+ Override global config for specific calls:
397
+
398
+ ```ruby
399
+ # Override general options
400
+ TokenKit.tokenize("BRCA1 Gene", lowercase: false)
401
+ # => ["BRCA1", "Gene"]
402
+
403
+ # Override strategy-specific options
404
+ TokenKit.tokenize("laptop", strategy: :edge_ngram, min_gram: 3, max_gram: 5)
405
+ # => ["lap", "lapt", "lapto"]
406
+
407
+ TokenKit.tokenize("C:\\Windows\\System", strategy: :path_hierarchy, delimiter: "\\")
408
+ # => ["C:", "C:\\Windows", "C:\\Windows\\System"]
409
+
410
+ # Combine multiple overrides
411
+ TokenKit.tokenize(
412
+ "TEST",
413
+ strategy: :edge_ngram,
414
+ min_gram: 2,
415
+ max_gram: 3,
416
+ lowercase: false
417
+ )
418
+ # => ["TE", "TES"]
419
+ ```
420
+
421
+ All strategy-specific options can be overridden per-call:
422
+ - `:pattern` - `regex: /pattern/`
423
+ - `:grapheme` - `extended: true/false`
424
+ - `:edge_ngram` - `min_gram: n, max_gram: n`
425
+ - `:ngram` - `min_gram: n, max_gram: n`
426
+ - `:path_hierarchy` - `delimiter: "/"`
427
+ - `:char_group` - `split_on_chars: ",;"`
428
+
429
+ ### Get Current Config
430
+
431
+ ```ruby
432
+ config = TokenKit.config_hash
433
+ # Returns a Configuration object with accessor methods
434
+
435
+ config.strategy # => :unicode
436
+ config.lowercase # => true
437
+ config.remove_punctuation # => false
438
+ config.preserve_patterns # => [...]
439
+
440
+ # Strategy predicates
441
+ config.edge_ngram? # => false
442
+ config.ngram? # => false
443
+ config.pattern? # => false
444
+ config.grapheme? # => false
445
+ config.path_hierarchy? # => false
446
+ config.char_group? # => false
447
+ config.letter? # => false
448
+ config.lowercase? # => false
449
+
450
+ # Strategy-specific accessors
451
+ config.min_gram # => 2 (for edge_ngram and ngram)
452
+ config.max_gram # => 10 (for edge_ngram and ngram)
453
+ config.delimiter # => "/" (for path_hierarchy)
454
+ config.split_on_chars # => " \t\n\r" (for char_group)
455
+ config.extended # => true (for grapheme)
456
+ config.regex # => "..." (for pattern)
457
+
458
+ # Convert to hash if needed
459
+ config.to_h
460
+ # => {"strategy" => "unicode", "lowercase" => true, ...}
461
+ ```
462
+
463
+ ### Reset to Defaults
464
+
465
+ ```ruby
466
+ TokenKit.reset
467
+ ```
468
+
469
+ ## Use Cases
470
+
471
+ ### Biotech/Life Sciences
472
+
473
+ ```ruby
474
+ TokenKit.configure do |config|
475
+ config.strategy = :unicode
476
+ config.lowercase = true
477
+ config.preserve_patterns = [
478
+ /\d+(ug|mg|ml|ul|units)/i, # Measurements
479
+ /anti-[a-z0-9-]+/i, # Antibodies
480
+ /[A-Z]{2,10}/, # Gene names (CDK10, BRCA1, TP53)
481
+ /cd\d+/i, # Cell markers (CD3, CD4, CD8)
482
+ /ig[gmaed]/i # Immunoglobulins (IgG, IgM)
483
+ ]
484
+ end
485
+
486
+ text = "Anti-CD3 IgG antibody 100ug for BRCA1 research"
487
+ tokens = TokenKit.tokenize(text)
488
+ # => ["Anti-CD3", "IgG", "antibody", "100ug", "for", "BRCA1", "research"]
489
+ ```
490
+
491
+ ### E-commerce/Catalogs
492
+
493
+ ```ruby
494
+ TokenKit.configure do |config|
495
+ config.strategy = :unicode
496
+ config.lowercase = true
497
+ config.preserve_patterns = [
498
+ /\$\d+(\.\d{2})?/, # Prices: $99.99
499
+ /\d+(-\d+)+/, # SKUs: 123-456-789
500
+ /\d+(mm|cm|inch)/i # Dimensions: 10mm, 5cm
501
+ ]
502
+ end
503
+
504
+ text = "Widget $49.99 SKU: 123-456 size: 10cm"
505
+ tokens = TokenKit.tokenize(text)
506
+ # => ["widget", "$49.99", "sku", "123-456", "size", "10cm"]
507
+ ```
508
+
509
+ ### Search Applications
510
+
511
+ ```ruby
512
+ # Exact matching with case normalization
513
+ TokenKit.configure do |config|
514
+ config.strategy = :lowercase
515
+ config.lowercase = true
516
+ end
517
+
518
+ # Index time: normalize documents
519
+ doc_tokens = TokenKit.tokenize("Product Code: ABC-123")
520
+ # => ["product", "code", "abc"]
521
+
522
+ # Query time: normalize user input
523
+ query_tokens = TokenKit.tokenize("product abc")
524
+ # => ["product", "abc"]
525
+
526
+ # Fuzzy matching with n-grams
527
+ TokenKit.configure do |config|
528
+ config.strategy = :ngram
529
+ config.min_gram = 2
530
+ config.max_gram = 4
531
+ config.lowercase = true
532
+ end
533
+
534
+ # Index time: generate n-grams
535
+ TokenKit.tokenize("search")
536
+ # => ["se", "ea", "ar", "rc", "ch", "sea", "ear", "arc", "rch", "sear", "earc", "arch"]
537
+
538
+ # Query time: typo "serch" still has significant overlap
539
+ TokenKit.tokenize("serch")
540
+ # => ["se", "er", "rc", "ch", "ser", "erc", "rch", "serc", "erch"]
541
+ # Overlap: ["se", "rc", "ch", "rch"] allows matching despite typo
542
+
543
+ # Autocomplete with edge n-grams
544
+ TokenKit.configure do |config|
545
+ config.strategy = :edge_ngram
546
+ config.min_gram = 2
547
+ config.max_gram = 10
548
+ end
549
+
550
+ TokenKit.tokenize("laptop")
551
+ # => ["la", "lap", "lapt", "lapto", "laptop"]
552
+ # Matches "la", "lap", "lapt" as user types
553
+ ```
554
+
555
+ ## Performance
556
+
557
+ TokenKit has been extensively optimized for production use:
558
+
559
+ - **Unicode tokenization**: ~870K tokens/sec (baseline)
560
+ - **Pattern preservation**: ~410K tokens/sec with 4 patterns (was 3.6K/sec before v0.3.0 optimizations)
561
+ - **Memory efficient**: Pre-allocated buffers and in-place operations
562
+ - **Thread-safe**: Cached instances with mutex protection, safe for concurrent use
563
+ - **110x speedup**: For pattern-heavy workloads through intelligent caching
564
+
565
+ Key optimizations:
566
+ - Regex patterns compiled once and cached (not per-tokenization)
567
+ - String allocations minimized through index-based operations
568
+ - Tokenizer instances reused across calls
569
+ - In-place post-processing for lowercase and punctuation removal
570
+
571
+ See the [Performance Guide](docs/PERFORMANCE.md) for detailed benchmarks and optimization techniques.
572
+
573
+ ## Integration
574
+
575
+ TokenKit is designed to work with other gems in the scientist-labs ecosystem:
576
+
577
+ - **PhraseKit**: Use TokenKit for consistent phrase extraction
578
+ - **SpellKit**: Tokenize before spell correction
579
+ - **red-candle**: Tokenize before NER/embeddings
580
+
581
+ ## Documentation
582
+
583
+ - [API Documentation](https://rubydoc.info/gems/tokenkit) - Full API reference
584
+ - [Architecture Guide](docs/ARCHITECTURE.md) - Internal design and structure
585
+ - [Performance Guide](docs/PERFORMANCE.md) - Benchmarks and optimization details
586
+
587
+ ### Generating Documentation Locally
588
+
589
+ ```bash
590
+ # Install documentation dependencies
591
+ bundle install
592
+
593
+ # Generate YARD documentation
594
+ bundle exec yard doc
595
+
596
+ # Open documentation in browser
597
+ open doc/index.html
598
+ ```
599
+
600
+ ## Development
601
+
602
+ ```bash
603
+ # Setup
604
+ bundle install
605
+ bundle exec rake compile
606
+
607
+ # Run tests
608
+ bundle exec rspec
609
+
610
+ # Run tests with coverage
611
+ COVERAGE=true bundle exec rspec
612
+
613
+ # Run linter
614
+ bundle exec standardrb
615
+
616
+ # Run benchmarks
617
+ ruby benchmarks/tokenizer_benchmark.rb
618
+
619
+ # Build gem
620
+ gem build tokenkit.gemspec
621
+ ```
622
+
623
+ ## Requirements
624
+
625
+ - Ruby >= 3.1.0
626
+ - Rust toolchain (for building from source)
627
+
628
+ ## License
629
+
630
+ MIT License. See [LICENSE.txt](LICENSE.txt) for details.
631
+
632
+ ## Contributing
633
+
634
+ Bug reports and pull requests are welcome on GitHub at https://github.com/scientist-labs/tokenkit.
635
+
636
+ This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](CODE_OF_CONDUCT.md).
637
+
638
+ ## Credits
639
+
640
+ Built with:
641
+ - [Magnus](https://github.com/matsadler/magnus) for Ruby-Rust bindings
642
+ - [unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) for Unicode word boundaries
643
+ - [linkify](https://github.com/robinst/linkify) for robust URL and email detection
644
+ - [regex](https://github.com/rust-lang/regex) for pattern matching
data/Rakefile ADDED
@@ -0,0 +1,18 @@
1
+ # frozen_string_literal: true
2
+
3
+ require "bundler/gem_tasks"
4
+ require "rspec/core/rake_task"
5
+
6
+ RSpec::Core::RakeTask.new(:spec)
7
+
8
+ require "standard/rake"
9
+ require "rake/extensiontask"
10
+
11
+ GEMSPEC = Gem::Specification.load("tokenkit.gemspec")
12
+
13
+ Rake::ExtensionTask.new("tokenkit", GEMSPEC) do |ext|
14
+ ext.lib_dir = "lib/tokenkit"
15
+ end
16
+
17
+ task spec: :compile
18
+ task default: %i[clobber compile spec standard]