tokenkit 0.1.0.pre.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (43) hide show
  1. checksums.yaml +7 -0
  2. data/.rspec +3 -0
  3. data/.standard.yml +3 -0
  4. data/.yardopts +12 -0
  5. data/CODE_OF_CONDUCT.md +132 -0
  6. data/LICENSE.txt +21 -0
  7. data/README.md +644 -0
  8. data/Rakefile +18 -0
  9. data/benchmarks/cache_test.rb +63 -0
  10. data/benchmarks/final_comparison.rb +83 -0
  11. data/benchmarks/tokenizer_benchmark.rb +250 -0
  12. data/docs/ARCHITECTURE.md +469 -0
  13. data/docs/PERFORMANCE.md +382 -0
  14. data/docs/README.md +118 -0
  15. data/ext/tokenkit/Cargo.toml +21 -0
  16. data/ext/tokenkit/extconf.rb +4 -0
  17. data/ext/tokenkit/src/config.rs +37 -0
  18. data/ext/tokenkit/src/error.rs +67 -0
  19. data/ext/tokenkit/src/lib.rs +346 -0
  20. data/ext/tokenkit/src/tokenizer/base.rs +41 -0
  21. data/ext/tokenkit/src/tokenizer/char_group.rs +62 -0
  22. data/ext/tokenkit/src/tokenizer/edge_ngram.rs +73 -0
  23. data/ext/tokenkit/src/tokenizer/grapheme.rs +26 -0
  24. data/ext/tokenkit/src/tokenizer/keyword.rs +25 -0
  25. data/ext/tokenkit/src/tokenizer/letter.rs +41 -0
  26. data/ext/tokenkit/src/tokenizer/lowercase.rs +51 -0
  27. data/ext/tokenkit/src/tokenizer/mod.rs +254 -0
  28. data/ext/tokenkit/src/tokenizer/ngram.rs +80 -0
  29. data/ext/tokenkit/src/tokenizer/path_hierarchy.rs +187 -0
  30. data/ext/tokenkit/src/tokenizer/pattern.rs +38 -0
  31. data/ext/tokenkit/src/tokenizer/sentence.rs +89 -0
  32. data/ext/tokenkit/src/tokenizer/unicode.rs +36 -0
  33. data/ext/tokenkit/src/tokenizer/url_email.rs +108 -0
  34. data/ext/tokenkit/src/tokenizer/whitespace.rs +31 -0
  35. data/lib/tokenkit/config.rb +74 -0
  36. data/lib/tokenkit/config_builder.rb +209 -0
  37. data/lib/tokenkit/config_compat.rb +52 -0
  38. data/lib/tokenkit/configuration.rb +194 -0
  39. data/lib/tokenkit/regex_converter.rb +58 -0
  40. data/lib/tokenkit/version.rb +5 -0
  41. data/lib/tokenkit.rb +336 -0
  42. data/sig/tokenkit.rbs +4 -0
  43. metadata +172 -0
@@ -0,0 +1,469 @@
1
+ # Architecture Guide
2
+
3
+ This guide explains the internal architecture of TokenKit, design decisions, and how the Ruby and Rust components work together.
4
+
5
+ ## Overview
6
+
7
+ TokenKit is a hybrid Ruby/Rust gem that leverages Rust's performance for tokenization while providing a friendly Ruby API. The architecture prioritizes:
8
+
9
+ 1. **Performance**: Rust implementation with minimal FFI overhead
10
+ 2. **Safety**: Thread-safe by design with proper error handling
11
+ 3. **Flexibility**: Multiple tokenization strategies with unified API
12
+ 4. **Maintainability**: Clear separation of concerns and trait-based design
13
+
14
+ ```
15
+ ┌─────────────────┐
16
+ │ Ruby Layer │ lib/tokenkit.rb, lib/tokenkit/*.rb
17
+ ├─────────────────┤
18
+ │ Magnus Bridge │ FFI boundary (automatic serialization)
19
+ ├─────────────────┤
20
+ │ Rust Layer │ ext/tokenkit/src/*.rs
21
+ └─────────────────┘
22
+ ```
23
+
24
+ ## Component Architecture
25
+
26
+ ### Ruby Layer (`lib/`)
27
+
28
+ ```
29
+ lib/
30
+ ├── tokenkit.rb # Main module and API
31
+ ├── tokenkit/
32
+ │ ├── version.rb # Version constant
33
+ │ ├── tokenizer.rb # Instance-based tokenizer
34
+ │ └── configuration.rb # Config object with accessors
35
+ ```
36
+
37
+ **Key Components:**
38
+
39
+ 1. **TokenKit Module** (`lib/tokenkit.rb`):
40
+ - Public API methods: `tokenize`, `configure`, `reset`
41
+ - Delegates to Rust via Magnus
42
+ - Handles option merging for per-call overrides
43
+
44
+ 2. **Configuration** (`lib/tokenkit/configuration.rb`):
45
+ - Ruby wrapper around config hash from Rust
46
+ - Provides convenient accessors and predicates
47
+ - Immutable once created
48
+
49
+ 3. **Tokenizer Class** (`lib/tokenkit/tokenizer.rb`):
50
+ - Instance-based API for specific configurations
51
+ - Useful for bulk processing with different settings
52
+ - Wraps module-level functions
53
+
54
+ ### Rust Layer (`ext/tokenkit/src/`)
55
+
56
+ ```
57
+ ext/tokenkit/src/
58
+ ├── lib.rs # Magnus bindings and caching
59
+ ├── config.rs # Configuration structs
60
+ ├── error.rs # Error types with thiserror
61
+ ├── tokenizer/
62
+ │ ├── mod.rs # Trait definition and factory
63
+ │ ├── base.rs # Common functionality
64
+ │ ├── unicode.rs # Unicode word boundaries
65
+ │ ├── whitespace.rs # Simple whitespace splitting
66
+ │ ├── pattern.rs # Regex-based tokenization
67
+ │ └── ... # Other tokenizer implementations
68
+ ```
69
+
70
+ **Key Components:**
71
+
72
+ 1. **Entry Point** (`lib.rs`):
73
+ - Magnus function exports
74
+ - Tokenizer cache management
75
+ - Configuration parsing and validation
76
+ - Ruby ↔ Rust type conversion
77
+
78
+ 2. **Configuration** (`config.rs`):
79
+ - `TokenizerConfig` struct with all settings
80
+ - `TokenizerStrategy` enum for strategy-specific options
81
+ - Serde serialization for debugging
82
+
83
+ 3. **Error Handling** (`error.rs`):
84
+ - `TokenizerError` enum with thiserror
85
+ - Automatic conversion to Ruby exceptions
86
+ - Detailed error messages
87
+
88
+ 4. **Tokenizer Trait** (`tokenizer/mod.rs`):
89
+ ```rust
90
+ pub trait Tokenizer: Send + Sync {
91
+ fn tokenize(&self, text: &str) -> Vec<String>;
92
+ }
93
+ ```
94
+ - Simple, focused interface
95
+ - Thread-safe (`Send + Sync`)
96
+ - Returns owned strings for Ruby
97
+
98
+ 5. **Base Functionality** (`tokenizer/base.rs`):
99
+ - `BaseTokenizerFields` for common state
100
+ - Regex compilation for preserve_patterns
101
+ - Shared helper functions
102
+
103
+ ## Design Patterns
104
+
105
+ ### 1. Strategy Pattern
106
+
107
+ Each tokenization strategy implements the `Tokenizer` trait:
108
+
109
+ ```rust
110
+ // Factory function creates appropriate tokenizer
111
+ pub fn from_config(config: TokenizerConfig) -> Result<Box<dyn Tokenizer>> {
112
+ match config.strategy {
113
+ TokenizerStrategy::Unicode => Ok(Box::new(UnicodeTokenizer::new(config))),
114
+ TokenizerStrategy::Whitespace => Ok(Box::new(WhitespaceTokenizer::new(config))),
115
+ // ... other strategies
116
+ }
117
+ }
118
+ ```
119
+
120
+ ### 2. Composition over Inheritance
121
+
122
+ Tokenizers compose `BaseTokenizerFields` for common functionality:
123
+
124
+ ```rust
125
+ pub struct UnicodeTokenizer {
126
+ base: BaseTokenizerFields,
127
+ }
128
+
129
+ impl UnicodeTokenizer {
130
+ pub fn new(config: TokenizerConfig) -> Self {
131
+ Self {
132
+ base: BaseTokenizerFields::new(config),
133
+ }
134
+ }
135
+ }
136
+ ```
137
+
138
+ ### 3. Lazy Initialization with Caching
139
+
140
+ Tokenizer instances are created lazily and cached:
141
+
142
+ ```rust
143
+ static DEFAULT_CACHE: Lazy<Mutex<TokenizerCache>> = Lazy::new(|| {
144
+ Mutex::new(TokenizerCache {
145
+ config: TokenizerConfig::default(),
146
+ tokenizer: None, // Created on first use
147
+ })
148
+ });
149
+
150
+ fn tokenize(text: String) -> Result<Vec<String>, Error> {
151
+ let mut cache = DEFAULT_CACHE.lock()?;
152
+
153
+ if cache.tokenizer.is_none() {
154
+ cache.tokenizer = Some(from_config(cache.config.clone())?);
155
+ }
156
+
157
+ Ok(cache.tokenizer.as_ref().unwrap().tokenize(&text))
158
+ }
159
+ ```
160
+
161
+ ### 4. Builder Pattern for Configuration
162
+
163
+ Configuration uses a builder-like pattern in Ruby:
164
+
165
+ ```ruby
166
+ TokenKit.configure do |config|
167
+ config.strategy = :unicode
168
+ config.lowercase = true
169
+ config.preserve_patterns = [...]
170
+ end
171
+ ```
172
+
173
+ ## Pattern Preservation Architecture
174
+
175
+ Pattern preservation is a key feature that required careful design:
176
+
177
+ ### Problem
178
+
179
+ Different tokenizers split text differently, but preserve_patterns must work consistently across all strategies.
180
+
181
+ ### Solution
182
+
183
+ Strategy-aware pattern preservation:
184
+
185
+ ```rust
186
+ // Each tokenizer can provide custom tokenization for pattern gaps
187
+ pub fn apply_preserve_patterns_with_tokenizer<F>(
188
+ tokens: Vec<String>,
189
+ preserve_patterns: &[Regex],
190
+ original_text: &str,
191
+ config: &TokenizerConfig,
192
+ tokenizer_fn: F,
193
+ ) -> Vec<String>
194
+ where
195
+ F: Fn(&str) -> Vec<String>
196
+ ```
197
+
198
+ Example for CharGroup tokenizer:
199
+ ```rust
200
+ // CharGroup uses its own delimiter logic for consistency
201
+ let tokens_with_patterns = apply_preserve_patterns_with_tokenizer(
202
+ tokens,
203
+ self.base.preserve_patterns(),
204
+ text,
205
+ &self.base.config,
206
+ |text| self.tokenize_text(text), // Uses CharGroup's split logic
207
+ );
208
+ ```
209
+
210
+ ## Thread Safety
211
+
212
+ TokenKit is thread-safe through careful design:
213
+
214
+ ### Global State Protection
215
+
216
+ ```rust
217
+ // Single mutex protects configuration and tokenizer
218
+ static DEFAULT_CACHE: Lazy<Mutex<TokenizerCache>> = Lazy::new(...)
219
+ ```
220
+
221
+ ### Tokenizer Trait Requirements
222
+
223
+ ```rust
224
+ pub trait Tokenizer: Send + Sync {
225
+ // Send: Can be transferred between threads
226
+ // Sync: Can be shared between threads
227
+ }
228
+ ```
229
+
230
+ ### Immutable Tokenizers
231
+
232
+ Once created, tokenizers are immutable. Configuration changes create new instances:
233
+
234
+ ```rust
235
+ fn configure(config_hash: RHash) -> Result<(), Error> {
236
+ let config = parse_config_from_hash(config_hash)?;
237
+ let mut cache = DEFAULT_CACHE.lock()?;
238
+ cache.config = config;
239
+ cache.tokenizer = None; // Invalidate old tokenizer
240
+ Ok(())
241
+ }
242
+ ```
243
+
244
+ ## Error Handling
245
+
246
+ Errors flow from Rust to Ruby with proper type conversion:
247
+
248
+ ### Rust Side
249
+
250
+ ```rust
251
+ #[derive(Error, Debug)]
252
+ pub enum TokenizerError {
253
+ #[error("Invalid regex pattern '{pattern}': {error}")]
254
+ InvalidRegex { pattern: String, error: String },
255
+
256
+ #[error("Invalid n-gram configuration: min_gram ({min}) must be > 0 and <= max_gram ({max})")]
257
+ InvalidNgramConfig { min: usize, max: usize },
258
+ // ...
259
+ }
260
+ ```
261
+
262
+ ### Conversion to Ruby
263
+
264
+ ```rust
265
+ impl From<TokenizerError> for magnus::Error {
266
+ fn from(e: TokenizerError) -> Self {
267
+ match e {
268
+ TokenizerError::InvalidRegex { .. } => {
269
+ magnus::Error::new(regexp_error_class(), e.to_string())
270
+ }
271
+ TokenizerError::InvalidNgramConfig { .. } => {
272
+ magnus::Error::new(arg_error_class(), e.to_string())
273
+ }
274
+ // ...
275
+ }
276
+ }
277
+ }
278
+ ```
279
+
280
+ ### Ruby Side
281
+
282
+ ```ruby
283
+ begin
284
+ TokenKit.configure do |c|
285
+ c.strategy = :pattern
286
+ c.regex = "[invalid("
287
+ end
288
+ rescue RegexpError => e
289
+ puts "Invalid regex: #{e.message}"
290
+ end
291
+ ```
292
+
293
+ ## Performance Optimizations
294
+
295
+ ### 1. Cached Tokenizer Instances
296
+
297
+ Eliminated regex recompilation on every tokenization:
298
+ - 110x speedup for pattern-heavy workloads
299
+ - Tokenizer created once, reused many times
300
+ - Cache invalidated only on configuration change
301
+
302
+ ### 2. Zero-Copy String Slicing
303
+
304
+ Where possible, we work with string slices:
305
+
306
+ ```rust
307
+ let before = &original_text[pos..start]; // No allocation
308
+ ```
309
+
310
+ ### 3. Capacity Hints
311
+
312
+ Pre-allocate vectors with estimated sizes:
313
+
314
+ ```rust
315
+ let mut result = Vec::with_capacity(tokens.len() + preserved_spans.len());
316
+ ```
317
+
318
+ ### 4. Lazy String Allocation
319
+
320
+ Store indices instead of strings until needed:
321
+
322
+ ```rust
323
+ // Store only positions during pattern matching
324
+ let mut preserved_spans: Vec<(usize, usize)> = Vec::with_capacity(32);
325
+
326
+ // Allocate strings only when building result
327
+ result.push(original_text[start..end].to_string());
328
+ ```
329
+
330
+ ## Magnus Bridge
331
+
332
+ Magnus provides seamless Ruby ↔ Rust interop:
333
+
334
+ ### Type Conversions
335
+
336
+ ```rust
337
+ // Ruby String → Rust String
338
+ fn tokenize(text: String) -> Result<Vec<String>, Error>
339
+
340
+ // Ruby Hash → Rust Config
341
+ fn configure(config_hash: RHash) -> Result<(), Error>
342
+
343
+ // Rust Vec<String> → Ruby Array
344
+ Ok(tokenizer.tokenize(&text)) // Automatic conversion
345
+ ```
346
+
347
+ ### Function Export
348
+
349
+ ```rust
350
+ #[magnus::init]
351
+ fn init(_ruby: &magnus::Ruby) -> Result<(), Error> {
352
+ let module = define_module("TokenKit")?;
353
+
354
+ module.define_module_function("_tokenize", function!(tokenize, 1))?;
355
+ module.define_module_function("_configure", function!(configure, 1))?;
356
+ // ...
357
+ }
358
+ ```
359
+
360
+ ### Memory Management
361
+
362
+ Magnus handles memory management automatically:
363
+ - Rust strings converted to Ruby strings
364
+ - Ruby GC manages Ruby objects
365
+ - Rust objects dropped when out of scope
366
+
367
+ ## Testing Strategy
368
+
369
+ ### Ruby Tests (`spec/`)
370
+
371
+ - Integration tests for public API
372
+ - Test all tokenization strategies
373
+ - Verify Ruby-specific behavior
374
+ - Edge cases and error conditions
375
+
376
+ ### Rust Tests
377
+
378
+ Currently, we rely on Ruby tests for coverage. Future work could add:
379
+ - Unit tests for tokenizer implementations
380
+ - Property-based testing for pattern preservation
381
+ - Benchmarks in Rust
382
+
383
+ ### Coverage
384
+
385
+ - 94.12% line coverage
386
+ - 87.8% branch coverage
387
+ - All tokenizers thoroughly tested
388
+ - Error paths verified
389
+
390
+ ## Build System
391
+
392
+ ### Compilation
393
+
394
+ 1. `rake compile` invokes cargo
395
+ 2. Cargo builds with release optimizations:
396
+ ```toml
397
+ [profile.release]
398
+ lto = true # Link-time optimization
399
+ codegen-units = 1 # Better optimization
400
+ ```
401
+ 3. Shared library copied to lib/
402
+
403
+ ### Cross-Platform
404
+
405
+ - macOS: `.bundle` extension
406
+ - Linux: `.so` extension
407
+ - Windows: `.dll` extension
408
+
409
+ Magnus handles platform differences automatically.
410
+
411
+ ## Future Architecture Improvements
412
+
413
+ ### 1. Parallel Tokenization
414
+
415
+ For very large texts, parallelize using Rayon:
416
+ ```rust
417
+ use rayon::prelude::*;
418
+
419
+ large_text.par_lines()
420
+ .flat_map(|line| tokenizer.tokenize(line))
421
+ .collect()
422
+ ```
423
+
424
+ ### 2. Streaming API
425
+
426
+ For huge documents:
427
+ ```rust
428
+ pub trait StreamingTokenizer {
429
+ fn tokenize_stream(&self, reader: impl BufRead) -> impl Iterator<Item = String>;
430
+ }
431
+ ```
432
+
433
+ ### 3. Custom Allocators
434
+
435
+ Investigate jemalloc or mimalloc for better performance:
436
+ ```toml
437
+ [dependencies]
438
+ tikv-jemallocator = "0.5"
439
+ ```
440
+
441
+ ### 4. SIMD Optimizations
442
+
443
+ Use SIMD for character scanning:
444
+ ```rust
445
+ use std::simd::*;
446
+ // Vectorized character matching
447
+ ```
448
+
449
+ ### 5. Regex Set Optimization
450
+
451
+ When multiple patterns, use regex::RegexSet:
452
+ ```rust
453
+ use regex::RegexSet;
454
+
455
+ let patterns = RegexSet::new(&[...]).unwrap();
456
+ let matches: Vec<_> = patterns.matches(text).into_iter().collect();
457
+ ```
458
+
459
+ ## Conclusion
460
+
461
+ TokenKit's architecture balances performance, safety, and usability through:
462
+
463
+ - Clean separation between Ruby API and Rust implementation
464
+ - Trait-based design for extensibility
465
+ - Intelligent caching for performance
466
+ - Comprehensive error handling
467
+ - Thread-safe by design
468
+
469
+ The hybrid Ruby/Rust approach provides the best of both worlds: Ruby's elegant API and Rust's performance.