RubyGems - tokenkit - Versions diffs - 0.1.0.pre.1 - Mend

tokenkit 0.1.0.pre.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (43) hide show

checksums.yaml +7 -0
data/.rspec +3 -0
data/.standard.yml +3 -0
data/.yardopts +12 -0
data/CODE_OF_CONDUCT.md +132 -0
data/LICENSE.txt +21 -0
data/README.md +644 -0
data/Rakefile +18 -0
data/benchmarks/cache_test.rb +63 -0
data/benchmarks/final_comparison.rb +83 -0
data/benchmarks/tokenizer_benchmark.rb +250 -0
data/docs/ARCHITECTURE.md +469 -0
data/docs/PERFORMANCE.md +382 -0
data/docs/README.md +118 -0
data/ext/tokenkit/Cargo.toml +21 -0
data/ext/tokenkit/extconf.rb +4 -0
data/ext/tokenkit/src/config.rs +37 -0
data/ext/tokenkit/src/error.rs +67 -0
data/ext/tokenkit/src/lib.rs +346 -0
data/ext/tokenkit/src/tokenizer/base.rs +41 -0
data/ext/tokenkit/src/tokenizer/char_group.rs +62 -0
data/ext/tokenkit/src/tokenizer/edge_ngram.rs +73 -0
data/ext/tokenkit/src/tokenizer/grapheme.rs +26 -0
data/ext/tokenkit/src/tokenizer/keyword.rs +25 -0
data/ext/tokenkit/src/tokenizer/letter.rs +41 -0
data/ext/tokenkit/src/tokenizer/lowercase.rs +51 -0
data/ext/tokenkit/src/tokenizer/mod.rs +254 -0
data/ext/tokenkit/src/tokenizer/ngram.rs +80 -0
data/ext/tokenkit/src/tokenizer/path_hierarchy.rs +187 -0
data/ext/tokenkit/src/tokenizer/pattern.rs +38 -0
data/ext/tokenkit/src/tokenizer/sentence.rs +89 -0
data/ext/tokenkit/src/tokenizer/unicode.rs +36 -0
data/ext/tokenkit/src/tokenizer/url_email.rs +108 -0
data/ext/tokenkit/src/tokenizer/whitespace.rs +31 -0
data/lib/tokenkit/config.rb +74 -0
data/lib/tokenkit/config_builder.rb +209 -0
data/lib/tokenkit/config_compat.rb +52 -0
data/lib/tokenkit/configuration.rb +194 -0
data/lib/tokenkit/regex_converter.rb +58 -0
data/lib/tokenkit/version.rb +5 -0
data/lib/tokenkit.rb +336 -0
data/sig/tokenkit.rbs +4 -0
metadata +172 -0

data/README.md ADDED Viewed

@@ -0,0 +1,644 @@
+# TokenKit
+Fast, Rust-backed word-level tokenization for Ruby with pattern preservation.
+TokenKit is a Ruby wrapper around Rust's [unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) crate, providing lightweight, Unicode-aware tokenization designed for NLP pipelines, search applications, and text processing where you need consistent, high-quality word segmentation.
+## Quickstart
+```ruby
+# Install the gem
+gem install tokenkit
+# Or add to your Gemfile
+gem 'tokenkit'
+```
+```ruby
+require 'tokenkit'
+# Basic tokenization - handles Unicode, contractions, accents
+TokenKit.tokenize("Hello, world! café can't")
+# => ["hello", "world", "café", "can't"]
+# Preserve domain-specific terms even when lowercasing
+TokenKit.configure do |config|
+  config.lowercase = true
+  config.preserve_patterns = [
+    /\d+ug/i,           # Measurements: 100ug
+    /[A-Z][A-Z0-9]+/    # Gene names: BRCA1, TP53
+  ]
+end
+TokenKit.tokenize("Patient received 100ug for BRCA1 study")
+# => ["patient", "received", "100ug", "for", "BRCA1", "study"]
+```
+## Features
+- **Thirteen tokenization strategies**: whitespace, unicode (recommended), custom regex patterns, sentence, grapheme, keyword, edge n-gram, n-gram, path hierarchy, URL/email-aware, character group, letter, and lowercase
+- **Pattern preservation**: Keep domain-specific terms (gene names, measurements, antibodies) intact even with case normalization
+- **Fast**: Rust-backed implementation (~100K docs/sec)
+- **Thread-safe**: Safe for concurrent use
+- **Simple API**: Configure once, use everywhere
+- **Zero dependencies**: Pure Ruby API with Rust extension
+## Tokenization Strategies
+### Unicode (Recommended)
+Uses Unicode word segmentation for proper handling of contractions, accents, and multi-language text.
+**✅ Supports `preserve_patterns`**
+```ruby
+TokenKit.configure do |config|
+  config.strategy = :unicode
+  config.lowercase = true
+end
+TokenKit.tokenize("Don't worry about café!")
+# => ["don't", "worry", "about", "café"]
+```
+### Whitespace
+Simple whitespace splitting.
+**✅ Supports `preserve_patterns`**
+```ruby
+TokenKit.configure do |config|
+  config.strategy = :whitespace
+  config.lowercase = true
+end
+TokenKit.tokenize("hello world")
+# => ["hello", "world"]
+```
+### Pattern (Custom Regex)
+Custom tokenization using regex patterns.
+**✅ Supports `preserve_patterns`**
+```ruby
+TokenKit.configure do |config|
+  config.strategy = :pattern
+  config.regex = /[\w-]+/  # Keep words and hyphens
+  config.lowercase = true
+end
+TokenKit.tokenize("anti-CD3 antibody")
+# => ["anti-cd3", "antibody"]
+```
+### Sentence
+Splits text into sentences using Unicode sentence boundaries.
+**✅ Supports `preserve_patterns`** (preserves patterns within each sentence)
+```ruby
+TokenKit.configure do |config|
+  config.strategy = :sentence
+  config.lowercase = false
+end
+TokenKit.tokenize("Hello world! How are you? I am fine.")
+# => ["Hello world! ", "How are you? ", "I am fine."]
+```
+Useful for document-level processing, sentence embeddings, or paragraph analysis.
+### Grapheme
+Splits text into grapheme clusters (user-perceived characters).
+**⚠️ Does NOT support `preserve_patterns`** (patterns will be lowercased if `lowercase: true`)
+```ruby
+TokenKit.configure do |config|
+  config.strategy = :grapheme
+  config.grapheme_extended = true  # Use extended grapheme clusters (default)
+  config.lowercase = false
+end
+TokenKit.tokenize("👨‍👩‍👧‍👦café")
+# => ["👨‍👩‍👧‍👦", "c", "a", "f", "é"]
+```
+Perfect for handling emoji, combining characters, and complex scripts. Set `grapheme_extended = false` for legacy grapheme boundaries.
+### Keyword
+Treats entire input as a single token (no splitting).
+**⚠️ Does NOT support `preserve_patterns`** (patterns will be lowercased if `lowercase: true`)
+```ruby
+TokenKit.configure do |config|
+  config.strategy = :keyword
+  config.lowercase = false
+end
+TokenKit.tokenize("PROD-2024-ABC-001")
+# => ["PROD-2024-ABC-001"]
+```
+Ideal for exact matching of SKUs, IDs, product codes, or category names where splitting would lose meaning.
+### Edge N-gram (Search-as-you-type)
+Generates prefixes from the beginning of words for autocomplete functionality.
+**⚠️ Does NOT support `preserve_patterns`** (patterns will be lowercased if `lowercase: true`)
+```ruby
+TokenKit.configure do |config|
+  config.strategy = :edge_ngram
+  config.min_gram = 2        # Minimum prefix length
+  config.max_gram = 10       # Maximum prefix length
+  config.lowercase = true
+end
+TokenKit.tokenize("laptop")
+# => ["la", "lap", "lapt", "lapto", "laptop"]
+```
+Essential for autocomplete, type-ahead search, and prefix matching. At index time, generate edge n-grams of your product names or search terms.
+### N-gram (Fuzzy Matching)
+Generates all substring n-grams (sliding window) for fuzzy matching and misspelling tolerance.
+**⚠️ Does NOT support `preserve_patterns`** (patterns will be lowercased if `lowercase: true`)
+```ruby
+TokenKit.configure do |config|
+  config.strategy = :ngram
+  config.min_gram = 2        # Minimum n-gram length
+  config.max_gram = 3        # Maximum n-gram length
+  config.lowercase = true
+end
+TokenKit.tokenize("quick")
+# => ["qu", "ui", "ic", "ck", "qui", "uic", "ick"]
+```
+Perfect for fuzzy search, typo tolerance, and partial matching. Unlike edge n-grams which only generate prefixes, n-grams generate all possible substrings.
+### Path Hierarchy (Hierarchical Navigation)
+Creates tokens for each level of a path hierarchy.
+**⚠️ Partially supports `preserve_patterns`** (has limitations with hierarchical structure)
+```ruby
+TokenKit.configure do |config|
+  config.strategy = :path_hierarchy
+  config.delimiter = "/"     # Use "\\" for Windows paths
+  config.lowercase = false
+end
+TokenKit.tokenize("/usr/local/bin/ruby")
+# => ["/usr", "/usr/local", "/usr/local/bin", "/usr/local/bin/ruby"]
+# Works for category hierarchies too
+TokenKit.tokenize("electronics/computers/laptops")
+# => ["electronics", "electronics/computers", "electronics/computers/laptops"]
+```
+Perfect for filesystem paths, URL structures, category hierarchies, and breadcrumb navigation.
+### URL/Email-Aware (Web Content)
+Preserves URLs and email addresses as single tokens while tokenizing surrounding text.
+**✅ Supports `preserve_patterns`** (preserves patterns alongside URLs/emails)
+```ruby
+TokenKit.configure do |config|
+  config.strategy = :url_email
+  config.lowercase = true
+end
+TokenKit.tokenize("Contact support@example.com or visit https://example.com")
+# => ["contact", "support@example.com", "or", "visit", "https://example.com"]
+```
+Essential for user-generated content, customer support messages, product descriptions with links, and social media text.
+### Character Group (Fast Custom Splitting)
+Splits text based on a custom set of characters (faster than regex for simple delimiters).
+**⚠️ Partially supports `preserve_patterns`** (works best with whitespace delimiters; non-whitespace delimiters may have issues)
+```ruby
+TokenKit.configure do |config|
+  config.strategy = :char_group
+  config.split_on_chars = ",;"  # Split on commas and semicolons
+  config.lowercase = false
+end
+TokenKit.tokenize("apple,banana;cherry")
+# => ["apple", "banana", "cherry"]
+# CSV parsing
+TokenKit.tokenize("John Doe,30,Software Engineer")
+# => ["John Doe", "30", "Software Engineer"]
+```
+Ideal for structured data (CSV, TSV), log parsing, and custom delimiter-based formats. Default split characters are ` \t\n\r` (whitespace).
+### Letter (Language-Agnostic)
+Splits on any non-letter character (simpler than Unicode tokenizer, no special handling for contractions).
+**✅ Supports `preserve_patterns`**
+```ruby
+TokenKit.configure do |config|
+  config.strategy = :letter
+  config.lowercase = true
+end
+TokenKit.tokenize("hello-world123test")
+# => ["hello", "world", "test"]
+# Handles multiple scripts
+TokenKit.tokenize("Hello-世界-test")
+# => ["hello", "世界", "test"]
+```
+Great for noisy text, mixed scripts, and cases where you want aggressive splitting on any non-letter character.
+### Lowercase (Efficient Case Normalization)
+Like the Letter tokenizer but always lowercases in a single pass (more efficient than letter + lowercase filter).
+**✅ Supports `preserve_patterns`** (preserved patterns maintain original case despite always lowercasing)
+```ruby
+TokenKit.configure do |config|
+  config.strategy = :lowercase
+  # Note: config.lowercase setting is ignored - this tokenizer ALWAYS lowercases
+end
+TokenKit.tokenize("HELLO-WORLD")
+# => ["hello", "world"]
+# Case-insensitive search indexing
+TokenKit.tokenize("User-Agent: Mozilla/5.0")
+# => ["user", "agent", "mozilla"]
+```
+**⚠️ Important**: The `:lowercase` strategy **always** lowercases text, regardless of the `config.lowercase` setting. If you need control over lowercasing, use the `:letter` strategy instead with `config.lowercase = true/false`.
+Perfect for case-insensitive search indexing, normalizing product codes, and cleaning social media text. Handles Unicode correctly, including characters that lowercase to multiple characters (e.g., Turkish İ).
+## Pattern Preservation
+Preserve domain-specific terms even when lowercasing.
+**Fully Supported by:** Unicode, Pattern, Whitespace, Letter, Lowercase, Sentence, and URL/Email tokenizers.
+**Partially Supported by:** Character Group (works best with whitespace delimiters) and Path Hierarchy (limitations with hierarchical structure) tokenizers.
+**Not Supported by:** Grapheme, Keyword, Edge N-gram, and N-gram tokenizers.
+```ruby
+TokenKit.configure do |config|
+  config.strategy = :unicode
+  config.lowercase = true
+  config.preserve_patterns = [
+    /\d+(ug|mg|ml|units)/i,  # Measurements: 100ug, 50mg
+    /anti-cd\d+/i,            # Antibodies: Anti-CD3, anti-CD28
+    /[A-Z][A-Z0-9]+/          # Gene names: BRCA1, TP53, EGFR
+  ]
+end
+text = "Patient received 100ug Anti-CD3 with BRCA1 mutation"
+tokens = TokenKit.tokenize(text)
+# => ["patient", "received", "100ug", "Anti-CD3", "with", "BRCA1", "mutation"]
+```
+Pattern matches maintain their original case despite `lowercase=true`.
+### Regex Flags
+TokenKit supports Ruby regex flags for both `preserve_patterns` and the `:pattern` strategy:
+```ruby
+# Case-insensitive matching (i flag)
+TokenKit.configure do |config|
+  config.preserve_patterns = [/gene-\d+/i]
+end
+TokenKit.tokenize("Found GENE-123 and gene-456")
+# => ["found", "GENE-123", "and", "gene-456"]
+# Multiline mode (m flag) - dot matches newlines
+TokenKit.configure do |config|
+  config.strategy = :pattern
+  config.regex = /test./m
+end
+# Extended mode (x flag) - allows comments and whitespace
+pattern = /
+  \w+       # word characters
+  @         # at sign
+  \w+\.\w+  # domain.tld
+/x
+TokenKit.configure do |config|
+  config.preserve_patterns = [pattern]
+end
+# Combine flags
+TokenKit.configure do |config|
+  config.preserve_patterns = [/code-\d+/im]  # case-insensitive + multiline
+end
+```
+Supported flags:
+- `i` - Case-insensitive matching
+- `m` - Multiline mode (`.` matches newlines)
+- `x` - Extended mode (ignore whitespace, allow comments)
+Flags work with both Regexp objects and string patterns passed to `:pattern` strategy.
+## Configuration
+### Global Configuration
+```ruby
+TokenKit.configure do |config|
+  config.strategy = :unicode              # :whitespace, :unicode, :pattern, :sentence, :grapheme, :keyword, :edge_ngram, :ngram, :path_hierarchy, :url_email, :char_group, :letter, :lowercase
+  config.lowercase = true                 # Normalize to lowercase
+  config.remove_punctuation = false       # Remove punctuation from tokens
+  config.preserve_patterns = []           # Regex patterns to preserve
+  # Strategy-specific options
+  config.regex = /\w+/                    # Only for :pattern strategy
+  config.grapheme_extended = true         # Only for :grapheme strategy (default: true)
+  config.min_gram = 2                     # For :edge_ngram and :ngram strategies (default: 2)
+  config.max_gram = 10                    # For :edge_ngram and :ngram strategies (default: 10)
+  config.delimiter = "/"                  # Only for :path_hierarchy strategy (default: "/")
+  config.split_on_chars = " \t\n\r"       # Only for :char_group strategy (default: whitespace)
+end
+```
+### Per-Call Options
+Override global config for specific calls:
+```ruby
+# Override general options
+TokenKit.tokenize("BRCA1 Gene", lowercase: false)
+# => ["BRCA1", "Gene"]
+# Override strategy-specific options
+TokenKit.tokenize("laptop", strategy: :edge_ngram, min_gram: 3, max_gram: 5)
+# => ["lap", "lapt", "lapto"]
+TokenKit.tokenize("C:\\Windows\\System", strategy: :path_hierarchy, delimiter: "\\")
+# => ["C:", "C:\\Windows", "C:\\Windows\\System"]
+# Combine multiple overrides
+TokenKit.tokenize(
+  "TEST",
+  strategy: :edge_ngram,
+  min_gram: 2,
+  max_gram: 3,
+  lowercase: false
+)
+# => ["TE", "TES"]
+```
+All strategy-specific options can be overridden per-call:
+- `:pattern` - `regex: /pattern/`
+- `:grapheme` - `extended: true/false`
+- `:edge_ngram` - `min_gram: n, max_gram: n`
+- `:ngram` - `min_gram: n, max_gram: n`
+- `:path_hierarchy` - `delimiter: "/"`
+- `:char_group` - `split_on_chars: ",;"`
+### Get Current Config
+```ruby
+config = TokenKit.config_hash
+# Returns a Configuration object with accessor methods
+config.strategy           # => :unicode
+config.lowercase          # => true
+config.remove_punctuation # => false
+config.preserve_patterns  # => [...]
+# Strategy predicates
+config.edge_ngram?        # => false
+config.ngram?             # => false
+config.pattern?           # => false
+config.grapheme?          # => false
+config.path_hierarchy?    # => false
+config.char_group?        # => false
+config.letter?            # => false
+config.lowercase?         # => false
+# Strategy-specific accessors
+config.min_gram           # => 2 (for edge_ngram and ngram)
+config.max_gram           # => 10 (for edge_ngram and ngram)
+config.delimiter          # => "/" (for path_hierarchy)
+config.split_on_chars     # => " \t\n\r" (for char_group)
+config.extended           # => true (for grapheme)
+config.regex              # => "..." (for pattern)
+# Convert to hash if needed
+config.to_h
+# => {"strategy" => "unicode", "lowercase" => true, ...}
+```
+### Reset to Defaults
+```ruby
+TokenKit.reset
+```
+## Use Cases
+### Biotech/Life Sciences
+```ruby
+TokenKit.configure do |config|
+  config.strategy = :unicode
+  config.lowercase = true
+  config.preserve_patterns = [
+    /\d+(ug|mg|ml|ul|units)/i,  # Measurements
+    /anti-[a-z0-9-]+/i,          # Antibodies
+    /[A-Z]{2,10}/,               # Gene names (CDK10, BRCA1, TP53)
+    /cd\d+/i,                    # Cell markers (CD3, CD4, CD8)
+    /ig[gmaed]/i                 # Immunoglobulins (IgG, IgM)
+  ]
+end
+text = "Anti-CD3 IgG antibody 100ug for BRCA1 research"
+tokens = TokenKit.tokenize(text)
+# => ["Anti-CD3", "IgG", "antibody", "100ug", "for", "BRCA1", "research"]
+```
+### E-commerce/Catalogs
+```ruby
+TokenKit.configure do |config|
+  config.strategy = :unicode
+  config.lowercase = true
+  config.preserve_patterns = [
+    /\$\d+(\.\d{2})?/,          # Prices: $99.99
+    /\d+(-\d+)+/,               # SKUs: 123-456-789
+    /\d+(mm|cm|inch)/i          # Dimensions: 10mm, 5cm
+  ]
+end
+text = "Widget $49.99 SKU: 123-456 size: 10cm"
+tokens = TokenKit.tokenize(text)
+# => ["widget", "$49.99", "sku", "123-456", "size", "10cm"]
+```
+### Search Applications
+```ruby
+# Exact matching with case normalization
+TokenKit.configure do |config|
+  config.strategy = :lowercase
+  config.lowercase = true
+end
+# Index time: normalize documents
+doc_tokens = TokenKit.tokenize("Product Code: ABC-123")
+# => ["product", "code", "abc"]
+# Query time: normalize user input
+query_tokens = TokenKit.tokenize("product abc")
+# => ["product", "abc"]
+# Fuzzy matching with n-grams
+TokenKit.configure do |config|
+  config.strategy = :ngram
+  config.min_gram = 2
+  config.max_gram = 4
+  config.lowercase = true
+end
+# Index time: generate n-grams
+TokenKit.tokenize("search")
+# => ["se", "ea", "ar", "rc", "ch", "sea", "ear", "arc", "rch", "sear", "earc", "arch"]
+# Query time: typo "serch" still has significant overlap
+TokenKit.tokenize("serch")
+# => ["se", "er", "rc", "ch", "ser", "erc", "rch", "serc", "erch"]
+# Overlap: ["se", "rc", "ch", "rch"] allows matching despite typo
+# Autocomplete with edge n-grams
+TokenKit.configure do |config|
+  config.strategy = :edge_ngram
+  config.min_gram = 2
+  config.max_gram = 10
+end
+TokenKit.tokenize("laptop")
+# => ["la", "lap", "lapt", "lapto", "laptop"]
+# Matches "la", "lap", "lapt" as user types
+```
+## Performance
+TokenKit has been extensively optimized for production use:
+- **Unicode tokenization**: ~870K tokens/sec (baseline)
+- **Pattern preservation**: ~410K tokens/sec with 4 patterns (was 3.6K/sec before v0.3.0 optimizations)
+- **Memory efficient**: Pre-allocated buffers and in-place operations
+- **Thread-safe**: Cached instances with mutex protection, safe for concurrent use
+- **110x speedup**: For pattern-heavy workloads through intelligent caching
+Key optimizations:
+- Regex patterns compiled once and cached (not per-tokenization)
+- String allocations minimized through index-based operations
+- Tokenizer instances reused across calls
+- In-place post-processing for lowercase and punctuation removal
+See the [Performance Guide](docs/PERFORMANCE.md) for detailed benchmarks and optimization techniques.
+## Integration
+TokenKit is designed to work with other gems in the scientist-labs ecosystem:
+- **PhraseKit**: Use TokenKit for consistent phrase extraction
+- **SpellKit**: Tokenize before spell correction
+- **red-candle**: Tokenize before NER/embeddings
+## Documentation
+- [API Documentation](https://rubydoc.info/gems/tokenkit) - Full API reference
+- [Architecture Guide](docs/ARCHITECTURE.md) - Internal design and structure
+- [Performance Guide](docs/PERFORMANCE.md) - Benchmarks and optimization details
+### Generating Documentation Locally
+```bash
+# Install documentation dependencies
+bundle install
+# Generate YARD documentation
+bundle exec yard doc
+# Open documentation in browser
+open doc/index.html
+```
+## Development
+```bash
+# Setup
+bundle install
+bundle exec rake compile
+# Run tests
+bundle exec rspec
+# Run tests with coverage
+COVERAGE=true bundle exec rspec
+# Run linter
+bundle exec standardrb
+# Run benchmarks
+ruby benchmarks/tokenizer_benchmark.rb
+# Build gem
+gem build tokenkit.gemspec
+```
+## Requirements
+- Ruby >= 3.1.0
+- Rust toolchain (for building from source)
+## License
+MIT License. See [LICENSE.txt](LICENSE.txt) for details.
+## Contributing
+Bug reports and pull requests are welcome on GitHub at https://github.com/scientist-labs/tokenkit.
+This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the [code of conduct](CODE_OF_CONDUCT.md).
+## Credits
+Built with:
+- [Magnus](https://github.com/matsadler/magnus) for Ruby-Rust bindings
+- [unicode-segmentation](https://github.com/unicode-rs/unicode-segmentation) for Unicode word boundaries
+- [linkify](https://github.com/robinst/linkify) for robust URL and email detection
+- [regex](https://github.com/rust-lang/regex) for pattern matching

data/Rakefile ADDED Viewed

@@ -0,0 +1,18 @@
+# frozen_string_literal: true
+require "bundler/gem_tasks"
+require "rspec/core/rake_task"
+RSpec::Core::RakeTask.new(:spec)
+require "standard/rake"
+require "rake/extensiontask"
+GEMSPEC = Gem::Specification.load("tokenkit.gemspec")
+Rake::ExtensionTask.new("tokenkit", GEMSPEC) do |ext|
+  ext.lib_dir = "lib/tokenkit"
+end
+task spec: :compile
+task default: %i[clobber compile spec standard]