ahocorasick-rust 1.0.2 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 396b47e4a2166e5388e806b60370f1bbc403cd86696a612a82330a320a22d396
4
- data.tar.gz: 43a7063bf08b8d320f3a431861263e923228084e98b1efd038ae324dd07cdce2
3
+ metadata.gz: 040bfa66b5fdcf34e1ac7c7abc0ea45fa52561e83964750c59e54213bb5dbe10
4
+ data.tar.gz: 3e33df04f2ae0972f696085efe98984131eecf17cece98b9ae33c9a497f8b690
5
5
  SHA512:
6
- metadata.gz: 00b5689f3bd8132306f111ef7ebf102571c8cdb1a19fa072e196be6b9a24a60de105ba594a4fefe83efe1a866ca7c76efc2190fb6ad4805b15f3ae8523a30ed7
7
- data.tar.gz: a3361bb6ee20744a1d649c4cb07207e8335c1f469cbfea556afdba5ac3bb292e4213f24b9c9bc8faaa36755a84e4348df05bcc36cce864e012542312357e83fb
6
+ metadata.gz: 58bfcb0c7c05b0b477d1e85e9c9ed7d9ab683967771d2ec512a6f2f36b5c72c6c39beb4ba4f3c0ab3e1c66e3492f53fa0ccdff2b57b0dfe62a75114105810583
7
+ data.tar.gz: 77839a72ffe30a4f80aa6c1626d7770d6a215954f3365c9967b739f968a4b52d88de7d350e672cd4f5feea7c7a293fbea636a7194aaeecdbe085c94410a4af18
data/README.md CHANGED
@@ -19,10 +19,12 @@ Aho-Corasick is a powerful string searching algorithm that can find **multiple p
19
19
 
20
20
  **Why this gem rocks:**
21
21
  - 🦀 Powered by Rust for maximum speed
22
- - 💎 Easy Ruby interface
22
+ - 💎 Clean, intuitive Ruby API with 7+ search methods
23
23
  - 🚀 Up to **67x faster** than pure Ruby implementations
24
24
  - ✨ Precompiled binaries for major platforms
25
- - 🌈 Works with Ruby 2.7+
25
+ - 🎯 Multiple search modes: overlapping, positioned, existence checks
26
+ - 🔄 Find & replace with hash or block-based logic
27
+ - 🌈 Works with Ruby 2.7+ and UTF-8/emoji
26
28
 
27
29
  ## Installation 📦
28
30
 
@@ -44,24 +46,135 @@ Or install it yourself:
44
46
  gem install ahocorasick-rust
45
47
  ```
46
48
 
47
- ## Usage 🎀
49
+ ## Features
48
50
 
49
- It's super simple!
51
+ - **Multiple search modes** - Find all matches, overlapping matches, or just check existence
52
+ - **Position tracking** - Get byte offsets for every match
53
+ - **Case-insensitive matching** - Optional ASCII case-insensitive search
54
+ - **Match strategies** - Control priority when patterns overlap
55
+ - **Find & replace** - Replace patterns with strings or dynamic logic via blocks
56
+ - **Unicode support** - Works seamlessly with UTF-8 text and emoji
57
+ - **Zero-copy where possible** - Efficient memory usage
58
+
59
+ ## Quick Start 🎀
60
+
61
+ ### Basic Pattern Matching
50
62
 
51
63
  ```ruby
52
64
  require 'ahocorasick-rust'
53
65
 
54
- # Create a new matcher with your patterns
55
- animals = ['cat', 'dog', 'bunny', 'fox']
56
- matcher = AhoCorasickRust.new(animals)
66
+ # Create a matcher with your patterns
67
+ matcher = AhoCorasickRust.new(['cat', 'dog', 'fox'])
57
68
 
58
- # Search for all patterns in your text - finds them all in one pass! ✨
59
- text = "The quick brown fox jumps over the lazy dog."
60
- matcher.lookup(text)
69
+ # Find all matches
70
+ matcher.lookup("The quick brown fox jumps over the lazy dog.")
61
71
  # => ["fox", "dog"]
72
+
73
+ # Check if any pattern exists
74
+ matcher.match?("I have a cat")
75
+ # => true
62
76
  ```
63
77
 
64
- **Want more examples?** Check out our [example script](scripts/example.rb) with content filtering, language detection, and more! 🌈
78
+ ### Case-Insensitive Matching
79
+
80
+ ```ruby
81
+ matcher = AhoCorasickRust.new(['Ruby', 'Python'], case_insensitive: true)
82
+
83
+ matcher.lookup('I love RUBY and python!')
84
+ # => ["Ruby", "Python"]
85
+ ```
86
+
87
+ ### Get Match Positions
88
+
89
+ ```ruby
90
+ matcher = AhoCorasickRust.new(['fox', 'dog'])
91
+
92
+ matcher.lookup_with_positions('The fox and dog')
93
+ # => [
94
+ # { pattern: 'fox', start: 4, end: 7 },
95
+ # { pattern: 'dog', start: 12, end: 15 }
96
+ # ]
97
+ ```
98
+
99
+ ### Find & Replace
100
+
101
+ ```ruby
102
+ matcher = AhoCorasickRust.new(['bad', 'worse', 'worst'])
103
+
104
+ # Replace with hash
105
+ matcher.replace_all('This is bad and worse', { 'bad' => 'good', 'worse' => 'better' })
106
+ # => "This is good and better"
107
+
108
+ # Replace with block
109
+ matcher.replace_all('This is bad and worse') { |word| '*' * word.length }
110
+ # => "This is *** and *****"
111
+ ```
112
+
113
+ ### Overlapping Matches
114
+
115
+ ```ruby
116
+ matcher = AhoCorasickRust.new(['abc', 'bcd', 'cde'])
117
+
118
+ # Regular lookup finds non-overlapping matches
119
+ matcher.lookup('abcde')
120
+ # => ["abc"]
121
+
122
+ # Overlapping lookup finds all matches
123
+ matcher.lookup_overlapping('abcde')
124
+ # => ["abc", "bcd", "cde"]
125
+ ```
126
+
127
+ ### Advanced: Match Strategies
128
+
129
+ ```ruby
130
+ # Prefer longest matches
131
+ matcher = AhoCorasickRust.new(
132
+ ['test', 'testing'],
133
+ match_kind: :leftmost_longest
134
+ )
135
+
136
+ matcher.lookup('testing')
137
+ # => ["testing"] # chooses longer match over 'test'
138
+ ```
139
+
140
+ ### Find First (Efficient for Existence Checks)
141
+
142
+ ```ruby
143
+ matcher = AhoCorasickRust.new(['foo', 'bar', 'baz'])
144
+
145
+ # Get just the first match (faster than getting all matches)
146
+ matcher.find_first('hello foo bar baz')
147
+ # => "foo"
148
+
149
+ # Or with position
150
+ matcher.find_first_with_position('hello foo bar')
151
+ # => { pattern: 'foo', start: 6, end: 9 }
152
+ ```
153
+
154
+ ## API Overview 🔍
155
+
156
+ **Constructor:**
157
+ - `AhoCorasickRust.new(patterns, case_insensitive: false, match_kind: :leftmost_first)`
158
+
159
+ **Search Methods:**
160
+ - `#lookup(text)` - Find all non-overlapping matches
161
+ - `#lookup_overlapping(text)` - Find all matches including overlaps
162
+ - `#lookup_with_positions(text)` - Find matches with byte positions
163
+ - `#match?(text)` - Check if any pattern exists (returns boolean)
164
+ - `#find_first(text)` - Get first match only
165
+ - `#find_first_with_position(text)` - Get first match with position
166
+
167
+ **Replace Methods:**
168
+ - `#replace_all(text, hash)` - Replace with hash mapping
169
+ - `#replace_all(text) { |match| ... }` - Replace with block
170
+
171
+ ## Documentation 📖
172
+
173
+ - **[API Reference](docs/reference.md)** - Complete method documentation with examples
174
+ - **[Match Kind Guide](docs/match_kind.md)** - Understanding match strategies
175
+ - **[Example Script](scripts/example.rb)** - Real-world usage examples
176
+
177
+ **Want more examples?** Check out our example script with content filtering, language detection, and more! 🌈
65
178
 
66
179
  ## Benchmark 📊
67
180
 
@@ -0,0 +1,139 @@
1
+ # Match Kind Strategies
2
+
3
+ The `match_kind` option controls how the Aho-Corasick automaton behaves when multiple patterns could match at the same position in the text. This is important when you have overlapping patterns like `'abc'` and `'abcd'`.
4
+
5
+ ## Available Options
6
+
7
+ ### `:leftmost_first` (default)
8
+
9
+ **Priority:** First pattern in the list wins
10
+
11
+ When multiple patterns could match at the same position, the pattern that appears first in your pattern list takes precedence.
12
+
13
+ ```ruby
14
+ matcher = AhoCorasickRust.new(['abc', 'abcd'], match_kind: :leftmost_first)
15
+ matcher.lookup('abcd')
16
+ # => ['abc'] # 'abc' appears first in the pattern list, so it wins
17
+ ```
18
+
19
+ **Use cases:**
20
+ - Keyword replacement where you want specific patterns to take precedence
21
+ - Content filtering with priority rules
22
+ - When you want explicit control over match priority by ordering your patterns
23
+
24
+ ---
25
+
26
+ ### `:leftmost_longest`
27
+
28
+ **Priority:** Longest match wins
29
+
30
+ When multiple patterns could match at the same position, the longest matching pattern is preferred.
31
+
32
+ ```ruby
33
+ matcher = AhoCorasickRust.new(['abc', 'abcd'], match_kind: :leftmost_longest)
34
+ matcher.lookup('abcd')
35
+ # => ['abcd'] # 'abcd' is longer, so it wins
36
+ ```
37
+
38
+ **Use cases:**
39
+ - Tokenization where longer tokens are more meaningful
40
+ - Entity recognition where you want the most specific match
41
+ - Parsing structured text where longer patterns indicate more context
42
+
43
+ ---
44
+
45
+ ### `:standard`
46
+
47
+ **Priority:** Standard Aho-Corasick algorithm behavior
48
+
49
+ Reports matches as they're encountered in the automaton's state machine. This is the classical Aho-Corasick behavior.
50
+
51
+ ```ruby
52
+ matcher = AhoCorasickRust.new(['abc', 'abcd'], match_kind: :standard)
53
+ matcher.lookup('abcd')
54
+ # => ['abc'] # Reports matches as the automaton finds them
55
+ ```
56
+
57
+ **Use cases:**
58
+ - When you need strict Aho-Corasick semantics
59
+ - Potentially slightly faster for certain pattern sets
60
+ - Academic or research purposes requiring standard algorithm behavior
61
+
62
+ ## Comparison Example
63
+
64
+ Given patterns `['ab', 'abc', 'abcd']` and text `'abcd'`:
65
+
66
+ | match_kind | Result | Reason |
67
+ |------------|--------|--------|
68
+ | `:leftmost_first` | `['ab']` | First pattern in list |
69
+ | `:leftmost_longest` | `['abcd']` | Longest matching pattern |
70
+ | `:standard` | `['ab']` | First match encountered by automaton |
71
+
72
+ ## Interaction with Other Features
73
+
74
+ ### With `#lookup_overlapping`
75
+
76
+ The `match_kind` option does **not** affect `#lookup_overlapping`, which always returns all overlapping matches regardless of the strategy:
77
+
78
+ ```ruby
79
+ matcher = AhoCorasickRust.new(['abc', 'bcd'], match_kind: :leftmost_first)
80
+
81
+ # match_kind affects non-overlapping lookup
82
+ matcher.lookup('abcd')
83
+ # => ['abc'] (only one match, respects match_kind)
84
+
85
+ # overlapping lookup ignores match_kind
86
+ matcher.lookup_overlapping('abcd')
87
+ # => ['abc', 'bcd'] (all matches, ignores match_kind)
88
+ ```
89
+
90
+ ### With `case_insensitive`
91
+
92
+ The `match_kind` and `case_insensitive` options work together seamlessly:
93
+
94
+ ```ruby
95
+ matcher = AhoCorasickRust.new(
96
+ ['abc', 'abcd'],
97
+ match_kind: :leftmost_longest,
98
+ case_insensitive: true
99
+ )
100
+ matcher.lookup('ABCD')
101
+ # => ['abcd'] # Finds longest match, case-insensitively
102
+ ```
103
+
104
+ ### With `#find_first`
105
+
106
+ The `match_kind` affects which match is returned by `#find_first`:
107
+
108
+ ```ruby
109
+ # With leftmost_first
110
+ matcher1 = AhoCorasickRust.new(['abc', 'abcd'], match_kind: :leftmost_first)
111
+ matcher1.find_first('abcd') # => 'abc'
112
+
113
+ # With leftmost_longest
114
+ matcher2 = AhoCorasickRust.new(['abc', 'abcd'], match_kind: :leftmost_longest)
115
+ matcher2.find_first('abcd') # => 'abcd'
116
+ ```
117
+
118
+ ## Choosing the Right Strategy
119
+
120
+ **Use `:leftmost_first` when:**
121
+ - You want explicit control over priority
122
+ - Pattern order is meaningful in your domain
123
+ - You're doing rule-based text processing
124
+
125
+ **Use `:leftmost_longest` when:**
126
+ - Longer matches are more specific/important
127
+ - You're tokenizing or parsing
128
+ - You want the most complete match
129
+
130
+ **Use `:standard` when:**
131
+ - You need classical Aho-Corasick semantics
132
+ - You're comparing against other implementations
133
+ - Performance is critical and you understand the tradeoffs
134
+
135
+ ## Performance Notes
136
+
137
+ All three strategies use the same underlying automaton construction, so performance differences are minimal. The main difference is in the matching logic when choosing between multiple possible matches at the same position.
138
+
139
+ In practice, `:leftmost_first` (the default) provides the best balance of performance, predictability, and control for most use cases.
data/docs/reference.md ADDED
@@ -0,0 +1,396 @@
1
+ # API Reference
2
+
3
+ Complete reference for all methods and options in the `ahocorasick-rust` gem.
4
+
5
+ ## Table of Contents
6
+
7
+ - [Constructor](#constructor)
8
+ - [Search Methods](#search-methods)
9
+ - [Replace Methods](#replace-methods)
10
+
11
+ ---
12
+
13
+ ## Constructor
14
+
15
+ ### `AhoCorasickRust.new(patterns, **options)`
16
+
17
+ Creates a new Aho-Corasick matcher from an array of pattern strings.
18
+
19
+ **Parameters:**
20
+ - `patterns` (Array<String>) - Array of pattern strings to search for
21
+ - `options` (Hash) - Optional configuration
22
+
23
+ **Options:**
24
+ - `case_insensitive` (Boolean) - Enable ASCII case-insensitive matching (default: `false`)
25
+ - `match_kind` (Symbol) - Strategy for handling overlapping patterns (default: `:leftmost_first`)
26
+ - `:leftmost_first` - First pattern in list wins
27
+ - `:leftmost_longest` - Longest pattern wins
28
+ - `:standard` - Standard Aho-Corasick behavior
29
+
30
+ **Examples:**
31
+
32
+ ```ruby
33
+ # Basic usage
34
+ matcher = AhoCorasickRust.new(['foo', 'bar', 'baz'])
35
+
36
+ # Case-insensitive matching
37
+ matcher = AhoCorasickRust.new(['Ruby', 'Python'], case_insensitive: true)
38
+
39
+ # Control match priority
40
+ matcher = AhoCorasickRust.new(['abc', 'abcd'], match_kind: :leftmost_longest)
41
+
42
+ # Combine options
43
+ matcher = AhoCorasickRust.new(
44
+ ['test', 'testing'],
45
+ case_insensitive: true,
46
+ match_kind: :leftmost_longest
47
+ )
48
+ ```
49
+
50
+ **Raises:**
51
+ - `TypeError` - If patterns is not an array or contains non-strings
52
+ - `ArgumentError` - If match_kind is invalid
53
+ - `RuntimeError` - If automaton construction fails
54
+
55
+ ---
56
+
57
+ ## Search Methods
58
+
59
+ ### `#lookup(haystack)`
60
+
61
+ Finds all non-overlapping pattern matches in the haystack.
62
+
63
+ **Parameters:**
64
+ - `haystack` (String) - The text to search
65
+
66
+ **Returns:** Array<String> - Array of matched patterns
67
+
68
+ **Examples:**
69
+
70
+ ```ruby
71
+ matcher = AhoCorasickRust.new(['foo', 'bar'])
72
+
73
+ matcher.lookup('foo and bar')
74
+ # => ['foo', 'bar']
75
+
76
+ matcher.lookup('hello world')
77
+ # => []
78
+
79
+ # Non-overlapping: finds 'abc' and stops
80
+ matcher = AhoCorasickRust.new(['abc', 'bcd'])
81
+ matcher.lookup('abcd')
82
+ # => ['abc']
83
+ ```
84
+
85
+ ---
86
+
87
+ ### `#lookup_overlapping(haystack)`
88
+
89
+ Finds all pattern matches, including overlapping ones.
90
+
91
+ **Parameters:**
92
+ - `haystack` (String) - The text to search
93
+
94
+ **Returns:** Array<String> - Array of all matched patterns, including overlaps
95
+
96
+ **Examples:**
97
+
98
+ ```ruby
99
+ matcher = AhoCorasickRust.new(['abc', 'bcd', 'cde'])
100
+
101
+ matcher.lookup_overlapping('abcde')
102
+ # => ['abc', 'bcd', 'cde']
103
+
104
+ # Compare with non-overlapping
105
+ matcher.lookup('abcde')
106
+ # => ['abc']
107
+
108
+ # Finds multiple occurrences of same pattern at different positions
109
+ matcher = AhoCorasickRust.new(['a', 'ab'])
110
+ matcher.lookup_overlapping('aab')
111
+ # => ['a', 'a', 'ab']
112
+ ```
113
+
114
+ ---
115
+
116
+ ### `#lookup_with_positions(haystack)`
117
+
118
+ Finds all non-overlapping matches with their byte positions.
119
+
120
+ **Parameters:**
121
+ - `haystack` (String) - The text to search
122
+
123
+ **Returns:** Array<Hash> - Array of hashes with `:pattern`, `:start`, `:end` keys
124
+
125
+ **Examples:**
126
+
127
+ ```ruby
128
+ matcher = AhoCorasickRust.new(['fox', 'dog'])
129
+
130
+ matcher.lookup_with_positions('The quick brown fox jumps over the lazy dog.')
131
+ # => [
132
+ # { pattern: 'fox', start: 16, end: 19 },
133
+ # { pattern: 'dog', start: 40, end: 43 }
134
+ # ]
135
+
136
+ matcher.lookup_with_positions('hello world')
137
+ # => []
138
+
139
+ # Positions are byte offsets, not character offsets
140
+ matcher = AhoCorasickRust.new(['数据'])
141
+ matcher.lookup_with_positions('金数据工具')
142
+ # => [{ pattern: '数据', start: 3, end: 9 }] # byte positions
143
+ ```
144
+
145
+ ---
146
+
147
+ ### `#match?(haystack)`
148
+
149
+ Checks if any pattern matches in the haystack (predicate method).
150
+
151
+ **Parameters:**
152
+ - `haystack` (String) - The text to search
153
+
154
+ **Returns:** Boolean - `true` if any pattern matches, `false` otherwise
155
+
156
+ **Examples:**
157
+
158
+ ```ruby
159
+ matcher = AhoCorasickRust.new(['foo', 'bar'])
160
+
161
+ matcher.match?('hello foo world')
162
+ # => true
163
+
164
+ matcher.match?('hello world')
165
+ # => false
166
+
167
+ # Works with case-insensitive
168
+ matcher = AhoCorasickRust.new(['Ruby'], case_insensitive: true)
169
+ matcher.match?('I love ruby')
170
+ # => true
171
+ ```
172
+
173
+ ---
174
+
175
+ ### `#find_first(haystack)`
176
+
177
+ Returns the first pattern match found, or `nil` if no match.
178
+
179
+ **Parameters:**
180
+ - `haystack` (String) - The text to search
181
+
182
+ **Returns:** String or nil - First matched pattern, or `nil` if no match
183
+
184
+ **Examples:**
185
+
186
+ ```ruby
187
+ matcher = AhoCorasickRust.new(['foo', 'bar', 'baz'])
188
+
189
+ matcher.find_first('hello foo bar baz')
190
+ # => 'foo'
191
+
192
+ matcher.find_first('hello world')
193
+ # => nil
194
+
195
+ # Stops after first match (more efficient than #lookup)
196
+ matcher = AhoCorasickRust.new(['cat', 'dog', 'bird'])
197
+ matcher.find_first('The cat and dog are friends')
198
+ # => 'cat' (stops, doesn't find 'dog')
199
+ ```
200
+
201
+ ---
202
+
203
+ ### `#find_first_with_position(haystack)`
204
+
205
+ Returns the first pattern match with its position, or `nil` if no match.
206
+
207
+ **Parameters:**
208
+ - `haystack` (String) - The text to search
209
+
210
+ **Returns:** Hash or nil - Hash with `:pattern`, `:start`, `:end` keys, or `nil` if no match
211
+
212
+ **Examples:**
213
+
214
+ ```ruby
215
+ matcher = AhoCorasickRust.new(['foo', 'bar'])
216
+
217
+ matcher.find_first_with_position('hello foo world')
218
+ # => { pattern: 'foo', start: 6, end: 9 }
219
+
220
+ matcher.find_first_with_position('hello world')
221
+ # => nil
222
+
223
+ # Finds earliest match in text, not first in pattern list
224
+ matcher = AhoCorasickRust.new(['bar', 'foo'])
225
+ matcher.find_first_with_position('foo bar baz')
226
+ # => { pattern: 'foo', start: 0, end: 3 } # 'foo' appears first in text
227
+ ```
228
+
229
+ ---
230
+
231
+ ## Replace Methods
232
+
233
+ ### `#replace_all(haystack, replacements)`
234
+
235
+ Replaces all pattern matches with their corresponding replacements.
236
+
237
+ **Parameters:**
238
+ - `haystack` (String) - The text to search
239
+ - `replacements` (Hash or Block) - Replacement mapping or block
240
+
241
+ **Returns:** String - New string with replacements applied
242
+
243
+ **Hash-based replacement:**
244
+
245
+ Maps pattern strings to their replacements. Patterns not in the hash remain unchanged.
246
+
247
+ ```ruby
248
+ matcher = AhoCorasickRust.new(['foo', 'bar', 'baz'])
249
+
250
+ matcher.replace_all('foo and bar', { 'foo' => 'FOO', 'bar' => 'BAR' })
251
+ # => 'FOO and BAR'
252
+
253
+ # Partial replacement - 'baz' not in hash, stays unchanged
254
+ matcher.replace_all('foo bar baz', { 'foo' => 'hello' })
255
+ # => 'hello bar baz'
256
+
257
+ # Empty hash - no replacements
258
+ matcher.replace_all('foo and bar', {})
259
+ # => 'foo and bar'
260
+ ```
261
+
262
+ **Block-based replacement:**
263
+
264
+ Passes each matched pattern to the block, uses return value as replacement.
265
+
266
+ ```ruby
267
+ matcher = AhoCorasickRust.new(['foo', 'bar'])
268
+
269
+ matcher.replace_all('foo and bar') { |match| match.upcase }
270
+ # => 'FOO and BAR'
271
+
272
+ # Dynamic replacement logic
273
+ matcher = AhoCorasickRust.new(['apple', 'banana', 'cherry'])
274
+ text = 'I like apple, banana, and cherry'
275
+ result = matcher.replace_all(text) do |fruit|
276
+ { 'apple' => '🍎', 'banana' => '🍌', 'cherry' => '🍒' }[fruit]
277
+ end
278
+ # => 'I like 🍎, 🍌, and 🍒'
279
+
280
+ # Replacement length can differ from match
281
+ matcher = AhoCorasickRust.new(['a', 'bb'])
282
+ matcher.replace_all('a bb a bb') { |m| m.length.to_s }
283
+ # => '1 2 1 2'
284
+ ```
285
+
286
+ **Raises:**
287
+ - `ArgumentError` - If replacements is neither a Hash nor a block given
288
+
289
+ ---
290
+
291
+ ## Usage Patterns
292
+
293
+ ### Content Filtering
294
+
295
+ ```ruby
296
+ # Filter profanity with asterisks
297
+ bad_words = ['bad', 'worse', 'worst']
298
+ filter = AhoCorasickRust.new(bad_words, case_insensitive: true)
299
+
300
+ filter.replace_all('This is bad and worse') { |word| '*' * word.length }
301
+ # => 'This is *** and *****'
302
+ ```
303
+
304
+ ### Keyword Highlighting
305
+
306
+ ```ruby
307
+ keywords = ['Ruby', 'Python', 'JavaScript']
308
+ matcher = AhoCorasickRust.new(keywords)
309
+
310
+ positions = matcher.lookup_with_positions('I love Ruby and Python')
311
+ # Use positions to add HTML tags, syntax highlighting, etc.
312
+ ```
313
+
314
+ ### Quick Existence Check
315
+
316
+ ```ruby
317
+ # Check if any banned word appears
318
+ banned = ['spam', 'scam', 'fraud']
319
+ checker = AhoCorasickRust.new(banned, case_insensitive: true)
320
+
321
+ if checker.match?(user_input)
322
+ reject_message
323
+ end
324
+ ```
325
+
326
+ ### DNA Sequence Analysis
327
+
328
+ ```ruby
329
+ # Find all overlapping genetic markers
330
+ markers = ['ATCG', 'TCGA', 'CGAT']
331
+ analyzer = AhoCorasickRust.new(markers)
332
+
333
+ sequence = 'ATCGAT'
334
+ analyzer.lookup_overlapping(sequence)
335
+ # => ['ATCG', 'TCGA', 'CGAT'] # all overlapping matches
336
+ ```
337
+
338
+ ### Tokenization
339
+
340
+ ```ruby
341
+ # Prefer longest matches for tokens
342
+ keywords = ['if', 'iffy', 'then', 'end', 'endif']
343
+ tokenizer = AhoCorasickRust.new(keywords, match_kind: :leftmost_longest)
344
+
345
+ tokenizer.lookup('iffy then endif')
346
+ # => ['iffy', 'then', 'endif'] # chooses longer 'iffy' over 'if'
347
+ ```
348
+
349
+ ---
350
+
351
+ ## Type Compatibility
352
+
353
+ ### Accepted Types
354
+
355
+ - **Patterns:** Array of String objects
356
+ - **Haystack:** String objects
357
+ - **Options:** Symbol keys (`:case_insensitive`, `:match_kind`)
358
+ - **Replacements:** Hash with String keys/values, or Block returning String
359
+
360
+ ### Unicode Support
361
+
362
+ All methods support UTF-8 encoded strings:
363
+
364
+ ```ruby
365
+ matcher = AhoCorasickRust.new(['こんにちは', '世界'])
366
+ matcher.lookup('こんにちは世界')
367
+ # => ['こんにちは', '世界']
368
+
369
+ # Emoji support
370
+ matcher = AhoCorasickRust.new(['😊', '🎉'])
371
+ matcher.lookup('I am 😊 today 🎉')
372
+ # => ['😊', '🎉']
373
+ ```
374
+
375
+ **Note:** Position values in `#lookup_with_positions` and `#find_first_with_position` are byte offsets, not character offsets. For multi-byte UTF-8 characters, byte positions will differ from character positions.
376
+
377
+ ---
378
+
379
+ ## Error Handling
380
+
381
+ ```ruby
382
+ # TypeError: patterns must be array of strings
383
+ AhoCorasickRust.new('not an array')
384
+ # TypeError: wrong argument type String (expected Array)
385
+
386
+ AhoCorasickRust.new(['foo', 123])
387
+ # TypeError: wrong argument type Integer (expected String)
388
+
389
+ # ArgumentError: invalid match_kind
390
+ AhoCorasickRust.new(['foo'], match_kind: :invalid)
391
+ # ArgumentError: Invalid match_kind: 'invalid'...
392
+
393
+ # TypeError: haystack must be string
394
+ matcher.lookup(123)
395
+ # TypeError: wrong argument type Integer (expected String)
396
+ ```
@@ -1,5 +1,5 @@
1
- use aho_corasick::AhoCorasick;
2
- use magnus::{method, function, prelude::*, Error, Ruby};
1
+ use aho_corasick::{AhoCorasick, AhoCorasickBuilder, MatchKind};
2
+ use magnus::{method, function, prelude::*, Error, Ruby, RHash, RArray, Value, Symbol};
3
3
 
4
4
  #[magnus::wrap(class = "AhoCorasickRust")]
5
5
  pub struct AhoCorasickRust {
@@ -8,12 +8,56 @@ pub struct AhoCorasickRust {
8
8
  }
9
9
 
10
10
  impl AhoCorasickRust {
11
- fn new(ruby: &Ruby, words: Vec<String>) -> Result<Self, Error> {
12
- let ac = AhoCorasick::new(&words)
11
+ fn new_impl(ruby: &Ruby, words: Vec<String>, kwargs: Option<RHash>) -> Result<Self, Error> {
12
+ let mut builder = AhoCorasickBuilder::new();
13
+
14
+ // Check for options if kwargs provided
15
+ if let Some(kwargs) = kwargs {
16
+ // case_insensitive option
17
+ if let Some(val) = kwargs.get(ruby.to_symbol("case_insensitive")) {
18
+ if let Ok(case_insensitive) = bool::try_convert(val) {
19
+ if case_insensitive {
20
+ builder.ascii_case_insensitive(true);
21
+ }
22
+ }
23
+ }
24
+
25
+ // match_kind option
26
+ if let Some(val) = kwargs.get(ruby.to_symbol("match_kind")) {
27
+ if let Some(sym) = Symbol::from_value(val) {
28
+ let kind_str = sym.name()?.to_string();
29
+ let match_kind = match kind_str.as_ref() {
30
+ "standard" => MatchKind::Standard,
31
+ "leftmost_first" => MatchKind::LeftmostFirst,
32
+ "leftmost_longest" => MatchKind::LeftmostLongest,
33
+ _ => {
34
+ return Err(Error::new(
35
+ ruby.exception_arg_error(),
36
+ format!("Invalid match_kind: '{}'. Valid values are :standard, :leftmost_first, :leftmost_longest", kind_str)
37
+ ));
38
+ }
39
+ };
40
+ builder.match_kind(match_kind);
41
+ }
42
+ }
43
+ }
44
+
45
+ let ac = builder.build(&words)
13
46
  .map_err(|e| Error::new(ruby.exception_runtime_error(), format!("Failed to build automaton: {}", e)))?;
14
47
  Ok(Self { words, ac })
15
48
  }
16
49
 
50
+ fn new(ruby: &Ruby, args: &[Value]) -> Result<Self, Error> {
51
+ let args = magnus::scan_args::scan_args::<(Vec<String>,), (), (), (), RHash, ()>(args)?;
52
+ let (words,) = args.required;
53
+ let kwargs = args.keywords;
54
+
55
+ // Only pass kwargs if non-empty
56
+ let kwargs_opt = if kwargs.len() > 0 { Some(kwargs) } else { None };
57
+
58
+ Self::new_impl(ruby, words, kwargs_opt)
59
+ }
60
+
17
61
  fn lookup(&self, haystack: String) -> Vec<String> {
18
62
  let mut matches = vec![];
19
63
  for mat in self.ac.find_iter(&haystack) {
@@ -21,12 +65,127 @@ impl AhoCorasickRust {
21
65
  }
22
66
  matches
23
67
  }
68
+
69
+ fn is_match(&self, haystack: String) -> bool {
70
+ self.ac.is_match(&haystack)
71
+ }
72
+
73
+ fn lookup_overlapping(&self, haystack: String) -> Vec<String> {
74
+ let mut matches = vec![];
75
+ for mat in self.ac.find_overlapping_iter(&haystack) {
76
+ matches.push(self.words[mat.pattern()].clone());
77
+ }
78
+ matches
79
+ }
80
+
81
+ fn find_first(&self, haystack: String) -> Option<String> {
82
+ self.ac.find(&haystack).map(|mat| self.words[mat.pattern()].clone())
83
+ }
84
+
85
+ fn find_first_with_position(&self, haystack: String) -> Result<Option<RHash>, Error> {
86
+ let ruby = Ruby::get().unwrap();
87
+ if let Some(mat) = self.ac.find(&haystack) {
88
+ let hash = ruby.hash_new();
89
+ hash.aset(ruby.to_symbol("pattern"), self.words[mat.pattern()].clone())?;
90
+ hash.aset(ruby.to_symbol("start"), mat.start())?;
91
+ hash.aset(ruby.to_symbol("end"), mat.end())?;
92
+ Ok(Some(hash))
93
+ } else {
94
+ Ok(None)
95
+ }
96
+ }
97
+
98
+ fn lookup_with_positions(&self, haystack: String) -> Result<RArray, Error> {
99
+ let ruby = Ruby::get().unwrap();
100
+ let matches = ruby.ary_new();
101
+ for mat in self.ac.find_iter(&haystack) {
102
+ let hash = ruby.hash_new();
103
+ hash.aset(ruby.to_symbol("pattern"), self.words[mat.pattern()].clone())?;
104
+ hash.aset(ruby.to_symbol("start"), mat.start())?;
105
+ hash.aset(ruby.to_symbol("end"), mat.end())?;
106
+ matches.push(hash)?;
107
+ }
108
+ Ok(matches)
109
+ }
110
+
111
+ fn replace_all(&self, args: &[Value]) -> Result<String, Error> {
112
+ let ruby = Ruby::get().unwrap();
113
+
114
+ // Parse arguments: haystack (required), replacements (optional if block given)
115
+ let haystack: String = if args.is_empty() {
116
+ return Err(Error::new(
117
+ ruby.exception_arg_error(),
118
+ "wrong number of arguments (given 0, expected 1..2)"
119
+ ));
120
+ } else {
121
+ String::try_convert(args[0])?
122
+ };
123
+
124
+ // Check if a block was given
125
+ match ruby.block_proc() {
126
+ Ok(proc) => {
127
+ // Block-based replacement
128
+ let mut result = haystack.clone();
129
+ let mut offset: isize = 0;
130
+
131
+ for mat in self.ac.find_iter(&haystack) {
132
+ let pattern = &self.words[mat.pattern()];
133
+ let replacement: String = proc.call((pattern.clone(),))?;
134
+
135
+ let start = (mat.start() as isize + offset) as usize;
136
+ let end = (mat.end() as isize + offset) as usize;
137
+
138
+ result.replace_range(start..end, &replacement);
139
+ offset += replacement.len() as isize - (mat.end() - mat.start()) as isize;
140
+ }
141
+
142
+ Ok(result)
143
+ }
144
+ Err(_) if args.len() >= 2 => {
145
+ // Hash-based replacement
146
+ if let Some(hash) = RHash::from_value(args[1]) {
147
+ let mut replace_with: Vec<String> = Vec::with_capacity(self.words.len());
148
+
149
+ for word in &self.words {
150
+ if let Some(val) = hash.get(word.clone()) {
151
+ if let Ok(replacement) = String::try_convert(val) {
152
+ replace_with.push(replacement);
153
+ } else {
154
+ replace_with.push(word.clone());
155
+ }
156
+ } else {
157
+ replace_with.push(word.clone());
158
+ }
159
+ }
160
+
161
+ Ok(self.ac.replace_all(&haystack, &replace_with))
162
+ } else {
163
+ Err(Error::new(
164
+ ruby.exception_arg_error(),
165
+ "replace_all requires a Hash or block"
166
+ ))
167
+ }
168
+ }
169
+ Err(_) => {
170
+ Err(Error::new(
171
+ ruby.exception_arg_error(),
172
+ "replace_all requires a Hash or block"
173
+ ))
174
+ }
175
+ }
176
+ }
24
177
  }
25
178
 
26
179
  #[magnus::init]
27
180
  fn main(ruby: &Ruby) -> Result<(), Error> {
28
181
  let class = ruby.define_class("AhoCorasickRust", ruby.class_object())?;
29
- class.define_singleton_method("new", function!(AhoCorasickRust::new, 1))?;
182
+ class.define_singleton_method("new", function!(AhoCorasickRust::new, -1))?;
30
183
  class.define_method("lookup", method!(AhoCorasickRust::lookup, 1))?;
184
+ class.define_method("match?", method!(AhoCorasickRust::is_match, 1))?;
185
+ class.define_method("lookup_overlapping", method!(AhoCorasickRust::lookup_overlapping, 1))?;
186
+ class.define_method("find_first", method!(AhoCorasickRust::find_first, 1))?;
187
+ class.define_method("find_first_with_position", method!(AhoCorasickRust::find_first_with_position, 1))?;
188
+ class.define_method("lookup_with_positions", method!(AhoCorasickRust::lookup_with_positions, 1))?;
189
+ class.define_method("replace_all", method!(AhoCorasickRust::replace_all, -1))?;
31
190
  Ok(())
32
191
  }
metadata CHANGED
@@ -1,14 +1,13 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: ahocorasick-rust
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.2
4
+ version: 2.0.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Eric
8
- autorequire:
9
8
  bindir: bin
10
9
  cert_chain: []
11
- date: 2025-11-16 00:00:00.000000000 Z
10
+ date: 1980-01-02 00:00:00.000000000 Z
12
11
  dependencies:
13
12
  - !ruby/object:Gem::Dependency
14
13
  name: rb_sys
@@ -26,8 +25,9 @@ dependencies:
26
25
  version: 0.9.117
27
26
  description: A Ruby gem wrapping the legendary Rust Aho-Corasick algorithm! Aho-Corasick
28
27
  is a powerful string searching algorithm that finds multiple patterns simultaneously
29
- in a text. Perfect for matching dictionaries, filtering words, and other multi-pattern
30
- search tasks at lightning speed! (ノ◕ヮ◕)ノ*:・゚✧
28
+ in a text. Features include overlapping matches, case-insensitive search, find &
29
+ replace, match positions, and configurable match strategies. Perfect for content
30
+ filtering, tokenization, and multi-pattern search at lightning speed! (ノ◕ヮ◕)ノ*:・゚✧
31
31
  email:
32
32
  - eric@ebj.dev
33
33
  executables: []
@@ -37,6 +37,8 @@ extra_rdoc_files: []
37
37
  files:
38
38
  - README.md
39
39
  - Rakefile
40
+ - docs/match_kind.md
41
+ - docs/reference.md
40
42
  - ext/rahocorasick/Cargo.lock
41
43
  - ext/rahocorasick/Cargo.toml
42
44
  - ext/rahocorasick/extconf.rb
@@ -46,7 +48,6 @@ homepage: https://github.com/jetpks/ahocorasick-rust-ruby
46
48
  licenses:
47
49
  - MIT
48
50
  metadata: {}
49
- post_install_message:
50
51
  rdoc_options: []
51
52
  require_paths:
52
53
  - lib
@@ -61,8 +62,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
61
62
  - !ruby/object:Gem::Version
62
63
  version: '0'
63
64
  requirements: []
64
- rubygems_version: 3.5.22
65
- signing_key:
65
+ rubygems_version: 3.6.9
66
66
  specification_version: 4
67
67
  summary: Blazing-fast ✨ Ruby wrapper for the Rust Aho-Corasick string matching algorithm!
68
68
  test_files: []