ahocorasick-rust 2.2.0-aarch64-linux-musl

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: f29b41b6780293db5dec1eea569d7303b5b7aa5db21d50be29d8271226ac9e67
4
+ data.tar.gz: 26d85073da308cd263384679af7e6e88ac9f2e07ed8eb59c211b007bfbce6ad9
5
+ SHA512:
6
+ metadata.gz: cff0104df054af9338d3b11a6fac55343906140c14d870b21137a2c794b6d74a8c533cb2b16fbb1327b64ca354ddd1519cd04658476099cd488598c6b54f8392
7
+ data.tar.gz: 7c2311bebb1f546b67ee35580770bfef000b0969e64f121d9f6ff0a927fde08aebff339f8199de9a205dae778be3c99451cb366311f040148d53bd7d5e5f8fb1
data/README.md ADDED
@@ -0,0 +1,255 @@
1
+ # Aho-Corasick Rust ✨
2
+
3
+ [![Gem Version](https://badge.fury.io/rb/ahocorasick-rust.svg)](https://badge.fury.io/rb/ahocorasick-rust)
4
+
5
+ > **Blazing-fast multi-pattern string matching for Ruby!** (ノ◕ヮ◕)ノ*:・゚✧
6
+
7
+ `ahocorasick-rust` is a Ruby wrapper for the [Aho-Corasick algorithm](https://github.com/BurntSushi/aho-corasick) implemented in Rust! 🦀💎
8
+
9
+ ## What is Aho-Corasick? 🤔
10
+
11
+ Aho-Corasick is a powerful string searching algorithm that can find **multiple patterns simultaneously** in a single pass through your text! Unlike traditional string matching that searches for one pattern at a time, Aho-Corasick builds a finite state machine from your dictionary of patterns and matches them all at once.
12
+
13
+ **Perfect for:**
14
+ - 🔍 Content filtering & moderation
15
+ - 📝 Finding keywords in large documents
16
+ - 🚫 Detecting prohibited words or phrases
17
+ - 🏷️ Multi-pattern text analysis
18
+ - ⚡ Any scenario where you need to search for many patterns efficiently!
19
+
20
+ **Why this gem rocks:**
21
+ - 🦀 Powered by Rust for maximum speed
22
+ - 💎 Clean, intuitive Ruby API with 7+ search methods
23
+ - 🚀 Up to **67x faster** than pure Ruby implementations
24
+ - ✨ Precompiled binaries for major platforms
25
+ - 🎯 Multiple search modes: overlapping, positioned, existence checks
26
+ - 🔄 Find & replace with hash or block-based logic
27
+ - 🌈 Works with Ruby 2.7+ and UTF-8/emoji
28
+
29
+ ## Installation 📦
30
+
31
+ Add this gem to your `Gemfile`:
32
+
33
+ ```ruby
34
+ gem 'ahocorasick-rust'
35
+ ```
36
+
37
+ Then execute:
38
+
39
+ ```bash
40
+ bundle install
41
+ ```
42
+
43
+ Or install it yourself:
44
+
45
+ ```bash
46
+ gem install ahocorasick-rust
47
+ ```
48
+
49
+ ## Features ✨
50
+
51
+ - **Multiple search modes** - Find all matches, overlapping matches, or just check existence
52
+ - **Position tracking** - Get byte offsets for every match
53
+ - **Case-insensitive matching** - Optional ASCII case-insensitive search
54
+ - **Match strategies** - Control priority when patterns overlap
55
+ - **Find & replace** - Replace patterns with strings or dynamic logic via blocks
56
+ - **Unicode support** - Works seamlessly with UTF-8 text and emoji
57
+ - **Zero-copy where possible** - Efficient memory usage
58
+
59
+ ## Quick Start 🎀
60
+
61
+ ### Basic Pattern Matching
62
+
63
+ ```ruby
64
+ require 'ahocorasick-rust'
65
+
66
+ # Create a matcher with your patterns
67
+ matcher = AhoCorasickRust.new(['cat', 'dog', 'fox'])
68
+
69
+ # Find all matches
70
+ matcher.lookup("The quick brown fox jumps over the lazy dog.")
71
+ # => ["fox", "dog"]
72
+
73
+ # Check if any pattern exists
74
+ matcher.match?("I have a cat")
75
+ # => true
76
+ ```
77
+
78
+ ### Case-Insensitive Matching
79
+
80
+ ```ruby
81
+ matcher = AhoCorasickRust.new(['Ruby', 'Python'], case_insensitive: true)
82
+
83
+ matcher.lookup('I love RUBY and python!')
84
+ # => ["Ruby", "Python"]
85
+ ```
86
+
87
+ ### Get Match Positions
88
+
89
+ ```ruby
90
+ matcher = AhoCorasickRust.new(['fox', 'dog'])
91
+
92
+ matcher.lookup_with_positions('The fox and dog')
93
+ # => [
94
+ # { pattern: 'fox', start: 4, end: 7 },
95
+ # { pattern: 'dog', start: 12, end: 15 }
96
+ # ]
97
+ ```
98
+
99
+ ### Find & Replace
100
+
101
+ ```ruby
102
+ matcher = AhoCorasickRust.new(['bad', 'worse', 'worst'])
103
+
104
+ # Replace with hash
105
+ matcher.replace_all('This is bad and worse', { 'bad' => 'good', 'worse' => 'better' })
106
+ # => "This is good and better"
107
+
108
+ # Replace with block
109
+ matcher.replace_all('This is bad and worse') { |word| '*' * word.length }
110
+ # => "This is *** and *****"
111
+ ```
112
+
113
+ ### Overlapping Matches
114
+
115
+ ```ruby
116
+ matcher = AhoCorasickRust.new(['abc', 'bcd', 'cde'])
117
+
118
+ # Regular lookup finds non-overlapping matches
119
+ matcher.lookup('abcde')
120
+ # => ["abc"]
121
+
122
+ # Overlapping lookup finds all matches
123
+ matcher.lookup_overlapping('abcde')
124
+ # => ["abc", "bcd", "cde"]
125
+ ```
126
+
127
+ ### Advanced: Match Strategies
128
+
129
+ ```ruby
130
+ # Prefer longest matches
131
+ matcher = AhoCorasickRust.new(
132
+ ['test', 'testing'],
133
+ match_kind: :leftmost_longest
134
+ )
135
+
136
+ matcher.lookup('testing')
137
+ # => ["testing"] # chooses longer match over 'test'
138
+ ```
139
+
140
+ ### Find First (Efficient for Existence Checks)
141
+
142
+ ```ruby
143
+ matcher = AhoCorasickRust.new(['foo', 'bar', 'baz'])
144
+
145
+ # Get just the first match (faster than getting all matches)
146
+ matcher.find_first('hello foo bar baz')
147
+ # => "foo"
148
+
149
+ # Or with position
150
+ matcher.find_first_with_position('hello foo bar')
151
+ # => { pattern: 'foo', start: 6, end: 9 }
152
+ ```
153
+
154
+ ## API Overview 🔍
155
+
156
+ **Constructor:**
157
+ - `AhoCorasickRust.new(patterns, case_insensitive: false, match_kind: :leftmost_first)`
158
+
159
+ **Search Methods:**
160
+ - `#lookup(text)` - Find all non-overlapping matches
161
+ - `#lookup_overlapping(text)` - Find all matches including overlaps
162
+ - `#lookup_with_positions(text)` - Find matches with byte positions
163
+ - `#match?(text)` - Check if any pattern exists (returns boolean)
164
+ - `#find_first(text)` - Get first match only
165
+ - `#find_first_with_position(text)` - Get first match with position
166
+
167
+ **Replace Methods:**
168
+ - `#replace_all(text, hash)` - Replace with hash mapping
169
+ - `#replace_all(text) { |match| ... }` - Replace with block
170
+
171
+ ## Documentation 📖
172
+
173
+ - **[API Reference](docs/reference.md)** - Complete method documentation with examples
174
+ - **[Match Kind Guide](docs/match_kind.md)** - Understanding match strategies
175
+ - **[Example Script](scripts/example.rb)** - Real-world usage examples
176
+
177
+ **Want more examples?** Check out our example script with content filtering, language detection, and more! 🌈
178
+
179
+ ## Benchmark 📊
180
+
181
+ **Don't just take our word for it - check out these performance numbers!** 🎉
182
+
183
+ ### Test Setup 1
184
+ - Words: 500 patterns
185
+ - Test cases: 2,000
186
+ - Text length: 3,154 chars (avg), 23,676 (max)
187
+
188
+ ```
189
+ user system total real
190
+ each&include 6.487059 0.185424 6.672483 ( 6.791808)
191
+ ruby_ahoc 4.178672 0.138610 4.317282 ( 4.547964)
192
+ rust_ahoc 0.157662 0.004847 0.162509 ( 0.166964)
193
+ ```
194
+
195
+ > 🎈 **27.2x faster** than pure Ruby implementation!
196
+
197
+ ### Test Setup 2
198
+ - Words: 500 patterns
199
+ - Test cases: 2,000
200
+ - Text length: 49,162 chars (avg), 10,392,056 (max)
201
+
202
+ ```
203
+ user system total real
204
+ each&include 27.903179 0.237389 28.140568 ( 28.563194)
205
+ ruby_ahoc 45.220535 0.363107 45.583642 ( 46.477702)
206
+ rust_ahoc 0.670583 0.007192 0.677775 ( 0.686904)
207
+ ```
208
+
209
+ > 🎈 **67.7x faster** than pure Ruby implementation!
210
+
211
+ The larger your text and the more patterns you have, the more this gem shines! ✨
212
+
213
+ ## Platform Support 🌍
214
+
215
+ Precompiled binaries are available for:
216
+ - 🍎 macOS (ARM64 & x86_64)
217
+ - 🐧 Linux (ARM64 & x86_64)
218
+
219
+ If a precompiled binary isn't available for your platform, the gem will automatically compile the Rust extension during installation.
220
+
221
+ ## Development 🛠️
222
+
223
+ Want to contribute? Yay! 🎉
224
+
225
+ ```bash
226
+ # Install dependencies
227
+ bundle install
228
+
229
+ # Compile the extension
230
+ fish -c "bundle exec rake dev compile"
231
+
232
+ # Run tests
233
+ fish -c "bundle exec rake test"
234
+
235
+ # Build the gem
236
+ gem build ahocorasick-rust.gemspec
237
+ ```
238
+
239
+ ## References 📚
240
+
241
+ - [Aho-Corasick (Rust)](https://github.com/BurntSushi/aho-corasick) - The amazing Rust implementation we wrap
242
+ - [Aho-Corasick Algorithm](https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm) - Learn about the algorithm
243
+ - [Original Ruby Implementation](https://github.com/ahnick/ahocorasick) - Pure Ruby version for comparison
244
+
245
+ ## Contributing 💝
246
+
247
+ Bug reports and pull requests are welcome on GitHub at [https://github.com/jetpks/ahocorasick-rust-ruby](https://github.com/jetpks/ahocorasick-rust-ruby)!
248
+
249
+ ## License 📄
250
+
251
+ This gem is available as open source under the terms of the [MIT License](https://opensource.org/licenses/MIT).
252
+
253
+ ---
254
+
255
+ Made with 💖 and Rust 🦀 by Eric
data/Rakefile ADDED
@@ -0,0 +1,33 @@
1
+ # frozen_string_literal: true
2
+
3
+ require 'rake/testtask'
4
+ require 'rake/extensiontask'
5
+
6
+ CROSS_PLATFORMS = %w[
7
+ aarch64-linux-gnu
8
+ aarch64-linux-musl
9
+ arm64-darwin
10
+ x86_64-darwin
11
+ x86_64-linux-gnu
12
+ x86_64-linux-musl
13
+ ].freeze
14
+
15
+ spec = Bundler.load_gemspec('ahocorasick-rust.gemspec')
16
+
17
+ Rake::ExtensionTask.new('rahocorasick', spec) do |c|
18
+ c.lib_dir = 'lib/rahocorasick'
19
+ c.source_pattern = '*.{rs,toml}'
20
+ c.cross_compile = true
21
+ c.cross_platform = CROSS_PLATFORMS
22
+ end
23
+
24
+ Rake::TestTask.new do |t|
25
+ t.deps << :dev << :compile
26
+ t.test_files = FileList['test/**/*_test.rb']
27
+ end
28
+
29
+ task :dev do
30
+ ENV['RB_SYS_CARGO_PROFILE'] = 'dev'
31
+ end
32
+
33
+ task default: :test
@@ -0,0 +1,139 @@
1
+ # Match Kind Strategies
2
+
3
+ The `match_kind` option controls how the Aho-Corasick automaton behaves when multiple patterns could match at the same position in the text. This is important when you have overlapping patterns like `'abc'` and `'abcd'`.
4
+
5
+ ## Available Options
6
+
7
+ ### `:leftmost_first` (default)
8
+
9
+ **Priority:** First pattern in the list wins
10
+
11
+ When multiple patterns could match at the same position, the pattern that appears first in your pattern list takes precedence.
12
+
13
+ ```ruby
14
+ matcher = AhoCorasickRust.new(['abc', 'abcd'], match_kind: :leftmost_first)
15
+ matcher.lookup('abcd')
16
+ # => ['abc'] # 'abc' appears first in the pattern list, so it wins
17
+ ```
18
+
19
+ **Use cases:**
20
+ - Keyword replacement where you want specific patterns to take precedence
21
+ - Content filtering with priority rules
22
+ - When you want explicit control over match priority by ordering your patterns
23
+
24
+ ---
25
+
26
+ ### `:leftmost_longest`
27
+
28
+ **Priority:** Longest match wins
29
+
30
+ When multiple patterns could match at the same position, the longest matching pattern is preferred.
31
+
32
+ ```ruby
33
+ matcher = AhoCorasickRust.new(['abc', 'abcd'], match_kind: :leftmost_longest)
34
+ matcher.lookup('abcd')
35
+ # => ['abcd'] # 'abcd' is longer, so it wins
36
+ ```
37
+
38
+ **Use cases:**
39
+ - Tokenization where longer tokens are more meaningful
40
+ - Entity recognition where you want the most specific match
41
+ - Parsing structured text where longer patterns indicate more context
42
+
43
+ ---
44
+
45
+ ### `:standard`
46
+
47
+ **Priority:** Standard Aho-Corasick algorithm behavior
48
+
49
+ Reports matches as they're encountered in the automaton's state machine. This is the classical Aho-Corasick behavior.
50
+
51
+ ```ruby
52
+ matcher = AhoCorasickRust.new(['abc', 'abcd'], match_kind: :standard)
53
+ matcher.lookup('abcd')
54
+ # => ['abc'] # Reports matches as the automaton finds them
55
+ ```
56
+
57
+ **Use cases:**
58
+ - When you need strict Aho-Corasick semantics
59
+ - Potentially slightly faster for certain pattern sets
60
+ - Academic or research purposes requiring standard algorithm behavior
61
+
62
+ ## Comparison Example
63
+
64
+ Given patterns `['ab', 'abc', 'abcd']` and text `'abcd'`:
65
+
66
+ | match_kind | Result | Reason |
67
+ |------------|--------|--------|
68
+ | `:leftmost_first` | `['ab']` | First pattern in list |
69
+ | `:leftmost_longest` | `['abcd']` | Longest matching pattern |
70
+ | `:standard` | `['ab']` | First match encountered by automaton |
71
+
72
+ ## Interaction with Other Features
73
+
74
+ ### With `#lookup_overlapping`
75
+
76
+ The `match_kind` option does **not** affect `#lookup_overlapping`, which always returns all overlapping matches regardless of the strategy:
77
+
78
+ ```ruby
79
+ matcher = AhoCorasickRust.new(['abc', 'bcd'], match_kind: :leftmost_first)
80
+
81
+ # match_kind affects non-overlapping lookup
82
+ matcher.lookup('abcd')
83
+ # => ['abc'] (only one match, respects match_kind)
84
+
85
+ # overlapping lookup ignores match_kind
86
+ matcher.lookup_overlapping('abcd')
87
+ # => ['abc', 'bcd'] (all matches, ignores match_kind)
88
+ ```
89
+
90
+ ### With `case_insensitive`
91
+
92
+ The `match_kind` and `case_insensitive` options work together seamlessly:
93
+
94
+ ```ruby
95
+ matcher = AhoCorasickRust.new(
96
+ ['abc', 'abcd'],
97
+ match_kind: :leftmost_longest,
98
+ case_insensitive: true
99
+ )
100
+ matcher.lookup('ABCD')
101
+ # => ['abcd'] # Finds longest match, case-insensitively
102
+ ```
103
+
104
+ ### With `#find_first`
105
+
106
+ The `match_kind` affects which match is returned by `#find_first`:
107
+
108
+ ```ruby
109
+ # With leftmost_first
110
+ matcher1 = AhoCorasickRust.new(['abc', 'abcd'], match_kind: :leftmost_first)
111
+ matcher1.find_first('abcd') # => 'abc'
112
+
113
+ # With leftmost_longest
114
+ matcher2 = AhoCorasickRust.new(['abc', 'abcd'], match_kind: :leftmost_longest)
115
+ matcher2.find_first('abcd') # => 'abcd'
116
+ ```
117
+
118
+ ## Choosing the Right Strategy
119
+
120
+ **Use `:leftmost_first` when:**
121
+ - You want explicit control over priority
122
+ - Pattern order is meaningful in your domain
123
+ - You're doing rule-based text processing
124
+
125
+ **Use `:leftmost_longest` when:**
126
+ - Longer matches are more specific/important
127
+ - You're tokenizing or parsing
128
+ - You want the most complete match
129
+
130
+ **Use `:standard` when:**
131
+ - You need classical Aho-Corasick semantics
132
+ - You're comparing against other implementations
133
+ - Performance is critical and you understand the tradeoffs
134
+
135
+ ## Performance Notes
136
+
137
+ All three strategies use the same underlying automaton construction, so performance differences are minimal. The main difference is in the matching logic when choosing between multiple possible matches at the same position.
138
+
139
+ In practice, `:leftmost_first` (the default) provides the best balance of performance, predictability, and control for most use cases.