tre_regex 0.1.1 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: c397cd3765b015545e177b95c5bd349990bd93ff8bb50d93e31d6e8d1888516b
4
- data.tar.gz: d149058f38eb481add948d7f44e23e6c20a49a820c95bf08cda52772313eecc1
3
+ metadata.gz: 71f0e9a1dae495b0c90ea158eb3ca777628ab09bff914b8e93daa1f60bae0eb7
4
+ data.tar.gz: a1dad85bb7991a50043691664b32515868ee0eb74a2f57c012467dadd180da72
5
5
  SHA512:
6
- metadata.gz: f61f4e18c2a67b1e6136976f4558418eefbd18609bf4acb2a7a8e292aa6c6d19c675ac2008c3bc0995d6d15c071a9e4c6b58f37e5894b841632db96b55a27762
7
- data.tar.gz: 5caefb33cd586c165df5f4d3c29cc879d1706b04e974a838429322d91eabe439ad4eadf7c43b6859ed02626eefa0578f8b112e3ed2591ac4c127c31224d62371
6
+ metadata.gz: a2f9c17fff47178e2fa7b6bdab40714593ed163229b0fec6e824b35575f7046afe8fd75fbbb208c3bf54d891268b0bcc07a93aa67c43b8966f06d5d8267bc375
7
+ data.tar.gz: 7fb802f25c70580d0f47488f3f43fc102a1230d2b5a5e47e092b3b6186dba97465fd3e7fd0b0f88fbdf6ff98eeca2daf9680cd86aed83f8d1e8e6939d462c0fa
data/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2026 Oleksii Vasyliev
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,325 @@
1
+ # TreRegex [![Ruby Checks](https://github.com/le0pard/tre_regex/actions/workflows/main.yml/badge.svg)](https://github.com/le0pard/tre_regex/actions/workflows/main.yml)
2
+
3
+ `TreRegex` is a robust Ruby gem that provides a high-performance interface to the [TRE](https://github.com/laurikari/tre) approximate regex matching library. Powered by FFI, it allows you to perform lightning-fast fuzzy string searching while safely handling Ruby's Unicode characters.
4
+
5
+ ## Why?
6
+
7
+ Standard regular expressions are strictly exact. If you are searching text containing typos, OCR errors, or variations in spelling, standard `Regexp` will fail.
8
+
9
+ While Ruby has built-in string distance metrics (like Levenshtein distance), they usually require comparing whole strings against other whole strings. `TreRegex` solves this by allowing you to search for a pattern *within* a larger body of text while permitting a configurable number of errors (insertions, deletions, and substitutions).
10
+
11
+ ## Features
12
+
13
+ * **Approximate Matching**: Find matches even if the target string has missing, extra, or substituted characters.
14
+ * **Granular Control**: Set strict limits on `max_errors`, or fine-tune by specific error types (`max_insertions`, `max_deletions`, `max_substitutions`).
15
+ * **Multi-byte Unicode Safety**: Transparently maps underlying C byte-offsets back to native Ruby character indices (e.g., emojis won't break your offsets).
16
+
17
+ ## Installation
18
+
19
+ Add this line to your application's Gemfile:
20
+
21
+ ```ruby
22
+ gem 'tre_regex'
23
+ ```
24
+
25
+ And then execute:
26
+
27
+ ```bash
28
+ $ bundle install
29
+ ```
30
+
31
+ Or install it directly:
32
+
33
+ ```bash
34
+ $ gem install tre_regex
35
+ ```
36
+
37
+ ## Usage
38
+
39
+ ### Basic Matching
40
+
41
+ Create a new `TreRegex::Regex` object and use `exec` or `test?` to search text
42
+
43
+ ```ruby
44
+ require 'tre_regex'
45
+
46
+ regex = TreRegex::Regex.new('apple', ignore_case: true)
47
+
48
+ # Simple boolean check
49
+ regex.test?('I ate an APPLE today')
50
+ # => true
51
+
52
+ # Get detailed match data
53
+ result = regex.exec('I ate an apple today')
54
+ # => {
55
+ # :match => "apple",
56
+ # :submatches => [],
57
+ # :index => 9,
58
+ # :end_index => 14,
59
+ # :cost => 0,
60
+ # :errors => {:insertions=>0, :deletions=>0, :substitutions=>0}
61
+ # }
62
+ ```
63
+
64
+ ### Fuzzy Matching
65
+
66
+ You can configure fuzziness by passing options directly to the `exec` method
67
+
68
+ ```ruby
69
+ regex = TreRegex::Regex.new('apple')
70
+
71
+ # Allow up to 1 error of any kind
72
+ regex.exec('I ate an aple', max_errors: 1)
73
+ # => {match: "aple", submatches: [], index: 9, end_index: 13, cost: 1, errors: {insertions: 0, deletions: 1, substitutions: 0}}
74
+
75
+ # Allow substitutions, but explicitly forbid deletions
76
+ regex.exec('I ate an aple', max_substitutions: 1, max_deletions: 0)
77
+ # => nil
78
+ ```
79
+
80
+ ### Finding All Matches
81
+
82
+ Use `match_all` to find every occurrence of a pattern in a string. It can take a block or return an `Enumerator`
83
+
84
+ ```ruby
85
+ regex = TreRegex::Regex.new('cat')
86
+
87
+ # Returns an array of match hashes
88
+ regex.match_all('cat, cot, cut', max_errors: 1).to_a
89
+ # => [
90
+ # {match: "cat", submatches: [], index: 0, end_index: 3, cost: 0, errors: {insertions: 0, deletions: 0, substitutions: 0}},
91
+ # {match: "cot", submatches: [], index: 5, end_index: 8, cost: 1, errors: {insertions: 0, deletions: 0, substitutions: 1}},
92
+ # {match: "cut", submatches: [], index: 10, end_index: 13, cost: 1, errors: {insertions: 0, deletions: 0, substitutions: 1}}
93
+ # ]
94
+ ```
95
+
96
+ ### Capture Groups (Submatches)
97
+
98
+ `TreRegex` fully supports standard POSIX capture groups using parentheses `()`. Whenever a match is found, any captured data is returned as an array of strings under the `:submatches` key in the result hash.
99
+
100
+ If your pattern does not contain any capture groups, `:submatches` will simply return an empty array `[]`.
101
+
102
+ ```ruby
103
+ regex = TreRegex::Regex.new('I love (ruby|python)')
104
+ result = regex.exec('I love ruby a lot')
105
+
106
+ # The captured group is extracted exactly as it was matched
107
+ result[:submatches] # => ["ruby"]
108
+ ```
109
+
110
+ #### Multiple and Optional Groups
111
+
112
+ You can define multiple capture groups, and they will be returned in the array in the exact order they appear in the pattern.
113
+
114
+ If you use an optional capture group `?` that does not end up matching anything in the target text, `TreRegex` will safely insert a `nil` in its place in the array to maintain the correct index order.
115
+
116
+ ```ruby
117
+ # The first group (cat) is optional. The second group (dog) is required.
118
+ regex = TreRegex::Regex.new('(cat)?(dog)')
119
+
120
+ result = regex.exec('dog')
121
+ # => {match: "dog", submatches: [nil, "dog"], index: 0, end_index: 3, cost: 0, errors: {insertions: 0, deletions: 0, substitutions: 0}}
122
+ ```
123
+
124
+ #### Fuzzy Capture Groups
125
+
126
+ One of the most powerful features of `TreRegex` is that capture groups respect your fuzzy matching rules! If a typo occurs *inside* a capture group, the `:submatches` array will return the actual typed text with the typo included.
127
+
128
+ ```ruby
129
+ regex = TreRegex::Regex.new('I ate an (apple)')
130
+
131
+ # We allow 1 error. The user typed 'aple' (1 deletion).
132
+ result = regex.exec('I ate an aple', max_errors: 1)
133
+
134
+ result[:submatches] # => ["aple"]
135
+ ```
136
+
137
+ #### The 9-Group Limit
138
+
139
+ For memory safety and performance during FFI allocation, `TreRegex` allocates a strict maximum of 10 slots per match. Because the first slot is always reserved for the full regex match itself, the engine will only extract a maximum of **9 capture groups** per match.
140
+
141
+ If your pattern contains 10 or more capture groups `()`, the regex will still compile and match perfectly, but any captured groups beyond the 9th one will be safely ignored and omitted from the `:submatches` array.
142
+
143
+ ## Configuration Options
144
+
145
+ `TreRegex` provides fine-grained control over how patterns are compiled and how fuzzy matching constraints are applied.
146
+
147
+ ### Initialization Options
148
+
149
+ When creating a new `TreRegex::Regex` object, you can pass options to modify how the pattern is compiled:
150
+
151
+ * **`ignore_case`** *(Boolean)*: If `true`, the regex will match characters regardless of their case (equivalent to the `/i` flag in standard Ruby regex). Default is `false`.
152
+
153
+ ```ruby
154
+ # Fails because case doesn't match
155
+ exact_regex = TreRegex::Regex.new('ruby')
156
+ exact_regex.test?('RUBY') # => false
157
+
158
+ # Succeeds using the ignore_case flag
159
+ case_regex = TreRegex::Regex.new('ruby', ignore_case: true)
160
+ case_regex.test?('RUBY') # => true
161
+ ```
162
+
163
+ ### Fuzzy Matching Options
164
+
165
+ When calling `exec`, `test?`, or `match_all`, you can pass a hash of fuzzy matching options. If no options are provided, `TreRegex` forces an **exact match** (0 errors allowed).
166
+
167
+ #### Error Limits
168
+
169
+ These options strictly limit the number of specific operations required to transform the pattern into the matched string.
170
+
171
+ * **`max_errors`** *(Integer)*: The total maximum number of combined errors (insertions + deletions + substitutions) allowed for a match.
172
+ * **`max_insertions`** *(Integer)*: The maximum number of extra characters allowed in the searched text. *(e.g., Pattern `cat` matching `cart` is 1 insertion)*.
173
+ * **`max_deletions`** *(Integer)*: The maximum number of missing characters in the searched text. *(e.g., Pattern `cat` matching `ct` is 1 deletion)*.
174
+ * **`max_substitutions`** *(Integer)*: The maximum number of swapped characters. *(e.g., Pattern `cat` matching `cot` is 1 substitution)*.
175
+
176
+ > **Note:** If you specify granular limits (like `max_deletions: 1`) but omit `max_errors`, the gem will automatically calculate the maximum allowed errors so you don't accidentally trigger an unlimited fuzzy search.
177
+
178
+ ```ruby
179
+ regex = TreRegex::Regex.new('banana')
180
+
181
+ # Allow up to 2 typos of any kind
182
+ regex.exec('bananana', max_errors: 2) # => matches "bananana" (2 insertions)
183
+ regex.exec('bnnna', max_errors: 2) # => matches "bnnna" (2 deletions)
184
+ regex.exec('bonono', max_errors: 2) # => matches "bonono" (2 substitutions)
185
+
186
+ # Another example
187
+ regex = TreRegex::Regex.new('library')
188
+
189
+ # Allow 1 deletion, but STRICTLY 0 substitutions and 0 insertions
190
+ regex.exec('librry', max_deletions: 1, max_substitutions: 0, max_insertions: 0)
191
+ # => matches "librry"
192
+
193
+ # This fails because 'lubrary' requires a substitution, which we set to 0
194
+ regex.exec('lubrary', max_deletions: 1, max_substitutions: 0, max_insertions: 0)
195
+ # => nil
196
+ ```
197
+
198
+ #### Cost and Weights
199
+
200
+ Instead of hard limits, you can assign different "costs" to different types of errors. This is useful if you want to penalize certain typos more heavily than others.
201
+
202
+ * **`max_cost`** *(Integer)*: The maximum total cost allowed for a match to be considered successful.
203
+ * **`weight_insertion`** *(Integer)*: The cost penalty for each inserted character.
204
+ * **`weight_deletion`** *(Integer)*: The cost penalty for each deleted character.
205
+ * **`weight_substitution`** *(Integer)*: The cost penalty for each substituted character.
206
+
207
+ ```ruby
208
+ regex = TreRegex::Regex.new('algorithm')
209
+
210
+ # We allow a maximum cost of 2.
211
+ # Missing/extra characters cost 1 point.
212
+ # Wrong characters cost 3 points.
213
+ options = {
214
+ max_cost: 2,
215
+ weight_deletion: 1,
216
+ weight_insertion: 1,
217
+ weight_substitution: 3
218
+ }
219
+
220
+ # 'algoritm' has 1 deletion. Cost = 1. (Passes, 1 < 2)
221
+ regex.test?('algoritm', options) # => true
222
+
223
+ # 'algorethm' has 1 substitution. Cost = 3. (Fails, 3 > 2)
224
+ regex.test?('algorethm', options) # => false
225
+ ```
226
+
227
+ ## Gotchas & Best Practices
228
+
229
+ ### The "Empty Match" Phenomenon
230
+
231
+ Because `TreRegex` relies on strict mathematical edit distances, you must be careful when setting `max_errors` to a value that is **greater than or equal to the length of your pattern**.
232
+
233
+ If you allow 3 errors on a 3-letter word, the engine considers *deleting all 3 characters* to be a valid mathematical match (cost = 3). This will result in an unexpected match against an empty string (`""`).
234
+
235
+ ```ruby
236
+ regex = TreRegex::Regex.new('cat')
237
+
238
+ # We allow 3 errors on a 3-letter word.
239
+ # The engine matches "cow" (2 substitutions)...
240
+ # but it also matches "" at the end of the string (3 deletions)!
241
+ regex.match_all('cot, cow', max_errors: 3).to_a
242
+ # => [
243
+ # {match: "cot", submatches: [], index: 0, end_index: 3, cost: 1, errors: {insertions: 0, deletions: 0, substitutions: 1}},
244
+ # {match: "cow", submatches: [], index: 5, end_index: 8, cost: 2, errors: {insertions: 0, deletions: 0, substitutions: 2}},
245
+ # {match: "", submatches: [], index: 8, end_index: 8, cost: 3, errors: {insertions: 0, deletions: 3, substitutions: 0}}
246
+ # ]
247
+ ```
248
+
249
+ **Best Practice**: if you need a high `max_errors` limit but want to prevent the engine from matching empty strings, explicitly cap the `max_deletions` option so that at least one character of your pattern must survive
250
+
251
+ ```ruby
252
+ # Allow 3 total errors, but strictly forbid the engine from deleting more than 2 characters
253
+ regex.match_all('cot, cow', max_errors: 3, max_deletions: 2).to_a
254
+ # => [
255
+ # {match: "cot", submatches: [], index: 0, end_index: 3, cost: 1, errors: {insertions: 0, deletions: 0, substitutions: 1}},
256
+ # {match: "cow", submatches: [], index: 5, end_index: 8, cost: 2, errors: {insertions: 0, deletions: 0, substitutions: 2}}
257
+ # ] # The empty match is mathematically prevented
258
+ ```
259
+
260
+ ### POSIX vs. PCRE Syntax
261
+
262
+ Ruby’s built-in `Regexp` engine uses a PCRE-like syntax (Onigmo), which supports advanced features like lookaheads `(?=...)`, lookbehinds, and backreferences.
263
+
264
+ The underlying TRE C-library uses **POSIX Extended Regular Expressions (ERE)**. While it supports standard regex features (character classes `[a-z]`, quantifiers `*`, `+`, `?`, and grouping), it **does not** support Perl-specific extensions.
265
+
266
+ ```ruby
267
+ # Valid TRE syntax
268
+ TreRegex::Regex.new('(cat|dog)s?')
269
+
270
+ # INVALID: Lookarounds are not supported by POSIX ERE
271
+ TreRegex::Regex.new('cat(?=s)') # Failed to compile regex pattern: cat(?=s) (TreRegex::Error)
272
+ ```
273
+
274
+ ### The Performance Cost of Extreme Fuzziness
275
+
276
+ Fuzzy matching is inherently more computationally expensive than exact matching. The TRE algorithm scales based on the length of the string and the number of allowed errors.
277
+
278
+ If you are searching a massive block of text (like a whole book) and set `max_errors: 10`, the engine has to calculate an enormous number of branching possibilities.
279
+
280
+ **Best Practice**: Keep your error limits tight and realistic. An error limit of 1 to 3 is usually perfect for catching typos. If you need to allow a massive number of errors, consider breaking the target text into smaller chunks (like sentences or words) before matching.
281
+
282
+ ### Unicode Character Indices vs. Byte Offsets
283
+
284
+ In C, strings are just arrays of bytes. An emoji like 🍎 takes up 4 bytes, which often breaks indexing when C-libraries pass data back to Ruby.
285
+
286
+ `TreRegex` handles this for you under the hood. The `:index` and `:end_index` returned in the match hash are strictly mapped to **Ruby character indices**, not raw byte offsets.
287
+
288
+ **Best Practice**: You can safely use the returned indices directly with standard Ruby string slicing, even if the text is filled with emojis or multi-byte characters. Do not use them with `String#byteslice`
289
+
290
+ ```ruby
291
+ regex = TreRegex::Regex.new('apple')
292
+ target = 'I ate 🍎 and an aple'
293
+
294
+ result = regex.exec(target, max_errors: 1)
295
+ # => {match: "aple", submatches: [], index: 15, end_index: 19, cost: 1, errors: {insertions: 0, deletions: 1, substitutions: 0}}
296
+
297
+ # This is 100% safe and will correctly return "aple"
298
+ target[result[:index]...result[:end_index]]
299
+ ```
300
+
301
+ ### Overlapping Matches in `match_all`
302
+
303
+ When using `match_all`, be aware that the engine consumes the string as it matches. By default, standard regex engines (including TRE) do not return overlapping matches.
304
+
305
+ If you search for `"ana"` in `"banana"`, it will only match the first `"ana"`. Once it consumes those characters, it moves on to the remaining `"na"`.
306
+
307
+ ```ruby
308
+ regex = TreRegex::Regex.new('ana')
309
+
310
+ # Returns 1 match, not 2!
311
+ regex.match_all('banana').to_a
312
+ # => [{match: "ana", submatches: [], index: 1, end_index: 4, cost: 0, errors: {insertions: 0, deletions: 0, substitutions: 0}}]
313
+ ```
314
+
315
+ If you need to find overlapping fuzzy matches, you will need to manually step through the string by advancing your starting index by 1 character after each search.
316
+
317
+ ## Development
318
+
319
+ After checking out the repo, run `bin/setup` to install dependencies. Then, run `rake spec` to run the tests. You can also run `bin/console` for an interactive prompt that will allow you to experiment.
320
+
321
+ To install this gem onto your local machine, run `bundle exec rake install`. To release a new version, update the version number in `version.rb`, and then run `bundle exec rake release`, which will create a git tag for the version, push git commits and the created tag, and push the `.gem` file to [rubygems.org](https://rubygems.org).
322
+
323
+ ## License
324
+
325
+ The gem is available as open source under the terms of the MIT License.
@@ -5,10 +5,39 @@ require 'rbconfig'
5
5
  require 'open-uri'
6
6
  require 'net/http'
7
7
  require 'fileutils'
8
+ require 'digest'
8
9
 
9
10
  is_windows = RbConfig::CONFIG['host_os'] =~ /mingw|mswin/
10
11
  is_darwin = RbConfig::CONFIG['host_os'].include?('darwin')
11
12
 
13
+ build_env = {
14
+ 'CC' => RbConfig::CONFIG['CC'],
15
+ 'CFLAGS' => RbConfig::CONFIG['CFLAGS'],
16
+ 'CPPFLAGS' => RbConfig::CONFIG['CPPFLAGS'],
17
+ 'LDFLAGS' => RbConfig::CONFIG['LDFLAGS']
18
+ }
19
+
20
+ # Embed standard C libraries directly into the DLL on Windows so it doesn't crash on bare machines
21
+ if is_windows
22
+ # append static flags directly to CC! Libtool doesn't strip flags from the CC variable.
23
+ build_env['CC'] = "#{build_env['CC']} -static-libgcc -static-libstdc++"
24
+
25
+ # force MinGW to statically link the hidden winpthread dependency
26
+ build_env['LDFLAGS'] = "#{build_env['LDFLAGS']} -Wl,-Bstatic -lpthread -Wl,-Bdynamic"
27
+
28
+ # prevent MinGW from injecting a dependency on libssp-0.dll
29
+ build_env['CFLAGS'] = "#{build_env['CFLAGS']} -fno-stack-protector"
30
+ end
31
+
32
+ gnu_host = RbConfig::CONFIG['host_alias']
33
+ gnu_host = RbConfig::CONFIG['host'] if gnu_host.nil? || gnu_host.empty?
34
+
35
+ # Convert 'arm64' to 'aarch64' (Apple Silicon) and 'x64' to 'x86_64' (Windows)
36
+ gnu_host = gnu_host.sub('arm64', 'aarch64').sub(/^x64/, 'x86_64')
37
+
38
+ # Pass the translated, safe name to configure
39
+ host_flag = "--host=#{gnu_host}"
40
+
12
41
  root_dir = File.expand_path(__dir__)
13
42
  root_dir = File.dirname(root_dir) until Dir.exist?(File.join(root_dir, 'lib')) || root_dir == '/'
14
43
 
@@ -16,6 +45,7 @@ root_dir = File.dirname(root_dir) until Dir.exist?(File.join(root_dir, 'lib')) |
16
45
  github_repo = 'laurikari/tre'
17
46
  version = '5ac28057f648debda76f9bf4d39dfdfa85b0df18'
18
47
  tarball_url = "https://github.com/#{github_repo}/archive/#{version}.tar.gz"
48
+ expected_tarball_sha256 = '528a8f8a4672cd3a0e5354629323a17d0cfa98b3792a57d764b64db30e2d5e9a'
19
49
  tarball_file = File.expand_path("./tre-#{version}.tar.gz", __dir__)
20
50
  tre_src_dir = File.expand_path("./tre-#{version}", __dir__)
21
51
  dest_lib_dir = File.join(root_dir, 'lib', 'tre_regex', 'bin')
@@ -43,6 +73,18 @@ unless Dir.exist?(tre_src_dir)
43
73
  puts '========== Downloading TRE from GitHub =========='
44
74
  begin
45
75
  content = download_file(tarball_url)
76
+
77
+ actual_sha256 = Digest::SHA256.hexdigest(content)
78
+
79
+ if actual_sha256 != expected_tarball_sha256
80
+ abort [
81
+ 'SECURITY ERROR:',
82
+ 'Checksum mismatch for TRE source!',
83
+ "Expected: #{expected_tarball_sha256}",
84
+ "Actual: #{actual_sha256}"
85
+ ].join("\n")
86
+ end
87
+
46
88
  File.binwrite(tarball_file, content)
47
89
  rescue StandardError => e
48
90
  abort "Error: #{e.message}"
@@ -50,33 +92,34 @@ unless Dir.exist?(tre_src_dir)
50
92
 
51
93
  puts '========== Extracting TRE Source =========='
52
94
  # Ensure we use -z for gzip
53
- system("tar -xzf #{tarball_file} -C #{__dir__}") || abort('Extraction failed')
95
+ system('tar', '-xzf', tarball_file, '-C', __dir__) || abort('Extraction failed')
54
96
  end
55
97
 
56
98
  # Build TRE synchronously using Ruby
57
- host_flag = enable_config('cross-build') ? "--host=#{RbConfig::CONFIG['host']} " : ''
58
- RbConfig::CONFIG['SOEXT'] || RbConfig::CONFIG['DLEXT'] || (is_windows ? 'dll' : 'so')
59
99
 
60
100
  puts '========== Building TRE =========='
61
101
  Dir.chdir(tre_src_dir) do
62
- system('./utils/autogen.sh') || raise('autogen.sh failed') unless File.exist?('configure')
63
-
64
- system("./configure #{host_flag} --enable-shared --disable-static --disable-agrep") || raise('configure failed')
65
- system('make') || raise('make failed')
102
+ system(build_env, './utils/autogen.sh') || raise('autogen.sh failed') unless File.exist?('configure')
103
+
104
+ system(
105
+ build_env,
106
+ './configure',
107
+ host_flag,
108
+ '--enable-shared',
109
+ '--disable-static',
110
+ '--disable-agrep',
111
+ '--disable-nls',
112
+ 'ac_cv_header_libintl_h=no',
113
+ 'ac_cv_lib_intl_libintl_gettext=no'
114
+ ) || raise('configure failed')
115
+ system(build_env, 'make') || raise('make failed')
66
116
  end
67
117
 
68
118
  puts '========== Staging Shared Library for FFI =========='
69
119
  FileUtils.mkdir_p(dest_lib_dir)
70
120
 
71
- # Find the REAL physical file, strictly ignoring symlinks and static (.a) archives
121
+ # Find the shared library, ignoring static archives
72
122
  src_lib = Dir.glob("#{tre_src_dir}/lib/.libs/*").find do |f|
73
- (f.include?('.so') || f.include?('.dylib') || f.end_with?('.dll')) &&
74
- !f.end_with?('.a') &&
75
- !File.symlink?(f)
76
- end
77
-
78
- # Fallback just in case libtool behaved differently
79
- src_lib ||= Dir.glob("#{tre_src_dir}/lib/.libs/*").find do |f|
80
123
  (f.include?('.so') || f.include?('.dylib') || f.end_with?('.dll')) && !f.end_with?('.a')
81
124
  end
82
125
 
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module TreRegex
4
- VERSION = '0.1.1'
4
+ VERSION = '0.2.0'
5
5
  end
data/lib/tre_regex.rb CHANGED
@@ -5,6 +5,8 @@ require 'rbconfig'
5
5
  require_relative 'tre_regex/version'
6
6
 
7
7
  module TreRegex
8
+ MAX_NMATCH = 10 # 1 full match + 9 capture groups = 10 slots
9
+
8
10
  class Error < StandardError; end
9
11
 
10
12
  # The FFI Native Bridge
@@ -70,7 +72,7 @@ module TreRegex
70
72
 
71
73
  attach_function :tre_regcomp, %i[pointer string int], :int
72
74
  attach_function :tre_regfree, [:pointer], :void
73
- attach_function :tre_regaexec, [:pointer, :string, :pointer, RegAParams.by_value, :int], :int
75
+ attach_function :tre_reganexec, [:pointer, :pointer, :size_t, :pointer, RegAParams.by_value, :int], :int
74
76
  attach_function :tre_regaparams_default, [:pointer], :void
75
77
  end
76
78
 
@@ -101,42 +103,62 @@ module TreRegex
101
103
  end
102
104
  end
103
105
 
104
- def exec(text, options = {})
105
- params = build_params(options)
106
- pmatch = FFI::MemoryPointer.new(Native::RegMatch)
107
- match_data = prepare_match_data(pmatch)
108
-
109
- res = Native.tre_regaexec(@preg, text, match_data, params, 0)
110
- res.zero? ? parse_result(text, match_data, pmatch) : nil
111
- end
112
-
113
106
  def test?(text, options = {})
114
107
  !exec(text, options).nil?
115
108
  end
116
109
 
110
+ def exec(text, options = {})
111
+ ptr = FFI::MemoryPointer.from_string(text)
112
+ m_info = execute_match(ptr, text.bytesize, options)
113
+
114
+ m_info ? extract_match_payload(text, 0, 0, m_info).first : nil
115
+ end
116
+
117
117
  def match_all(text, options = {})
118
118
  return enum_for(:match_all, text, options) unless block_given?
119
119
 
120
- offset = 0
121
- while offset <= text.length
122
- result = exec(text[offset..] || '', options)
123
- break unless result
120
+ ptr = FFI::MemoryPointer.from_string(text)
121
+ b_off = c_off = 0
122
+
123
+ while b_off <= text.bytesize
124
+ m_info = execute_match(ptr + b_off, text.bytesize - b_off, options)
125
+ break unless m_info
126
+
127
+ payload, adv_b, adv_c = extract_match_payload(text, b_off, c_off, m_info)
128
+ yield payload
129
+
130
+ break if adv_b.zero? && b_off == text.bytesize
124
131
 
125
- result[:index] += offset
126
- result[:end_index] += offset
127
- yield result
132
+ # Zero-width infinite loop protection
133
+ if adv_b.zero?
134
+ adv_b = text.byteslice(b_off..).chr.bytesize
135
+ adv_c = 1
136
+ end
128
137
 
129
- advance = (result[:end_index] - result[:index]).clamp(1, Float::INFINITY)
130
- offset = result[:index] + advance
138
+ b_off += adv_b
139
+ c_off += adv_c
131
140
  end
132
141
  end
133
142
 
134
143
  private
135
144
 
145
+ def execute_match(text_ptr, len, options)
146
+ params = build_params(options)
147
+
148
+ # Allocate a continuous block of memory for 10 RegMatch structs
149
+ pmatch_array = FFI::MemoryPointer.new(Native::RegMatch, MAX_NMATCH)
150
+ match_data = prepare_match_data(pmatch_array, MAX_NMATCH)
151
+
152
+ res = Native.tre_reganexec(@preg, text_ptr, len, match_data, params, 0)
153
+ return nil unless res.zero?
154
+
155
+ # Return the entire array pointer to be parsed
156
+ [pmatch_array, MAX_NMATCH, match_data]
157
+ end
158
+
136
159
  def build_params(opts)
137
160
  params = Native::RegAParams.new
138
161
  Native.tre_regaparams_default(params.to_ptr)
139
- return params.tap { |p| p[:max_err] = 0 } if opts.empty?
140
162
 
141
163
  apply_limits(params, opts)
142
164
  apply_costs(params, opts)
@@ -163,28 +185,58 @@ module TreRegex
163
185
  params[:cost_subst] = opts[:weight_substitution] if opts.key?(:weight_substitution)
164
186
  end
165
187
 
166
- def prepare_match_data(pmatch)
188
+ def prepare_match_data(pmatch_array, nmatch)
167
189
  Native::RegAMatch.new.tap do |m|
168
- m[:nmatch] = 1
169
- m[:pmatch] = pmatch
190
+ m[:nmatch] = nmatch # Tell TRE we have space for 10 matches
191
+ m[:pmatch] = pmatch_array
170
192
  end
171
193
  end
172
194
 
173
- def parse_result(text, match_data, pmatch)
174
- rm = Native::RegMatch.new(pmatch)
175
- byte_match = text.byteslice(rm[:rm_so]...rm[:rm_eo])
176
- char_start = text.byteslice(0...rm[:rm_so]).length
195
+ def extract_match_payload(text, byte_off, char_off, m_info)
196
+ pmatch_array, nmatch, match_data = m_info
177
197
 
178
- {
179
- match: byte_match,
180
- index: char_start,
181
- end_index: char_start + byte_match.length,
198
+ # Read the full match boundaries from index 0
199
+ full_rm = Native::RegMatch.new(pmatch_array)
200
+ rm_so = full_rm[:rm_so]
201
+ rm_eo = full_rm[:rm_eo]
202
+
203
+ prefix_len = (text.byteslice(byte_off, rm_so) || '').length
204
+ match_str = text.byteslice((byte_off + rm_so)...(byte_off + rm_eo))
205
+
206
+ payload = {
207
+ match: match_str,
208
+ submatches: extract_submatches(text, byte_off, pmatch_array, nmatch),
209
+ index: char_off + prefix_len,
210
+ end_index: char_off + prefix_len + match_str.length,
182
211
  cost: match_data[:cost],
183
- errors: {
184
- insertions: match_data[:num_ins],
185
- deletions: match_data[:num_del],
186
- substitutions: match_data[:num_subst]
187
- }
212
+ errors: parse_errors(match_data)
213
+ }
214
+
215
+ [payload, rm_eo, prefix_len + match_str.length]
216
+ end
217
+
218
+ def extract_submatches(text, byte_off, pmatch_array, nmatch)
219
+ submatches = (1...nmatch).map do |i|
220
+ # Advance the memory pointer by the size of the struct for each index
221
+ rm = Native::RegMatch.new(pmatch_array + (i * Native::RegMatch.size))
222
+ sub_so = rm[:rm_so]
223
+ sub_eo = rm[:rm_eo]
224
+
225
+ # Safely extract the group, inserting nil if it was optional and unmatched
226
+ sub_so == -1 ? nil : text.byteslice((byte_off + sub_so)...(byte_off + sub_eo))
227
+ end
228
+
229
+ # Cleanup: Remove trailing nil values (unused capture groups)
230
+ submatches.pop while submatches.last.nil? && !submatches.empty?
231
+
232
+ submatches
233
+ end
234
+
235
+ def parse_errors(match_data)
236
+ {
237
+ insertions: match_data[:num_ins],
238
+ deletions: match_data[:num_del],
239
+ substitutions: match_data[:num_subst]
188
240
  }
189
241
  end
190
242
  end
data/tre_regex.gemspec ADDED
@@ -0,0 +1,50 @@
1
+ # frozen_string_literal: true
2
+
3
+ require_relative 'lib/tre_regex/version'
4
+
5
+ Gem::Specification.new do |spec|
6
+ spec.name = 'tre_regex'
7
+ spec.version = TreRegex::VERSION
8
+ spec.authors = ['Oleksii Vasyliev']
9
+ spec.email = ['leopard.not.a@gmail.com']
10
+ spec.license = 'MIT'
11
+
12
+ spec.summary = 'A fast Ruby FFI wrapper for the TRE approximate regex matching library.'
13
+ spec.description = [
14
+ 'TreRegex provides a high-performance Ruby interface to the TRE C library using FFI.',
15
+ 'It brings robust approximate (fuzzy) regular expression matching to Ruby, featuring',
16
+ 'multi-byte Unicode string safety, granular error limits, and precompiled cross-platform native binaries'
17
+ ].join(' ')
18
+ spec.homepage = 'https://github.com/le0pard/tre_regex'
19
+ spec.required_ruby_version = '>= 3.3.0'
20
+
21
+ spec.metadata['homepage_uri'] = spec.homepage
22
+ spec.metadata['source_code_uri'] = 'https://github.com/le0pard/tre_regex'
23
+ spec.metadata['changelog_uri'] = 'https://github.com/le0pard/tre_regex/releases'
24
+ spec.metadata['bug_tracker_uri'] = 'https://github.com/le0pard/tre_regex/issues'
25
+ spec.metadata['documentation_uri'] = 'https://github.com/le0pard/tre_regex/blob/main/README.md'
26
+ spec.metadata['rubygems_mfa_required'] = 'true'
27
+
28
+ # Specify which files should be added to the gem when it is released.
29
+ # The `git ls-files -z` loads the files in the RubyGem that have been added into git.
30
+
31
+ spec.files = %w[
32
+ lib/**/*.rb
33
+ ext/tre_regex/extconf.rb
34
+ ext/tre_regex/tre_regex.c
35
+ README.md
36
+ LICENSE
37
+ tre_regex.gemspec
38
+ ].flat_map { |p| Dir[p] }
39
+ spec.extensions = ['ext/tre_regex/extconf.rb']
40
+
41
+ spec.bindir = 'exe'
42
+ spec.executables = spec.files.grep(%r{\Aexe/}) { |f| File.basename(f) }
43
+ spec.require_paths = ['lib']
44
+
45
+ # Uncomment to register a new dependency of your gem
46
+ spec.add_dependency 'ffi', '>= 1.0'
47
+
48
+ # For more information and examples about making a new gem, check out our
49
+ # guide at: https://bundler.io/guides/creating_gem.html
50
+ end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: tre_regex
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.1
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Oleksii Vasyliev
@@ -34,10 +34,13 @@ extensions:
34
34
  - ext/tre_regex/extconf.rb
35
35
  extra_rdoc_files: []
36
36
  files:
37
+ - LICENSE
38
+ - README.md
37
39
  - ext/tre_regex/extconf.rb
38
40
  - ext/tre_regex/tre_regex.c
39
41
  - lib/tre_regex.rb
40
42
  - lib/tre_regex/version.rb
43
+ - tre_regex.gemspec
41
44
  homepage: https://github.com/le0pard/tre_regex
42
45
  licenses:
43
46
  - MIT