fast_regexp 0.5.0 → 0.6.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/Cargo.lock +0 -1
- data/README.md +19 -6
- data/ext/fast_regexp/Cargo.toml +0 -1
- data/ext/fast_regexp/src/lib.rs +38 -108
- data/lib/fast_regexp/version.rb +1 -1
- data/lib/fast_regexp.rb +47 -16
- metadata +1 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: b7e4b114057d201c62b2d8ec4405b3d2554c1cebf1399433d275e111b93864dd
|
|
4
|
+
data.tar.gz: da33511a662b556bcd9d085d0d3a52e2a138120b040a79b37015970486630ee5
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 6417817f643dd1f81979e02f75b5c211c2921e87fa448fdf33c2a7da4c44980ebe6e53f9a91feb18e494e2feafbbfb03d004d3c97e51a91e3de182d0b09132a1
|
|
7
|
+
data.tar.gz: 6ad069ffca51038fe5f74a6ad8496a672c25c8044e76cb084dd965a8cd380beaebba99908460431e41d223f96ac9168d0f02d6fca2e05f3d3fcf3c5661e8b1e2
|
data/Cargo.lock
CHANGED
data/README.md
CHANGED
|
@@ -5,12 +5,10 @@
|
|
|
5
5
|
|
|
6
6
|
Fast, drop-in regex for Ruby — backed by [rust/regex](https://docs.rs/regex/latest/regex/) with transparent fallback to the stdlib `::Regexp` engine for features rust/regex doesn't support (lookaround, backreferences, possessive quantifiers, etc.).
|
|
7
7
|
|
|
8
|
-
You get rust/regex's speed
|
|
8
|
+
You get rust/regex's speed on the common path, and a single uniform API (`Fast::Regexp`, `Fast::Regexp::MatchData`) regardless of which engine actually ran underneath.
|
|
9
9
|
|
|
10
10
|
## Installation
|
|
11
11
|
|
|
12
|
-
Install [Rust](https://www.rust-lang.org/) via [rustup](https://rustup.rs/) or in any other way.
|
|
13
|
-
|
|
14
12
|
Add as a dependency:
|
|
15
13
|
|
|
16
14
|
```ruby
|
|
@@ -21,6 +19,14 @@ gem "fast_regexp"
|
|
|
21
19
|
gem install fast_regexp
|
|
22
20
|
```
|
|
23
21
|
|
|
22
|
+
Precompiled native gems are published for **arm64-darwin**, **x86_64-linux**,
|
|
23
|
+
and **aarch64-linux** against Ruby **3.3**, **3.4**, and **4.0** — no Rust
|
|
24
|
+
toolchain required on those platforms.
|
|
25
|
+
|
|
26
|
+
On any other platform/Ruby combo, Bundler/RubyGems falls back to the source
|
|
27
|
+
gem and compiles the extension at install time. That path needs
|
|
28
|
+
[Rust](https://www.rust-lang.org/) (install via [rustup](https://rustup.rs/)).
|
|
29
|
+
|
|
24
30
|
Include in your code:
|
|
25
31
|
|
|
26
32
|
```ruby
|
|
@@ -155,6 +161,14 @@ slow.match?("foobar") # => true
|
|
|
155
161
|
/ `#stdlib` accessors. Replacement templates use rust/regex syntax (`$1`,
|
|
156
162
|
`${name}`, `$$`) on both paths; the stdlib fallback translates them for you.
|
|
157
163
|
|
|
164
|
+
You can force a specific engine via the `backend:` kwarg:
|
|
165
|
+
|
|
166
|
+
```ruby
|
|
167
|
+
Fast::Regexp.new('\w+', backend: :fast) # rust/regex only; raises on unsupported
|
|
168
|
+
Fast::Regexp.new(pat, backend: :stdlib) # skip rust/regex; use ::Regexp directly
|
|
169
|
+
Fast::Regexp.new('\w+', backend: :auto) # default — try rust, fall back on reject
|
|
170
|
+
```
|
|
171
|
+
|
|
158
172
|
> [!NOTE]
|
|
159
173
|
> The fast path is byte-based (rust/regex's `regex::bytes`), so `#=~` returns
|
|
160
174
|
> a *byte* offset. The stdlib fallback path returns the byte offset too, for
|
|
@@ -230,7 +244,7 @@ In-depth docs live under [`docs/`](docs/README.md), organized via the
|
|
|
230
244
|
- **Tutorial:** [Getting started](docs/tutorials/getting-started.md)
|
|
231
245
|
- **How-to:** [Migrate from stdlib `::Regexp`](docs/how-to/migrate-from-stdlib-regexp.md), [Handle unsupported syntax](docs/how-to/handle-unsupported-syntax.md)
|
|
232
246
|
- **Reference:** [`Fast::Regexp`](docs/reference/fast-regexp.md), [`MatchData`](docs/reference/fast-regexp-matchdata.md), [`Set`](docs/reference/fast-regexp-set.md)
|
|
233
|
-
- **Explainers:** [Engine fallback](docs/explainers/engine-fallback.md)
|
|
247
|
+
- **Explainers:** [Engine fallback](docs/explainers/engine-fallback.md)
|
|
234
248
|
|
|
235
249
|
## Development
|
|
236
250
|
|
|
@@ -252,8 +266,7 @@ Bug reports and pull requests are welcome on GitHub at https://github.com/jetpks
|
|
|
252
266
|
huge thanks for the original bindings and the clean magnus integration that
|
|
253
267
|
made this work easy to extend. This fork rebrands the gem, reshapes the
|
|
254
268
|
public API (`Fast::Regexp`, real `MatchData`, `sub`/`gsub`, `===`/`=~`,
|
|
255
|
-
`Regexp`-constructor coercion),
|
|
256
|
-
thread/fiber-friendly matching, and adds transparent fallback to stdlib
|
|
269
|
+
`Regexp`-constructor coercion), and adds transparent fallback to stdlib
|
|
257
270
|
`::Regexp` for patterns rust/regex can't compile.
|
|
258
271
|
|
|
259
272
|
## License
|
data/ext/fast_regexp/Cargo.toml
CHANGED
data/ext/fast_regexp/src/lib.rs
CHANGED
|
@@ -6,67 +6,9 @@ use magnus::{
|
|
|
6
6
|
};
|
|
7
7
|
use regex::bytes::{NoExpand, Regex, RegexBuilder, RegexSet, RegexSetBuilder};
|
|
8
8
|
use std::collections::HashMap;
|
|
9
|
-
use std::ffi::c_void;
|
|
10
|
-
use std::ptr;
|
|
11
9
|
use std::sync::Arc;
|
|
12
10
|
|
|
13
|
-
/// Skip GVL release for haystacks smaller than this — the release/reacquire
|
|
14
|
-
/// overhead (~µs) dwarfs the cost of a regex match on a tiny string.
|
|
15
|
-
const GVL_RELEASE_THRESHOLD: usize = 1024;
|
|
16
|
-
|
|
17
|
-
/// Run `func` with the GVL released so other Ruby threads (and the fiber
|
|
18
|
-
/// scheduler on its host thread) can make progress during a long regex match.
|
|
19
|
-
///
|
|
20
|
-
/// The callback runs on the same OS thread — `Send` is not required, and the
|
|
21
|
-
/// borrow checker enforces that any references stay valid for the call.
|
|
22
|
-
fn without_gvl<F, R>(func: F) -> R
|
|
23
|
-
where
|
|
24
|
-
F: FnOnce() -> R,
|
|
25
|
-
{
|
|
26
|
-
struct Pack<F, R> {
|
|
27
|
-
func: Option<F>,
|
|
28
|
-
result: Option<R>,
|
|
29
|
-
}
|
|
30
|
-
|
|
31
|
-
unsafe extern "C" fn trampoline<F, R>(data: *mut c_void) -> *mut c_void
|
|
32
|
-
where
|
|
33
|
-
F: FnOnce() -> R,
|
|
34
|
-
{
|
|
35
|
-
let pack = &mut *(data as *mut Pack<F, R>);
|
|
36
|
-
let func = pack.func.take().expect("trampoline called twice");
|
|
37
|
-
pack.result = Some(func());
|
|
38
|
-
ptr::null_mut()
|
|
39
|
-
}
|
|
40
|
-
|
|
41
|
-
let mut pack: Pack<F, R> = Pack {
|
|
42
|
-
func: Some(func),
|
|
43
|
-
result: None,
|
|
44
|
-
};
|
|
45
|
-
unsafe {
|
|
46
|
-
rb_sys::rb_thread_call_without_gvl(
|
|
47
|
-
Some(trampoline::<F, R>),
|
|
48
|
-
&mut pack as *mut _ as *mut c_void,
|
|
49
|
-
None,
|
|
50
|
-
ptr::null_mut(),
|
|
51
|
-
);
|
|
52
|
-
}
|
|
53
|
-
pack.result.take().expect("callback did not run")
|
|
54
|
-
}
|
|
55
|
-
|
|
56
|
-
fn run_regex<F, R>(haystack_len: usize, func: F) -> R
|
|
57
|
-
where
|
|
58
|
-
F: FnOnce() -> R,
|
|
59
|
-
{
|
|
60
|
-
if haystack_len >= GVL_RELEASE_THRESHOLD {
|
|
61
|
-
without_gvl(func)
|
|
62
|
-
} else {
|
|
63
|
-
func()
|
|
64
|
-
}
|
|
65
|
-
}
|
|
66
|
-
|
|
67
11
|
fn haystack_bytes(haystack: &RString) -> Vec<u8> {
|
|
68
|
-
// Copy out of the Ruby heap so the bytes are safe to read after the GVL is
|
|
69
|
-
// released (Ruby 4's compacting GC can otherwise move the string).
|
|
70
12
|
unsafe { haystack.as_slice() }.to_vec()
|
|
71
13
|
}
|
|
72
14
|
|
|
@@ -157,12 +99,10 @@ impl FastRegexp {
|
|
|
157
99
|
let regex = &self.0.regex;
|
|
158
100
|
let bytes = haystack_bytes(&haystack);
|
|
159
101
|
|
|
160
|
-
let offsets =
|
|
161
|
-
regex.
|
|
162
|
-
(
|
|
163
|
-
|
|
164
|
-
.collect::<Vec<_>>()
|
|
165
|
-
})
|
|
102
|
+
let offsets = regex.captures(&bytes).map(|caps| {
|
|
103
|
+
(0..regex.captures_len())
|
|
104
|
+
.map(|i| caps.get(i).map(|m| (m.start(), m.end())))
|
|
105
|
+
.collect::<Vec<_>>()
|
|
166
106
|
})?;
|
|
167
107
|
|
|
168
108
|
Some(self.build_match_data(Arc::new(bytes), offsets))
|
|
@@ -173,29 +113,25 @@ impl FastRegexp {
|
|
|
173
113
|
let bytes = haystack_bytes(&haystack);
|
|
174
114
|
|
|
175
115
|
if regex.captures_len() == 1 {
|
|
176
|
-
let ranges: Vec<(usize, usize)> =
|
|
177
|
-
|
|
178
|
-
|
|
179
|
-
|
|
180
|
-
.collect()
|
|
181
|
-
});
|
|
116
|
+
let ranges: Vec<(usize, usize)> = regex
|
|
117
|
+
.find_iter(&bytes)
|
|
118
|
+
.map(|m| (m.start(), m.end()))
|
|
119
|
+
.collect();
|
|
182
120
|
let result = ruby.ary_new_capa(ranges.len());
|
|
183
121
|
for (s, e) in ranges {
|
|
184
122
|
result.push(utf8_string(ruby, &bytes[s..e]))?;
|
|
185
123
|
}
|
|
186
124
|
Ok(result)
|
|
187
125
|
} else {
|
|
188
|
-
let groups: Vec<Vec<CaptureOffset>> =
|
|
189
|
-
|
|
190
|
-
|
|
191
|
-
.
|
|
192
|
-
|
|
193
|
-
|
|
194
|
-
|
|
195
|
-
|
|
196
|
-
|
|
197
|
-
.collect()
|
|
198
|
-
});
|
|
126
|
+
let groups: Vec<Vec<CaptureOffset>> = regex
|
|
127
|
+
.captures_iter(&bytes)
|
|
128
|
+
.map(|caps| {
|
|
129
|
+
caps.iter()
|
|
130
|
+
.skip(1)
|
|
131
|
+
.map(|c| c.map(|m| (m.start(), m.end())))
|
|
132
|
+
.collect()
|
|
133
|
+
})
|
|
134
|
+
.collect();
|
|
199
135
|
let result = ruby.ary_new_capa(groups.len());
|
|
200
136
|
for group_ranges in groups {
|
|
201
137
|
let group = ruby.ary_new_capa(group_ranges.len());
|
|
@@ -216,16 +152,14 @@ impl FastRegexp {
|
|
|
216
152
|
let bytes = haystack_bytes(&haystack);
|
|
217
153
|
let n_groups = regex.captures_len();
|
|
218
154
|
|
|
219
|
-
let all: Vec<Vec<CaptureOffset>> =
|
|
220
|
-
|
|
221
|
-
|
|
222
|
-
|
|
223
|
-
(
|
|
224
|
-
|
|
225
|
-
|
|
226
|
-
|
|
227
|
-
.collect()
|
|
228
|
-
});
|
|
155
|
+
let all: Vec<Vec<CaptureOffset>> = regex
|
|
156
|
+
.captures_iter(&bytes)
|
|
157
|
+
.map(|caps| {
|
|
158
|
+
(0..n_groups)
|
|
159
|
+
.map(|i| caps.get(i).map(|m| (m.start(), m.end())))
|
|
160
|
+
.collect()
|
|
161
|
+
})
|
|
162
|
+
.collect();
|
|
229
163
|
|
|
230
164
|
let shared = Arc::new(bytes);
|
|
231
165
|
let result = ruby.ary_new_capa(all.len());
|
|
@@ -238,7 +172,7 @@ impl FastRegexp {
|
|
|
238
172
|
pub fn is_match(&self, haystack: RString) -> bool {
|
|
239
173
|
let regex = &self.0.regex;
|
|
240
174
|
let bytes = haystack_bytes(&haystack);
|
|
241
|
-
|
|
175
|
+
regex.is_match(&bytes)
|
|
242
176
|
}
|
|
243
177
|
|
|
244
178
|
pub fn sub_str(
|
|
@@ -251,13 +185,11 @@ impl FastRegexp {
|
|
|
251
185
|
let regex = &rb_self.0.regex;
|
|
252
186
|
let bytes = haystack_bytes(&haystack);
|
|
253
187
|
let repl = haystack_bytes(&replacement);
|
|
254
|
-
let out
|
|
255
|
-
|
|
256
|
-
|
|
257
|
-
|
|
258
|
-
|
|
259
|
-
}
|
|
260
|
-
});
|
|
188
|
+
let out: Vec<u8> = if literal {
|
|
189
|
+
regex.replace(&bytes, NoExpand(&repl)).into_owned()
|
|
190
|
+
} else {
|
|
191
|
+
regex.replace(&bytes, &repl[..]).into_owned()
|
|
192
|
+
};
|
|
261
193
|
utf8_string(ruby, &out)
|
|
262
194
|
}
|
|
263
195
|
|
|
@@ -271,13 +203,11 @@ impl FastRegexp {
|
|
|
271
203
|
let regex = &rb_self.0.regex;
|
|
272
204
|
let bytes = haystack_bytes(&haystack);
|
|
273
205
|
let repl = haystack_bytes(&replacement);
|
|
274
|
-
let out
|
|
275
|
-
|
|
276
|
-
|
|
277
|
-
|
|
278
|
-
|
|
279
|
-
}
|
|
280
|
-
});
|
|
206
|
+
let out: Vec<u8> = if literal {
|
|
207
|
+
regex.replace_all(&bytes, NoExpand(&repl)).into_owned()
|
|
208
|
+
} else {
|
|
209
|
+
regex.replace_all(&bytes, &repl[..]).into_owned()
|
|
210
|
+
};
|
|
281
211
|
utf8_string(ruby, &out)
|
|
282
212
|
}
|
|
283
213
|
|
|
@@ -482,13 +412,13 @@ impl FastRegexpSet {
|
|
|
482
412
|
pub fn matches(&self, haystack: RString) -> Vec<usize> {
|
|
483
413
|
let set = &self.0;
|
|
484
414
|
let bytes = haystack_bytes(&haystack);
|
|
485
|
-
|
|
415
|
+
set.matches(&bytes).iter().collect()
|
|
486
416
|
}
|
|
487
417
|
|
|
488
418
|
pub fn is_match(&self, haystack: RString) -> bool {
|
|
489
419
|
let set = &self.0;
|
|
490
420
|
let bytes = haystack_bytes(&haystack);
|
|
491
|
-
|
|
421
|
+
set.is_match(&bytes)
|
|
492
422
|
}
|
|
493
423
|
|
|
494
424
|
pub fn patterns(&self) -> Vec<String> {
|
data/lib/fast_regexp/version.rb
CHANGED
data/lib/fast_regexp.rb
CHANGED
|
@@ -6,16 +6,27 @@ module Fast
|
|
|
6
6
|
# Façade over rust/regex with a transparent fallback to stdlib `::Regexp`.
|
|
7
7
|
#
|
|
8
8
|
# `Fast::Regexp.new(pattern)` first tries to compile with rust/regex (fast,
|
|
9
|
-
#
|
|
10
|
-
#
|
|
11
|
-
#
|
|
9
|
+
# byte-based). If the pattern uses features rust/regex does not support
|
|
10
|
+
# (lookaround, backreferences, possessive quantifiers, etc.) we fall back
|
|
11
|
+
# to `::Regexp` so consumers don't have to juggle two libraries.
|
|
12
12
|
class Regexp
|
|
13
|
+
NATIVE_EXTENSIONS = %w[.bundle .so .rb].freeze
|
|
14
|
+
|
|
15
|
+
# Precompiled native gems ship per-ABI subdirs (`fast_regexp/4.0/...`),
|
|
16
|
+
# the source-gem `rake compile` build lands flat (`fast_regexp/...`).
|
|
17
|
+
# Pick whichever exists for the current Ruby ABI, with the per-ABI path
|
|
18
|
+
# winning when both are present.
|
|
19
|
+
def self.locate_native(base, ruby_version: RUBY_VERSION)
|
|
20
|
+
abi = ruby_version[/\d+\.\d+/]
|
|
21
|
+
candidates = [File.join(base, abi, "fast_regexp"), File.join(base, "fast_regexp")]
|
|
22
|
+
candidates.find { |stem| NATIVE_EXTENSIONS.any? { |ext| File.exist?(stem + ext) } }
|
|
23
|
+
end
|
|
13
24
|
end
|
|
14
25
|
end
|
|
15
26
|
|
|
16
|
-
|
|
17
|
-
|
|
18
|
-
|
|
27
|
+
native = Fast::Regexp.locate_native(File.expand_path("fast_regexp", __dir__))
|
|
28
|
+
raise LoadError, "could not locate fast_regexp native extension" unless native
|
|
29
|
+
require native
|
|
19
30
|
|
|
20
31
|
module Fast
|
|
21
32
|
class Regexp
|
|
@@ -34,6 +45,15 @@ module Fast
|
|
|
34
45
|
allocate.tap { |re| re.send(:initialize, translated, original: pattern, **opts) }
|
|
35
46
|
end
|
|
36
47
|
|
|
48
|
+
# Bulk-compile a symbol-keyed hash of patterns. Handy for defining a set
|
|
49
|
+
# of regex constants in one shot:
|
|
50
|
+
#
|
|
51
|
+
# RE = Fast::Regexp.create_many(word: '\w+', num: '\d+').freeze
|
|
52
|
+
# RE[:word].match("hello")
|
|
53
|
+
def create_many(**patterns)
|
|
54
|
+
patterns.transform_values { |pat| new(pat) }
|
|
55
|
+
end
|
|
56
|
+
|
|
37
57
|
private
|
|
38
58
|
|
|
39
59
|
def translate_regexp(regexp)
|
|
@@ -46,11 +66,17 @@ module Fast
|
|
|
46
66
|
|
|
47
67
|
attr_reader :pattern, :backend
|
|
48
68
|
|
|
69
|
+
BACKENDS = %i[auto fast stdlib].freeze
|
|
70
|
+
|
|
49
71
|
# Internal — use `Fast::Regexp.new`. `original` is the unmodified input
|
|
50
72
|
# (String or ::Regexp) so we can build an accurate stdlib fallback.
|
|
51
|
-
|
|
73
|
+
# `backend:` forces a specific engine: `:auto` (default) tries rust/regex
|
|
74
|
+
# and falls back to stdlib, `:fast` raises if rust/regex rejects the
|
|
75
|
+
# pattern, `:stdlib` skips rust/regex entirely.
|
|
76
|
+
def initialize(pattern, original: pattern, backend: :auto, **opts)
|
|
77
|
+
raise ArgumentError, "backend must be one of #{BACKENDS.inspect}" unless BACKENDS.include?(backend)
|
|
52
78
|
@pattern = pattern
|
|
53
|
-
@backend = compile_backend(pattern, original, opts)
|
|
79
|
+
@backend = compile_backend(pattern, original, backend, opts)
|
|
54
80
|
end
|
|
55
81
|
|
|
56
82
|
def fast? = @backend.is_a?(Native)
|
|
@@ -178,17 +204,22 @@ module Fast
|
|
|
178
204
|
count
|
|
179
205
|
end
|
|
180
206
|
|
|
181
|
-
def compile_backend(pattern, original, opts)
|
|
207
|
+
def compile_backend(pattern, original, backend, opts)
|
|
208
|
+
return compile_stdlib(pattern, original) if backend == :stdlib
|
|
182
209
|
Native._native_new(pattern, **opts)
|
|
183
210
|
rescue ArgumentError => e
|
|
211
|
+
raise if backend == :fast
|
|
212
|
+
compile_stdlib(pattern, original, fallback_from: e)
|
|
213
|
+
end
|
|
214
|
+
|
|
215
|
+
def compile_stdlib(pattern, original, fallback_from: nil)
|
|
184
216
|
return original if original.is_a?(::Regexp)
|
|
185
|
-
|
|
186
|
-
|
|
187
|
-
|
|
188
|
-
|
|
189
|
-
|
|
190
|
-
|
|
191
|
-
end
|
|
217
|
+
::Regexp.new(pattern)
|
|
218
|
+
rescue ::RegexpError => e
|
|
219
|
+
# Pattern is malformed in both engines (auto path) — surface the
|
|
220
|
+
# original rust/regex error since it's typically more detailed.
|
|
221
|
+
# Otherwise propagate the stdlib error as-is.
|
|
222
|
+
raise(fallback_from ? ArgumentError.new(fallback_from.message) : e)
|
|
192
223
|
end
|
|
193
224
|
|
|
194
225
|
# Maps positional capture indices to names for the stdlib backend so we
|