fast_regexp 0.4.0 → 0.6.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/Cargo.lock +0 -1
- data/README.md +61 -10
- data/ext/fast_regexp/Cargo.toml +0 -1
- data/ext/fast_regexp/src/lib.rs +49 -114
- data/lib/fast_regexp/version.rb +1 -1
- data/lib/fast_regexp.rb +214 -41
- metadata +1 -1
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 3512de3c64b392ae48a21422dc3e61423af58ddd4c8141e952a23c80799c2261
|
|
4
|
+
data.tar.gz: cd09cfa3f3299812e19bc043da29f814c340342af4434981b4a47a9cbb7eab53
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: e83cb56946505a54ce6511b0755ecfea0838268f79b51daf26f081490130f4db8049e561d37d136e1666536b10b24b555e7e005889e0d19fba6adca01b3bcc2e
|
|
7
|
+
data.tar.gz: 5f10f02b3fa5accab0732be061bca6c1ea0d5739c72dfa52485cd7306685792d129ba14e0e881a3aa27563347428dc367a725b829a9d493c0cdb9d92dc0f21cd
|
data/Cargo.lock
CHANGED
data/README.md
CHANGED
|
@@ -3,7 +3,9 @@
|
|
|
3
3
|
[](https://badge.fury.io/rb/fast_regexp)
|
|
4
4
|
[](https://github.com/jetpks/fast_regexp/actions)
|
|
5
5
|
|
|
6
|
-
|
|
6
|
+
Fast, drop-in regex for Ruby — backed by [rust/regex](https://docs.rs/regex/latest/regex/) with transparent fallback to the stdlib `::Regexp` engine for features rust/regex doesn't support (lookaround, backreferences, possessive quantifiers, etc.).
|
|
7
|
+
|
|
8
|
+
You get rust/regex's speed on the common path, and a single uniform API (`Fast::Regexp`, `Fast::Regexp::MatchData`) regardless of which engine actually ran underneath.
|
|
7
9
|
|
|
8
10
|
## Installation
|
|
9
11
|
|
|
@@ -128,11 +130,50 @@ Fast::Regexp.new('(?P<n>\w+)').names # => ["n"]
|
|
|
128
130
|
Fast::Regexp.new('(a)(b)').captures_count # => 2
|
|
129
131
|
```
|
|
130
132
|
|
|
133
|
+
### Engine fallback
|
|
134
|
+
|
|
135
|
+
rust/regex doesn't support lookaround, backreferences, or possessive
|
|
136
|
+
quantifiers. Rather than make you manage two regex libraries, `Fast::Regexp`
|
|
137
|
+
silently falls back to stdlib `::Regexp` when it sees something rust/regex
|
|
138
|
+
can't compile. The public API (`#match`, `#sub`, `#gsub`, `#===`, `#=~`,
|
|
139
|
+
`MatchData`) is identical on both paths, so callers don't have to care which
|
|
140
|
+
engine ran — but you can inspect or reach the underlying object when you
|
|
141
|
+
need to:
|
|
142
|
+
|
|
143
|
+
```ruby
|
|
144
|
+
fast = Fast::Regexp.new('\w+')
|
|
145
|
+
fast.fast? # => true
|
|
146
|
+
fast.native # => #<Fast::Regexp::Native ...> (rust-backed)
|
|
147
|
+
|
|
148
|
+
slow = Fast::Regexp.new('foo(?=bar)') # lookahead — rust/regex rejects
|
|
149
|
+
slow.stdlib? # => true
|
|
150
|
+
slow.stdlib # => /foo(?=bar)/ (the real ::Regexp)
|
|
151
|
+
slow.match?("foobar") # => true
|
|
152
|
+
```
|
|
153
|
+
|
|
154
|
+
`Fast::Regexp::MatchData` exposes the same `#native?` / `#stdlib?` / `#native`
|
|
155
|
+
/ `#stdlib` accessors. Replacement templates use rust/regex syntax (`$1`,
|
|
156
|
+
`${name}`, `$$`) on both paths; the stdlib fallback translates them for you.
|
|
157
|
+
|
|
158
|
+
You can force a specific engine via the `backend:` kwarg:
|
|
159
|
+
|
|
160
|
+
```ruby
|
|
161
|
+
Fast::Regexp.new('\w+', backend: :fast) # rust/regex only; raises on unsupported
|
|
162
|
+
Fast::Regexp.new(pat, backend: :stdlib) # skip rust/regex; use ::Regexp directly
|
|
163
|
+
Fast::Regexp.new('\w+', backend: :auto) # default — try rust, fall back on reject
|
|
164
|
+
```
|
|
165
|
+
|
|
166
|
+
> [!NOTE]
|
|
167
|
+
> The fast path is byte-based (rust/regex's `regex::bytes`), so `#=~` returns
|
|
168
|
+
> a *byte* offset. The stdlib fallback path returns the byte offset too, for
|
|
169
|
+
> API consistency.
|
|
170
|
+
|
|
131
171
|
> [!WARNING]
|
|
132
|
-
> `rust/regex`
|
|
133
|
-
> [`Regexp`](https://docs.ruby-lang.org/en/3.4/Regexp.html)
|
|
134
|
-
> [
|
|
135
|
-
>
|
|
172
|
+
> `rust/regex` syntax differs from Ruby's built-in
|
|
173
|
+
> [`Regexp`](https://docs.ruby-lang.org/en/3.4/Regexp.html) — see the
|
|
174
|
+
> [rust/regex syntax page](https://docs.rs/regex/latest/regex/index.html#syntax).
|
|
175
|
+
> When fallback kicks in, your pattern is interpreted by stdlib `::Regexp`
|
|
176
|
+
> instead, so Ruby's syntax applies for that compile.
|
|
136
177
|
|
|
137
178
|
### Searching simultaneously
|
|
138
179
|
|
|
@@ -176,11 +217,11 @@ It also supports parsing of strings with invalid UTF-8 characters by default. It
|
|
|
176
217
|
In case unicode awarness of matchers should be disabled, both `Fast::Regexp` and `Fast::Regexp::Set` support `unicode: false` option:
|
|
177
218
|
|
|
178
219
|
```ruby
|
|
179
|
-
Fast::Regexp.new('\w+').match('ю٤夏')
|
|
180
|
-
# =>
|
|
220
|
+
Fast::Regexp.new('\w+').match('ю٤夏')[0]
|
|
221
|
+
# => "ю٤夏"
|
|
181
222
|
|
|
182
223
|
Fast::Regexp.new('\w+', unicode: false).match('ю٤夏')
|
|
183
|
-
# =>
|
|
224
|
+
# => nil
|
|
184
225
|
|
|
185
226
|
Fast::Regexp::Set.new(['\w', '\d', '\s']).match("ю٤\u2000")
|
|
186
227
|
# => [0, 1, 2]
|
|
@@ -189,6 +230,16 @@ Fast::Regexp::Set.new(['\w', '\d', '\s'], unicode: false).match("ю٤\u2000")
|
|
|
189
230
|
# => []
|
|
190
231
|
```
|
|
191
232
|
|
|
233
|
+
## Documentation
|
|
234
|
+
|
|
235
|
+
In-depth docs live under [`docs/`](docs/README.md), organized via the
|
|
236
|
+
[Diátaxis](https://diataxis.fr/) framework:
|
|
237
|
+
|
|
238
|
+
- **Tutorial:** [Getting started](docs/tutorials/getting-started.md)
|
|
239
|
+
- **How-to:** [Migrate from stdlib `::Regexp`](docs/how-to/migrate-from-stdlib-regexp.md), [Handle unsupported syntax](docs/how-to/handle-unsupported-syntax.md)
|
|
240
|
+
- **Reference:** [`Fast::Regexp`](docs/reference/fast-regexp.md), [`MatchData`](docs/reference/fast-regexp-matchdata.md), [`Set`](docs/reference/fast-regexp-set.md)
|
|
241
|
+
- **Explainers:** [Engine fallback](docs/explainers/engine-fallback.md)
|
|
242
|
+
|
|
192
243
|
## Development
|
|
193
244
|
|
|
194
245
|
```sh
|
|
@@ -209,8 +260,8 @@ Bug reports and pull requests are welcome on GitHub at https://github.com/jetpks
|
|
|
209
260
|
huge thanks for the original bindings and the clean magnus integration that
|
|
210
261
|
made this work easy to extend. This fork rebrands the gem, reshapes the
|
|
211
262
|
public API (`Fast::Regexp`, real `MatchData`, `sub`/`gsub`, `===`/`=~`,
|
|
212
|
-
`Regexp`-constructor coercion), and
|
|
213
|
-
for
|
|
263
|
+
`Regexp`-constructor coercion), and adds transparent fallback to stdlib
|
|
264
|
+
`::Regexp` for patterns rust/regex can't compile.
|
|
214
265
|
|
|
215
266
|
## License
|
|
216
267
|
|
data/ext/fast_regexp/Cargo.toml
CHANGED
data/ext/fast_regexp/src/lib.rs
CHANGED
|
@@ -2,71 +2,13 @@ use magnus::{
|
|
|
2
2
|
function, method,
|
|
3
3
|
scan_args::{get_kwargs, scan_args},
|
|
4
4
|
value::ReprValue,
|
|
5
|
-
Error, Module, Object, RArray, RHash, RString, Ruby, Symbol, TryConvert, Value,
|
|
5
|
+
Error, Module, Object, RArray, RClass, RHash, RString, Ruby, Symbol, TryConvert, Value,
|
|
6
6
|
};
|
|
7
7
|
use regex::bytes::{NoExpand, Regex, RegexBuilder, RegexSet, RegexSetBuilder};
|
|
8
8
|
use std::collections::HashMap;
|
|
9
|
-
use std::ffi::c_void;
|
|
10
|
-
use std::ptr;
|
|
11
9
|
use std::sync::Arc;
|
|
12
10
|
|
|
13
|
-
/// Skip GVL release for haystacks smaller than this — the release/reacquire
|
|
14
|
-
/// overhead (~µs) dwarfs the cost of a regex match on a tiny string.
|
|
15
|
-
const GVL_RELEASE_THRESHOLD: usize = 1024;
|
|
16
|
-
|
|
17
|
-
/// Run `func` with the GVL released so other Ruby threads (and the fiber
|
|
18
|
-
/// scheduler on its host thread) can make progress during a long regex match.
|
|
19
|
-
///
|
|
20
|
-
/// The callback runs on the same OS thread — `Send` is not required, and the
|
|
21
|
-
/// borrow checker enforces that any references stay valid for the call.
|
|
22
|
-
fn without_gvl<F, R>(func: F) -> R
|
|
23
|
-
where
|
|
24
|
-
F: FnOnce() -> R,
|
|
25
|
-
{
|
|
26
|
-
struct Pack<F, R> {
|
|
27
|
-
func: Option<F>,
|
|
28
|
-
result: Option<R>,
|
|
29
|
-
}
|
|
30
|
-
|
|
31
|
-
unsafe extern "C" fn trampoline<F, R>(data: *mut c_void) -> *mut c_void
|
|
32
|
-
where
|
|
33
|
-
F: FnOnce() -> R,
|
|
34
|
-
{
|
|
35
|
-
let pack = &mut *(data as *mut Pack<F, R>);
|
|
36
|
-
let func = pack.func.take().expect("trampoline called twice");
|
|
37
|
-
pack.result = Some(func());
|
|
38
|
-
ptr::null_mut()
|
|
39
|
-
}
|
|
40
|
-
|
|
41
|
-
let mut pack: Pack<F, R> = Pack {
|
|
42
|
-
func: Some(func),
|
|
43
|
-
result: None,
|
|
44
|
-
};
|
|
45
|
-
unsafe {
|
|
46
|
-
rb_sys::rb_thread_call_without_gvl(
|
|
47
|
-
Some(trampoline::<F, R>),
|
|
48
|
-
&mut pack as *mut _ as *mut c_void,
|
|
49
|
-
None,
|
|
50
|
-
ptr::null_mut(),
|
|
51
|
-
);
|
|
52
|
-
}
|
|
53
|
-
pack.result.take().expect("callback did not run")
|
|
54
|
-
}
|
|
55
|
-
|
|
56
|
-
fn run_regex<F, R>(haystack_len: usize, func: F) -> R
|
|
57
|
-
where
|
|
58
|
-
F: FnOnce() -> R,
|
|
59
|
-
{
|
|
60
|
-
if haystack_len >= GVL_RELEASE_THRESHOLD {
|
|
61
|
-
without_gvl(func)
|
|
62
|
-
} else {
|
|
63
|
-
func()
|
|
64
|
-
}
|
|
65
|
-
}
|
|
66
|
-
|
|
67
11
|
fn haystack_bytes(haystack: &RString) -> Vec<u8> {
|
|
68
|
-
// Copy out of the Ruby heap so the bytes are safe to read after the GVL is
|
|
69
|
-
// released (Ruby 4's compacting GC can otherwise move the string).
|
|
70
12
|
unsafe { haystack.as_slice() }.to_vec()
|
|
71
13
|
}
|
|
72
14
|
|
|
@@ -126,7 +68,7 @@ impl RegexInner {
|
|
|
126
68
|
}
|
|
127
69
|
}
|
|
128
70
|
|
|
129
|
-
#[magnus::wrap(class = "Fast::Regexp", free_immediately, size)]
|
|
71
|
+
#[magnus::wrap(class = "Fast::Regexp::Native", free_immediately, size)]
|
|
130
72
|
pub struct FastRegexp(Arc<RegexInner>);
|
|
131
73
|
|
|
132
74
|
impl FastRegexp {
|
|
@@ -157,12 +99,10 @@ impl FastRegexp {
|
|
|
157
99
|
let regex = &self.0.regex;
|
|
158
100
|
let bytes = haystack_bytes(&haystack);
|
|
159
101
|
|
|
160
|
-
let offsets =
|
|
161
|
-
regex.
|
|
162
|
-
(
|
|
163
|
-
|
|
164
|
-
.collect::<Vec<_>>()
|
|
165
|
-
})
|
|
102
|
+
let offsets = regex.captures(&bytes).map(|caps| {
|
|
103
|
+
(0..regex.captures_len())
|
|
104
|
+
.map(|i| caps.get(i).map(|m| (m.start(), m.end())))
|
|
105
|
+
.collect::<Vec<_>>()
|
|
166
106
|
})?;
|
|
167
107
|
|
|
168
108
|
Some(self.build_match_data(Arc::new(bytes), offsets))
|
|
@@ -173,29 +113,25 @@ impl FastRegexp {
|
|
|
173
113
|
let bytes = haystack_bytes(&haystack);
|
|
174
114
|
|
|
175
115
|
if regex.captures_len() == 1 {
|
|
176
|
-
let ranges: Vec<(usize, usize)> =
|
|
177
|
-
|
|
178
|
-
|
|
179
|
-
|
|
180
|
-
.collect()
|
|
181
|
-
});
|
|
116
|
+
let ranges: Vec<(usize, usize)> = regex
|
|
117
|
+
.find_iter(&bytes)
|
|
118
|
+
.map(|m| (m.start(), m.end()))
|
|
119
|
+
.collect();
|
|
182
120
|
let result = ruby.ary_new_capa(ranges.len());
|
|
183
121
|
for (s, e) in ranges {
|
|
184
122
|
result.push(utf8_string(ruby, &bytes[s..e]))?;
|
|
185
123
|
}
|
|
186
124
|
Ok(result)
|
|
187
125
|
} else {
|
|
188
|
-
let groups: Vec<Vec<CaptureOffset>> =
|
|
189
|
-
|
|
190
|
-
|
|
191
|
-
.
|
|
192
|
-
|
|
193
|
-
|
|
194
|
-
|
|
195
|
-
|
|
196
|
-
|
|
197
|
-
.collect()
|
|
198
|
-
});
|
|
126
|
+
let groups: Vec<Vec<CaptureOffset>> = regex
|
|
127
|
+
.captures_iter(&bytes)
|
|
128
|
+
.map(|caps| {
|
|
129
|
+
caps.iter()
|
|
130
|
+
.skip(1)
|
|
131
|
+
.map(|c| c.map(|m| (m.start(), m.end())))
|
|
132
|
+
.collect()
|
|
133
|
+
})
|
|
134
|
+
.collect();
|
|
199
135
|
let result = ruby.ary_new_capa(groups.len());
|
|
200
136
|
for group_ranges in groups {
|
|
201
137
|
let group = ruby.ary_new_capa(group_ranges.len());
|
|
@@ -216,16 +152,14 @@ impl FastRegexp {
|
|
|
216
152
|
let bytes = haystack_bytes(&haystack);
|
|
217
153
|
let n_groups = regex.captures_len();
|
|
218
154
|
|
|
219
|
-
let all: Vec<Vec<CaptureOffset>> =
|
|
220
|
-
|
|
221
|
-
|
|
222
|
-
|
|
223
|
-
(
|
|
224
|
-
|
|
225
|
-
|
|
226
|
-
|
|
227
|
-
.collect()
|
|
228
|
-
});
|
|
155
|
+
let all: Vec<Vec<CaptureOffset>> = regex
|
|
156
|
+
.captures_iter(&bytes)
|
|
157
|
+
.map(|caps| {
|
|
158
|
+
(0..n_groups)
|
|
159
|
+
.map(|i| caps.get(i).map(|m| (m.start(), m.end())))
|
|
160
|
+
.collect()
|
|
161
|
+
})
|
|
162
|
+
.collect();
|
|
229
163
|
|
|
230
164
|
let shared = Arc::new(bytes);
|
|
231
165
|
let result = ruby.ary_new_capa(all.len());
|
|
@@ -238,7 +172,7 @@ impl FastRegexp {
|
|
|
238
172
|
pub fn is_match(&self, haystack: RString) -> bool {
|
|
239
173
|
let regex = &self.0.regex;
|
|
240
174
|
let bytes = haystack_bytes(&haystack);
|
|
241
|
-
|
|
175
|
+
regex.is_match(&bytes)
|
|
242
176
|
}
|
|
243
177
|
|
|
244
178
|
pub fn sub_str(
|
|
@@ -251,13 +185,11 @@ impl FastRegexp {
|
|
|
251
185
|
let regex = &rb_self.0.regex;
|
|
252
186
|
let bytes = haystack_bytes(&haystack);
|
|
253
187
|
let repl = haystack_bytes(&replacement);
|
|
254
|
-
let out
|
|
255
|
-
|
|
256
|
-
|
|
257
|
-
|
|
258
|
-
|
|
259
|
-
}
|
|
260
|
-
});
|
|
188
|
+
let out: Vec<u8> = if literal {
|
|
189
|
+
regex.replace(&bytes, NoExpand(&repl)).into_owned()
|
|
190
|
+
} else {
|
|
191
|
+
regex.replace(&bytes, &repl[..]).into_owned()
|
|
192
|
+
};
|
|
261
193
|
utf8_string(ruby, &out)
|
|
262
194
|
}
|
|
263
195
|
|
|
@@ -271,13 +203,11 @@ impl FastRegexp {
|
|
|
271
203
|
let regex = &rb_self.0.regex;
|
|
272
204
|
let bytes = haystack_bytes(&haystack);
|
|
273
205
|
let repl = haystack_bytes(&replacement);
|
|
274
|
-
let out
|
|
275
|
-
|
|
276
|
-
|
|
277
|
-
|
|
278
|
-
|
|
279
|
-
}
|
|
280
|
-
});
|
|
206
|
+
let out: Vec<u8> = if literal {
|
|
207
|
+
regex.replace_all(&bytes, NoExpand(&repl)).into_owned()
|
|
208
|
+
} else {
|
|
209
|
+
regex.replace_all(&bytes, &repl[..]).into_owned()
|
|
210
|
+
};
|
|
281
211
|
utf8_string(ruby, &out)
|
|
282
212
|
}
|
|
283
213
|
|
|
@@ -299,7 +229,7 @@ impl FastRegexp {
|
|
|
299
229
|
}
|
|
300
230
|
}
|
|
301
231
|
|
|
302
|
-
#[magnus::wrap(class = "Fast::Regexp::MatchData", free_immediately, size)]
|
|
232
|
+
#[magnus::wrap(class = "Fast::Regexp::Native::MatchData", free_immediately, size)]
|
|
303
233
|
pub struct FastMatchData {
|
|
304
234
|
haystack: Arc<Vec<u8>>,
|
|
305
235
|
/// Index 0 is the whole match. Subsequent entries are capture groups in
|
|
@@ -482,13 +412,13 @@ impl FastRegexpSet {
|
|
|
482
412
|
pub fn matches(&self, haystack: RString) -> Vec<usize> {
|
|
483
413
|
let set = &self.0;
|
|
484
414
|
let bytes = haystack_bytes(&haystack);
|
|
485
|
-
|
|
415
|
+
set.matches(&bytes).iter().collect()
|
|
486
416
|
}
|
|
487
417
|
|
|
488
418
|
pub fn is_match(&self, haystack: RString) -> bool {
|
|
489
419
|
let set = &self.0;
|
|
490
420
|
let bytes = haystack_bytes(&haystack);
|
|
491
|
-
|
|
421
|
+
set.is_match(&bytes)
|
|
492
422
|
}
|
|
493
423
|
|
|
494
424
|
pub fn patterns(&self) -> Vec<String> {
|
|
@@ -499,8 +429,12 @@ impl FastRegexpSet {
|
|
|
499
429
|
#[magnus::init]
|
|
500
430
|
pub fn init(ruby: &Ruby) -> Result<(), Error> {
|
|
501
431
|
let object_class = ruby.class_object();
|
|
502
|
-
|
|
503
|
-
|
|
432
|
+
// Fast::Regexp must already be defined as a class by lib/fast_regexp.rb
|
|
433
|
+
// before this extension is required — the Ruby façade owns that constant
|
|
434
|
+
// and delegates to the Native class registered below.
|
|
435
|
+
let fast_module: magnus::RModule = object_class.const_get("Fast")?;
|
|
436
|
+
let regexp_facade: RClass = fast_module.const_get("Regexp")?;
|
|
437
|
+
let regexp_class = regexp_facade.define_class("Native", object_class)?;
|
|
504
438
|
|
|
505
439
|
regexp_class.define_singleton_method("_native_new", function!(FastRegexp::new, -1))?;
|
|
506
440
|
regexp_class.define_method("_native_match", method!(FastRegexp::rmatch, 1))?;
|
|
@@ -531,7 +465,8 @@ pub fn init(ruby: &Ruby) -> Result<(), Error> {
|
|
|
531
465
|
match_data_class.define_method("byte_end", method!(FastMatchData::byte_end, 1))?;
|
|
532
466
|
match_data_class.define_method("inspect", method!(FastMatchData::inspect, 0))?;
|
|
533
467
|
|
|
534
|
-
|
|
468
|
+
// Set stays at Fast::Regexp::Set (no fallback; rust/regex RegexSet is the only backend).
|
|
469
|
+
let regexp_set_class = regexp_facade.define_class("Set", object_class)?;
|
|
535
470
|
|
|
536
471
|
regexp_set_class.define_singleton_method("new", function!(FastRegexpSet::new, -1))?;
|
|
537
472
|
regexp_set_class.define_method("match", method!(FastRegexpSet::matches, 1))?;
|
data/lib/fast_regexp/version.rb
CHANGED
data/lib/fast_regexp.rb
CHANGED
|
@@ -1,12 +1,22 @@
|
|
|
1
1
|
# frozen_string_literal: true
|
|
2
2
|
|
|
3
3
|
require_relative "fast_regexp/version"
|
|
4
|
+
|
|
5
|
+
module Fast
|
|
6
|
+
# Façade over rust/regex with a transparent fallback to stdlib `::Regexp`.
|
|
7
|
+
#
|
|
8
|
+
# `Fast::Regexp.new(pattern)` first tries to compile with rust/regex (fast,
|
|
9
|
+
# byte-based). If the pattern uses features rust/regex does not support
|
|
10
|
+
# (lookaround, backreferences, possessive quantifiers, etc.) we fall back
|
|
11
|
+
# to `::Regexp` so consumers don't have to juggle two libraries.
|
|
12
|
+
class Regexp
|
|
13
|
+
end
|
|
14
|
+
end
|
|
15
|
+
|
|
16
|
+
# Load the native extension AFTER the Fast::Regexp class shell exists — the
|
|
17
|
+
# Rust init() looks up Fast::Regexp and registers Native under it.
|
|
4
18
|
require_relative "fast_regexp/fast_regexp"
|
|
5
19
|
|
|
6
|
-
# Ruby-side conveniences on top of the Rust extension. The native class only
|
|
7
|
-
# exposes raw primitives (`_native_new`, `_native_match`, `_native_sub`,
|
|
8
|
-
# `_native_gsub`); everything user-facing lives here so the API can grow
|
|
9
|
-
# without touching FFI.
|
|
10
20
|
module Fast
|
|
11
21
|
class Regexp
|
|
12
22
|
RUBY_FLAG_MAP = {
|
|
@@ -16,14 +26,21 @@ module Fast
|
|
|
16
26
|
}.freeze
|
|
17
27
|
|
|
18
28
|
class << self
|
|
19
|
-
#
|
|
20
|
-
#
|
|
21
|
-
#
|
|
22
|
-
# (lookaround, backrefs) still raise from the engine with a clear
|
|
23
|
-
# message.
|
|
29
|
+
# Compile a pattern. Accepts a String or a `::Regexp` (flags are
|
|
30
|
+
# translated into a leading `(?...)` group so rust/regex sees them).
|
|
31
|
+
# Falls back to `::Regexp` if rust/regex rejects the pattern.
|
|
24
32
|
def new(pattern, **opts)
|
|
25
|
-
|
|
26
|
-
|
|
33
|
+
translated = pattern.is_a?(::Regexp) ? translate_regexp(pattern) : pattern
|
|
34
|
+
allocate.tap { |re| re.send(:initialize, translated, original: pattern, **opts) }
|
|
35
|
+
end
|
|
36
|
+
|
|
37
|
+
# Bulk-compile a symbol-keyed hash of patterns. Handy for defining a set
|
|
38
|
+
# of regex constants in one shot:
|
|
39
|
+
#
|
|
40
|
+
# RE = Fast::Regexp.create_many(word: '\w+', num: '\d+').freeze
|
|
41
|
+
# RE[:word].match("hello")
|
|
42
|
+
def create_many(**patterns)
|
|
43
|
+
patterns.transform_values { |pat| new(pat) }
|
|
27
44
|
end
|
|
28
45
|
|
|
29
46
|
private
|
|
@@ -36,34 +53,66 @@ module Fast
|
|
|
36
53
|
end
|
|
37
54
|
end
|
|
38
55
|
|
|
39
|
-
|
|
40
|
-
|
|
41
|
-
|
|
56
|
+
attr_reader :pattern, :backend
|
|
57
|
+
|
|
58
|
+
BACKENDS = %i[auto fast stdlib].freeze
|
|
59
|
+
|
|
60
|
+
# Internal — use `Fast::Regexp.new`. `original` is the unmodified input
|
|
61
|
+
# (String or ::Regexp) so we can build an accurate stdlib fallback.
|
|
62
|
+
# `backend:` forces a specific engine: `:auto` (default) tries rust/regex
|
|
63
|
+
# and falls back to stdlib, `:fast` raises if rust/regex rejects the
|
|
64
|
+
# pattern, `:stdlib` skips rust/regex entirely.
|
|
65
|
+
def initialize(pattern, original: pattern, backend: :auto, **opts)
|
|
66
|
+
raise ArgumentError, "backend must be one of #{BACKENDS.inspect}" unless BACKENDS.include?(backend)
|
|
67
|
+
@pattern = pattern
|
|
68
|
+
@backend = compile_backend(pattern, original, backend, opts)
|
|
69
|
+
end
|
|
70
|
+
|
|
71
|
+
def fast? = @backend.is_a?(Native)
|
|
72
|
+
def stdlib? = !fast?
|
|
73
|
+
|
|
74
|
+
# Escape hatches for callers that need the underlying object directly.
|
|
75
|
+
def native = fast? ? @backend : nil
|
|
76
|
+
def stdlib = stdlib? ? @backend : nil
|
|
77
|
+
|
|
42
78
|
def match(haystack)
|
|
43
|
-
|
|
79
|
+
haystack = coerce_string(haystack)
|
|
80
|
+
raw = fast? ? @backend._native_match(haystack) : @backend.match(haystack)
|
|
81
|
+
raw && MatchData.new(raw, haystack)
|
|
82
|
+
end
|
|
83
|
+
|
|
84
|
+
def match?(haystack)
|
|
85
|
+
@backend.match?(coerce_string(haystack))
|
|
44
86
|
end
|
|
45
87
|
|
|
46
|
-
# Case-equality. Enables `case/when` and RSpec's
|
|
47
|
-
# `expect(str).to match(re)`.
|
|
48
88
|
def ===(other)
|
|
49
89
|
return false unless other.respond_to?(:to_str)
|
|
50
90
|
match?(other.to_str)
|
|
51
91
|
end
|
|
52
92
|
|
|
53
|
-
#
|
|
54
|
-
#
|
|
55
|
-
# is byte-based).
|
|
93
|
+
# Byte offset of the first match (rust/regex is byte-based; stdlib path
|
|
94
|
+
# also returns bytes here for API consistency).
|
|
56
95
|
def =~(other)
|
|
57
96
|
return nil unless other.respond_to?(:to_str)
|
|
58
97
|
m = match(other.to_str)
|
|
59
98
|
m && m.byte_begin(0)
|
|
60
99
|
end
|
|
61
100
|
|
|
62
|
-
|
|
63
|
-
|
|
64
|
-
|
|
65
|
-
|
|
66
|
-
|
|
101
|
+
def scan(haystack)
|
|
102
|
+
@backend.scan(coerce_string(haystack))
|
|
103
|
+
end
|
|
104
|
+
|
|
105
|
+
def scan_matches(haystack)
|
|
106
|
+
haystack = coerce_string(haystack)
|
|
107
|
+
if fast?
|
|
108
|
+
@backend.scan_matches(haystack).map { |m| MatchData.new(m, haystack) }
|
|
109
|
+
else
|
|
110
|
+
results = []
|
|
111
|
+
haystack.scan(@backend) { results << MatchData.new(::Regexp.last_match, haystack) }
|
|
112
|
+
results
|
|
113
|
+
end
|
|
114
|
+
end
|
|
115
|
+
|
|
67
116
|
def sub(haystack, replacement = nil, literal: false, &block)
|
|
68
117
|
haystack = coerce_string(haystack)
|
|
69
118
|
if block
|
|
@@ -73,32 +122,123 @@ module Fast
|
|
|
73
122
|
"#{m.pre_match}#{block.call(m)}#{m.post_match}"
|
|
74
123
|
else
|
|
75
124
|
raise ArgumentError, "wrong number of arguments (given 1, expected 2)" if replacement.nil?
|
|
76
|
-
|
|
125
|
+
replacement = coerce_string(replacement)
|
|
126
|
+
if fast?
|
|
127
|
+
@backend._native_sub(haystack, replacement, literal)
|
|
128
|
+
else
|
|
129
|
+
haystack.sub(@backend, stdlib_replacement(replacement, literal))
|
|
130
|
+
end
|
|
77
131
|
end
|
|
78
132
|
end
|
|
79
133
|
|
|
80
|
-
# `gsub(haystack, replacement)` or `gsub(haystack) { |m| ... }`. See
|
|
81
|
-
# `#sub` for the template syntax.
|
|
82
134
|
def gsub(haystack, replacement = nil, literal: false, &block)
|
|
83
135
|
haystack = coerce_string(haystack)
|
|
84
136
|
if block
|
|
85
137
|
raise ArgumentError, "wrong number of arguments (given 2, expected 1 with block)" if replacement
|
|
86
|
-
|
|
138
|
+
fast? ? fast_gsub_with_block(haystack, &block) : stdlib_gsub_with_block(haystack, &block)
|
|
87
139
|
else
|
|
88
140
|
raise ArgumentError, "wrong number of arguments (given 1, expected 2)" if replacement.nil?
|
|
89
|
-
|
|
141
|
+
replacement = coerce_string(replacement)
|
|
142
|
+
if fast?
|
|
143
|
+
@backend._native_gsub(haystack, replacement, literal)
|
|
144
|
+
else
|
|
145
|
+
haystack.gsub(@backend, stdlib_replacement(replacement, literal))
|
|
146
|
+
end
|
|
90
147
|
end
|
|
91
148
|
end
|
|
92
149
|
|
|
93
|
-
|
|
94
|
-
|
|
150
|
+
# On the fast path this comes from rust/regex directly. On the stdlib
|
|
151
|
+
# fallback we walk the source counting capturing groups while honoring
|
|
152
|
+
# escapes, character classes, and non-capturing / lookaround prefixes.
|
|
153
|
+
def captures_count
|
|
154
|
+
return @backend.captures_count if fast?
|
|
155
|
+
count_stdlib_captures(@backend.source)
|
|
95
156
|
end
|
|
96
157
|
|
|
158
|
+
def names
|
|
159
|
+
@backend.names
|
|
160
|
+
end
|
|
161
|
+
|
|
162
|
+
def inspect = "#<Fast::Regexp #{@pattern.inspect}#{stdlib? ? " (stdlib)" : ""}>"
|
|
97
163
|
alias_method :to_s, :pattern
|
|
98
164
|
|
|
99
165
|
private
|
|
100
166
|
|
|
101
|
-
def
|
|
167
|
+
def count_stdlib_captures(source)
|
|
168
|
+
count = 0
|
|
169
|
+
i = 0
|
|
170
|
+
len = source.length
|
|
171
|
+
while i < len
|
|
172
|
+
c = source[i]
|
|
173
|
+
if c == "\\"
|
|
174
|
+
i += 2
|
|
175
|
+
elsif c == "["
|
|
176
|
+
i += 1
|
|
177
|
+
i += 1 while i < len && source[i] != "]"
|
|
178
|
+
i += 1
|
|
179
|
+
elsif c == "("
|
|
180
|
+
# Capturing unless followed by (?:, (?=, (?!, (?#, (?<=, (?<!.
|
|
181
|
+
prefix = source[i + 1, 4] || ""
|
|
182
|
+
if prefix.start_with?("?:") || prefix.start_with?("?=") || prefix.start_with?("?!") ||
|
|
183
|
+
prefix.start_with?("?#") || prefix.start_with?("?<=") || prefix.start_with?("?<!")
|
|
184
|
+
i += 1
|
|
185
|
+
else
|
|
186
|
+
count += 1
|
|
187
|
+
i += 1
|
|
188
|
+
end
|
|
189
|
+
else
|
|
190
|
+
i += 1
|
|
191
|
+
end
|
|
192
|
+
end
|
|
193
|
+
count
|
|
194
|
+
end
|
|
195
|
+
|
|
196
|
+
def compile_backend(pattern, original, backend, opts)
|
|
197
|
+
return compile_stdlib(pattern, original) if backend == :stdlib
|
|
198
|
+
Native._native_new(pattern, **opts)
|
|
199
|
+
rescue ArgumentError => e
|
|
200
|
+
raise if backend == :fast
|
|
201
|
+
compile_stdlib(pattern, original, fallback_from: e)
|
|
202
|
+
end
|
|
203
|
+
|
|
204
|
+
def compile_stdlib(pattern, original, fallback_from: nil)
|
|
205
|
+
return original if original.is_a?(::Regexp)
|
|
206
|
+
::Regexp.new(pattern)
|
|
207
|
+
rescue ::RegexpError => e
|
|
208
|
+
# Pattern is malformed in both engines (auto path) — surface the
|
|
209
|
+
# original rust/regex error since it's typically more detailed.
|
|
210
|
+
# Otherwise propagate the stdlib error as-is.
|
|
211
|
+
raise(fallback_from ? ArgumentError.new(fallback_from.message) : e)
|
|
212
|
+
end
|
|
213
|
+
|
|
214
|
+
# Maps positional capture indices to names for the stdlib backend so we
|
|
215
|
+
# can translate `$N` to `\k<name>` (Ruby's gsub ignores `\N` for named
|
|
216
|
+
# groups). Only invoked from the stdlib path.
|
|
217
|
+
def stdlib_name_by_index
|
|
218
|
+
@stdlib_name_by_index ||= @backend.named_captures.flat_map { |n, idxs| idxs.map { |i| [i, n] } }.to_h
|
|
219
|
+
end
|
|
220
|
+
|
|
221
|
+
# Translate rust/regex replacement syntax ($N, ${name}, $$) into Ruby
|
|
222
|
+
# syntax (\N, \k<name>, $) for the stdlib fallback path.
|
|
223
|
+
def stdlib_replacement(template, literal)
|
|
224
|
+
return template.gsub('\\', '\\\\\\\\') if literal
|
|
225
|
+
|
|
226
|
+
template.gsub(/\$(\$|\d+|\{[^}]+\})/) do
|
|
227
|
+
token = ::Regexp.last_match(1)
|
|
228
|
+
case token
|
|
229
|
+
when "$" then "$"
|
|
230
|
+
when /\A\d+\z/
|
|
231
|
+
name = stdlib_name_by_index[token.to_i]
|
|
232
|
+
name ? "\\k<#{name}>" : "\\#{token}"
|
|
233
|
+
else "\\k<#{token[1..-2]}>"
|
|
234
|
+
end
|
|
235
|
+
end
|
|
236
|
+
end
|
|
237
|
+
|
|
238
|
+
# Fast path: rust/regex has no native iterate-with-replace, so we scan all
|
|
239
|
+
# match positions up-front, then splice the result by byte offset in one
|
|
240
|
+
# pass. Stays UTF-8 since haystack and match slices are.
|
|
241
|
+
def fast_gsub_with_block(haystack)
|
|
102
242
|
matches = scan_matches(haystack)
|
|
103
243
|
return haystack.dup if matches.empty?
|
|
104
244
|
|
|
@@ -114,31 +254,64 @@ module Fast
|
|
|
114
254
|
out
|
|
115
255
|
end
|
|
116
256
|
|
|
257
|
+
# Stdlib path: String#gsub already does single-pass iterate-and-replace
|
|
258
|
+
# and sets $~ inside the block, so wrap the current ::MatchData and yield.
|
|
259
|
+
def stdlib_gsub_with_block(haystack)
|
|
260
|
+
haystack.gsub(@backend) { yield(MatchData.new(::Regexp.last_match, haystack)).to_s }
|
|
261
|
+
end
|
|
262
|
+
|
|
117
263
|
def coerce_string(value)
|
|
118
264
|
return value if value.is_a?(String)
|
|
119
265
|
return value.to_str if value.respond_to?(:to_str)
|
|
120
266
|
raise TypeError, "no implicit conversion of #{value.class} into String"
|
|
121
267
|
end
|
|
122
268
|
|
|
269
|
+
# Wraps either a Fast::Regexp::Native::MatchData or a stdlib ::MatchData
|
|
270
|
+
# so callers see one type regardless of which backend ran.
|
|
123
271
|
class MatchData
|
|
124
272
|
include Enumerable
|
|
125
273
|
|
|
126
|
-
|
|
127
|
-
to_a.each(&block)
|
|
128
|
-
end
|
|
274
|
+
attr_reader :backend, :string
|
|
129
275
|
|
|
130
|
-
def
|
|
131
|
-
|
|
276
|
+
def initialize(backend, haystack)
|
|
277
|
+
@backend = backend
|
|
278
|
+
@string = haystack
|
|
132
279
|
end
|
|
133
280
|
|
|
281
|
+
def native? = @backend.is_a?(Fast::Regexp::Native::MatchData)
|
|
282
|
+
def stdlib? = !native?
|
|
283
|
+
|
|
284
|
+
def native = native? ? @backend : nil
|
|
285
|
+
def stdlib = stdlib? ? @backend : nil
|
|
286
|
+
|
|
287
|
+
def [](key) = @backend[key]
|
|
288
|
+
def to_a = @backend.to_a
|
|
289
|
+
def captures = @backend.captures
|
|
290
|
+
def named_captures = @backend.named_captures
|
|
291
|
+
def names = @backend.names
|
|
292
|
+
def size = @backend.size
|
|
293
|
+
alias_method :length, :size
|
|
294
|
+
def pre_match = @backend.pre_match
|
|
295
|
+
def post_match = @backend.post_match
|
|
296
|
+
def to_s = @backend.to_s
|
|
297
|
+
|
|
298
|
+
# Byte-based offsets. Both backends expose these (stdlib MatchData has
|
|
299
|
+
# `byteoffset` / `byte_begin` / `byte_end` since Ruby 3.2).
|
|
300
|
+
def byteoffset(key) = @backend.byteoffset(key)
|
|
301
|
+
def byte_begin(key) = @backend.byte_begin(key)
|
|
302
|
+
def byte_end(key) = @backend.byte_end(key)
|
|
303
|
+
|
|
304
|
+
def each(&block) = to_a.each(&block)
|
|
305
|
+
def values_at(*indices) = indices.map { |i| self[i] }
|
|
306
|
+
|
|
134
307
|
def ==(other)
|
|
135
308
|
other.is_a?(MatchData) && to_a == other.to_a && string == other.string
|
|
136
309
|
end
|
|
137
310
|
alias_method :eql?, :==
|
|
138
311
|
|
|
139
|
-
def hash
|
|
140
|
-
|
|
141
|
-
|
|
312
|
+
def hash = [to_a, string].hash
|
|
313
|
+
|
|
314
|
+
def inspect = "#<Fast::Regexp::MatchData #{to_s.inspect}>"
|
|
142
315
|
end
|
|
143
316
|
end
|
|
144
317
|
end
|