fast_regexp 0.4.0 → 0.6.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: ea6096017d3c6747b5efdd80ac2a1c0feb0eff1b90653cb0b24f5e74295bea75
4
- data.tar.gz: f22d78881e171014dcd8e02dd2918157faf4bbc0bd36fb9ae86f183ffb8db2cd
3
+ metadata.gz: 3512de3c64b392ae48a21422dc3e61423af58ddd4c8141e952a23c80799c2261
4
+ data.tar.gz: cd09cfa3f3299812e19bc043da29f814c340342af4434981b4a47a9cbb7eab53
5
5
  SHA512:
6
- metadata.gz: 83407af617e35aa7892c4b05650304389644d415dfb442b6cc3352bd9c31a2d1c88348f4d74d499f7437ee322404f841b33008ac96eb1cd9685526e3bd8770e8
7
- data.tar.gz: 919e759dc7b9d9ba21071b9a3e093ef3e0b32fd3bebfc9597650d753b9c40952fed2e401710a603f02aa116314bfb71935158852593a87b421f1b795b9d45caf
6
+ metadata.gz: e83cb56946505a54ce6511b0755ecfea0838268f79b51daf26f081490130f4db8049e561d37d136e1666536b10b24b555e7e005889e0d19fba6adca01b3bcc2e
7
+ data.tar.gz: 5f10f02b3fa5accab0732be061bca6c1ea0d5739c72dfa52485cd7306685792d129ba14e0e881a3aa27563347428dc367a725b829a9d493c0cdb9d92dc0f21cd
data/Cargo.lock CHANGED
@@ -72,7 +72,6 @@ name = "fast_regexp"
72
72
  version = "0.1.0"
73
73
  dependencies = [
74
74
  "magnus",
75
- "rb-sys",
76
75
  "regex",
77
76
  ]
78
77
 
data/README.md CHANGED
@@ -3,7 +3,9 @@
3
3
  [![Gem Version](https://badge.fury.io/rb/fast_regexp.svg)](https://badge.fury.io/rb/fast_regexp)
4
4
  [![Test](https://github.com/jetpks/fast_regexp/workflows/CI/badge.svg)](https://github.com/jetpks/fast_regexp/actions)
5
5
 
6
- Ruby bindings for [rust/regex](https://docs.rs/regex/latest/regex/) library.
6
+ Fast, drop-in regex for Ruby — backed by [rust/regex](https://docs.rs/regex/latest/regex/) with transparent fallback to the stdlib `::Regexp` engine for features rust/regex doesn't support (lookaround, backreferences, possessive quantifiers, etc.).
7
+
8
+ You get rust/regex's speed on the common path, and a single uniform API (`Fast::Regexp`, `Fast::Regexp::MatchData`) regardless of which engine actually ran underneath.
7
9
 
8
10
  ## Installation
9
11
 
@@ -128,11 +130,50 @@ Fast::Regexp.new('(?P<n>\w+)').names # => ["n"]
128
130
  Fast::Regexp.new('(a)(b)').captures_count # => 2
129
131
  ```
130
132
 
133
+ ### Engine fallback
134
+
135
+ rust/regex doesn't support lookaround, backreferences, or possessive
136
+ quantifiers. Rather than make you manage two regex libraries, `Fast::Regexp`
137
+ silently falls back to stdlib `::Regexp` when it sees something rust/regex
138
+ can't compile. The public API (`#match`, `#sub`, `#gsub`, `#===`, `#=~`,
139
+ `MatchData`) is identical on both paths, so callers don't have to care which
140
+ engine ran — but you can inspect or reach the underlying object when you
141
+ need to:
142
+
143
+ ```ruby
144
+ fast = Fast::Regexp.new('\w+')
145
+ fast.fast? # => true
146
+ fast.native # => #<Fast::Regexp::Native ...> (rust-backed)
147
+
148
+ slow = Fast::Regexp.new('foo(?=bar)') # lookahead — rust/regex rejects
149
+ slow.stdlib? # => true
150
+ slow.stdlib # => /foo(?=bar)/ (the real ::Regexp)
151
+ slow.match?("foobar") # => true
152
+ ```
153
+
154
+ `Fast::Regexp::MatchData` exposes the same `#native?` / `#stdlib?` / `#native`
155
+ / `#stdlib` accessors. Replacement templates use rust/regex syntax (`$1`,
156
+ `${name}`, `$$`) on both paths; the stdlib fallback translates them for you.
157
+
158
+ You can force a specific engine via the `backend:` kwarg:
159
+
160
+ ```ruby
161
+ Fast::Regexp.new('\w+', backend: :fast) # rust/regex only; raises on unsupported
162
+ Fast::Regexp.new(pat, backend: :stdlib) # skip rust/regex; use ::Regexp directly
163
+ Fast::Regexp.new('\w+', backend: :auto) # default — try rust, fall back on reject
164
+ ```
165
+
166
+ > [!NOTE]
167
+ > The fast path is byte-based (rust/regex's `regex::bytes`), so `#=~` returns
168
+ > a *byte* offset. The stdlib fallback path returns the byte offset too, for
169
+ > API consistency.
170
+
131
171
  > [!WARNING]
132
- > `rust/regex` regular expression syntax differs from Ruby's built-in
133
- > [`Regexp`](https://docs.ruby-lang.org/en/3.4/Regexp.html) library, see the
134
- > [official syntax page](https://docs.rs/regex/latest/regex/index.html#syntax) for more
135
- > details.
172
+ > `rust/regex` syntax differs from Ruby's built-in
173
+ > [`Regexp`](https://docs.ruby-lang.org/en/3.4/Regexp.html) see the
174
+ > [rust/regex syntax page](https://docs.rs/regex/latest/regex/index.html#syntax).
175
+ > When fallback kicks in, your pattern is interpreted by stdlib `::Regexp`
176
+ > instead, so Ruby's syntax applies for that compile.
136
177
 
137
178
  ### Searching simultaneously
138
179
 
@@ -176,11 +217,11 @@ It also supports parsing of strings with invalid UTF-8 characters by default. It
176
217
  In case unicode awarness of matchers should be disabled, both `Fast::Regexp` and `Fast::Regexp::Set` support `unicode: false` option:
177
218
 
178
219
  ```ruby
179
- Fast::Regexp.new('\w+').match('ю٤夏')
180
- # => ["ю٤夏"]
220
+ Fast::Regexp.new('\w+').match('ю٤夏')[0]
221
+ # => "ю٤夏"
181
222
 
182
223
  Fast::Regexp.new('\w+', unicode: false).match('ю٤夏')
183
- # => []
224
+ # => nil
184
225
 
185
226
  Fast::Regexp::Set.new(['\w', '\d', '\s']).match("ю٤\u2000")
186
227
  # => [0, 1, 2]
@@ -189,6 +230,16 @@ Fast::Regexp::Set.new(['\w', '\d', '\s'], unicode: false).match("ю٤\u2000")
189
230
  # => []
190
231
  ```
191
232
 
233
+ ## Documentation
234
+
235
+ In-depth docs live under [`docs/`](docs/README.md), organized via the
236
+ [Diátaxis](https://diataxis.fr/) framework:
237
+
238
+ - **Tutorial:** [Getting started](docs/tutorials/getting-started.md)
239
+ - **How-to:** [Migrate from stdlib `::Regexp`](docs/how-to/migrate-from-stdlib-regexp.md), [Handle unsupported syntax](docs/how-to/handle-unsupported-syntax.md)
240
+ - **Reference:** [`Fast::Regexp`](docs/reference/fast-regexp.md), [`MatchData`](docs/reference/fast-regexp-matchdata.md), [`Set`](docs/reference/fast-regexp-set.md)
241
+ - **Explainers:** [Engine fallback](docs/explainers/engine-fallback.md)
242
+
192
243
  ## Development
193
244
 
194
245
  ```sh
@@ -209,8 +260,8 @@ Bug reports and pull requests are welcome on GitHub at https://github.com/jetpks
209
260
  huge thanks for the original bindings and the clean magnus integration that
210
261
  made this work easy to extend. This fork rebrands the gem, reshapes the
211
262
  public API (`Fast::Regexp`, real `MatchData`, `sub`/`gsub`, `===`/`=~`,
212
- `Regexp`-constructor coercion), and releases the GVL around regex execution
213
- for thread/fiber-friendly matching.
263
+ `Regexp`-constructor coercion), and adds transparent fallback to stdlib
264
+ `::Regexp` for patterns rust/regex can't compile.
214
265
 
215
266
  ## License
216
267
 
@@ -9,4 +9,3 @@ crate-type = ["cdylib"]
9
9
  [dependencies]
10
10
  magnus = "0.8"
11
11
  regex = "1"
12
- rb-sys = "0.9"
@@ -2,71 +2,13 @@ use magnus::{
2
2
  function, method,
3
3
  scan_args::{get_kwargs, scan_args},
4
4
  value::ReprValue,
5
- Error, Module, Object, RArray, RHash, RString, Ruby, Symbol, TryConvert, Value,
5
+ Error, Module, Object, RArray, RClass, RHash, RString, Ruby, Symbol, TryConvert, Value,
6
6
  };
7
7
  use regex::bytes::{NoExpand, Regex, RegexBuilder, RegexSet, RegexSetBuilder};
8
8
  use std::collections::HashMap;
9
- use std::ffi::c_void;
10
- use std::ptr;
11
9
  use std::sync::Arc;
12
10
 
13
- /// Skip GVL release for haystacks smaller than this — the release/reacquire
14
- /// overhead (~µs) dwarfs the cost of a regex match on a tiny string.
15
- const GVL_RELEASE_THRESHOLD: usize = 1024;
16
-
17
- /// Run `func` with the GVL released so other Ruby threads (and the fiber
18
- /// scheduler on its host thread) can make progress during a long regex match.
19
- ///
20
- /// The callback runs on the same OS thread — `Send` is not required, and the
21
- /// borrow checker enforces that any references stay valid for the call.
22
- fn without_gvl<F, R>(func: F) -> R
23
- where
24
- F: FnOnce() -> R,
25
- {
26
- struct Pack<F, R> {
27
- func: Option<F>,
28
- result: Option<R>,
29
- }
30
-
31
- unsafe extern "C" fn trampoline<F, R>(data: *mut c_void) -> *mut c_void
32
- where
33
- F: FnOnce() -> R,
34
- {
35
- let pack = &mut *(data as *mut Pack<F, R>);
36
- let func = pack.func.take().expect("trampoline called twice");
37
- pack.result = Some(func());
38
- ptr::null_mut()
39
- }
40
-
41
- let mut pack: Pack<F, R> = Pack {
42
- func: Some(func),
43
- result: None,
44
- };
45
- unsafe {
46
- rb_sys::rb_thread_call_without_gvl(
47
- Some(trampoline::<F, R>),
48
- &mut pack as *mut _ as *mut c_void,
49
- None,
50
- ptr::null_mut(),
51
- );
52
- }
53
- pack.result.take().expect("callback did not run")
54
- }
55
-
56
- fn run_regex<F, R>(haystack_len: usize, func: F) -> R
57
- where
58
- F: FnOnce() -> R,
59
- {
60
- if haystack_len >= GVL_RELEASE_THRESHOLD {
61
- without_gvl(func)
62
- } else {
63
- func()
64
- }
65
- }
66
-
67
11
  fn haystack_bytes(haystack: &RString) -> Vec<u8> {
68
- // Copy out of the Ruby heap so the bytes are safe to read after the GVL is
69
- // released (Ruby 4's compacting GC can otherwise move the string).
70
12
  unsafe { haystack.as_slice() }.to_vec()
71
13
  }
72
14
 
@@ -126,7 +68,7 @@ impl RegexInner {
126
68
  }
127
69
  }
128
70
 
129
- #[magnus::wrap(class = "Fast::Regexp", free_immediately, size)]
71
+ #[magnus::wrap(class = "Fast::Regexp::Native", free_immediately, size)]
130
72
  pub struct FastRegexp(Arc<RegexInner>);
131
73
 
132
74
  impl FastRegexp {
@@ -157,12 +99,10 @@ impl FastRegexp {
157
99
  let regex = &self.0.regex;
158
100
  let bytes = haystack_bytes(&haystack);
159
101
 
160
- let offsets = run_regex(bytes.len(), || {
161
- regex.captures(&bytes).map(|caps| {
162
- (0..regex.captures_len())
163
- .map(|i| caps.get(i).map(|m| (m.start(), m.end())))
164
- .collect::<Vec<_>>()
165
- })
102
+ let offsets = regex.captures(&bytes).map(|caps| {
103
+ (0..regex.captures_len())
104
+ .map(|i| caps.get(i).map(|m| (m.start(), m.end())))
105
+ .collect::<Vec<_>>()
166
106
  })?;
167
107
 
168
108
  Some(self.build_match_data(Arc::new(bytes), offsets))
@@ -173,29 +113,25 @@ impl FastRegexp {
173
113
  let bytes = haystack_bytes(&haystack);
174
114
 
175
115
  if regex.captures_len() == 1 {
176
- let ranges: Vec<(usize, usize)> = run_regex(bytes.len(), || {
177
- regex
178
- .find_iter(&bytes)
179
- .map(|m| (m.start(), m.end()))
180
- .collect()
181
- });
116
+ let ranges: Vec<(usize, usize)> = regex
117
+ .find_iter(&bytes)
118
+ .map(|m| (m.start(), m.end()))
119
+ .collect();
182
120
  let result = ruby.ary_new_capa(ranges.len());
183
121
  for (s, e) in ranges {
184
122
  result.push(utf8_string(ruby, &bytes[s..e]))?;
185
123
  }
186
124
  Ok(result)
187
125
  } else {
188
- let groups: Vec<Vec<CaptureOffset>> = run_regex(bytes.len(), || {
189
- regex
190
- .captures_iter(&bytes)
191
- .map(|caps| {
192
- caps.iter()
193
- .skip(1)
194
- .map(|c| c.map(|m| (m.start(), m.end())))
195
- .collect()
196
- })
197
- .collect()
198
- });
126
+ let groups: Vec<Vec<CaptureOffset>> = regex
127
+ .captures_iter(&bytes)
128
+ .map(|caps| {
129
+ caps.iter()
130
+ .skip(1)
131
+ .map(|c| c.map(|m| (m.start(), m.end())))
132
+ .collect()
133
+ })
134
+ .collect();
199
135
  let result = ruby.ary_new_capa(groups.len());
200
136
  for group_ranges in groups {
201
137
  let group = ruby.ary_new_capa(group_ranges.len());
@@ -216,16 +152,14 @@ impl FastRegexp {
216
152
  let bytes = haystack_bytes(&haystack);
217
153
  let n_groups = regex.captures_len();
218
154
 
219
- let all: Vec<Vec<CaptureOffset>> = run_regex(bytes.len(), || {
220
- regex
221
- .captures_iter(&bytes)
222
- .map(|caps| {
223
- (0..n_groups)
224
- .map(|i| caps.get(i).map(|m| (m.start(), m.end())))
225
- .collect()
226
- })
227
- .collect()
228
- });
155
+ let all: Vec<Vec<CaptureOffset>> = regex
156
+ .captures_iter(&bytes)
157
+ .map(|caps| {
158
+ (0..n_groups)
159
+ .map(|i| caps.get(i).map(|m| (m.start(), m.end())))
160
+ .collect()
161
+ })
162
+ .collect();
229
163
 
230
164
  let shared = Arc::new(bytes);
231
165
  let result = ruby.ary_new_capa(all.len());
@@ -238,7 +172,7 @@ impl FastRegexp {
238
172
  pub fn is_match(&self, haystack: RString) -> bool {
239
173
  let regex = &self.0.regex;
240
174
  let bytes = haystack_bytes(&haystack);
241
- run_regex(bytes.len(), || regex.is_match(&bytes))
175
+ regex.is_match(&bytes)
242
176
  }
243
177
 
244
178
  pub fn sub_str(
@@ -251,13 +185,11 @@ impl FastRegexp {
251
185
  let regex = &rb_self.0.regex;
252
186
  let bytes = haystack_bytes(&haystack);
253
187
  let repl = haystack_bytes(&replacement);
254
- let out = run_regex(bytes.len(), || -> Vec<u8> {
255
- if literal {
256
- regex.replace(&bytes, NoExpand(&repl)).into_owned()
257
- } else {
258
- regex.replace(&bytes, &repl[..]).into_owned()
259
- }
260
- });
188
+ let out: Vec<u8> = if literal {
189
+ regex.replace(&bytes, NoExpand(&repl)).into_owned()
190
+ } else {
191
+ regex.replace(&bytes, &repl[..]).into_owned()
192
+ };
261
193
  utf8_string(ruby, &out)
262
194
  }
263
195
 
@@ -271,13 +203,11 @@ impl FastRegexp {
271
203
  let regex = &rb_self.0.regex;
272
204
  let bytes = haystack_bytes(&haystack);
273
205
  let repl = haystack_bytes(&replacement);
274
- let out = run_regex(bytes.len(), || -> Vec<u8> {
275
- if literal {
276
- regex.replace_all(&bytes, NoExpand(&repl)).into_owned()
277
- } else {
278
- regex.replace_all(&bytes, &repl[..]).into_owned()
279
- }
280
- });
206
+ let out: Vec<u8> = if literal {
207
+ regex.replace_all(&bytes, NoExpand(&repl)).into_owned()
208
+ } else {
209
+ regex.replace_all(&bytes, &repl[..]).into_owned()
210
+ };
281
211
  utf8_string(ruby, &out)
282
212
  }
283
213
 
@@ -299,7 +229,7 @@ impl FastRegexp {
299
229
  }
300
230
  }
301
231
 
302
- #[magnus::wrap(class = "Fast::Regexp::MatchData", free_immediately, size)]
232
+ #[magnus::wrap(class = "Fast::Regexp::Native::MatchData", free_immediately, size)]
303
233
  pub struct FastMatchData {
304
234
  haystack: Arc<Vec<u8>>,
305
235
  /// Index 0 is the whole match. Subsequent entries are capture groups in
@@ -482,13 +412,13 @@ impl FastRegexpSet {
482
412
  pub fn matches(&self, haystack: RString) -> Vec<usize> {
483
413
  let set = &self.0;
484
414
  let bytes = haystack_bytes(&haystack);
485
- run_regex(bytes.len(), || set.matches(&bytes).iter().collect())
415
+ set.matches(&bytes).iter().collect()
486
416
  }
487
417
 
488
418
  pub fn is_match(&self, haystack: RString) -> bool {
489
419
  let set = &self.0;
490
420
  let bytes = haystack_bytes(&haystack);
491
- run_regex(bytes.len(), || set.is_match(&bytes))
421
+ set.is_match(&bytes)
492
422
  }
493
423
 
494
424
  pub fn patterns(&self) -> Vec<String> {
@@ -499,8 +429,12 @@ impl FastRegexpSet {
499
429
  #[magnus::init]
500
430
  pub fn init(ruby: &Ruby) -> Result<(), Error> {
501
431
  let object_class = ruby.class_object();
502
- let fast_module = ruby.define_module("Fast")?;
503
- let regexp_class = fast_module.define_class("Regexp", object_class)?;
432
+ // Fast::Regexp must already be defined as a class by lib/fast_regexp.rb
433
+ // before this extension is required — the Ruby façade owns that constant
434
+ // and delegates to the Native class registered below.
435
+ let fast_module: magnus::RModule = object_class.const_get("Fast")?;
436
+ let regexp_facade: RClass = fast_module.const_get("Regexp")?;
437
+ let regexp_class = regexp_facade.define_class("Native", object_class)?;
504
438
 
505
439
  regexp_class.define_singleton_method("_native_new", function!(FastRegexp::new, -1))?;
506
440
  regexp_class.define_method("_native_match", method!(FastRegexp::rmatch, 1))?;
@@ -531,7 +465,8 @@ pub fn init(ruby: &Ruby) -> Result<(), Error> {
531
465
  match_data_class.define_method("byte_end", method!(FastMatchData::byte_end, 1))?;
532
466
  match_data_class.define_method("inspect", method!(FastMatchData::inspect, 0))?;
533
467
 
534
- let regexp_set_class = regexp_class.define_class("Set", object_class)?;
468
+ // Set stays at Fast::Regexp::Set (no fallback; rust/regex RegexSet is the only backend).
469
+ let regexp_set_class = regexp_facade.define_class("Set", object_class)?;
535
470
 
536
471
  regexp_set_class.define_singleton_method("new", function!(FastRegexpSet::new, -1))?;
537
472
  regexp_set_class.define_method("match", method!(FastRegexpSet::matches, 1))?;
@@ -2,6 +2,6 @@
2
2
 
3
3
  module Fast
4
4
  class Regexp
5
- VERSION = "0.4.0"
5
+ VERSION = "0.6.0"
6
6
  end
7
7
  end
data/lib/fast_regexp.rb CHANGED
@@ -1,12 +1,22 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  require_relative "fast_regexp/version"
4
+
5
+ module Fast
6
+ # Façade over rust/regex with a transparent fallback to stdlib `::Regexp`.
7
+ #
8
+ # `Fast::Regexp.new(pattern)` first tries to compile with rust/regex (fast,
9
+ # byte-based). If the pattern uses features rust/regex does not support
10
+ # (lookaround, backreferences, possessive quantifiers, etc.) we fall back
11
+ # to `::Regexp` so consumers don't have to juggle two libraries.
12
+ class Regexp
13
+ end
14
+ end
15
+
16
+ # Load the native extension AFTER the Fast::Regexp class shell exists — the
17
+ # Rust init() looks up Fast::Regexp and registers Native under it.
4
18
  require_relative "fast_regexp/fast_regexp"
5
19
 
6
- # Ruby-side conveniences on top of the Rust extension. The native class only
7
- # exposes raw primitives (`_native_new`, `_native_match`, `_native_sub`,
8
- # `_native_gsub`); everything user-facing lives here so the API can grow
9
- # without touching FFI.
10
20
  module Fast
11
21
  class Regexp
12
22
  RUBY_FLAG_MAP = {
@@ -16,14 +26,21 @@ module Fast
16
26
  }.freeze
17
27
 
18
28
  class << self
19
- # Accept either a pattern string or an existing `::Regexp`. When given a
20
- # Regexp the flags (`/i`, `/x`, `/m`) are translated into a leading
21
- # inline group so the rust/regex engine sees them. Unsupported features
22
- # (lookaround, backrefs) still raise from the engine with a clear
23
- # message.
29
+ # Compile a pattern. Accepts a String or a `::Regexp` (flags are
30
+ # translated into a leading `(?...)` group so rust/regex sees them).
31
+ # Falls back to `::Regexp` if rust/regex rejects the pattern.
24
32
  def new(pattern, **opts)
25
- pattern = translate_regexp(pattern) if pattern.is_a?(::Regexp)
26
- _native_new(pattern, **opts)
33
+ translated = pattern.is_a?(::Regexp) ? translate_regexp(pattern) : pattern
34
+ allocate.tap { |re| re.send(:initialize, translated, original: pattern, **opts) }
35
+ end
36
+
37
+ # Bulk-compile a symbol-keyed hash of patterns. Handy for defining a set
38
+ # of regex constants in one shot:
39
+ #
40
+ # RE = Fast::Regexp.create_many(word: '\w+', num: '\d+').freeze
41
+ # RE[:word].match("hello")
42
+ def create_many(**patterns)
43
+ patterns.transform_values { |pat| new(pat) }
27
44
  end
28
45
 
29
46
  private
@@ -36,34 +53,66 @@ module Fast
36
53
  end
37
54
  end
38
55
 
39
- # Returns a `Fast::Regexp::MatchData` on hit, `nil` on miss — matching
40
- # Ruby's `Regexp#match` shape so the old `[]`-truthy-but-empty trap is
41
- # gone.
56
+ attr_reader :pattern, :backend
57
+
58
+ BACKENDS = %i[auto fast stdlib].freeze
59
+
60
+ # Internal — use `Fast::Regexp.new`. `original` is the unmodified input
61
+ # (String or ::Regexp) so we can build an accurate stdlib fallback.
62
+ # `backend:` forces a specific engine: `:auto` (default) tries rust/regex
63
+ # and falls back to stdlib, `:fast` raises if rust/regex rejects the
64
+ # pattern, `:stdlib` skips rust/regex entirely.
65
+ def initialize(pattern, original: pattern, backend: :auto, **opts)
66
+ raise ArgumentError, "backend must be one of #{BACKENDS.inspect}" unless BACKENDS.include?(backend)
67
+ @pattern = pattern
68
+ @backend = compile_backend(pattern, original, backend, opts)
69
+ end
70
+
71
+ def fast? = @backend.is_a?(Native)
72
+ def stdlib? = !fast?
73
+
74
+ # Escape hatches for callers that need the underlying object directly.
75
+ def native = fast? ? @backend : nil
76
+ def stdlib = stdlib? ? @backend : nil
77
+
42
78
  def match(haystack)
43
- _native_match(coerce_string(haystack))
79
+ haystack = coerce_string(haystack)
80
+ raw = fast? ? @backend._native_match(haystack) : @backend.match(haystack)
81
+ raw && MatchData.new(raw, haystack)
82
+ end
83
+
84
+ def match?(haystack)
85
+ @backend.match?(coerce_string(haystack))
44
86
  end
45
87
 
46
- # Case-equality. Enables `case/when` and RSpec's
47
- # `expect(str).to match(re)`.
48
88
  def ===(other)
49
89
  return false unless other.respond_to?(:to_str)
50
90
  match?(other.to_str)
51
91
  end
52
92
 
53
- # Returns the byte offset of the first match, or nil. Matches the
54
- # semantics of `Regexp#=~` except positions are in **bytes** (rust/regex
55
- # is byte-based).
93
+ # Byte offset of the first match (rust/regex is byte-based; stdlib path
94
+ # also returns bytes here for API consistency).
56
95
  def =~(other)
57
96
  return nil unless other.respond_to?(:to_str)
58
97
  m = match(other.to_str)
59
98
  m && m.byte_begin(0)
60
99
  end
61
100
 
62
- # `sub(haystack, replacement)` or `sub(haystack) { |m| ... }`.
63
- #
64
- # The string form uses rust/regex's native replacement template: `$1`,
65
- # `${name}`, `$$` for a literal `$`. To pass a replacement string that
66
- # contains `$` literally, pass `literal: true`.
101
+ def scan(haystack)
102
+ @backend.scan(coerce_string(haystack))
103
+ end
104
+
105
+ def scan_matches(haystack)
106
+ haystack = coerce_string(haystack)
107
+ if fast?
108
+ @backend.scan_matches(haystack).map { |m| MatchData.new(m, haystack) }
109
+ else
110
+ results = []
111
+ haystack.scan(@backend) { results << MatchData.new(::Regexp.last_match, haystack) }
112
+ results
113
+ end
114
+ end
115
+
67
116
  def sub(haystack, replacement = nil, literal: false, &block)
68
117
  haystack = coerce_string(haystack)
69
118
  if block
@@ -73,32 +122,123 @@ module Fast
73
122
  "#{m.pre_match}#{block.call(m)}#{m.post_match}"
74
123
  else
75
124
  raise ArgumentError, "wrong number of arguments (given 1, expected 2)" if replacement.nil?
76
- _native_sub(haystack, coerce_string(replacement), literal)
125
+ replacement = coerce_string(replacement)
126
+ if fast?
127
+ @backend._native_sub(haystack, replacement, literal)
128
+ else
129
+ haystack.sub(@backend, stdlib_replacement(replacement, literal))
130
+ end
77
131
  end
78
132
  end
79
133
 
80
- # `gsub(haystack, replacement)` or `gsub(haystack) { |m| ... }`. See
81
- # `#sub` for the template syntax.
82
134
  def gsub(haystack, replacement = nil, literal: false, &block)
83
135
  haystack = coerce_string(haystack)
84
136
  if block
85
137
  raise ArgumentError, "wrong number of arguments (given 2, expected 1 with block)" if replacement
86
- gsub_with_block(haystack, &block)
138
+ fast? ? fast_gsub_with_block(haystack, &block) : stdlib_gsub_with_block(haystack, &block)
87
139
  else
88
140
  raise ArgumentError, "wrong number of arguments (given 1, expected 2)" if replacement.nil?
89
- _native_gsub(haystack, coerce_string(replacement), literal)
141
+ replacement = coerce_string(replacement)
142
+ if fast?
143
+ @backend._native_gsub(haystack, replacement, literal)
144
+ else
145
+ haystack.gsub(@backend, stdlib_replacement(replacement, literal))
146
+ end
90
147
  end
91
148
  end
92
149
 
93
- def inspect
94
- "#<Fast::Regexp #{pattern.inspect}>"
150
+ # On the fast path this comes from rust/regex directly. On the stdlib
151
+ # fallback we walk the source counting capturing groups while honoring
152
+ # escapes, character classes, and non-capturing / lookaround prefixes.
153
+ def captures_count
154
+ return @backend.captures_count if fast?
155
+ count_stdlib_captures(@backend.source)
95
156
  end
96
157
 
158
+ def names
159
+ @backend.names
160
+ end
161
+
162
+ def inspect = "#<Fast::Regexp #{@pattern.inspect}#{stdlib? ? " (stdlib)" : ""}>"
97
163
  alias_method :to_s, :pattern
98
164
 
99
165
  private
100
166
 
101
- def gsub_with_block(haystack)
167
+ def count_stdlib_captures(source)
168
+ count = 0
169
+ i = 0
170
+ len = source.length
171
+ while i < len
172
+ c = source[i]
173
+ if c == "\\"
174
+ i += 2
175
+ elsif c == "["
176
+ i += 1
177
+ i += 1 while i < len && source[i] != "]"
178
+ i += 1
179
+ elsif c == "("
180
+ # Capturing unless followed by (?:, (?=, (?!, (?#, (?<=, (?<!.
181
+ prefix = source[i + 1, 4] || ""
182
+ if prefix.start_with?("?:") || prefix.start_with?("?=") || prefix.start_with?("?!") ||
183
+ prefix.start_with?("?#") || prefix.start_with?("?<=") || prefix.start_with?("?<!")
184
+ i += 1
185
+ else
186
+ count += 1
187
+ i += 1
188
+ end
189
+ else
190
+ i += 1
191
+ end
192
+ end
193
+ count
194
+ end
195
+
196
+ def compile_backend(pattern, original, backend, opts)
197
+ return compile_stdlib(pattern, original) if backend == :stdlib
198
+ Native._native_new(pattern, **opts)
199
+ rescue ArgumentError => e
200
+ raise if backend == :fast
201
+ compile_stdlib(pattern, original, fallback_from: e)
202
+ end
203
+
204
+ def compile_stdlib(pattern, original, fallback_from: nil)
205
+ return original if original.is_a?(::Regexp)
206
+ ::Regexp.new(pattern)
207
+ rescue ::RegexpError => e
208
+ # Pattern is malformed in both engines (auto path) — surface the
209
+ # original rust/regex error since it's typically more detailed.
210
+ # Otherwise propagate the stdlib error as-is.
211
+ raise(fallback_from ? ArgumentError.new(fallback_from.message) : e)
212
+ end
213
+
214
+ # Maps positional capture indices to names for the stdlib backend so we
215
+ # can translate `$N` to `\k<name>` (Ruby's gsub ignores `\N` for named
216
+ # groups). Only invoked from the stdlib path.
217
+ def stdlib_name_by_index
218
+ @stdlib_name_by_index ||= @backend.named_captures.flat_map { |n, idxs| idxs.map { |i| [i, n] } }.to_h
219
+ end
220
+
221
+ # Translate rust/regex replacement syntax ($N, ${name}, $$) into Ruby
222
+ # syntax (\N, \k<name>, $) for the stdlib fallback path.
223
+ def stdlib_replacement(template, literal)
224
+ return template.gsub('\\', '\\\\\\\\') if literal
225
+
226
+ template.gsub(/\$(\$|\d+|\{[^}]+\})/) do
227
+ token = ::Regexp.last_match(1)
228
+ case token
229
+ when "$" then "$"
230
+ when /\A\d+\z/
231
+ name = stdlib_name_by_index[token.to_i]
232
+ name ? "\\k<#{name}>" : "\\#{token}"
233
+ else "\\k<#{token[1..-2]}>"
234
+ end
235
+ end
236
+ end
237
+
238
+ # Fast path: rust/regex has no native iterate-with-replace, so we scan all
239
+ # match positions up-front, then splice the result by byte offset in one
240
+ # pass. Stays UTF-8 since haystack and match slices are.
241
+ def fast_gsub_with_block(haystack)
102
242
  matches = scan_matches(haystack)
103
243
  return haystack.dup if matches.empty?
104
244
 
@@ -114,31 +254,64 @@ module Fast
114
254
  out
115
255
  end
116
256
 
257
+ # Stdlib path: String#gsub already does single-pass iterate-and-replace
258
+ # and sets $~ inside the block, so wrap the current ::MatchData and yield.
259
+ def stdlib_gsub_with_block(haystack)
260
+ haystack.gsub(@backend) { yield(MatchData.new(::Regexp.last_match, haystack)).to_s }
261
+ end
262
+
117
263
  def coerce_string(value)
118
264
  return value if value.is_a?(String)
119
265
  return value.to_str if value.respond_to?(:to_str)
120
266
  raise TypeError, "no implicit conversion of #{value.class} into String"
121
267
  end
122
268
 
269
+ # Wraps either a Fast::Regexp::Native::MatchData or a stdlib ::MatchData
270
+ # so callers see one type regardless of which backend ran.
123
271
  class MatchData
124
272
  include Enumerable
125
273
 
126
- def each(&block)
127
- to_a.each(&block)
128
- end
274
+ attr_reader :backend, :string
129
275
 
130
- def values_at(*indices)
131
- indices.map { |i| self[i] }
276
+ def initialize(backend, haystack)
277
+ @backend = backend
278
+ @string = haystack
132
279
  end
133
280
 
281
+ def native? = @backend.is_a?(Fast::Regexp::Native::MatchData)
282
+ def stdlib? = !native?
283
+
284
+ def native = native? ? @backend : nil
285
+ def stdlib = stdlib? ? @backend : nil
286
+
287
+ def [](key) = @backend[key]
288
+ def to_a = @backend.to_a
289
+ def captures = @backend.captures
290
+ def named_captures = @backend.named_captures
291
+ def names = @backend.names
292
+ def size = @backend.size
293
+ alias_method :length, :size
294
+ def pre_match = @backend.pre_match
295
+ def post_match = @backend.post_match
296
+ def to_s = @backend.to_s
297
+
298
+ # Byte-based offsets. Both backends expose these (stdlib MatchData has
299
+ # `byteoffset` / `byte_begin` / `byte_end` since Ruby 3.2).
300
+ def byteoffset(key) = @backend.byteoffset(key)
301
+ def byte_begin(key) = @backend.byte_begin(key)
302
+ def byte_end(key) = @backend.byte_end(key)
303
+
304
+ def each(&block) = to_a.each(&block)
305
+ def values_at(*indices) = indices.map { |i| self[i] }
306
+
134
307
  def ==(other)
135
308
  other.is_a?(MatchData) && to_a == other.to_a && string == other.string
136
309
  end
137
310
  alias_method :eql?, :==
138
311
 
139
- def hash
140
- [to_a, string].hash
141
- end
312
+ def hash = [to_a, string].hash
313
+
314
+ def inspect = "#<Fast::Regexp::MatchData #{to_s.inspect}>"
142
315
  end
143
316
  end
144
317
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: fast_regexp
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.4.0
4
+ version: 0.6.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Eric Jacobs