re2js 2.0.2 → 2.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -3,21 +3,15 @@
3
3
 
4
4
  ## [Playground](https://re2js.leopard.in.ua/)
5
5
 
6
- ## TLDR
7
-
8
- The built-in JavaScript regular expression engine can, under certain special combinations, run in exponential time. This situation can trigger what's referred to as a [Regular Expression Denial of Service (ReDoS)](https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS). RE2, a different regular expression engine, can effectively safeguard your Node.js applications from ReDoS attacks. With RE2JS, this protective feature extends to browser environments as well, enabling you to utilize the RE2 engine more comprehensively.
9
-
10
6
  ## What is RE2?
11
7
 
12
- RE2 is a regular expression engine designed to operate in time proportional to the size of the input, ensuring linear time complexity. RE2JS, on the other hand, is a pure JavaScript port of the [RE2 library](https://github.com/google/re2) more specifically, it's a port of the [RE2/J library](https://github.com/google/re2j).
13
-
14
- JavaScript standard regular expression package, [RegExp](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions), and many other widely used regular expression packages such as PCRE, Perl and Python use a backtracking implementation strategy: when a pattern presents two alternatives such as a|b, the engine will try to match subpattern a first, and if that yields no match, it will reset the input stream and try to match b instead.
8
+ RE2 is a regular expression engine designed to operate in time proportional to the size of the input, ensuring linear time complexity. RE2JS is a pure JavaScript port that achieves full architectural parity with the [Go regexp implementation](https://pkg.go.dev/regexp).
15
9
 
16
- If such choices are deeply nested, this strategy requires an exponential number of passes over the input data before it can detect whether the input matches. If the input is large, it is easy to construct a pattern whose running time would exceed the lifetime of the universe. This creates a security risk when accepting regular expression patterns from untrusted sources, such as users of a web application.
10
+ JavaScript's standard regular expression engine, [RegExp](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions), and many other widely used packages (Perl, Python, PCRE) use a backtracking implementation strategy. When a pattern presents alternatives like `a|b`, the engine tries to match subpattern `a` first; if that fails, it resets the input and tries subpattern `b`.
17
11
 
18
- In contrast, the RE2 algorithm explores all matches simultaneously in a single pass over the input data by using a nondeterministic finite automaton.
12
+ If such choices are deeply nested, this strategy requires an exponential number of passes over the input data, potentially exceeding the lifetime of the universe for large inputs. This creates a security risk known as Regular Expression Denial of Service (ReDoS) when accepting patterns from untrusted sources.
19
13
 
20
- There are certain features of PCRE or Perl regular expressions that cannot be implemented in linear time, for example, backreferences, but the vast majority of regular expressions patterns in practice avoid such features.
14
+ In contrast, RE2JS utilizes a combination of Deterministic Finite Automaton (DFA) and Nondeterministic Finite Automaton (NFA) strategies to explore all matches simultaneously in a single pass over the input data. This approach guarantees $O(n)$ linear time complexity, providing a secure environment for both Node.js and browser applications.
21
15
 
22
16
  ## Installation
23
17
 
@@ -236,6 +230,57 @@ RE2JS.compile(':').split('boo:and:foo', 2) // ['boo', 'and:foo']
236
230
  RE2JS.compile(':').split('boo:and:foo', 5) // ['boo', 'and', 'foo']
237
231
  ```
238
232
 
233
+ ### Multi-Pattern Matching (RE2Set)
234
+
235
+ RE2JS includes a highly optimized `RE2Set` API that allows you to match multiple regular expressions against a single string simultaneously. Instead of running 100 different regexes in a loop ($O(100n)$ time), `RE2Set` compiles them into a single state machine and finds all matches in a single pass ($O(n)$ linear time).
236
+
237
+ This is incredibly powerful for profanity filters, routing engines, or log parsers.
238
+
239
+ ```js
240
+ import { RE2Set, RE2Flags } from 're2js'
241
+
242
+ // Create a new set. You can optionally pass anchoring and flags.
243
+ // Default: RE2Flags.UNANCHORED, RE2Flags.PERL
244
+ const set = new RE2Set()
245
+
246
+ // Add patterns to the set.
247
+ // The add() method returns the integer ID of the pattern.
248
+ set.add('error') // ID: 0
249
+ set.add('warning') // ID: 1
250
+ set.add('critical') // ID: 2
251
+
252
+ // You must compile the set before matching!
253
+ set.compile()
254
+
255
+ // Match against a string.
256
+ // Returns an array of IDs for all patterns that successfully matched.
257
+ console.log(set.match('The system encountered a critical error.'))
258
+ // Outputs: [0, 2]
259
+
260
+ console.log(set.match('All systems operational.'))
261
+ // Outputs: []
262
+ ```
263
+
264
+ #### Anchoring a Set
265
+
266
+ You can strictly anchor the entire set by passing an anchor flag to the constructor
267
+
268
+ ```js
269
+ import { RE2Set, RE2Flags } from 're2js'
270
+
271
+ const set = new RE2Set(RE2Flags.ANCHOR_BOTH)
272
+ set.add('foo') // ID: 0
273
+ set.add('bar') // ID: 1
274
+ set.add('.*') // ID: 2
275
+
276
+ set.compile()
277
+
278
+ console.log(set.match('foo')) // [0, 2] (Matches 'foo' and '.*')
279
+ console.log(set.match('foobar')) // [2] (Only '.*' matches the entire string)
280
+ ```
281
+
282
+ ***Performance Note:** `RE2Set` heavily utilizes the high-speed DFA engine to process multi-pattern matches simultaneously. However, if your patterns contain boundaries (e.g., `\b`) or trigger a massive state explosion, it will seamlessly and safely fall back to the bounded NFA engine.*
283
+
239
284
  ### Working with Groups
240
285
 
241
286
  RE2JS supports capturing groups in regex patterns
@@ -314,6 +359,27 @@ if (mString.matches()) {
314
359
  }
315
360
  ```
316
361
 
362
+ #### Extracting All Named Groups
363
+
364
+ If you have multiple named capturing groups, the `getNamedGroups()` method provides a convenient way to retrieve all of them at once as a JavaScript dictionary (object). If an optional group was not matched, its value will be `null`.
365
+
366
+ ```js
367
+ import { RE2JS } from 're2js'
368
+
369
+ const p = RE2JS.compile('(?P<first>\\w+) (?:(?P<middle>\\w+) )?(?P<last>\\w+)')
370
+ const matchString = p.matcher('John Doe')
371
+
372
+ if (matchString.matches()) {
373
+ matchString.getNamedGroups()
374
+ // Returns:
375
+ // {
376
+ // first: 'John',
377
+ // middle: null,
378
+ // last: 'Doe'
379
+ // }
380
+ }
381
+ ```
382
+
317
383
  ### Replacing Matches
318
384
 
319
385
  RE2JS allows you to replace all occurrences or the first occurrence of a pattern match in a string with a specific replacement string
@@ -445,9 +511,23 @@ RE2JS.matches(unicodeRegexp, '😀') // true
445
511
  RE2JS.matches(unicodeRegexp, '😃') // false
446
512
  ```
447
513
 
448
- ## Performance
514
+ ## Performance and Architecture
515
+
516
+ The RE2JS engine provides strict linear-time $O(n)$ safety guarantees against Regular Expression Denial of Service (ReDoS) attacks, a critical vulnerability inherent to native JavaScript `RegExp` objects.
517
+
518
+ Originally, the C++ implementation of the RE2 engine included both NFA (Nondeterministic Finite Automaton) and DFA (Deterministic Finite Automaton) engines with highly optimized memory operations. Russ Cox later ported the core engine to Go, and Alan Donovan ported it to Java.
519
+
520
+ `re2js` achieves full architectural parity with the highly optimized Go `regexp` package and incorporates advanced performance features from the original C++ engine. To maximize execution speed on everyday queries without ever sacrificing memory safety, `re2js` intelligently and dynamically routes execution through a highly advanced multi-tiered architecture:
521
+
522
+ * **The Prefilter Engine:** Analyzes the Abstract Syntax Tree (AST) before execution to extract mandatory string literals (e.g., extracting `"error"` and `"critical"` from `/error.*critical/`). It uses blistering-fast native JavaScript `indexOf` to instantly reject mismatches, completely bypassing the regex state-machines.
523
+ * **Aggressive AST Simplification:** Trims impossible match branches and collapses redundant logic prior to compilation, mathematically pruning dead execution paths to dramatically reduce the size of the generated state machine.
524
+ * **Multi-Pattern Sets (`RE2Set`):** Combines hundreds or thousands of regular expressions into a single combined DFA, allowing you to search a string for all patterns simultaneously in strict $O(N)$ linear time.
525
+ * **OnePass DFA:** Provides high-speed capture group extraction for mathematically 1-unambiguous patterns, bypassing thread queues entirely.
526
+ * **Lazy Powerset DFA:** Executes high-speed boolean matches (e.g., `.test()`) by fusing active states dynamically on the fly.
527
+ * **BitState Backtracker:** Avoids heavy object array allocations by using bitwise operations to extract captures on short-to-medium length strings.
528
+ * **Pike VM (NFA):** Acts as the robust, bounded-memory fallback engine for complex, ambiguous patterns that exceed fast-path limits.
449
529
 
450
- The RE2JS engine runs more slowly compared to native RegExp objects for simple queries. This reduced speed is also noticeable when comparing RE2JS to the original RE2 engine. The C++ implementation of the RE2 engine includes both NFA (Nondeterministic Finite Automaton) and DFA (Deterministic Finite Automaton) engines, as well as a variety of highly optimized memory operations. Russ Cox ported a simplified version of the NFA engine to Go. Later, Alan Donovan ported the NFA-based Go implementation to Java. I then ported the NFA-based Java implementation (plus Golang additions + Lazy DFA fast-path) to a pure JS version.
530
+ Thanks to these dynamic fast-paths, `re2js` delivers performance comparable to native engines for simple queries, while remaining completely immune to catastrophic backtracking and stack overflow crashes.
451
531
 
452
532
  Should you require maximum absolute performance on the server side when using RE2, it would be beneficial to consider the following packages for JS:
453
533
 
@@ -456,26 +536,27 @@ Should you require maximum absolute performance on the server side when using RE
456
536
 
457
537
  ### RE2JS vs RE2-Node (C++ Bindings)
458
538
 
459
- Because RE2JS implements a Just-In-Time (JIT) compiled DFA, it can actually perform on par with—and sometimes faster than—native C++ bindings (`re2-node`) by avoiding the cross-boundary serialization costs between JavaScript and C++.
539
+ Because RE2JS's Lazy DFA, Prefilter, and OnePass engines operate efficiently within V8's Just-In-Time (JIT) compiler, they can outperform native C++ bindings (`re2-node`) for many operations by avoiding the cross-boundary serialization costs between JavaScript and C++.
460
540
 
461
- Here is a benchmark running 30,000 items through both engines using their respective `.test()` fast-paths:
541
+ Here is a benchmark running 30,000 items through both engines using their respective `.test()` fast-paths (averages of multiple runs):
462
542
 
463
- | Benchmark Scenario | Pattern Example | RE2JS (Pure JS) | RE2-Node (C++) | Result |
464
- |:--------------------------|:---------------------------|:----------------|:---------------|:----------------------------|
465
- | **Bounded Repetition** | `/[A-Z][a-z]{5,15}/` | **11.80 ms** | 14.79 ms | `re2js` is **1.25x** faster |
466
- | **Massive Alternation** | `/White\|Blue\|Black.../` | **15.42 ms** | 16.02 ms | `re2js` is **1.04x** faster |
467
- | **Deep State Machine** | `/([0-9]+(/[0-9]+)+)/` | 19.34 ms | **17.16 ms** | `re2-node` is 1.13x faster |
468
- | **Case Insensitive** | `/(?i)swamp/` | 20.27 ms | **17.26 ms** | `re2-node` is 1.17x faster |
469
- | **ReDoS Attempt** | `/(a+)+!/` | 20.56 ms | **17.33 ms** | `re2-node` is 1.19x faster |
470
- | **Greedy Wildcard** | `/enters.*battlefield/` | 18.93 ms | **14.33 ms** | `re2-node` is 1.32x faster |
471
- | **Lazy Wildcard** | `/enters.*?battlefield/` | 18.93 ms | **14.16 ms** | `re2-node` is 1.34x faster |
472
- | **Simple Literal** | `/damage/` | 19.54 ms | **14.11 ms** | `re2-node` is 1.39x faster |
473
- | **Word Boundaries (NFA)** | `/\b(Flying\|First...)\b/` | 296.16 ms | **16.73 ms** | `re2-node` is 17.70x faster |
543
+ | Benchmark Scenario | Pattern Example | RE2JS (Pure JS) | RE2-Node (C++) | Result |
544
+ |:--------------------------|:---------------------------|:----------------|:---------------|:-----------------------------|
545
+ | **Simple Literal** | `/damage/` | **~5.82 ms** | ~14.08 ms | `re2js` is **~2.42x faster** |
546
+ | **Greedy Wildcard** | `/enters.*battlefield/` | **~8.44 ms** | ~13.32 ms | `re2js` is **~1.58x faster** |
547
+ | **Lazy Wildcard** | `/enters.*?battlefield/` | **~8.43 ms** | ~13.33 ms | `re2js` is **~1.58x faster** |
548
+ | **Deep State Machine** | `/([0-9]+(/[0-9]+)+)/` | **~7.71 ms** | ~16.08 ms | `re2js` is **~2.09x faster** |
549
+ | **Massive Alternation** | `/White\|Blue\|Black.../` | **~11.62 ms** | ~14.99 ms | `re2js` is **~1.29x faster** |
550
+ | **Bounded Repetition** | `/[A-Z][a-z]{5,15}/` | **~12.20 ms** | ~13.77 ms | `re2js` is **~1.13x faster** |
551
+ | **ReDoS Attempt** | `/(a+)+!/` | **~5.68 ms** | ~16.25 ms | `re2js` is **~2.86x faster** |
552
+ | **Case Insensitive** | `/(?i)swamp/` | ~18.71 ms | **~16.22 ms** | `re2-node` is ~1.15x faster |
553
+ | **Word Boundaries (NFA)** | `/\b(Flying\|First...)\b/` | ~57.24 ms | **~15.66 ms** | `re2-node` is ~3.66x faster |
474
554
 
475
555
  **Takeaways:**
476
- * **DFA Strengths:** For state-heavy tasks like massive alternations (`White|Blue|...`) or bounded repetitions (`{5,15}`), RE2JS operates entirely within V8's optimized JIT and actually outpaces C++ bindings.
477
- * **C++ Strengths:** For simple string scanning (like literal or wildcard searches), C++ wins because it can utilize optimized, hardware-level raw memory scanning operations (like `memchr`).
478
- * **The NFA Fallback:** Pure DFA engines mathematically cannot track look-behind context like Word Boundaries (`\b`). When RE2JS encounters these, it safely bails out to the much slower NFA engine, resulting in a large performance gap compared to C++.
556
+ * **The Literal & Prefilter Advantage (JS wins):** For simple text searches like literals and wildcards, RE2JS's Literal Fast-Path and Prefilter Engine leverage highly optimized native JavaScript `indexOf` string scanning. By bypassing the regex state machines completely, pure JavaScript now outperforms native C++ bindings by **~1.5x to 2.4x**.
557
+ * **State-Heavy Tasks (JS wins):** For complex state machines, massive alternations, and catastrophic backtracking (ReDoS) attempts, RE2JS operates entirely within V8's highly optimized JIT. Avoiding the JS-to-C++ N-API bridge overhead allows pure JavaScript to beat native bindings by **~1.1x to 2.8x**.
558
+ * **Case Insensitivity (C++ wins):** Case-folded literal matching currently skips the prefilter and requires full DFA state-machine evaluation, giving C++ a slight ~1.15x edge due to raw memory scanning speeds.
559
+ * **The Fallback Engines (C++ wins):** Pure DFA engines mathematically cannot track look-behind context like Word Boundaries (`\b`). When RE2JS encounters these, it safely bails out to its NFA engine. As shown in the benchmarks, the pure JS NFA fallback is slower than the C++ NFA. **For maximum performance in RE2JS, avoid `\b` when doing bulk boolean `.test()` matching.**
479
560
 
480
561
  ### RE2JS vs JavaScript's native RegExp
481
562
 
@@ -508,13 +589,13 @@ RE2JS processed this poison-pill string **30,000 times in just ~454 milliseconds
508
589
 
509
590
  ## Development
510
591
 
511
- Some files like `CharGroup.js` and `UnicodeTables.js` is generated and should be edited in generator files
592
+ Some files like `CharGroup.js` and `UnicodeTables.js` are generated and should be edited in their respective generator files:
512
593
 
513
594
  ```bash
514
595
  ./tools/scripts/make_perl_groups.pl > src/CharGroup.js
515
596
  yarn node ./tools/scripts/genUnicodeTable.js > src/UnicodeTables.js
516
597
  ```
517
598
 
518
- To run `make_perl_groups.pl` you need to have install perl (version inside `.tool-versions`)
599
+ To run `make_perl_groups.pl`, you need to have Perl installed (the required version is specified inside `.tool-versions`).
519
600
 
520
601
  [Playground website](https://re2js.leopard.in.ua/) maintained in `www` branch