re2js 2.0.2 → 2.1.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +111 -30
- package/build/index.cjs.cjs +1557 -163
- package/build/index.cjs.cjs.map +1 -1
- package/build/index.esm.d.ts +71 -0
- package/build/index.esm.d.ts.map +1 -1
- package/build/index.esm.js +1556 -164
- package/build/index.esm.js.map +1 -1
- package/build/index.umd.js +1557 -163
- package/build/index.umd.js.map +1 -1
- package/package.json +2 -2
package/README.md
CHANGED
|
@@ -3,21 +3,15 @@
|
|
|
3
3
|
|
|
4
4
|
## [Playground](https://re2js.leopard.in.ua/)
|
|
5
5
|
|
|
6
|
-
## TLDR
|
|
7
|
-
|
|
8
|
-
The built-in JavaScript regular expression engine can, under certain special combinations, run in exponential time. This situation can trigger what's referred to as a [Regular Expression Denial of Service (ReDoS)](https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS). RE2, a different regular expression engine, can effectively safeguard your Node.js applications from ReDoS attacks. With RE2JS, this protective feature extends to browser environments as well, enabling you to utilize the RE2 engine more comprehensively.
|
|
9
|
-
|
|
10
6
|
## What is RE2?
|
|
11
7
|
|
|
12
|
-
RE2 is a regular expression engine designed to operate in time proportional to the size of the input, ensuring linear time complexity. RE2JS
|
|
13
|
-
|
|
14
|
-
JavaScript standard regular expression package, [RegExp](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions), and many other widely used regular expression packages such as PCRE, Perl and Python use a backtracking implementation strategy: when a pattern presents two alternatives such as a|b, the engine will try to match subpattern a first, and if that yields no match, it will reset the input stream and try to match b instead.
|
|
8
|
+
RE2 is a regular expression engine designed to operate in time proportional to the size of the input, ensuring linear time complexity. RE2JS is a pure JavaScript port that achieves full architectural parity with the [Go regexp implementation](https://pkg.go.dev/regexp).
|
|
15
9
|
|
|
16
|
-
|
|
10
|
+
JavaScript's standard regular expression engine, [RegExp](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions), and many other widely used packages (Perl, Python, PCRE) use a backtracking implementation strategy. When a pattern presents alternatives like `a|b`, the engine tries to match subpattern `a` first; if that fails, it resets the input and tries subpattern `b`.
|
|
17
11
|
|
|
18
|
-
|
|
12
|
+
If such choices are deeply nested, this strategy requires an exponential number of passes over the input data, potentially exceeding the lifetime of the universe for large inputs. This creates a security risk known as Regular Expression Denial of Service (ReDoS) when accepting patterns from untrusted sources.
|
|
19
13
|
|
|
20
|
-
|
|
14
|
+
In contrast, RE2JS utilizes a combination of Deterministic Finite Automaton (DFA) and Nondeterministic Finite Automaton (NFA) strategies to explore all matches simultaneously in a single pass over the input data. This approach guarantees $O(n)$ linear time complexity, providing a secure environment for both Node.js and browser applications.
|
|
21
15
|
|
|
22
16
|
## Installation
|
|
23
17
|
|
|
@@ -236,6 +230,57 @@ RE2JS.compile(':').split('boo:and:foo', 2) // ['boo', 'and:foo']
|
|
|
236
230
|
RE2JS.compile(':').split('boo:and:foo', 5) // ['boo', 'and', 'foo']
|
|
237
231
|
```
|
|
238
232
|
|
|
233
|
+
### Multi-Pattern Matching (RE2Set)
|
|
234
|
+
|
|
235
|
+
RE2JS includes a highly optimized `RE2Set` API that allows you to match multiple regular expressions against a single string simultaneously. Instead of running 100 different regexes in a loop ($O(100n)$ time), `RE2Set` compiles them into a single state machine and finds all matches in a single pass ($O(n)$ linear time).
|
|
236
|
+
|
|
237
|
+
This is incredibly powerful for profanity filters, routing engines, or log parsers.
|
|
238
|
+
|
|
239
|
+
```js
|
|
240
|
+
import { RE2Set, RE2Flags } from 're2js'
|
|
241
|
+
|
|
242
|
+
// Create a new set. You can optionally pass anchoring and flags.
|
|
243
|
+
// Default: RE2Flags.UNANCHORED, RE2Flags.PERL
|
|
244
|
+
const set = new RE2Set()
|
|
245
|
+
|
|
246
|
+
// Add patterns to the set.
|
|
247
|
+
// The add() method returns the integer ID of the pattern.
|
|
248
|
+
set.add('error') // ID: 0
|
|
249
|
+
set.add('warning') // ID: 1
|
|
250
|
+
set.add('critical') // ID: 2
|
|
251
|
+
|
|
252
|
+
// You must compile the set before matching!
|
|
253
|
+
set.compile()
|
|
254
|
+
|
|
255
|
+
// Match against a string.
|
|
256
|
+
// Returns an array of IDs for all patterns that successfully matched.
|
|
257
|
+
console.log(set.match('The system encountered a critical error.'))
|
|
258
|
+
// Outputs: [0, 2]
|
|
259
|
+
|
|
260
|
+
console.log(set.match('All systems operational.'))
|
|
261
|
+
// Outputs: []
|
|
262
|
+
```
|
|
263
|
+
|
|
264
|
+
#### Anchoring a Set
|
|
265
|
+
|
|
266
|
+
You can strictly anchor the entire set by passing an anchor flag to the constructor
|
|
267
|
+
|
|
268
|
+
```js
|
|
269
|
+
import { RE2Set, RE2Flags } from 're2js'
|
|
270
|
+
|
|
271
|
+
const set = new RE2Set(RE2Flags.ANCHOR_BOTH)
|
|
272
|
+
set.add('foo') // ID: 0
|
|
273
|
+
set.add('bar') // ID: 1
|
|
274
|
+
set.add('.*') // ID: 2
|
|
275
|
+
|
|
276
|
+
set.compile()
|
|
277
|
+
|
|
278
|
+
console.log(set.match('foo')) // [0, 2] (Matches 'foo' and '.*')
|
|
279
|
+
console.log(set.match('foobar')) // [2] (Only '.*' matches the entire string)
|
|
280
|
+
```
|
|
281
|
+
|
|
282
|
+
***Performance Note:** `RE2Set` heavily utilizes the high-speed DFA engine to process multi-pattern matches simultaneously. However, if your patterns contain boundaries (e.g., `\b`) or trigger a massive state explosion, it will seamlessly and safely fall back to the bounded NFA engine.*
|
|
283
|
+
|
|
239
284
|
### Working with Groups
|
|
240
285
|
|
|
241
286
|
RE2JS supports capturing groups in regex patterns
|
|
@@ -314,6 +359,27 @@ if (mString.matches()) {
|
|
|
314
359
|
}
|
|
315
360
|
```
|
|
316
361
|
|
|
362
|
+
#### Extracting All Named Groups
|
|
363
|
+
|
|
364
|
+
If you have multiple named capturing groups, the `getNamedGroups()` method provides a convenient way to retrieve all of them at once as a JavaScript dictionary (object). If an optional group was not matched, its value will be `null`.
|
|
365
|
+
|
|
366
|
+
```js
|
|
367
|
+
import { RE2JS } from 're2js'
|
|
368
|
+
|
|
369
|
+
const p = RE2JS.compile('(?P<first>\\w+) (?:(?P<middle>\\w+) )?(?P<last>\\w+)')
|
|
370
|
+
const matchString = p.matcher('John Doe')
|
|
371
|
+
|
|
372
|
+
if (matchString.matches()) {
|
|
373
|
+
matchString.getNamedGroups()
|
|
374
|
+
// Returns:
|
|
375
|
+
// {
|
|
376
|
+
// first: 'John',
|
|
377
|
+
// middle: null,
|
|
378
|
+
// last: 'Doe'
|
|
379
|
+
// }
|
|
380
|
+
}
|
|
381
|
+
```
|
|
382
|
+
|
|
317
383
|
### Replacing Matches
|
|
318
384
|
|
|
319
385
|
RE2JS allows you to replace all occurrences or the first occurrence of a pattern match in a string with a specific replacement string
|
|
@@ -445,9 +511,23 @@ RE2JS.matches(unicodeRegexp, '😀') // true
|
|
|
445
511
|
RE2JS.matches(unicodeRegexp, '😃') // false
|
|
446
512
|
```
|
|
447
513
|
|
|
448
|
-
## Performance
|
|
514
|
+
## Performance and Architecture
|
|
515
|
+
|
|
516
|
+
The RE2JS engine provides strict linear-time $O(n)$ safety guarantees against Regular Expression Denial of Service (ReDoS) attacks, a critical vulnerability inherent to native JavaScript `RegExp` objects.
|
|
517
|
+
|
|
518
|
+
Originally, the C++ implementation of the RE2 engine included both NFA (Nondeterministic Finite Automaton) and DFA (Deterministic Finite Automaton) engines with highly optimized memory operations. Russ Cox later ported the core engine to Go, and Alan Donovan ported it to Java.
|
|
519
|
+
|
|
520
|
+
`re2js` achieves full architectural parity with the highly optimized Go `regexp` package and incorporates advanced performance features from the original C++ engine. To maximize execution speed on everyday queries without ever sacrificing memory safety, `re2js` intelligently and dynamically routes execution through a highly advanced multi-tiered architecture:
|
|
521
|
+
|
|
522
|
+
* **The Prefilter Engine:** Analyzes the Abstract Syntax Tree (AST) before execution to extract mandatory string literals (e.g., extracting `"error"` and `"critical"` from `/error.*critical/`). It uses blistering-fast native JavaScript `indexOf` to instantly reject mismatches, completely bypassing the regex state-machines.
|
|
523
|
+
* **Aggressive AST Simplification:** Trims impossible match branches and collapses redundant logic prior to compilation, mathematically pruning dead execution paths to dramatically reduce the size of the generated state machine.
|
|
524
|
+
* **Multi-Pattern Sets (`RE2Set`):** Combines hundreds or thousands of regular expressions into a single combined DFA, allowing you to search a string for all patterns simultaneously in strict $O(N)$ linear time.
|
|
525
|
+
* **OnePass DFA:** Provides high-speed capture group extraction for mathematically 1-unambiguous patterns, bypassing thread queues entirely.
|
|
526
|
+
* **Lazy Powerset DFA:** Executes high-speed boolean matches (e.g., `.test()`) by fusing active states dynamically on the fly.
|
|
527
|
+
* **BitState Backtracker:** Avoids heavy object array allocations by using bitwise operations to extract captures on short-to-medium length strings.
|
|
528
|
+
* **Pike VM (NFA):** Acts as the robust, bounded-memory fallback engine for complex, ambiguous patterns that exceed fast-path limits.
|
|
449
529
|
|
|
450
|
-
|
|
530
|
+
Thanks to these dynamic fast-paths, `re2js` delivers performance comparable to native engines for simple queries, while remaining completely immune to catastrophic backtracking and stack overflow crashes.
|
|
451
531
|
|
|
452
532
|
Should you require maximum absolute performance on the server side when using RE2, it would be beneficial to consider the following packages for JS:
|
|
453
533
|
|
|
@@ -456,26 +536,27 @@ Should you require maximum absolute performance on the server side when using RE
|
|
|
456
536
|
|
|
457
537
|
### RE2JS vs RE2-Node (C++ Bindings)
|
|
458
538
|
|
|
459
|
-
Because RE2JS
|
|
539
|
+
Because RE2JS's Lazy DFA, Prefilter, and OnePass engines operate efficiently within V8's Just-In-Time (JIT) compiler, they can outperform native C++ bindings (`re2-node`) for many operations by avoiding the cross-boundary serialization costs between JavaScript and C++.
|
|
460
540
|
|
|
461
|
-
Here is a benchmark running 30,000 items through both engines using their respective `.test()` fast-paths:
|
|
541
|
+
Here is a benchmark running 30,000 items through both engines using their respective `.test()` fast-paths (averages of multiple runs):
|
|
462
542
|
|
|
463
|
-
| Benchmark Scenario | Pattern Example | RE2JS (Pure JS) | RE2-Node (C++) | Result
|
|
464
|
-
|
|
465
|
-
| **
|
|
466
|
-
| **
|
|
467
|
-
| **
|
|
468
|
-
| **
|
|
469
|
-
| **
|
|
470
|
-
| **
|
|
471
|
-
| **
|
|
472
|
-
| **
|
|
473
|
-
| **Word Boundaries (NFA)** | `/\b(Flying\|First...)\b/` |
|
|
543
|
+
| Benchmark Scenario | Pattern Example | RE2JS (Pure JS) | RE2-Node (C++) | Result |
|
|
544
|
+
|:--------------------------|:---------------------------|:----------------|:---------------|:-----------------------------|
|
|
545
|
+
| **Simple Literal** | `/damage/` | **~5.82 ms** | ~14.08 ms | `re2js` is **~2.42x faster** |
|
|
546
|
+
| **Greedy Wildcard** | `/enters.*battlefield/` | **~8.44 ms** | ~13.32 ms | `re2js` is **~1.58x faster** |
|
|
547
|
+
| **Lazy Wildcard** | `/enters.*?battlefield/` | **~8.43 ms** | ~13.33 ms | `re2js` is **~1.58x faster** |
|
|
548
|
+
| **Deep State Machine** | `/([0-9]+(/[0-9]+)+)/` | **~7.71 ms** | ~16.08 ms | `re2js` is **~2.09x faster** |
|
|
549
|
+
| **Massive Alternation** | `/White\|Blue\|Black.../` | **~11.62 ms** | ~14.99 ms | `re2js` is **~1.29x faster** |
|
|
550
|
+
| **Bounded Repetition** | `/[A-Z][a-z]{5,15}/` | **~12.20 ms** | ~13.77 ms | `re2js` is **~1.13x faster** |
|
|
551
|
+
| **ReDoS Attempt** | `/(a+)+!/` | **~5.68 ms** | ~16.25 ms | `re2js` is **~2.86x faster** |
|
|
552
|
+
| **Case Insensitive** | `/(?i)swamp/` | ~18.71 ms | **~16.22 ms** | `re2-node` is ~1.15x faster |
|
|
553
|
+
| **Word Boundaries (NFA)** | `/\b(Flying\|First...)\b/` | ~57.24 ms | **~15.66 ms** | `re2-node` is ~3.66x faster |
|
|
474
554
|
|
|
475
555
|
**Takeaways:**
|
|
476
|
-
* **
|
|
477
|
-
* **
|
|
478
|
-
* **
|
|
556
|
+
* **The Literal & Prefilter Advantage (JS wins):** For simple text searches like literals and wildcards, RE2JS's Literal Fast-Path and Prefilter Engine leverage highly optimized native JavaScript `indexOf` string scanning. By bypassing the regex state machines completely, pure JavaScript now outperforms native C++ bindings by **~1.5x to 2.4x**.
|
|
557
|
+
* **State-Heavy Tasks (JS wins):** For complex state machines, massive alternations, and catastrophic backtracking (ReDoS) attempts, RE2JS operates entirely within V8's highly optimized JIT. Avoiding the JS-to-C++ N-API bridge overhead allows pure JavaScript to beat native bindings by **~1.1x to 2.8x**.
|
|
558
|
+
* **Case Insensitivity (C++ wins):** Case-folded literal matching currently skips the prefilter and requires full DFA state-machine evaluation, giving C++ a slight ~1.15x edge due to raw memory scanning speeds.
|
|
559
|
+
* **The Fallback Engines (C++ wins):** Pure DFA engines mathematically cannot track look-behind context like Word Boundaries (`\b`). When RE2JS encounters these, it safely bails out to its NFA engine. As shown in the benchmarks, the pure JS NFA fallback is slower than the C++ NFA. **For maximum performance in RE2JS, avoid `\b` when doing bulk boolean `.test()` matching.**
|
|
479
560
|
|
|
480
561
|
### RE2JS vs JavaScript's native RegExp
|
|
481
562
|
|
|
@@ -508,13 +589,13 @@ RE2JS processed this poison-pill string **30,000 times in just ~454 milliseconds
|
|
|
508
589
|
|
|
509
590
|
## Development
|
|
510
591
|
|
|
511
|
-
Some files like `CharGroup.js` and `UnicodeTables.js`
|
|
592
|
+
Some files like `CharGroup.js` and `UnicodeTables.js` are generated and should be edited in their respective generator files:
|
|
512
593
|
|
|
513
594
|
```bash
|
|
514
595
|
./tools/scripts/make_perl_groups.pl > src/CharGroup.js
|
|
515
596
|
yarn node ./tools/scripts/genUnicodeTable.js > src/UnicodeTables.js
|
|
516
597
|
```
|
|
517
598
|
|
|
518
|
-
To run `make_perl_groups.pl
|
|
599
|
+
To run `make_perl_groups.pl`, you need to have Perl installed (the required version is specified inside `.tool-versions`).
|
|
519
600
|
|
|
520
601
|
[Playground website](https://re2js.leopard.in.ua/) maintained in `www` branch
|