re2js 2.0.0 → 2.0.2
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +39 -25
- package/build/index.cjs.cjs +381 -309
- package/build/index.cjs.cjs.map +1 -1
- package/build/index.esm.d.ts.map +1 -1
- package/build/index.esm.js +381 -309
- package/build/index.esm.js.map +1 -1
- package/build/index.umd.js +381 -309
- package/build/index.umd.js.map +1 -1
- package/package.json +3 -3
package/README.md
CHANGED
|
@@ -356,10 +356,10 @@ import { RE2JS } from 're2js'
|
|
|
356
356
|
|
|
357
357
|
RE2JS.compile('(\\w+) (\\w+)')
|
|
358
358
|
.matcher('Hello World')
|
|
359
|
-
.replaceAll('
|
|
359
|
+
.replaceAll('$& - $&') // 'Hello World - Hello World'
|
|
360
360
|
RE2JS.compile('(\\w+) (\\w+)')
|
|
361
361
|
.matcher('Hello World')
|
|
362
|
-
.replaceAll('
|
|
362
|
+
.replaceAll('$0 - $0', true) // 'Hello World - Hello World'
|
|
363
363
|
```
|
|
364
364
|
|
|
365
365
|
#### Replacing the First Occurrence
|
|
@@ -447,50 +447,64 @@ RE2JS.matches(unicodeRegexp, '😃') // false
|
|
|
447
447
|
|
|
448
448
|
## Performance
|
|
449
449
|
|
|
450
|
-
The RE2JS engine runs more slowly compared to native RegExp objects. This reduced speed is also noticeable when comparing RE2JS to the original RE2 engine. The C++ implementation of the RE2 engine includes both NFA (Nondeterministic Finite Automaton) and DFA (Deterministic Finite Automaton) engines, as well as a variety of
|
|
450
|
+
The RE2JS engine runs more slowly compared to native RegExp objects for simple queries. This reduced speed is also noticeable when comparing RE2JS to the original RE2 engine. The C++ implementation of the RE2 engine includes both NFA (Nondeterministic Finite Automaton) and DFA (Deterministic Finite Automaton) engines, as well as a variety of highly optimized memory operations. Russ Cox ported a simplified version of the NFA engine to Go. Later, Alan Donovan ported the NFA-based Go implementation to Java. I then ported the NFA-based Java implementation (plus Golang additions + Lazy DFA fast-path) to a pure JS version.
|
|
451
451
|
|
|
452
|
-
Should you require
|
|
452
|
+
Should you require maximum absolute performance on the server side when using RE2, it would be beneficial to consider the following packages for JS:
|
|
453
453
|
|
|
454
|
-
- [Node-RE2](https://github.com/uhop/node-re2/): A powerful RE2
|
|
454
|
+
- [Node-RE2](https://github.com/uhop/node-re2/): A powerful RE2 C++ binding for Node.js
|
|
455
455
|
- [RE2-WASM](https://github.com/google/re2-wasm/): This package is a WASM wrapper for RE2. Please note, as of now, it does not work in browsers
|
|
456
456
|
|
|
457
|
+
### RE2JS vs RE2-Node (C++ Bindings)
|
|
458
|
+
|
|
459
|
+
Because RE2JS implements a Just-In-Time (JIT) compiled DFA, it can actually perform on par with—and sometimes faster than—native C++ bindings (`re2-node`) by avoiding the cross-boundary serialization costs between JavaScript and C++.
|
|
460
|
+
|
|
461
|
+
Here is a benchmark running 30,000 items through both engines using their respective `.test()` fast-paths:
|
|
462
|
+
|
|
463
|
+
| Benchmark Scenario | Pattern Example | RE2JS (Pure JS) | RE2-Node (C++) | Result |
|
|
464
|
+
|:--------------------------|:---------------------------|:----------------|:---------------|:----------------------------|
|
|
465
|
+
| **Bounded Repetition** | `/[A-Z][a-z]{5,15}/` | **11.80 ms** | 14.79 ms | `re2js` is **1.25x** faster |
|
|
466
|
+
| **Massive Alternation** | `/White\|Blue\|Black.../` | **15.42 ms** | 16.02 ms | `re2js` is **1.04x** faster |
|
|
467
|
+
| **Deep State Machine** | `/([0-9]+(/[0-9]+)+)/` | 19.34 ms | **17.16 ms** | `re2-node` is 1.13x faster |
|
|
468
|
+
| **Case Insensitive** | `/(?i)swamp/` | 20.27 ms | **17.26 ms** | `re2-node` is 1.17x faster |
|
|
469
|
+
| **ReDoS Attempt** | `/(a+)+!/` | 20.56 ms | **17.33 ms** | `re2-node` is 1.19x faster |
|
|
470
|
+
| **Greedy Wildcard** | `/enters.*battlefield/` | 18.93 ms | **14.33 ms** | `re2-node` is 1.32x faster |
|
|
471
|
+
| **Lazy Wildcard** | `/enters.*?battlefield/` | 18.93 ms | **14.16 ms** | `re2-node` is 1.34x faster |
|
|
472
|
+
| **Simple Literal** | `/damage/` | 19.54 ms | **14.11 ms** | `re2-node` is 1.39x faster |
|
|
473
|
+
| **Word Boundaries (NFA)** | `/\b(Flying\|First...)\b/` | 296.16 ms | **16.73 ms** | `re2-node` is 17.70x faster |
|
|
474
|
+
|
|
475
|
+
**Takeaways:**
|
|
476
|
+
* **DFA Strengths:** For state-heavy tasks like massive alternations (`White|Blue|...`) or bounded repetitions (`{5,15}`), RE2JS operates entirely within V8's optimized JIT and actually outpaces C++ bindings.
|
|
477
|
+
* **C++ Strengths:** For simple string scanning (like literal or wildcard searches), C++ wins because it can utilize optimized, hardware-level raw memory scanning operations (like `memchr`).
|
|
478
|
+
* **The NFA Fallback:** Pure DFA engines mathematically cannot track look-behind context like Word Boundaries (`\b`). When RE2JS encounters these, it safely bails out to the much slower NFA engine, resulting in a large performance gap compared to C++.
|
|
479
|
+
|
|
457
480
|
### RE2JS vs JavaScript's native RegExp
|
|
458
481
|
|
|
459
|
-
These examples illustrate the performance comparison between the RE2JS library and JavaScript's native RegExp for both a simple case and a ReDoS (Regular Expression Denial of Service) scenario
|
|
482
|
+
These examples illustrate the performance comparison between the RE2JS library and JavaScript's native `RegExp` for both a simple case and a ReDoS (Regular Expression Denial of Service) scenario.
|
|
460
483
|
|
|
461
484
|
```js
|
|
462
485
|
const regex = 'a+'
|
|
463
486
|
const string = 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!'
|
|
464
487
|
|
|
465
|
-
|
|
466
|
-
|
|
488
|
+
// Running 30,000 iterations
|
|
489
|
+
RE2JS.compile(regex).test(string) // Total time: ~9.87 ms
|
|
490
|
+
new RegExp(regex).test(string) // Total time: ~11.43 ms
|
|
467
491
|
```
|
|
468
492
|
|
|
469
|
-
|
|
493
|
+
For safe, simple patterns, the RE2JS DFA fast-path is heavily optimized and performs at parity with—or even slightly faster than—V8's native RegExp engine.
|
|
470
494
|
|
|
471
495
|
```js
|
|
472
496
|
const regex = '([a-z]+)+$'
|
|
473
497
|
const string = 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!'
|
|
474
498
|
|
|
475
|
-
|
|
476
|
-
|
|
499
|
+
// Running 30,000 iterations
|
|
500
|
+
RE2JS.compile(regex).test(string) // Total time: ~454.17 ms
|
|
501
|
+
// Running EXACTLY 1 iteration
|
|
502
|
+
new RegExp(regex).test(string) // Total time: ~105802.02 ms (over 105 seconds)
|
|
477
503
|
```
|
|
478
504
|
|
|
479
|
-
In the second example, a ReDoS scenario is depicted. The regular expression `([a-z]+)+$` is a potentially problematic one
|
|
480
|
-
|
|
481
|
-
The string is the same as in the first example, which does not pose a problem for either RE2JS or RegExp under normal circumstances. However, when dealing with the nested quantifier, RE2JS took around **3.62 ms** to find a match, while RegExp took significantly longer, around **103768.26 ms (~103 seconds)**. This demonstrates that RE2JS is much more efficient in handling potentially harmful regular expressions, thus preventing ReDoS attacks.
|
|
482
|
-
|
|
483
|
-
In conclusion, while JavaScript's native RegExp might be faster for simple regular expressions, RE2JS offers significant performance advantages when dealing with complex or potentially dangerous regular expressions. RE2JS provides protection against excessive backtracking that could lead to performance issues or ReDoS attacks.
|
|
484
|
-
|
|
485
|
-
## Rationale for RE2 JavaScript port
|
|
486
|
-
|
|
487
|
-
There are several reasons that underscore the importance of having an RE2 vanilla JavaScript (JS) port.
|
|
488
|
-
|
|
489
|
-
Firstly, it enables RE2 JS validation on the client side within the browser. This is vital as it allows the implementation and execution of regular expression operations directly in the browser, enhancing performance by reducing the necessity of server-side computations and back-and-forth communication.
|
|
490
|
-
|
|
491
|
-
Secondly, it provides a platform for simple RE2 parsing, specifically for the extraction of regex groups. This feature is particularly useful when dealing with complex regular expressions, as it allows for the breakdown of regex patterns into manageable and identifiable segments or 'groups'.
|
|
505
|
+
In the second example, a ReDoS scenario is depicted. The regular expression `([a-z]+)+$` is a potentially problematic one because it contains a nested quantifier. In standard NFA engines (like JavaScript's native `RegExp`), nested quantifiers can cause catastrophic backtracking. If a malicious user inputs a carefully crafted string, it results in exponentially high processing times, leading to a Denial of Service (DoS) attack.
|
|
492
506
|
|
|
493
|
-
|
|
507
|
+
RE2JS processed this poison-pill string **30,000 times in just ~454 milliseconds**, while the native RegExp completely locked up the main thread for **over 1 minute and 45 seconds trying to evaluate it just once**. This demonstrates why RE2JS is absolutely essential for securely handling untrusted regular expressions and protecting Node.js and browser applications against ReDoS attacks.
|
|
494
508
|
|
|
495
509
|
## Development
|
|
496
510
|
|