re2js 2.0.0 → 2.0.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +39 -25
- package/build/index.cjs.cjs +59 -45
- package/build/index.cjs.cjs.map +1 -1
- package/build/index.esm.d.ts.map +1 -1
- package/build/index.esm.js +59 -45
- package/build/index.esm.js.map +1 -1
- package/build/index.umd.js +59 -45
- package/build/index.umd.js.map +1 -1
- package/package.json +1 -1
package/README.md
CHANGED
|
@@ -356,10 +356,10 @@ import { RE2JS } from 're2js'
|
|
|
356
356
|
|
|
357
357
|
RE2JS.compile('(\\w+) (\\w+)')
|
|
358
358
|
.matcher('Hello World')
|
|
359
|
-
.replaceAll('
|
|
359
|
+
.replaceAll('$& - $&') // 'Hello World - Hello World'
|
|
360
360
|
RE2JS.compile('(\\w+) (\\w+)')
|
|
361
361
|
.matcher('Hello World')
|
|
362
|
-
.replaceAll('
|
|
362
|
+
.replaceAll('$0 - $0', true) // 'Hello World - Hello World'
|
|
363
363
|
```
|
|
364
364
|
|
|
365
365
|
#### Replacing the First Occurrence
|
|
@@ -447,50 +447,64 @@ RE2JS.matches(unicodeRegexp, '😃') // false
|
|
|
447
447
|
|
|
448
448
|
## Performance
|
|
449
449
|
|
|
450
|
-
The RE2JS engine runs more slowly compared to native RegExp objects. This reduced speed is also noticeable when comparing RE2JS to the original RE2 engine. The C++ implementation of the RE2 engine includes both NFA (Nondeterministic Finite Automaton) and DFA (Deterministic Finite Automaton) engines, as well as a variety of
|
|
450
|
+
The RE2JS engine runs more slowly compared to native RegExp objects for simple queries. This reduced speed is also noticeable when comparing RE2JS to the original RE2 engine. The C++ implementation of the RE2 engine includes both NFA (Nondeterministic Finite Automaton) and DFA (Deterministic Finite Automaton) engines, as well as a variety of highly optimized memory operations. Russ Cox ported a simplified version of the NFA engine to Go. Later, Alan Donovan ported the NFA-based Go implementation to Java. I then ported the NFA-based Java implementation (plus Golang additions + Lazy DFA fast-path) to a pure JS version.
|
|
451
451
|
|
|
452
|
-
Should you require
|
|
452
|
+
Should you require maximum absolute performance on the server side when using RE2, it would be beneficial to consider the following packages for JS:
|
|
453
453
|
|
|
454
|
-
- [Node-RE2](https://github.com/uhop/node-re2/): A powerful RE2
|
|
454
|
+
- [Node-RE2](https://github.com/uhop/node-re2/): A powerful RE2 C++ binding for Node.js
|
|
455
455
|
- [RE2-WASM](https://github.com/google/re2-wasm/): This package is a WASM wrapper for RE2. Please note, as of now, it does not work in browsers
|
|
456
456
|
|
|
457
|
+
### RE2JS vs RE2-Node (C++ Bindings)
|
|
458
|
+
|
|
459
|
+
Because RE2JS implements a Just-In-Time (JIT) compiled DFA, it can actually perform on par with—and sometimes faster than—native C++ bindings (`re2-node`) by avoiding the cross-boundary serialization costs between JavaScript and C++.
|
|
460
|
+
|
|
461
|
+
Here is a benchmark running 30,000 items through both engines using their respective `.test()` fast-paths:
|
|
462
|
+
|
|
463
|
+
| Benchmark Scenario | Pattern Example | RE2JS (Pure JS) | RE2-Node (C++) | Result |
|
|
464
|
+
|:--------------------------|:---------------------------|:----------------|:---------------|:----------------------------|
|
|
465
|
+
| **Bounded Repetition** | `/[A-Z][a-z]{5,15}/` | **11.80 ms** | 14.79 ms | `re2js` is **1.25x** faster |
|
|
466
|
+
| **Massive Alternation** | `/White\|Blue\|Black.../` | **15.42 ms** | 16.02 ms | `re2js` is **1.04x** faster |
|
|
467
|
+
| **Deep State Machine** | `/([0-9]+(/[0-9]+)+)/` | 19.34 ms | **17.16 ms** | `re2-node` is 1.13x faster |
|
|
468
|
+
| **Case Insensitive** | `/(?i)swamp/` | 20.27 ms | **17.26 ms** | `re2-node` is 1.17x faster |
|
|
469
|
+
| **ReDoS Attempt** | `/(a+)+!/` | 20.56 ms | **17.33 ms** | `re2-node` is 1.19x faster |
|
|
470
|
+
| **Greedy Wildcard** | `/enters.*battlefield/` | 18.93 ms | **14.33 ms** | `re2-node` is 1.32x faster |
|
|
471
|
+
| **Lazy Wildcard** | `/enters.*?battlefield/` | 18.93 ms | **14.16 ms** | `re2-node` is 1.34x faster |
|
|
472
|
+
| **Simple Literal** | `/damage/` | 19.54 ms | **14.11 ms** | `re2-node` is 1.39x faster |
|
|
473
|
+
| **Word Boundaries (NFA)** | `/\b(Flying\|First...)\b/` | 296.16 ms | **16.73 ms** | `re2-node` is 17.70x faster |
|
|
474
|
+
|
|
475
|
+
**Takeaways:**
|
|
476
|
+
* **DFA Strengths:** For state-heavy tasks like massive alternations (`White|Blue|...`) or bounded repetitions (`{5,15}`), RE2JS operates entirely within V8's optimized JIT and actually outpaces C++ bindings.
|
|
477
|
+
* **C++ Strengths:** For simple string scanning (like literal or wildcard searches), C++ wins because it can utilize optimized, hardware-level raw memory scanning operations (like `memchr`).
|
|
478
|
+
* **The NFA Fallback:** Pure DFA engines mathematically cannot track look-behind context like Word Boundaries (`\b`). When RE2JS encounters these, it safely bails out to the much slower NFA engine, resulting in a large performance gap compared to C++.
|
|
479
|
+
|
|
457
480
|
### RE2JS vs JavaScript's native RegExp
|
|
458
481
|
|
|
459
|
-
These examples illustrate the performance comparison between the RE2JS library and JavaScript's native RegExp for both a simple case and a ReDoS (Regular Expression Denial of Service) scenario
|
|
482
|
+
These examples illustrate the performance comparison between the RE2JS library and JavaScript's native `RegExp` for both a simple case and a ReDoS (Regular Expression Denial of Service) scenario.
|
|
460
483
|
|
|
461
484
|
```js
|
|
462
485
|
const regex = 'a+'
|
|
463
486
|
const string = 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!'
|
|
464
487
|
|
|
465
|
-
|
|
466
|
-
|
|
488
|
+
// Running 30,000 iterations
|
|
489
|
+
RE2JS.compile(regex).test(string) // Total time: ~9.87 ms
|
|
490
|
+
new RegExp(regex).test(string) // Total time: ~11.43 ms
|
|
467
491
|
```
|
|
468
492
|
|
|
469
|
-
|
|
493
|
+
For safe, simple patterns, the RE2JS DFA fast-path is heavily optimized and performs at parity with—or even slightly faster than—V8's native RegExp engine.
|
|
470
494
|
|
|
471
495
|
```js
|
|
472
496
|
const regex = '([a-z]+)+$'
|
|
473
497
|
const string = 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!'
|
|
474
498
|
|
|
475
|
-
|
|
476
|
-
|
|
499
|
+
// Running 30,000 iterations
|
|
500
|
+
RE2JS.compile(regex).test(string) // Total time: ~454.17 ms
|
|
501
|
+
// Running EXACTLY 1 iteration
|
|
502
|
+
new RegExp(regex).test(string) // Total time: ~105802.02 ms (over 105 seconds)
|
|
477
503
|
```
|
|
478
504
|
|
|
479
|
-
In the second example, a ReDoS scenario is depicted. The regular expression `([a-z]+)+$` is a potentially problematic one
|
|
480
|
-
|
|
481
|
-
The string is the same as in the first example, which does not pose a problem for either RE2JS or RegExp under normal circumstances. However, when dealing with the nested quantifier, RE2JS took around **3.62 ms** to find a match, while RegExp took significantly longer, around **103768.26 ms (~103 seconds)**. This demonstrates that RE2JS is much more efficient in handling potentially harmful regular expressions, thus preventing ReDoS attacks.
|
|
482
|
-
|
|
483
|
-
In conclusion, while JavaScript's native RegExp might be faster for simple regular expressions, RE2JS offers significant performance advantages when dealing with complex or potentially dangerous regular expressions. RE2JS provides protection against excessive backtracking that could lead to performance issues or ReDoS attacks.
|
|
484
|
-
|
|
485
|
-
## Rationale for RE2 JavaScript port
|
|
486
|
-
|
|
487
|
-
There are several reasons that underscore the importance of having an RE2 vanilla JavaScript (JS) port.
|
|
488
|
-
|
|
489
|
-
Firstly, it enables RE2 JS validation on the client side within the browser. This is vital as it allows the implementation and execution of regular expression operations directly in the browser, enhancing performance by reducing the necessity of server-side computations and back-and-forth communication.
|
|
490
|
-
|
|
491
|
-
Secondly, it provides a platform for simple RE2 parsing, specifically for the extraction of regex groups. This feature is particularly useful when dealing with complex regular expressions, as it allows for the breakdown of regex patterns into manageable and identifiable segments or 'groups'.
|
|
505
|
+
In the second example, a ReDoS scenario is depicted. The regular expression `([a-z]+)+$` is a potentially problematic one because it contains a nested quantifier. In standard NFA engines (like JavaScript's native `RegExp`), nested quantifiers can cause catastrophic backtracking. If a malicious user inputs a carefully crafted string, it results in exponentially high processing times, leading to a Denial of Service (DoS) attack.
|
|
492
506
|
|
|
493
|
-
|
|
507
|
+
RE2JS processed this poison-pill string **30,000 times in just ~454 milliseconds**, while the native RegExp completely locked up the main thread for **over 1 minute and 45 seconds trying to evaluate it just once**. This demonstrates why RE2JS is absolutely essential for securely handling untrusted regular expressions and protecting Node.js and browser applications against ReDoS attacks.
|
|
494
508
|
|
|
495
509
|
## Development
|
|
496
510
|
|
package/build/index.cjs.cjs
CHANGED
|
@@ -2,7 +2,7 @@
|
|
|
2
2
|
* re2js
|
|
3
3
|
* RE2JS is the JavaScript port of RE2, a regular expression engine that provides linear time matching
|
|
4
4
|
*
|
|
5
|
-
* @version v2.0.
|
|
5
|
+
* @version v2.0.1
|
|
6
6
|
* @author Alexey Vasiliev
|
|
7
7
|
* @homepage https://github.com/le0pard/re2js#readme
|
|
8
8
|
* @repository github:le0pard/re2js
|
|
@@ -361,6 +361,11 @@ class Unicode {
|
|
|
361
361
|
// Checked during test.
|
|
362
362
|
static MIN_FOLD = 0x0041;
|
|
363
363
|
static MAX_FOLD = 0x1e943;
|
|
364
|
+
static MIN_HIGH_SURROGATE = 0xd800;
|
|
365
|
+
static MAX_HIGH_SURROGATE = 0xdbff;
|
|
366
|
+
static MIN_LOW_SURROGATE = 0xdc00;
|
|
367
|
+
static MAX_LOW_SURROGATE = 0xdfff;
|
|
368
|
+
static MIN_SUPPLEMENTARY_CODE_POINT = 0x10000;
|
|
364
369
|
|
|
365
370
|
// is32 uses binary search to test whether rune is in the specified
|
|
366
371
|
// slice of 32-bit ranges.
|
|
@@ -667,9 +672,9 @@ class Utils {
|
|
|
667
672
|
} else if (c < 2048) {
|
|
668
673
|
out[p++] = c >> 6 | 192;
|
|
669
674
|
out[p++] = c & 63 | 128;
|
|
670
|
-
} else if ((c & 0xfc00) ===
|
|
675
|
+
} else if ((c & 0xfc00) === Unicode.MIN_HIGH_SURROGATE && i + 1 < str.length && (str.charCodeAt(i + 1) & 0xfc00) === Unicode.MIN_LOW_SURROGATE) {
|
|
671
676
|
// Surrogate Pair
|
|
672
|
-
c =
|
|
677
|
+
c = Unicode.MIN_SUPPLEMENTARY_CODE_POINT + ((c & 0x03ff) << 10) + (str.charCodeAt(++i) & 0x03ff);
|
|
673
678
|
out[p++] = c >> 18 | 240;
|
|
674
679
|
out[p++] = c >> 12 & 63 | 128;
|
|
675
680
|
out[p++] = c >> 6 & 63 | 128;
|
|
@@ -703,9 +708,9 @@ class Utils {
|
|
|
703
708
|
let c2 = bytes[pos++];
|
|
704
709
|
let c3 = bytes[pos++];
|
|
705
710
|
let c4 = bytes[pos++];
|
|
706
|
-
let u = ((c1 & 7) << 18 | (c2 & 63) << 12 | (c3 & 63) << 6 | c4 & 63) -
|
|
707
|
-
out[c++] = String.fromCharCode(
|
|
708
|
-
out[c++] = String.fromCharCode(
|
|
711
|
+
let u = ((c1 & 7) << 18 | (c2 & 63) << 12 | (c3 & 63) << 6 | c4 & 63) - Unicode.MIN_SUPPLEMENTARY_CODE_POINT;
|
|
712
|
+
out[c++] = String.fromCharCode(Unicode.MIN_HIGH_SURROGATE + (u >> 10));
|
|
713
|
+
out[c++] = String.fromCharCode(Unicode.MIN_LOW_SURROGATE + (u & 1023));
|
|
709
714
|
} else {
|
|
710
715
|
let c2 = bytes[pos++];
|
|
711
716
|
let c3 = bytes[pos++];
|
|
@@ -879,38 +884,34 @@ class MachineUTF8Input extends MachineInputBase {
|
|
|
879
884
|
// the lower 3 bits, and the rune (Unicode code point) in the high
|
|
880
885
|
// bits. Never negative, except for EOF which is represented as -1
|
|
881
886
|
// << 3 | 0.
|
|
882
|
-
step(
|
|
883
|
-
|
|
884
|
-
if (
|
|
887
|
+
step(pos) {
|
|
888
|
+
pos += this.start;
|
|
889
|
+
if (pos >= this.end) {
|
|
885
890
|
return MachineInputBase.EOF();
|
|
886
891
|
}
|
|
887
|
-
|
|
888
|
-
|
|
889
|
-
|
|
890
|
-
|
|
891
|
-
|
|
892
|
-
|
|
893
|
-
|
|
894
|
-
|
|
895
|
-
|
|
896
|
-
|
|
897
|
-
|
|
898
|
-
|
|
899
|
-
|
|
900
|
-
|
|
901
|
-
|
|
902
|
-
|
|
903
|
-
|
|
904
|
-
|
|
892
|
+
|
|
893
|
+
// Read UTF-8 bytes to extract the Rune and its width
|
|
894
|
+
const c = this.bytes[pos] & 0xff;
|
|
895
|
+
if (c < 0x80) {
|
|
896
|
+
return c << 3 | 1;
|
|
897
|
+
} else if (c >= 0xc2 && c <= 0xdf && pos + 1 < this.end) {
|
|
898
|
+
const c1 = this.bytes[pos + 1] & 0xff;
|
|
899
|
+
const rune = (c & 0x1f) << 6 | c1 & 0x3f;
|
|
900
|
+
return rune << 3 | 2;
|
|
901
|
+
} else if (c >= 0xe0 && c <= 0xef && pos + 2 < this.end) {
|
|
902
|
+
const c1 = this.bytes[pos + 1] & 0xff;
|
|
903
|
+
const c2 = this.bytes[pos + 2] & 0xff;
|
|
904
|
+
const rune = (c & 0x0f) << 12 | (c1 & 0x3f) << 6 | c2 & 0x3f;
|
|
905
|
+
return rune << 3 | 3;
|
|
906
|
+
} else if (c >= 0xf0 && c <= 0xf4 && pos + 3 < this.end) {
|
|
907
|
+
const c1 = this.bytes[pos + 1] & 0xff;
|
|
908
|
+
const c2 = this.bytes[pos + 2] & 0xff;
|
|
909
|
+
const c3 = this.bytes[pos + 3] & 0xff;
|
|
910
|
+
const rune = (c & 0x07) << 18 | (c1 & 0x3f) << 12 | (c2 & 0x3f) << 6 | c3 & 0x3f;
|
|
911
|
+
return rune << 3 | 4;
|
|
905
912
|
} else {
|
|
906
|
-
|
|
907
|
-
|
|
908
|
-
return MachineInputBase.EOF();
|
|
909
|
-
}
|
|
910
|
-
x = x << 6 | this.bytes[i++] & 63;
|
|
911
|
-
x = x << 6 | this.bytes[i++] & 63;
|
|
912
|
-
x = x << 6 | this.bytes[i++] & 63;
|
|
913
|
-
return x << 3 | 4;
|
|
913
|
+
// Invalid sequence fallback
|
|
914
|
+
return c << 3 | 1;
|
|
914
915
|
}
|
|
915
916
|
}
|
|
916
917
|
|
|
@@ -985,12 +986,25 @@ class MachineUTF16Input extends MachineInputBase {
|
|
|
985
986
|
// << 3 | 0.
|
|
986
987
|
step(pos) {
|
|
987
988
|
pos += this.start;
|
|
988
|
-
if (pos
|
|
989
|
-
const rune = this.charSequence.codePointAt(pos);
|
|
990
|
-
return rune << 3 | Utils.charCount(rune);
|
|
991
|
-
} else {
|
|
989
|
+
if (pos >= this.end) {
|
|
992
990
|
return MachineInputBase.EOF();
|
|
993
991
|
}
|
|
992
|
+
const c1 = this.charSequence.charCodeAt(pos);
|
|
993
|
+
|
|
994
|
+
// Fast path: standard BMP character (not a high surrogate)
|
|
995
|
+
if (c1 < Unicode.MIN_HIGH_SURROGATE || c1 > Unicode.MAX_HIGH_SURROGATE || pos + 1 >= this.end) {
|
|
996
|
+
return c1 << 3 | 1;
|
|
997
|
+
}
|
|
998
|
+
|
|
999
|
+
// Slow path: Calculate surrogate pair manually
|
|
1000
|
+
const c2 = this.charSequence.charCodeAt(pos + 1);
|
|
1001
|
+
if (c2 >= Unicode.MIN_LOW_SURROGATE && c2 <= Unicode.MAX_LOW_SURROGATE) {
|
|
1002
|
+
const rune = (c1 - Unicode.MIN_HIGH_SURROGATE) * 0x400 + (c2 - Unicode.MIN_LOW_SURROGATE) + Unicode.MIN_SUPPLEMENTARY_CODE_POINT;
|
|
1003
|
+
return rune << 3 | 2;
|
|
1004
|
+
}
|
|
1005
|
+
|
|
1006
|
+
// Invalid surrogate pair fallback
|
|
1007
|
+
return c1 << 3 | 1;
|
|
994
1008
|
}
|
|
995
1009
|
|
|
996
1010
|
// Returns the index relative to |pos| at which |re2.prefix| is found
|
|
@@ -1738,7 +1752,7 @@ class Inst {
|
|
|
1738
1752
|
let lo = 0;
|
|
1739
1753
|
let hi = this.runes.length / 2 | 0;
|
|
1740
1754
|
while (lo < hi) {
|
|
1741
|
-
const m = lo + ((hi - lo) / 2 | 0)
|
|
1755
|
+
const m = lo + hi >> 1; // native cpu instruction for "lo + (((hi - lo) / 2) | 0)"
|
|
1742
1756
|
const c = this.runes[2 * m];
|
|
1743
1757
|
if (c <= r) {
|
|
1744
1758
|
if (r <= this.runes[2 * m + 1]) {
|
|
@@ -1799,10 +1813,10 @@ class Thread {
|
|
|
1799
1813
|
// A queue is a 'sparse array' holding pending threads of execution. See:
|
|
1800
1814
|
// research.swtch.com/2008/03/using-uninitialized-memory-for-fun-and.html
|
|
1801
1815
|
class Queue {
|
|
1802
|
-
constructor() {
|
|
1803
|
-
this.sparse =
|
|
1804
|
-
this.densePcs =
|
|
1805
|
-
this.denseThreads =
|
|
1816
|
+
constructor(numInst) {
|
|
1817
|
+
this.sparse = new Int32Array(numInst); // may contain stale but in-bounds values.
|
|
1818
|
+
this.densePcs = new Int32Array(numInst); // may contain stale pc in slots >= size
|
|
1819
|
+
this.denseThreads = new Array(numInst); // may contain stale Thread in slots >= size
|
|
1806
1820
|
this.size = 0;
|
|
1807
1821
|
}
|
|
1808
1822
|
contains(pc) {
|
|
@@ -2303,7 +2317,7 @@ class DFA {
|
|
|
2303
2317
|
if (width === 0) {
|
|
2304
2318
|
break;
|
|
2305
2319
|
}
|
|
2306
|
-
currentState = this.step(currentState, rune, anchor);
|
|
2320
|
+
currentState = anchor === RE2Flags.UNANCHORED && rune <= Unicode.MAX_ASCII && currentState.nextAscii[rune] || this.step(currentState, rune, anchor);
|
|
2307
2321
|
|
|
2308
2322
|
// If we hit an unrecoverable DFA error or bailout, signal fallback
|
|
2309
2323
|
if (currentState === null) return null;
|