re2js 2.0.0 → 2.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -356,10 +356,10 @@ import { RE2JS } from 're2js'
356
356
 
357
357
  RE2JS.compile('(\\w+) (\\w+)')
358
358
  .matcher('Hello World')
359
- .replaceAll('$0 - $0') // 'Hello World - Hello World'
359
+ .replaceAll('$& - $&') // 'Hello World - Hello World'
360
360
  RE2JS.compile('(\\w+) (\\w+)')
361
361
  .matcher('Hello World')
362
- .replaceAll('$& - $&', true) // 'Hello World - Hello World'
362
+ .replaceAll('$0 - $0', true) // 'Hello World - Hello World'
363
363
  ```
364
364
 
365
365
  #### Replacing the First Occurrence
@@ -447,50 +447,64 @@ RE2JS.matches(unicodeRegexp, '😃') // false
447
447
 
448
448
  ## Performance
449
449
 
450
- The RE2JS engine runs more slowly compared to native RegExp objects. This reduced speed is also noticeable when comparing RE2JS to the original RE2 engine. The C++ implementation of the RE2 engine includes both NFA (Nondeterministic Finite Automaton) and DFA (Deterministic Finite Automaton) engines, as well as a variety of optimizations. Russ Cox ported a simplified version of the NFA engine to Go. Later, Alan Donovan ported the NFA-based Go implementation to Java. I then ported the NFA-based Java implementation (plus Golang stuff, which are not present in Java implementation, like checks for regular expression complexity) to a pure JS version. This is another reason why the pure JS version will perform more slowly compared to the original RE2 engine.
450
+ The RE2JS engine runs more slowly compared to native RegExp objects for simple queries. This reduced speed is also noticeable when comparing RE2JS to the original RE2 engine. The C++ implementation of the RE2 engine includes both NFA (Nondeterministic Finite Automaton) and DFA (Deterministic Finite Automaton) engines, as well as a variety of highly optimized memory operations. Russ Cox ported a simplified version of the NFA engine to Go. Later, Alan Donovan ported the NFA-based Go implementation to Java. I then ported the NFA-based Java implementation (plus Golang additions + Lazy DFA fast-path) to a pure JS version.
451
451
 
452
- Should you require high performance on the server side when using RE2, it would be beneficial to consider the following packages for JS:
452
+ Should you require maximum absolute performance on the server side when using RE2, it would be beneficial to consider the following packages for JS:
453
453
 
454
- - [Node-RE2](https://github.com/uhop/node-re2/): A powerful RE2 package for Node.js
454
+ - [Node-RE2](https://github.com/uhop/node-re2/): A powerful RE2 C++ binding for Node.js
455
455
  - [RE2-WASM](https://github.com/google/re2-wasm/): This package is a WASM wrapper for RE2. Please note, as of now, it does not work in browsers
456
456
 
457
+ ### RE2JS vs RE2-Node (C++ Bindings)
458
+
459
+ Because RE2JS implements a Just-In-Time (JIT) compiled DFA, it can actually perform on par with—and sometimes faster than—native C++ bindings (`re2-node`) by avoiding the cross-boundary serialization costs between JavaScript and C++.
460
+
461
+ Here is a benchmark running 30,000 items through both engines using their respective `.test()` fast-paths:
462
+
463
+ | Benchmark Scenario | Pattern Example | RE2JS (Pure JS) | RE2-Node (C++) | Result |
464
+ |:--------------------------|:---------------------------|:----------------|:---------------|:----------------------------|
465
+ | **Bounded Repetition** | `/[A-Z][a-z]{5,15}/` | **11.80 ms** | 14.79 ms | `re2js` is **1.25x** faster |
466
+ | **Massive Alternation** | `/White\|Blue\|Black.../` | **15.42 ms** | 16.02 ms | `re2js` is **1.04x** faster |
467
+ | **Deep State Machine** | `/([0-9]+(/[0-9]+)+)/` | 19.34 ms | **17.16 ms** | `re2-node` is 1.13x faster |
468
+ | **Case Insensitive** | `/(?i)swamp/` | 20.27 ms | **17.26 ms** | `re2-node` is 1.17x faster |
469
+ | **ReDoS Attempt** | `/(a+)+!/` | 20.56 ms | **17.33 ms** | `re2-node` is 1.19x faster |
470
+ | **Greedy Wildcard** | `/enters.*battlefield/` | 18.93 ms | **14.33 ms** | `re2-node` is 1.32x faster |
471
+ | **Lazy Wildcard** | `/enters.*?battlefield/` | 18.93 ms | **14.16 ms** | `re2-node` is 1.34x faster |
472
+ | **Simple Literal** | `/damage/` | 19.54 ms | **14.11 ms** | `re2-node` is 1.39x faster |
473
+ | **Word Boundaries (NFA)** | `/\b(Flying\|First...)\b/` | 296.16 ms | **16.73 ms** | `re2-node` is 17.70x faster |
474
+
475
+ **Takeaways:**
476
+ * **DFA Strengths:** For state-heavy tasks like massive alternations (`White|Blue|...`) or bounded repetitions (`{5,15}`), RE2JS operates entirely within V8's optimized JIT and actually outpaces C++ bindings.
477
+ * **C++ Strengths:** For simple string scanning (like literal or wildcard searches), C++ wins because it can utilize optimized, hardware-level raw memory scanning operations (like `memchr`).
478
+ * **The NFA Fallback:** Pure DFA engines mathematically cannot track look-behind context like Word Boundaries (`\b`). When RE2JS encounters these, it safely bails out to the much slower NFA engine, resulting in a large performance gap compared to C++.
479
+
457
480
  ### RE2JS vs JavaScript's native RegExp
458
481
 
459
- These examples illustrate the performance comparison between the RE2JS library and JavaScript's native RegExp for both a simple case and a ReDoS (Regular Expression Denial of Service) scenario
482
+ These examples illustrate the performance comparison between the RE2JS library and JavaScript's native `RegExp` for both a simple case and a ReDoS (Regular Expression Denial of Service) scenario.
460
483
 
461
484
  ```js
462
485
  const regex = 'a+'
463
486
  const string = 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!'
464
487
 
465
- RE2JS.compile(regex).matcher(string).find() // avg: 5.657783601 ms
466
- new RegExp(regex).test(string) // avg: 1.504824999 ms
488
+ // Running 30,000 iterations
489
+ RE2JS.compile(regex).test(string) // Total time: ~9.87 ms
490
+ new RegExp(regex).test(string) // Total time: ~11.43 ms
467
491
  ```
468
492
 
469
- The result shows that the RE2JS library took around **5.66 ms** on average to find a match, while the native RegExp took around **1.50 ms**. This indicates that, in this case, RegExp performed faster than RE2JS
493
+ For safe, simple patterns, the RE2JS DFA fast-path is heavily optimized and performs at parity with—or even slightly faster than—V8's native RegExp engine.
470
494
 
471
495
  ```js
472
496
  const regex = '([a-z]+)+$'
473
497
  const string = 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!'
474
498
 
475
- RE2JS.compile(regex).matcher(string).find() // avg: 3.6155000030994415 ms
476
- new RegExp(regex).test(string) // avg: 103768.25712499022 ms
499
+ // Running 30,000 iterations
500
+ RE2JS.compile(regex).test(string) // Total time: ~454.17 ms
501
+ // Running EXACTLY 1 iteration
502
+ new RegExp(regex).test(string) // Total time: ~105802.02 ms (over 105 seconds)
477
503
  ```
478
504
 
479
- In the second example, a ReDoS scenario is depicted. The regular expression `([a-z]+)+$` is a potentially problematic one, as it has a nested quantifier. Nested quantifiers can cause catastrophic backtracking, which results in high processing time, leading to a potential Denial of Service (DoS) attack if a malicious user inputs a carefully crafted string.
480
-
481
- The string is the same as in the first example, which does not pose a problem for either RE2JS or RegExp under normal circumstances. However, when dealing with the nested quantifier, RE2JS took around **3.62 ms** to find a match, while RegExp took significantly longer, around **103768.26 ms (~103 seconds)**. This demonstrates that RE2JS is much more efficient in handling potentially harmful regular expressions, thus preventing ReDoS attacks.
482
-
483
- In conclusion, while JavaScript's native RegExp might be faster for simple regular expressions, RE2JS offers significant performance advantages when dealing with complex or potentially dangerous regular expressions. RE2JS provides protection against excessive backtracking that could lead to performance issues or ReDoS attacks.
484
-
485
- ## Rationale for RE2 JavaScript port
486
-
487
- There are several reasons that underscore the importance of having an RE2 vanilla JavaScript (JS) port.
488
-
489
- Firstly, it enables RE2 JS validation on the client side within the browser. This is vital as it allows the implementation and execution of regular expression operations directly in the browser, enhancing performance by reducing the necessity of server-side computations and back-and-forth communication.
490
-
491
- Secondly, it provides a platform for simple RE2 parsing, specifically for the extraction of regex groups. This feature is particularly useful when dealing with complex regular expressions, as it allows for the breakdown of regex patterns into manageable and identifiable segments or 'groups'.
505
+ In the second example, a ReDoS scenario is depicted. The regular expression `([a-z]+)+$` is a potentially problematic one because it contains a nested quantifier. In standard NFA engines (like JavaScript's native `RegExp`), nested quantifiers can cause catastrophic backtracking. If a malicious user inputs a carefully crafted string, it results in exponentially high processing times, leading to a Denial of Service (DoS) attack.
492
506
 
493
- These factors combined make the RE2 vanilla JS port a valuable tool for developers needing to work with complex regular expressions within a browser environment.
507
+ RE2JS processed this poison-pill string **30,000 times in just ~454 milliseconds**, while the native RegExp completely locked up the main thread for **over 1 minute and 45 seconds trying to evaluate it just once**. This demonstrates why RE2JS is absolutely essential for securely handling untrusted regular expressions and protecting Node.js and browser applications against ReDoS attacks.
494
508
 
495
509
  ## Development
496
510
 
@@ -2,7 +2,7 @@
2
2
  * re2js
3
3
  * RE2JS is the JavaScript port of RE2, a regular expression engine that provides linear time matching
4
4
  *
5
- * @version v2.0.0
5
+ * @version v2.0.1
6
6
  * @author Alexey Vasiliev
7
7
  * @homepage https://github.com/le0pard/re2js#readme
8
8
  * @repository github:le0pard/re2js
@@ -361,6 +361,11 @@ class Unicode {
361
361
  // Checked during test.
362
362
  static MIN_FOLD = 0x0041;
363
363
  static MAX_FOLD = 0x1e943;
364
+ static MIN_HIGH_SURROGATE = 0xd800;
365
+ static MAX_HIGH_SURROGATE = 0xdbff;
366
+ static MIN_LOW_SURROGATE = 0xdc00;
367
+ static MAX_LOW_SURROGATE = 0xdfff;
368
+ static MIN_SUPPLEMENTARY_CODE_POINT = 0x10000;
364
369
 
365
370
  // is32 uses binary search to test whether rune is in the specified
366
371
  // slice of 32-bit ranges.
@@ -667,9 +672,9 @@ class Utils {
667
672
  } else if (c < 2048) {
668
673
  out[p++] = c >> 6 | 192;
669
674
  out[p++] = c & 63 | 128;
670
- } else if ((c & 0xfc00) === 0xd800 && i + 1 < str.length && (str.charCodeAt(i + 1) & 0xfc00) === 0xdc00) {
675
+ } else if ((c & 0xfc00) === Unicode.MIN_HIGH_SURROGATE && i + 1 < str.length && (str.charCodeAt(i + 1) & 0xfc00) === Unicode.MIN_LOW_SURROGATE) {
671
676
  // Surrogate Pair
672
- c = 0x10000 + ((c & 0x03ff) << 10) + (str.charCodeAt(++i) & 0x03ff);
677
+ c = Unicode.MIN_SUPPLEMENTARY_CODE_POINT + ((c & 0x03ff) << 10) + (str.charCodeAt(++i) & 0x03ff);
673
678
  out[p++] = c >> 18 | 240;
674
679
  out[p++] = c >> 12 & 63 | 128;
675
680
  out[p++] = c >> 6 & 63 | 128;
@@ -703,9 +708,9 @@ class Utils {
703
708
  let c2 = bytes[pos++];
704
709
  let c3 = bytes[pos++];
705
710
  let c4 = bytes[pos++];
706
- let u = ((c1 & 7) << 18 | (c2 & 63) << 12 | (c3 & 63) << 6 | c4 & 63) - 0x10000;
707
- out[c++] = String.fromCharCode(0xd800 + (u >> 10));
708
- out[c++] = String.fromCharCode(0xdc00 + (u & 1023));
711
+ let u = ((c1 & 7) << 18 | (c2 & 63) << 12 | (c3 & 63) << 6 | c4 & 63) - Unicode.MIN_SUPPLEMENTARY_CODE_POINT;
712
+ out[c++] = String.fromCharCode(Unicode.MIN_HIGH_SURROGATE + (u >> 10));
713
+ out[c++] = String.fromCharCode(Unicode.MIN_LOW_SURROGATE + (u & 1023));
709
714
  } else {
710
715
  let c2 = bytes[pos++];
711
716
  let c3 = bytes[pos++];
@@ -879,38 +884,34 @@ class MachineUTF8Input extends MachineInputBase {
879
884
  // the lower 3 bits, and the rune (Unicode code point) in the high
880
885
  // bits. Never negative, except for EOF which is represented as -1
881
886
  // << 3 | 0.
882
- step(i) {
883
- i += this.start;
884
- if (i >= this.end) {
887
+ step(pos) {
888
+ pos += this.start;
889
+ if (pos >= this.end) {
885
890
  return MachineInputBase.EOF();
886
891
  }
887
- let x = this.bytes[i++] & 255;
888
- if ((x & 128) === 0) {
889
- return x << 3 | 1;
890
- } else if ((x & 224) === 192) {
891
- x = x & 31;
892
- if (i >= this.end) {
893
- return MachineInputBase.EOF();
894
- }
895
- x = x << 6 | this.bytes[i++] & 63;
896
- return x << 3 | 2;
897
- } else if ((x & 240) === 224) {
898
- x = x & 15;
899
- if (i + 1 >= this.end) {
900
- return MachineInputBase.EOF();
901
- }
902
- x = x << 6 | this.bytes[i++] & 63;
903
- x = x << 6 | this.bytes[i++] & 63;
904
- return x << 3 | 3;
892
+
893
+ // Read UTF-8 bytes to extract the Rune and its width
894
+ const c = this.bytes[pos] & 0xff;
895
+ if (c < 0x80) {
896
+ return c << 3 | 1;
897
+ } else if (c >= 0xc2 && c <= 0xdf && pos + 1 < this.end) {
898
+ const c1 = this.bytes[pos + 1] & 0xff;
899
+ const rune = (c & 0x1f) << 6 | c1 & 0x3f;
900
+ return rune << 3 | 2;
901
+ } else if (c >= 0xe0 && c <= 0xef && pos + 2 < this.end) {
902
+ const c1 = this.bytes[pos + 1] & 0xff;
903
+ const c2 = this.bytes[pos + 2] & 0xff;
904
+ const rune = (c & 0x0f) << 12 | (c1 & 0x3f) << 6 | c2 & 0x3f;
905
+ return rune << 3 | 3;
906
+ } else if (c >= 0xf0 && c <= 0xf4 && pos + 3 < this.end) {
907
+ const c1 = this.bytes[pos + 1] & 0xff;
908
+ const c2 = this.bytes[pos + 2] & 0xff;
909
+ const c3 = this.bytes[pos + 3] & 0xff;
910
+ const rune = (c & 0x07) << 18 | (c1 & 0x3f) << 12 | (c2 & 0x3f) << 6 | c3 & 0x3f;
911
+ return rune << 3 | 4;
905
912
  } else {
906
- x = x & 7;
907
- if (i + 2 >= this.end) {
908
- return MachineInputBase.EOF();
909
- }
910
- x = x << 6 | this.bytes[i++] & 63;
911
- x = x << 6 | this.bytes[i++] & 63;
912
- x = x << 6 | this.bytes[i++] & 63;
913
- return x << 3 | 4;
913
+ // Invalid sequence fallback
914
+ return c << 3 | 1;
914
915
  }
915
916
  }
916
917
 
@@ -985,12 +986,25 @@ class MachineUTF16Input extends MachineInputBase {
985
986
  // << 3 | 0.
986
987
  step(pos) {
987
988
  pos += this.start;
988
- if (pos < this.end) {
989
- const rune = this.charSequence.codePointAt(pos);
990
- return rune << 3 | Utils.charCount(rune);
991
- } else {
989
+ if (pos >= this.end) {
992
990
  return MachineInputBase.EOF();
993
991
  }
992
+ const c1 = this.charSequence.charCodeAt(pos);
993
+
994
+ // Fast path: standard BMP character (not a high surrogate)
995
+ if (c1 < Unicode.MIN_HIGH_SURROGATE || c1 > Unicode.MAX_HIGH_SURROGATE || pos + 1 >= this.end) {
996
+ return c1 << 3 | 1;
997
+ }
998
+
999
+ // Slow path: Calculate surrogate pair manually
1000
+ const c2 = this.charSequence.charCodeAt(pos + 1);
1001
+ if (c2 >= Unicode.MIN_LOW_SURROGATE && c2 <= Unicode.MAX_LOW_SURROGATE) {
1002
+ const rune = (c1 - Unicode.MIN_HIGH_SURROGATE) * 0x400 + (c2 - Unicode.MIN_LOW_SURROGATE) + Unicode.MIN_SUPPLEMENTARY_CODE_POINT;
1003
+ return rune << 3 | 2;
1004
+ }
1005
+
1006
+ // Invalid surrogate pair fallback
1007
+ return c1 << 3 | 1;
994
1008
  }
995
1009
 
996
1010
  // Returns the index relative to |pos| at which |re2.prefix| is found
@@ -1738,7 +1752,7 @@ class Inst {
1738
1752
  let lo = 0;
1739
1753
  let hi = this.runes.length / 2 | 0;
1740
1754
  while (lo < hi) {
1741
- const m = lo + ((hi - lo) / 2 | 0);
1755
+ const m = lo + hi >> 1; // native cpu instruction for "lo + (((hi - lo) / 2) | 0)"
1742
1756
  const c = this.runes[2 * m];
1743
1757
  if (c <= r) {
1744
1758
  if (r <= this.runes[2 * m + 1]) {
@@ -1799,10 +1813,10 @@ class Thread {
1799
1813
  // A queue is a 'sparse array' holding pending threads of execution. See:
1800
1814
  // research.swtch.com/2008/03/using-uninitialized-memory-for-fun-and.html
1801
1815
  class Queue {
1802
- constructor() {
1803
- this.sparse = []; // may contain stale but in-bounds values.
1804
- this.densePcs = []; // may contain stale pc in slots >= size
1805
- this.denseThreads = []; // may contain stale Thread in slots >= size
1816
+ constructor(numInst) {
1817
+ this.sparse = new Int32Array(numInst); // may contain stale but in-bounds values.
1818
+ this.densePcs = new Int32Array(numInst); // may contain stale pc in slots >= size
1819
+ this.denseThreads = new Array(numInst); // may contain stale Thread in slots >= size
1806
1820
  this.size = 0;
1807
1821
  }
1808
1822
  contains(pc) {
@@ -2303,7 +2317,7 @@ class DFA {
2303
2317
  if (width === 0) {
2304
2318
  break;
2305
2319
  }
2306
- currentState = this.step(currentState, rune, anchor);
2320
+ currentState = anchor === RE2Flags.UNANCHORED && rune <= Unicode.MAX_ASCII && currentState.nextAscii[rune] || this.step(currentState, rune, anchor);
2307
2321
 
2308
2322
  // If we hit an unrecoverable DFA error or bailout, signal fallback
2309
2323
  if (currentState === null) return null;