npm - re2js - Versions diffs - 2.0.0 → 2.0.1 - Mend

re2js 2.0.0 → 2.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (9) hide show

package/README.md CHANGED Viewed

@@ -356,10 +356,10 @@ import { RE2JS } from 're2js'
 RE2JS.compile('(\\w+) (\\w+)')
   .matcher('Hello World')
-  .replaceAll('$0 - $0') // 'Hello World - Hello World'
+  .replaceAll('$& - $&') // 'Hello World - Hello World'
 RE2JS.compile('(\\w+) (\\w+)')
   .matcher('Hello World')
-  .replaceAll('$& - $&', true) // 'Hello World - Hello World'
+  .replaceAll('$0 - $0', true) // 'Hello World - Hello World'
 ```
 #### Replacing the First Occurrence
@@ -447,50 +447,64 @@ RE2JS.matches(unicodeRegexp, '😃') // false
 ## Performance
-The RE2JS engine runs more slowly compared to native RegExp objects. This reduced speed is also noticeable when comparing RE2JS to the original RE2 engine. The C++ implementation of the RE2 engine includes both NFA (Nondeterministic Finite Automaton) and DFA (Deterministic Finite Automaton) engines, as well as a variety of optimizations. Russ Cox ported a simplified version of the NFA engine to Go. Later, Alan Donovan ported the NFA-based Go implementation to Java. I then ported the NFA-based Java implementation (plus Golang stuff, which are not present in Java implementation, like checks for regular expression complexity) to a pure JS version. This is another reason why the pure JS version will perform more slowly compared to the original RE2 engine.
+The RE2JS engine runs more slowly compared to native RegExp objects for simple queries. This reduced speed is also noticeable when comparing RE2JS to the original RE2 engine. The C++ implementation of the RE2 engine includes both NFA (Nondeterministic Finite Automaton) and DFA (Deterministic Finite Automaton) engines, as well as a variety of highly optimized memory operations. Russ Cox ported a simplified version of the NFA engine to Go. Later, Alan Donovan ported the NFA-based Go implementation to Java. I then ported the NFA-based Java implementation (plus Golang additions + Lazy DFA fast-path) to a pure JS version.
-Should you require high performance on the server side when using RE2, it would be beneficial to consider the following packages for JS:
+Should you require maximum absolute performance on the server side when using RE2, it would be beneficial to consider the following packages for JS:
- - [Node-RE2](https://github.com/uhop/node-re2/): A powerful RE2 package for Node.js
+ - [Node-RE2](https://github.com/uhop/node-re2/): A powerful RE2 C++ binding for Node.js
  - [RE2-WASM](https://github.com/google/re2-wasm/): This package is a WASM wrapper for RE2. Please note, as of now, it does not work in browsers
+### RE2JS vs RE2-Node (C++ Bindings)
+Because RE2JS implements a Just-In-Time (JIT) compiled DFA, it can actually perform on par with—and sometimes faster than—native C++ bindings (`re2-node`) by avoiding the cross-boundary serialization costs between JavaScript and C++.
+Here is a benchmark running 30,000 items through both engines using their respective `.test()` fast-paths:
+| Benchmark Scenario        | Pattern Example            | RE2JS (Pure JS) | RE2-Node (C++) | Result                      |
+|:--------------------------|:---------------------------|:----------------|:---------------|:----------------------------|
+| **Bounded Repetition**    | `/[A-Z][a-z]{5,15}/`       | **11.80 ms**    | 14.79 ms       | `re2js` is **1.25x** faster |
+| **Massive Alternation**   | `/White\|Blue\|Black.../`  | **15.42 ms**    | 16.02 ms       | `re2js` is **1.04x** faster |
+| **Deep State Machine**    | `/([0-9]+(/[0-9]+)+)/`     | 19.34 ms        | **17.16 ms**   | `re2-node` is 1.13x faster  |
+| **Case Insensitive**      | `/(?i)swamp/`              | 20.27 ms        | **17.26 ms**   | `re2-node` is 1.17x faster  |
+| **ReDoS Attempt**         | `/(a+)+!/`                 | 20.56 ms        | **17.33 ms**   | `re2-node` is 1.19x faster  |
+| **Greedy Wildcard**       | `/enters.*battlefield/`    | 18.93 ms        | **14.33 ms**   | `re2-node` is 1.32x faster  |
+| **Lazy Wildcard**         | `/enters.*?battlefield/`   | 18.93 ms        | **14.16 ms**   | `re2-node` is 1.34x faster  |
+| **Simple Literal**        | `/damage/`                 | 19.54 ms        | **14.11 ms**   | `re2-node` is 1.39x faster  |
+| **Word Boundaries (NFA)** | `/\b(Flying\|First...)\b/` | 296.16 ms       | **16.73 ms**   | `re2-node` is 17.70x faster |
+**Takeaways:**
+* **DFA Strengths:** For state-heavy tasks like massive alternations (`White|Blue|...`) or bounded repetitions (`{5,15}`), RE2JS operates entirely within V8's optimized JIT and actually outpaces C++ bindings.
+* **C++ Strengths:** For simple string scanning (like literal or wildcard searches), C++ wins because it can utilize optimized, hardware-level raw memory scanning operations (like `memchr`).
+* **The NFA Fallback:** Pure DFA engines mathematically cannot track look-behind context like Word Boundaries (`\b`). When RE2JS encounters these, it safely bails out to the much slower NFA engine, resulting in a large performance gap compared to C++.
 ### RE2JS vs JavaScript's native RegExp
-These examples illustrate the performance comparison between the RE2JS library and JavaScript's native RegExp for both a simple case and a ReDoS (Regular Expression Denial of Service) scenario
+These examples illustrate the performance comparison between the RE2JS library and JavaScript's native `RegExp` for both a simple case and a ReDoS (Regular Expression Denial of Service) scenario.
 ```js
 const regex = 'a+'
 const string = 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!'
-RE2JS.compile(regex).matcher(string).find() // avg: 5.657783601 ms
-new RegExp(regex).test(string) // avg: 1.504824999 ms
+// Running 30,000 iterations
+RE2JS.compile(regex).test(string) // Total time: ~9.87 ms
+new RegExp(regex).test(string)    // Total time: ~11.43 ms
 ```
-The result shows that the RE2JS library took around **5.66 ms** on average to find a match, while the native RegExp took around **1.50 ms**. This indicates that, in this case, RegExp performed faster than RE2JS
+For safe, simple patterns, the RE2JS DFA fast-path is heavily optimized and performs at parity with—or even slightly faster than—V8's native RegExp engine.
 ```js
 const regex = '([a-z]+)+$'
 const string = 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!'
-RE2JS.compile(regex).matcher(string).find() // avg: 3.6155000030994415 ms
-new RegExp(regex).test(string) // avg: 103768.25712499022 ms
+// Running 30,000 iterations
+RE2JS.compile(regex).test(string) // Total time: ~454.17 ms
+// Running EXACTLY 1 iteration
+new RegExp(regex).test(string)    // Total time: ~105802.02 ms (over 105 seconds)
 ```
-In the second example, a ReDoS scenario is depicted. The regular expression `([a-z]+)+$` is a potentially problematic one, as it has a nested quantifier. Nested quantifiers can cause catastrophic backtracking, which results in high processing time, leading to a potential Denial of Service (DoS) attack if a malicious user inputs a carefully crafted string.
-The string is the same as in the first example, which does not pose a problem for either RE2JS or RegExp under normal circumstances. However, when dealing with the nested quantifier, RE2JS took around **3.62 ms** to find a match, while RegExp took significantly longer, around **103768.26 ms (~103 seconds)**. This demonstrates that RE2JS is much more efficient in handling potentially harmful regular expressions, thus preventing ReDoS attacks.
-In conclusion, while JavaScript's native RegExp might be faster for simple regular expressions, RE2JS offers significant performance advantages when dealing with complex or potentially dangerous regular expressions. RE2JS provides protection against excessive backtracking that could lead to performance issues or ReDoS attacks.
-## Rationale for RE2 JavaScript port
-There are several reasons that underscore the importance of having an RE2 vanilla JavaScript (JS) port.
-Firstly, it enables RE2 JS validation on the client side within the browser. This is vital as it allows the implementation and execution of regular expression operations directly in the browser, enhancing performance by reducing the necessity of server-side computations and back-and-forth communication.
-Secondly, it provides a platform for simple RE2 parsing, specifically for the extraction of regex groups. This feature is particularly useful when dealing with complex regular expressions, as it allows for the breakdown of regex patterns into manageable and identifiable segments or 'groups'.
+In the second example, a ReDoS scenario is depicted. The regular expression `([a-z]+)+$` is a potentially problematic one because it contains a nested quantifier. In standard NFA engines (like JavaScript's native `RegExp`), nested quantifiers can cause catastrophic backtracking. If a malicious user inputs a carefully crafted string, it results in exponentially high processing times, leading to a Denial of Service (DoS) attack.
-These factors combined make the RE2 vanilla JS port a valuable tool for developers needing to work with complex regular expressions within a browser environment.
+RE2JS processed this poison-pill string **30,000 times in just ~454 milliseconds**, while the native RegExp completely locked up the main thread for **over 1 minute and 45 seconds trying to evaluate it just once**. This demonstrates why RE2JS is absolutely essential for securely handling untrusted regular expressions and protecting Node.js and browser applications against ReDoS attacks.
 ## Development

package/build/index.cjs.cjs CHANGED Viewed

@@ -2,7 +2,7 @@
  * re2js
  * RE2JS is the JavaScript port of RE2, a regular expression engine that provides linear time matching
  *
- * @version v2.0.0
+ * @version v2.0.1
  * @author Alexey Vasiliev
  * @homepage https://github.com/le0pard/re2js#readme
  * @repository github:le0pard/re2js
@@ -361,6 +361,11 @@ class Unicode {
   // Checked during test.
   static MIN_FOLD = 0x0041;
   static MAX_FOLD = 0x1e943;
+  static MIN_HIGH_SURROGATE = 0xd800;
+  static MAX_HIGH_SURROGATE = 0xdbff;
+  static MIN_LOW_SURROGATE = 0xdc00;
+  static MAX_LOW_SURROGATE = 0xdfff;
+  static MIN_SUPPLEMENTARY_CODE_POINT = 0x10000;
   // is32 uses binary search to test whether rune is in the specified
   // slice of 32-bit ranges.
@@ -667,9 +672,9 @@ class Utils {
         } else if (c < 2048) {
           out[p++] = c >> 6 | 192;
           out[p++] = c & 63 | 128;
-        } else if ((c & 0xfc00) === 0xd800 && i + 1 < str.length && (str.charCodeAt(i + 1) & 0xfc00) === 0xdc00) {
+        } else if ((c & 0xfc00) === Unicode.MIN_HIGH_SURROGATE && i + 1 < str.length && (str.charCodeAt(i + 1) & 0xfc00) === Unicode.MIN_LOW_SURROGATE) {
           // Surrogate Pair
-          c = 0x10000 + ((c & 0x03ff) << 10) + (str.charCodeAt(++i) & 0x03ff);
+          c = Unicode.MIN_SUPPLEMENTARY_CODE_POINT + ((c & 0x03ff) << 10) + (str.charCodeAt(++i) & 0x03ff);
           out[p++] = c >> 18 | 240;
           out[p++] = c >> 12 & 63 | 128;
           out[p++] = c >> 6 & 63 | 128;
@@ -703,9 +708,9 @@ class Utils {
           let c2 = bytes[pos++];
           let c3 = bytes[pos++];
           let c4 = bytes[pos++];
-          let u = ((c1 & 7) << 18 | (c2 & 63) << 12 | (c3 & 63) << 6 | c4 & 63) - 0x10000;
-          out[c++] = String.fromCharCode(0xd800 + (u >> 10));
-          out[c++] = String.fromCharCode(0xdc00 + (u & 1023));
+          let u = ((c1 & 7) << 18 | (c2 & 63) << 12 | (c3 & 63) << 6 | c4 & 63) - Unicode.MIN_SUPPLEMENTARY_CODE_POINT;
+          out[c++] = String.fromCharCode(Unicode.MIN_HIGH_SURROGATE + (u >> 10));
+          out[c++] = String.fromCharCode(Unicode.MIN_LOW_SURROGATE + (u & 1023));
         } else {
           let c2 = bytes[pos++];
           let c3 = bytes[pos++];
@@ -879,38 +884,34 @@ class MachineUTF8Input extends MachineInputBase {
   // the lower 3 bits, and the rune (Unicode code point) in the high
   // bits.  Never negative, except for EOF which is represented as -1
   // << 3 | 0.
-  step(i) {
-    i += this.start;
-    if (i >= this.end) {
+  step(pos) {
+    pos += this.start;
+    if (pos >= this.end) {
       return MachineInputBase.EOF();
     }
-    let x = this.bytes[i++] & 255;
-    if ((x & 128) === 0) {
-      return x << 3 | 1;
-    } else if ((x & 224) === 192) {
-      x = x & 31;
-      if (i >= this.end) {
-        return MachineInputBase.EOF();
-      }
-      x = x << 6 | this.bytes[i++] & 63;
-      return x << 3 | 2;
-    } else if ((x & 240) === 224) {
-      x = x & 15;
-      if (i + 1 >= this.end) {
-        return MachineInputBase.EOF();
-      }
-      x = x << 6 | this.bytes[i++] & 63;
-      x = x << 6 | this.bytes[i++] & 63;
-      return x << 3 | 3;
+    // Read UTF-8 bytes to extract the Rune and its width
+    const c = this.bytes[pos] & 0xff;
+    if (c < 0x80) {
+      return c << 3 | 1;
+    } else if (c >= 0xc2 && c <= 0xdf && pos + 1 < this.end) {
+      const c1 = this.bytes[pos + 1] & 0xff;
+      const rune = (c & 0x1f) << 6 | c1 & 0x3f;
+      return rune << 3 | 2;
+    } else if (c >= 0xe0 && c <= 0xef && pos + 2 < this.end) {
+      const c1 = this.bytes[pos + 1] & 0xff;
+      const c2 = this.bytes[pos + 2] & 0xff;
+      const rune = (c & 0x0f) << 12 | (c1 & 0x3f) << 6 | c2 & 0x3f;
+      return rune << 3 | 3;
+    } else if (c >= 0xf0 && c <= 0xf4 && pos + 3 < this.end) {
+      const c1 = this.bytes[pos + 1] & 0xff;
+      const c2 = this.bytes[pos + 2] & 0xff;
+      const c3 = this.bytes[pos + 3] & 0xff;
+      const rune = (c & 0x07) << 18 | (c1 & 0x3f) << 12 | (c2 & 0x3f) << 6 | c3 & 0x3f;
+      return rune << 3 | 4;
     } else {
-      x = x & 7;
-      if (i + 2 >= this.end) {
-        return MachineInputBase.EOF();
-      }
-      x = x << 6 | this.bytes[i++] & 63;
-      x = x << 6 | this.bytes[i++] & 63;
-      x = x << 6 | this.bytes[i++] & 63;
-      return x << 3 | 4;
+      // Invalid sequence fallback
+      return c << 3 | 1;
     }
   }
@@ -985,12 +986,25 @@ class MachineUTF16Input extends MachineInputBase {
   // << 3 | 0.
   step(pos) {
     pos += this.start;
-    if (pos < this.end) {
-      const rune = this.charSequence.codePointAt(pos);
-      return rune << 3 | Utils.charCount(rune);
-    } else {
+    if (pos >= this.end) {
       return MachineInputBase.EOF();
     }
+    const c1 = this.charSequence.charCodeAt(pos);
+    // Fast path: standard BMP character (not a high surrogate)
+    if (c1 < Unicode.MIN_HIGH_SURROGATE || c1 > Unicode.MAX_HIGH_SURROGATE || pos + 1 >= this.end) {
+      return c1 << 3 | 1;
+    }
+    // Slow path: Calculate surrogate pair manually
+    const c2 = this.charSequence.charCodeAt(pos + 1);
+    if (c2 >= Unicode.MIN_LOW_SURROGATE && c2 <= Unicode.MAX_LOW_SURROGATE) {
+      const rune = (c1 - Unicode.MIN_HIGH_SURROGATE) * 0x400 + (c2 - Unicode.MIN_LOW_SURROGATE) + Unicode.MIN_SUPPLEMENTARY_CODE_POINT;
+      return rune << 3 | 2;
+    }
+    // Invalid surrogate pair fallback
+    return c1 << 3 | 1;
   }
   // Returns the index relative to |pos| at which |re2.prefix| is found
@@ -1738,7 +1752,7 @@ class Inst {
     let lo = 0;
     let hi = this.runes.length / 2 | 0;
     while (lo < hi) {
-      const m = lo + ((hi - lo) / 2 | 0);
+      const m = lo + hi >> 1; // native cpu instruction for "lo + (((hi - lo) / 2) | 0)"
       const c = this.runes[2 * m];
       if (c <= r) {
         if (r <= this.runes[2 * m + 1]) {
@@ -1799,10 +1813,10 @@ class Thread {
 // A queue is a 'sparse array' holding pending threads of execution.  See:
 // research.swtch.com/2008/03/using-uninitialized-memory-for-fun-and.html
 class Queue {
-  constructor() {
-    this.sparse = []; // may contain stale but in-bounds values.
-    this.densePcs = []; // may contain stale pc in slots >= size
-    this.denseThreads = []; // may contain stale Thread in slots >= size
+  constructor(numInst) {
+    this.sparse = new Int32Array(numInst); // may contain stale but in-bounds values.
+    this.densePcs = new Int32Array(numInst); // may contain stale pc in slots >= size
+    this.denseThreads = new Array(numInst); // may contain stale Thread in slots >= size
     this.size = 0;
   }
   contains(pc) {
@@ -2303,7 +2317,7 @@ class DFA {
       if (width === 0) {
         break;
       }
-      currentState = this.step(currentState, rune, anchor);
+      currentState = anchor === RE2Flags.UNANCHORED && rune <= Unicode.MAX_ASCII && currentState.nextAscii[rune] || this.step(currentState, rune, anchor);
       // If we hit an unrecoverable DFA error or bailout, signal fallback
       if (currentState === null) return null;