re2js 2.0.0 → 2.0.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -356,10 +356,10 @@ import { RE2JS } from 're2js'
356
356
 
357
357
  RE2JS.compile('(\\w+) (\\w+)')
358
358
  .matcher('Hello World')
359
- .replaceAll('$0 - $0') // 'Hello World - Hello World'
359
+ .replaceAll('$& - $&') // 'Hello World - Hello World'
360
360
  RE2JS.compile('(\\w+) (\\w+)')
361
361
  .matcher('Hello World')
362
- .replaceAll('$& - $&', true) // 'Hello World - Hello World'
362
+ .replaceAll('$0 - $0', true) // 'Hello World - Hello World'
363
363
  ```
364
364
 
365
365
  #### Replacing the First Occurrence
@@ -447,50 +447,64 @@ RE2JS.matches(unicodeRegexp, '😃') // false
447
447
 
448
448
  ## Performance
449
449
 
450
- The RE2JS engine runs more slowly compared to native RegExp objects. This reduced speed is also noticeable when comparing RE2JS to the original RE2 engine. The C++ implementation of the RE2 engine includes both NFA (Nondeterministic Finite Automaton) and DFA (Deterministic Finite Automaton) engines, as well as a variety of optimizations. Russ Cox ported a simplified version of the NFA engine to Go. Later, Alan Donovan ported the NFA-based Go implementation to Java. I then ported the NFA-based Java implementation (plus Golang stuff, which are not present in Java implementation, like checks for regular expression complexity) to a pure JS version. This is another reason why the pure JS version will perform more slowly compared to the original RE2 engine.
450
+ The RE2JS engine runs more slowly compared to native RegExp objects for simple queries. This reduced speed is also noticeable when comparing RE2JS to the original RE2 engine. The C++ implementation of the RE2 engine includes both NFA (Nondeterministic Finite Automaton) and DFA (Deterministic Finite Automaton) engines, as well as a variety of highly optimized memory operations. Russ Cox ported a simplified version of the NFA engine to Go. Later, Alan Donovan ported the NFA-based Go implementation to Java. I then ported the NFA-based Java implementation (plus Golang additions + Lazy DFA fast-path) to a pure JS version.
451
451
 
452
- Should you require high performance on the server side when using RE2, it would be beneficial to consider the following packages for JS:
452
+ Should you require maximum absolute performance on the server side when using RE2, it would be beneficial to consider the following packages for JS:
453
453
 
454
- - [Node-RE2](https://github.com/uhop/node-re2/): A powerful RE2 package for Node.js
454
+ - [Node-RE2](https://github.com/uhop/node-re2/): A powerful RE2 C++ binding for Node.js
455
455
  - [RE2-WASM](https://github.com/google/re2-wasm/): This package is a WASM wrapper for RE2. Please note, as of now, it does not work in browsers
456
456
 
457
+ ### RE2JS vs RE2-Node (C++ Bindings)
458
+
459
+ Because RE2JS implements a Just-In-Time (JIT) compiled DFA, it can actually perform on par with—and sometimes faster than—native C++ bindings (`re2-node`) by avoiding the cross-boundary serialization costs between JavaScript and C++.
460
+
461
+ Here is a benchmark running 30,000 items through both engines using their respective `.test()` fast-paths:
462
+
463
+ | Benchmark Scenario | Pattern Example | RE2JS (Pure JS) | RE2-Node (C++) | Result |
464
+ |:--------------------------|:---------------------------|:----------------|:---------------|:----------------------------|
465
+ | **Bounded Repetition** | `/[A-Z][a-z]{5,15}/` | **11.80 ms** | 14.79 ms | `re2js` is **1.25x** faster |
466
+ | **Massive Alternation** | `/White\|Blue\|Black.../` | **15.42 ms** | 16.02 ms | `re2js` is **1.04x** faster |
467
+ | **Deep State Machine** | `/([0-9]+(/[0-9]+)+)/` | 19.34 ms | **17.16 ms** | `re2-node` is 1.13x faster |
468
+ | **Case Insensitive** | `/(?i)swamp/` | 20.27 ms | **17.26 ms** | `re2-node` is 1.17x faster |
469
+ | **ReDoS Attempt** | `/(a+)+!/` | 20.56 ms | **17.33 ms** | `re2-node` is 1.19x faster |
470
+ | **Greedy Wildcard** | `/enters.*battlefield/` | 18.93 ms | **14.33 ms** | `re2-node` is 1.32x faster |
471
+ | **Lazy Wildcard** | `/enters.*?battlefield/` | 18.93 ms | **14.16 ms** | `re2-node` is 1.34x faster |
472
+ | **Simple Literal** | `/damage/` | 19.54 ms | **14.11 ms** | `re2-node` is 1.39x faster |
473
+ | **Word Boundaries (NFA)** | `/\b(Flying\|First...)\b/` | 296.16 ms | **16.73 ms** | `re2-node` is 17.70x faster |
474
+
475
+ **Takeaways:**
476
+ * **DFA Strengths:** For state-heavy tasks like massive alternations (`White|Blue|...`) or bounded repetitions (`{5,15}`), RE2JS operates entirely within V8's optimized JIT and actually outpaces C++ bindings.
477
+ * **C++ Strengths:** For simple string scanning (like literal or wildcard searches), C++ wins because it can utilize optimized, hardware-level raw memory scanning operations (like `memchr`).
478
+ * **The NFA Fallback:** Pure DFA engines mathematically cannot track look-behind context like Word Boundaries (`\b`). When RE2JS encounters these, it safely bails out to the much slower NFA engine, resulting in a large performance gap compared to C++.
479
+
457
480
  ### RE2JS vs JavaScript's native RegExp
458
481
 
459
- These examples illustrate the performance comparison between the RE2JS library and JavaScript's native RegExp for both a simple case and a ReDoS (Regular Expression Denial of Service) scenario
482
+ These examples illustrate the performance comparison between the RE2JS library and JavaScript's native `RegExp` for both a simple case and a ReDoS (Regular Expression Denial of Service) scenario.
460
483
 
461
484
  ```js
462
485
  const regex = 'a+'
463
486
  const string = 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!'
464
487
 
465
- RE2JS.compile(regex).matcher(string).find() // avg: 5.657783601 ms
466
- new RegExp(regex).test(string) // avg: 1.504824999 ms
488
+ // Running 30,000 iterations
489
+ RE2JS.compile(regex).test(string) // Total time: ~9.87 ms
490
+ new RegExp(regex).test(string) // Total time: ~11.43 ms
467
491
  ```
468
492
 
469
- The result shows that the RE2JS library took around **5.66 ms** on average to find a match, while the native RegExp took around **1.50 ms**. This indicates that, in this case, RegExp performed faster than RE2JS
493
+ For safe, simple patterns, the RE2JS DFA fast-path is heavily optimized and performs at parity with—or even slightly faster than—V8's native RegExp engine.
470
494
 
471
495
  ```js
472
496
  const regex = '([a-z]+)+$'
473
497
  const string = 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!'
474
498
 
475
- RE2JS.compile(regex).matcher(string).find() // avg: 3.6155000030994415 ms
476
- new RegExp(regex).test(string) // avg: 103768.25712499022 ms
499
+ // Running 30,000 iterations
500
+ RE2JS.compile(regex).test(string) // Total time: ~454.17 ms
501
+ // Running EXACTLY 1 iteration
502
+ new RegExp(regex).test(string) // Total time: ~105802.02 ms (over 105 seconds)
477
503
  ```
478
504
 
479
- In the second example, a ReDoS scenario is depicted. The regular expression `([a-z]+)+$` is a potentially problematic one, as it has a nested quantifier. Nested quantifiers can cause catastrophic backtracking, which results in high processing time, leading to a potential Denial of Service (DoS) attack if a malicious user inputs a carefully crafted string.
480
-
481
- The string is the same as in the first example, which does not pose a problem for either RE2JS or RegExp under normal circumstances. However, when dealing with the nested quantifier, RE2JS took around **3.62 ms** to find a match, while RegExp took significantly longer, around **103768.26 ms (~103 seconds)**. This demonstrates that RE2JS is much more efficient in handling potentially harmful regular expressions, thus preventing ReDoS attacks.
482
-
483
- In conclusion, while JavaScript's native RegExp might be faster for simple regular expressions, RE2JS offers significant performance advantages when dealing with complex or potentially dangerous regular expressions. RE2JS provides protection against excessive backtracking that could lead to performance issues or ReDoS attacks.
484
-
485
- ## Rationale for RE2 JavaScript port
486
-
487
- There are several reasons that underscore the importance of having an RE2 vanilla JavaScript (JS) port.
488
-
489
- Firstly, it enables RE2 JS validation on the client side within the browser. This is vital as it allows the implementation and execution of regular expression operations directly in the browser, enhancing performance by reducing the necessity of server-side computations and back-and-forth communication.
490
-
491
- Secondly, it provides a platform for simple RE2 parsing, specifically for the extraction of regex groups. This feature is particularly useful when dealing with complex regular expressions, as it allows for the breakdown of regex patterns into manageable and identifiable segments or 'groups'.
505
+ In the second example, a ReDoS scenario is depicted. The regular expression `([a-z]+)+$` is a potentially problematic one because it contains a nested quantifier. In standard NFA engines (like JavaScript's native `RegExp`), nested quantifiers can cause catastrophic backtracking. If a malicious user inputs a carefully crafted string, it results in exponentially high processing times, leading to a Denial of Service (DoS) attack.
492
506
 
493
- These factors combined make the RE2 vanilla JS port a valuable tool for developers needing to work with complex regular expressions within a browser environment.
507
+ RE2JS processed this poison-pill string **30,000 times in just ~454 milliseconds**, while the native RegExp completely locked up the main thread for **over 1 minute and 45 seconds trying to evaluate it just once**. This demonstrates why RE2JS is absolutely essential for securely handling untrusted regular expressions and protecting Node.js and browser applications against ReDoS attacks.
494
508
 
495
509
  ## Development
496
510