re2js 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/README.md CHANGED
@@ -1,6 +1,8 @@
1
1
  # RE2JS is the JavaScript port of RE2, a regular expression engine that provides linear time matching
2
2
  [![Test/Build/Deploy](https://github.com/le0pard/re2js/actions/workflows/tests.yml/badge.svg)](https://github.com/le0pard/re2js/actions/workflows/tests.yml)
3
3
 
4
+ ## [Playground](https://re2js.leopard.in.ua/)
5
+
4
6
  ## TLDR
5
7
 
6
8
  The built-in JavaScript regular expression engine can, under certain special combinations, run in exponential time. This situation can trigger what's referred to as a [Regular Expression Denial of Service (ReDoS)](https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS). RE2, a different regular expression engine, can effectively safeguard your Node.js applications from ReDoS attacks. With RE2JS, this protective feature extends to browser environments as well, enabling you to utilize the RE2 engine more comprehensively.
@@ -23,7 +25,7 @@ This document provides a series of examples demonstrating how to use RE2JS in yo
23
25
 
24
26
  ### Compiling Patterns
25
27
 
26
- You can compile a regex pattern using the `RE2JS.compile()` function:
28
+ You can compile a regex pattern using the `compile()` function:
27
29
 
28
30
  ```js
29
31
  import { RE2JS } from 're2js'
@@ -33,7 +35,7 @@ console.log(p.pattern()); // Outputs: 'abc'
33
35
  console.log(p.flags()); // Outputs: 0
34
36
  ```
35
37
 
36
- The `RE2JS.compile()` function also supports flags:
38
+ The `compile()` function also supports flags:
37
39
 
38
40
  ```js
39
41
  import { RE2JS } from 're2js'
@@ -64,14 +66,14 @@ RE2JS.MULTILINE
64
66
  */
65
67
  RE2JS.DISABLE_UNICODE_GROUPS
66
68
  /**
67
- * Flag: matches longest possible string.
69
+ * Flag: matches longest possible string (changes the match semantics to leftmost-longest).
68
70
  */
69
71
  RE2JS.LONGEST_MATCH
70
72
  ```
71
73
 
72
74
  ### Checking for Matches
73
75
 
74
- RE2JS allows you to check if a string matches a given regex pattern using the `RE2JS.matches()` function
76
+ RE2JS allows you to check if a string matches a given regex pattern using the `matches()` function
75
77
 
76
78
  ```js
77
79
  import { RE2JS } from 're2js'
@@ -83,6 +85,10 @@ RE2JS.compile('ab+c').matches('abbbc') // true
83
85
  RE2JS.compile('ab+c').matches('cbbba') // false
84
86
  // with flags
85
87
  RE2JS.compile('ab+c', RE2JS.CASE_INSENSITIVE).matches('AbBBc') // true
88
+ RE2JS.compile(
89
+ '^ab.*c$',
90
+ RE2JS.DOTALL | RE2JS.MULTILINE | RE2JS.CASE_INSENSITIVE
91
+ ).matches('AB\nc') // true
86
92
  ```
87
93
 
88
94
  ### Finding Matches
@@ -114,6 +120,19 @@ matchString.group() // 'e'
114
120
  matchString.find(7) // false
115
121
  ```
116
122
 
123
+ ### Checking Initial Match
124
+
125
+ The `lookingAt()` method determines whether the start of the given string matches the pattern
126
+
127
+ ```js
128
+ import { RE2JS } from 're2js'
129
+
130
+ RE2JS.compile('abc').matcher('abcdef').lookingAt() // true
131
+ RE2JS.compile('abc').matcher('ab').lookingAt() // false
132
+ ```
133
+
134
+ Note that the `lookingAt` method only checks the start of the string. It does not search the entire string for a match
135
+
117
136
  ### Splitting Strings
118
137
 
119
138
  You can split a string based on a regex pattern using the `split()` function
@@ -225,6 +244,33 @@ RE2JS.compile('(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)')
225
244
 
226
245
  Note that the replacement string can include references to capturing groups from the pattern
227
246
 
247
+ Parameters:
248
+ - `replacement (String)`: The string that replaces the substrings found. Capture groups and special characters in the replacement string have special behavior. For example:
249
+ - `$0` refers to the entire matched substring
250
+ - `$1, $2, ...` refer to the corresponding capture groups in the pattern
251
+ - `\$` inserts a literal `$`
252
+ - `${name}` can be used to reference named capture groups
253
+ - on invalid group - throw exception
254
+ - `perlMode (Boolean)`: If set to `true`, the replacement follows Perl/JS's rules for replacement. Defaults to `false`. If `perlMode = true`, changed rules for capture groups and special characters:
255
+ - `$&` refers to the entire matched substring
256
+ - `$1, $2, ...` refer to the corresponding capture groups in the pattern
257
+ - `$$` inserts a literal `$`
258
+ - `$<name>` can be used to reference named capture groups
259
+ - on invalid group - ignore it
260
+
261
+ Examples:
262
+
263
+ ```js
264
+ import { RE2JS } from 're2js'
265
+
266
+ RE2JS.compile('(\\w+) (\\w+)')
267
+ .matcher('Hello World')
268
+ .replaceAll('$0 - $0') // 'Hello World - Hello World'
269
+ RE2JS.compile('(\\w+) (\\w+)')
270
+ .matcher('Hello World')
271
+ .replaceAll('$& - $&', true) // 'Hello World - Hello World'
272
+ ```
273
+
228
274
  #### Replacing the First Occurrence
229
275
 
230
276
  The `replaceFirst()` method replaces the first occurrence of a pattern match in a string with the given replacement
@@ -240,18 +286,59 @@ RE2JS.compile('(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)')
240
286
  .replaceFirst('$10$20') // 'jb0nopqrstuvwxyz123'
241
287
  ```
242
288
 
243
- ## Performance
289
+ Function support second argument `perlMode`, which work in the same way, as for `replaceAll` function
290
+
291
+ ### Escaping Special Characters
292
+
293
+ The `quote()` method returns a literal pattern string for the specified string. This can be useful if you want to search for a literal string pattern that may contain special characters
294
+
295
+ ```js
296
+ import { RE2JS } from 're2js'
297
+
298
+ const regexp = RE2JS.quote('ab+c') // 'ab\\+c'
244
299
 
245
- The RE2JS engine runs more slowly compared to native RegExp objects. This reduced speed is also noticeable when comparing RE2JS to the original RE2 engine. The primary reason behind this is the lack of a synchronous threads solution within the browser environment. This deficiency is significant because the regex engine requires a synchronous API to operate optimally.
300
+ RE2JS.matches(regexp, 'ab+c') // true
301
+ RE2JS.matches(regexp, 'abc') // false
302
+ ```
303
+
304
+ ## Performance
246
305
 
247
- The C++ implementation of the RE2 engine includes both NFA (Nondeterministic Finite Automaton) and DFA (Deterministic Finite Automaton) engines, as well as a variety of optimizations. Russ Cox ported a simplified version of the NFA engine to Go. Later, Alan Donovan ported the NFA-based Go implementation to Java. I then ported the NFA-based Java implementation to a pure JS version. This is another reason why the pure JS version will perform more slowly compared to the original RE2 engine.
306
+ The RE2JS engine runs more slowly compared to native RegExp objects. This reduced speed is also noticeable when comparing RE2JS to the original RE2 engine. The C++ implementation of the RE2 engine includes both NFA (Nondeterministic Finite Automaton) and DFA (Deterministic Finite Automaton) engines, as well as a variety of optimizations. Russ Cox ported a simplified version of the NFA engine to Go. Later, Alan Donovan ported the NFA-based Go implementation to Java. I then ported the NFA-based Java implementation to a pure JS version. This is another reason why the pure JS version will perform more slowly compared to the original RE2 engine.
248
307
 
249
308
  Should you require high performance on the server side when using RE2, it would be beneficial to consider the following packages for JS:
250
309
 
251
310
  - [Node-RE2](https://github.com/uhop/node-re2/): A powerful RE2 package for Node.js
252
311
  - [RE2-WASM](https://github.com/google/re2-wasm/): This package is a WASM wrapper for RE2. Please note, as of now, it does not work in browsers
253
312
 
254
- ## Justification for this JS port existence
313
+ ### RE2JS vs JavaScript's native RegExp
314
+
315
+ These examples illustrate the performance comparison between the RE2JS library and JavaScript's native RegExp for both a simple case and a ReDoS (Regular Expression Denial of Service) scenario
316
+
317
+ ```js
318
+ const regex = 'a+'
319
+ const string = 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!'
320
+
321
+ RE2JS.compile(regex).matcher(string).find() // avg: 5.657783601 ms
322
+ new RegExp(regex).test(string) // avg: 1.504824999 ms
323
+ ```
324
+
325
+ The result shows that the RE2JS library took around **5.66 ms** on average to find a match, while the native RegExp took around **1.50 ms**. This indicates that, in this case, RegExp performed faster than RE2JS
326
+
327
+ ```js
328
+ const regex = '([a-z]+)+$'
329
+ const string = 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!'
330
+
331
+ RE2JS.compile(regex).matcher(string).find() // avg: 3.6155000030994415 ms
332
+ new RegExp(regex).test(string) // avg: 103768.25712499022 ms
333
+ ```
334
+
335
+ In the second example, a ReDoS scenario is depicted. The regular expression `([a-z]+)+$` is a potentially problematic one, as it has a nested quantifier. Nested quantifiers can cause catastrophic backtracking, which results in high processing time, leading to a potential Denial of Service (DoS) attack if a malicious user inputs a carefully crafted string.
336
+
337
+ The string is the same as in the first example, which does not pose a problem for either RE2JS or RegExp under normal circumstances. However, when dealing with the nested quantifier, RE2JS took around **3.62 ms** to find a match, while RegExp took significantly longer, around **103768.26 ms (~103 seconds)**. This demonstrates that RE2JS is much more efficient in handling potentially harmful regular expressions, thus preventing ReDoS attacks.
338
+
339
+ In conclusion, while JavaScript's native RegExp might be faster for simple regular expressions, RE2JS offers significant performance advantages when dealing with complex or potentially dangerous regular expressions. RE2JS provides protection against excessive backtracking that could lead to performance issues or ReDoS attacks.
340
+
341
+ ## Rationale for RE2 JavaScript port
255
342
 
256
343
  There are several reasons that underscore the importance of having an RE2 vanilla JavaScript (JS) port.
257
344
 
@@ -271,3 +358,5 @@ yarn node ./tools/scripts/genUnicodeTable.js > src/UnicodeTables.js
271
358
  ```
272
359
 
273
360
  To run `make_perl_groups.pl` you need to have install perl (version inside `.tool-versions`)
361
+
362
+ [Playground website](https://re2js.leopard.in.ua/) maintained in `www` branch