re2js 0.1.0 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- package/README.md +97 -8
- package/build/index.cjs.js +5763 -2
- package/build/index.cjs.js.map +1 -1
- package/build/index.esm.d.ts +360 -0
- package/build/index.esm.d.ts.map +1 -0
- package/build/index.esm.js +5756 -2
- package/build/index.esm.js.map +1 -1
- package/build/index.umd.js +5769 -2
- package/build/index.umd.js.map +1 -1
- package/package.json +17 -14
package/README.md
CHANGED
|
@@ -1,6 +1,8 @@
|
|
|
1
1
|
# RE2JS is the JavaScript port of RE2, a regular expression engine that provides linear time matching
|
|
2
2
|
[](https://github.com/le0pard/re2js/actions/workflows/tests.yml)
|
|
3
3
|
|
|
4
|
+
## [Playground](https://re2js.leopard.in.ua/)
|
|
5
|
+
|
|
4
6
|
## TLDR
|
|
5
7
|
|
|
6
8
|
The built-in JavaScript regular expression engine can, under certain special combinations, run in exponential time. This situation can trigger what's referred to as a [Regular Expression Denial of Service (ReDoS)](https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS). RE2, a different regular expression engine, can effectively safeguard your Node.js applications from ReDoS attacks. With RE2JS, this protective feature extends to browser environments as well, enabling you to utilize the RE2 engine more comprehensively.
|
|
@@ -23,7 +25,7 @@ This document provides a series of examples demonstrating how to use RE2JS in yo
|
|
|
23
25
|
|
|
24
26
|
### Compiling Patterns
|
|
25
27
|
|
|
26
|
-
You can compile a regex pattern using the `
|
|
28
|
+
You can compile a regex pattern using the `compile()` function:
|
|
27
29
|
|
|
28
30
|
```js
|
|
29
31
|
import { RE2JS } from 're2js'
|
|
@@ -33,7 +35,7 @@ console.log(p.pattern()); // Outputs: 'abc'
|
|
|
33
35
|
console.log(p.flags()); // Outputs: 0
|
|
34
36
|
```
|
|
35
37
|
|
|
36
|
-
The `
|
|
38
|
+
The `compile()` function also supports flags:
|
|
37
39
|
|
|
38
40
|
```js
|
|
39
41
|
import { RE2JS } from 're2js'
|
|
@@ -64,14 +66,14 @@ RE2JS.MULTILINE
|
|
|
64
66
|
*/
|
|
65
67
|
RE2JS.DISABLE_UNICODE_GROUPS
|
|
66
68
|
/**
|
|
67
|
-
* Flag: matches longest possible string.
|
|
69
|
+
* Flag: matches longest possible string (changes the match semantics to leftmost-longest).
|
|
68
70
|
*/
|
|
69
71
|
RE2JS.LONGEST_MATCH
|
|
70
72
|
```
|
|
71
73
|
|
|
72
74
|
### Checking for Matches
|
|
73
75
|
|
|
74
|
-
RE2JS allows you to check if a string matches a given regex pattern using the `
|
|
76
|
+
RE2JS allows you to check if a string matches a given regex pattern using the `matches()` function
|
|
75
77
|
|
|
76
78
|
```js
|
|
77
79
|
import { RE2JS } from 're2js'
|
|
@@ -83,6 +85,10 @@ RE2JS.compile('ab+c').matches('abbbc') // true
|
|
|
83
85
|
RE2JS.compile('ab+c').matches('cbbba') // false
|
|
84
86
|
// with flags
|
|
85
87
|
RE2JS.compile('ab+c', RE2JS.CASE_INSENSITIVE).matches('AbBBc') // true
|
|
88
|
+
RE2JS.compile(
|
|
89
|
+
'^ab.*c$',
|
|
90
|
+
RE2JS.DOTALL | RE2JS.MULTILINE | RE2JS.CASE_INSENSITIVE
|
|
91
|
+
).matches('AB\nc') // true
|
|
86
92
|
```
|
|
87
93
|
|
|
88
94
|
### Finding Matches
|
|
@@ -114,6 +120,19 @@ matchString.group() // 'e'
|
|
|
114
120
|
matchString.find(7) // false
|
|
115
121
|
```
|
|
116
122
|
|
|
123
|
+
### Checking Initial Match
|
|
124
|
+
|
|
125
|
+
The `lookingAt()` method determines whether the start of the given string matches the pattern
|
|
126
|
+
|
|
127
|
+
```js
|
|
128
|
+
import { RE2JS } from 're2js'
|
|
129
|
+
|
|
130
|
+
RE2JS.compile('abc').matcher('abcdef').lookingAt() // true
|
|
131
|
+
RE2JS.compile('abc').matcher('ab').lookingAt() // false
|
|
132
|
+
```
|
|
133
|
+
|
|
134
|
+
Note that the `lookingAt` method only checks the start of the string. It does not search the entire string for a match
|
|
135
|
+
|
|
117
136
|
### Splitting Strings
|
|
118
137
|
|
|
119
138
|
You can split a string based on a regex pattern using the `split()` function
|
|
@@ -225,6 +244,33 @@ RE2JS.compile('(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)')
|
|
|
225
244
|
|
|
226
245
|
Note that the replacement string can include references to capturing groups from the pattern
|
|
227
246
|
|
|
247
|
+
Parameters:
|
|
248
|
+
- `replacement (String)`: The string that replaces the substrings found. Capture groups and special characters in the replacement string have special behavior. For example:
|
|
249
|
+
- `$0` refers to the entire matched substring
|
|
250
|
+
- `$1, $2, ...` refer to the corresponding capture groups in the pattern
|
|
251
|
+
- `\$` inserts a literal `$`
|
|
252
|
+
- `${name}` can be used to reference named capture groups
|
|
253
|
+
- on invalid group - throw exception
|
|
254
|
+
- `perlMode (Boolean)`: If set to `true`, the replacement follows Perl/JS's rules for replacement. Defaults to `false`. If `perlMode = true`, changed rules for capture groups and special characters:
|
|
255
|
+
- `$&` refers to the entire matched substring
|
|
256
|
+
- `$1, $2, ...` refer to the corresponding capture groups in the pattern
|
|
257
|
+
- `$$` inserts a literal `$`
|
|
258
|
+
- `$<name>` can be used to reference named capture groups
|
|
259
|
+
- on invalid group - ignore it
|
|
260
|
+
|
|
261
|
+
Examples:
|
|
262
|
+
|
|
263
|
+
```js
|
|
264
|
+
import { RE2JS } from 're2js'
|
|
265
|
+
|
|
266
|
+
RE2JS.compile('(\\w+) (\\w+)')
|
|
267
|
+
.matcher('Hello World')
|
|
268
|
+
.replaceAll('$0 - $0') // 'Hello World - Hello World'
|
|
269
|
+
RE2JS.compile('(\\w+) (\\w+)')
|
|
270
|
+
.matcher('Hello World')
|
|
271
|
+
.replaceAll('$& - $&', true) // 'Hello World - Hello World'
|
|
272
|
+
```
|
|
273
|
+
|
|
228
274
|
#### Replacing the First Occurrence
|
|
229
275
|
|
|
230
276
|
The `replaceFirst()` method replaces the first occurrence of a pattern match in a string with the given replacement
|
|
@@ -240,18 +286,59 @@ RE2JS.compile('(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)')
|
|
|
240
286
|
.replaceFirst('$10$20') // 'jb0nopqrstuvwxyz123'
|
|
241
287
|
```
|
|
242
288
|
|
|
243
|
-
|
|
289
|
+
Function support second argument `perlMode`, which work in the same way, as for `replaceAll` function
|
|
290
|
+
|
|
291
|
+
### Escaping Special Characters
|
|
292
|
+
|
|
293
|
+
The `quote()` method returns a literal pattern string for the specified string. This can be useful if you want to search for a literal string pattern that may contain special characters
|
|
294
|
+
|
|
295
|
+
```js
|
|
296
|
+
import { RE2JS } from 're2js'
|
|
297
|
+
|
|
298
|
+
const regexp = RE2JS.quote('ab+c') // 'ab\\+c'
|
|
244
299
|
|
|
245
|
-
|
|
300
|
+
RE2JS.matches(regexp, 'ab+c') // true
|
|
301
|
+
RE2JS.matches(regexp, 'abc') // false
|
|
302
|
+
```
|
|
303
|
+
|
|
304
|
+
## Performance
|
|
246
305
|
|
|
247
|
-
The C++ implementation of the RE2 engine includes both NFA (Nondeterministic Finite Automaton) and DFA (Deterministic Finite Automaton) engines, as well as a variety of optimizations. Russ Cox ported a simplified version of the NFA engine to Go. Later, Alan Donovan ported the NFA-based Go implementation to Java. I then ported the NFA-based Java implementation to a pure JS version. This is another reason why the pure JS version will perform more slowly compared to the original RE2 engine.
|
|
306
|
+
The RE2JS engine runs more slowly compared to native RegExp objects. This reduced speed is also noticeable when comparing RE2JS to the original RE2 engine. The C++ implementation of the RE2 engine includes both NFA (Nondeterministic Finite Automaton) and DFA (Deterministic Finite Automaton) engines, as well as a variety of optimizations. Russ Cox ported a simplified version of the NFA engine to Go. Later, Alan Donovan ported the NFA-based Go implementation to Java. I then ported the NFA-based Java implementation to a pure JS version. This is another reason why the pure JS version will perform more slowly compared to the original RE2 engine.
|
|
248
307
|
|
|
249
308
|
Should you require high performance on the server side when using RE2, it would be beneficial to consider the following packages for JS:
|
|
250
309
|
|
|
251
310
|
- [Node-RE2](https://github.com/uhop/node-re2/): A powerful RE2 package for Node.js
|
|
252
311
|
- [RE2-WASM](https://github.com/google/re2-wasm/): This package is a WASM wrapper for RE2. Please note, as of now, it does not work in browsers
|
|
253
312
|
|
|
254
|
-
|
|
313
|
+
### RE2JS vs JavaScript's native RegExp
|
|
314
|
+
|
|
315
|
+
These examples illustrate the performance comparison between the RE2JS library and JavaScript's native RegExp for both a simple case and a ReDoS (Regular Expression Denial of Service) scenario
|
|
316
|
+
|
|
317
|
+
```js
|
|
318
|
+
const regex = 'a+'
|
|
319
|
+
const string = 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!'
|
|
320
|
+
|
|
321
|
+
RE2JS.compile(regex).matcher(string).find() // avg: 5.657783601 ms
|
|
322
|
+
new RegExp(regex).test(string) // avg: 1.504824999 ms
|
|
323
|
+
```
|
|
324
|
+
|
|
325
|
+
The result shows that the RE2JS library took around **5.66 ms** on average to find a match, while the native RegExp took around **1.50 ms**. This indicates that, in this case, RegExp performed faster than RE2JS
|
|
326
|
+
|
|
327
|
+
```js
|
|
328
|
+
const regex = '([a-z]+)+$'
|
|
329
|
+
const string = 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!'
|
|
330
|
+
|
|
331
|
+
RE2JS.compile(regex).matcher(string).find() // avg: 3.6155000030994415 ms
|
|
332
|
+
new RegExp(regex).test(string) // avg: 103768.25712499022 ms
|
|
333
|
+
```
|
|
334
|
+
|
|
335
|
+
In the second example, a ReDoS scenario is depicted. The regular expression `([a-z]+)+$` is a potentially problematic one, as it has a nested quantifier. Nested quantifiers can cause catastrophic backtracking, which results in high processing time, leading to a potential Denial of Service (DoS) attack if a malicious user inputs a carefully crafted string.
|
|
336
|
+
|
|
337
|
+
The string is the same as in the first example, which does not pose a problem for either RE2JS or RegExp under normal circumstances. However, when dealing with the nested quantifier, RE2JS took around **3.62 ms** to find a match, while RegExp took significantly longer, around **103768.26 ms (~103 seconds)**. This demonstrates that RE2JS is much more efficient in handling potentially harmful regular expressions, thus preventing ReDoS attacks.
|
|
338
|
+
|
|
339
|
+
In conclusion, while JavaScript's native RegExp might be faster for simple regular expressions, RE2JS offers significant performance advantages when dealing with complex or potentially dangerous regular expressions. RE2JS provides protection against excessive backtracking that could lead to performance issues or ReDoS attacks.
|
|
340
|
+
|
|
341
|
+
## Rationale for RE2 JavaScript port
|
|
255
342
|
|
|
256
343
|
There are several reasons that underscore the importance of having an RE2 vanilla JavaScript (JS) port.
|
|
257
344
|
|
|
@@ -271,3 +358,5 @@ yarn node ./tools/scripts/genUnicodeTable.js > src/UnicodeTables.js
|
|
|
271
358
|
```
|
|
272
359
|
|
|
273
360
|
To run `make_perl_groups.pl` you need to have install perl (version inside `.tool-versions`)
|
|
361
|
+
|
|
362
|
+
[Playground website](https://re2js.leopard.in.ua/) maintained in `www` branch
|