re2js 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/LICENSE ADDED
@@ -0,0 +1,21 @@
1
+ MIT License
2
+
3
+ Copyright (c) 2023 Alexey Vasiliev
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
package/README.md ADDED
@@ -0,0 +1,273 @@
1
+ # RE2JS is the JavaScript port of RE2, a regular expression engine that provides linear time matching
2
+ [![Test/Build/Deploy](https://github.com/le0pard/re2js/actions/workflows/tests.yml/badge.svg)](https://github.com/le0pard/re2js/actions/workflows/tests.yml)
3
+
4
+ ## TLDR
5
+
6
+ The built-in JavaScript regular expression engine can, under certain special combinations, run in exponential time. This situation can trigger what's referred to as a [Regular Expression Denial of Service (ReDoS)](https://www.owasp.org/index.php/Regular_expression_Denial_of_Service_-_ReDoS). RE2, a different regular expression engine, can effectively safeguard your Node.js applications from ReDoS attacks. With RE2JS, this protective feature extends to browser environments as well, enabling you to utilize the RE2 engine more comprehensively.
7
+
8
+ ## What is RE2?
9
+
10
+ RE2 is a regular expression engine designed to operate in time proportional to the size of the input, ensuring linear time complexity. RE2JS, on the other hand, is a pure JavaScript port of the [RE2 library](https://github.com/google/re2) — more specifically, it's a port of the [RE2/J library](https://github.com/google/re2j).
11
+
12
+ JavaScript standard regular expression package, [RegExp](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_expressions), and many other widely used regular expression packages such as PCRE, Perl and Python use a backtracking implementation strategy: when a pattern presents two alternatives such as a|b, the engine will try to match subpattern a first, and if that yields no match, it will reset the input stream and try to match b instead.
13
+
14
+ If such choices are deeply nested, this strategy requires an exponential number of passes over the input data before it can detect whether the input matches. If the input is large, it is easy to construct a pattern whose running time would exceed the lifetime of the universe. This creates a security risk when accepting regular expression patterns from untrusted sources, such as users of a web application.
15
+
16
+ In contrast, the RE2 algorithm explores all matches simultaneously in a single pass over the input data by using a nondeterministic finite automaton.
17
+
18
+ There are certain features of PCRE or Perl regular expressions that cannot be implemented in linear time, for example, backreferences, but the vast majority of regular expressions patterns in practice avoid such features.
19
+
20
+ ## Usage
21
+
22
+ This document provides a series of examples demonstrating how to use RE2JS in your code. For more detailed information about regex syntax, please visit this page: [Google RE2 Syntax Documentation](https://github.com/google/re2/wiki/Syntax).
23
+
24
+ ### Compiling Patterns
25
+
26
+ You can compile a regex pattern using the `RE2JS.compile()` function:
27
+
28
+ ```js
29
+ import { RE2JS } from 're2js'
30
+
31
+ const p = RE2JS.compile('abc');
32
+ console.log(p.pattern()); // Outputs: 'abc'
33
+ console.log(p.flags()); // Outputs: 0
34
+ ```
35
+
36
+ The `RE2JS.compile()` function also supports flags:
37
+
38
+ ```js
39
+ import { RE2JS } from 're2js'
40
+
41
+ const p = RE2JS.compile('abc', RE2JS.CASE_INSENSITIVE | RE2JS.MULTILINE);
42
+ console.log(p.pattern()); // Outputs: 'abc'
43
+ console.log(p.flags()); // Outputs: 5
44
+ ```
45
+
46
+ Supported flags:
47
+
48
+ ```js
49
+ /**
50
+ * Flag: case insensitive matching.
51
+ */
52
+ RE2JS.CASE_INSENSITIVE
53
+ /**
54
+ * Flag: dot ({@code .}) matches all characters, including newline.
55
+ */
56
+ RE2JS.DOTALL
57
+ /**
58
+ * Flag: multiline matching: {@code ^} and {@code $} match at beginning and end of line, not just
59
+ * beginning and end of input.
60
+ */
61
+ RE2JS.MULTILINE
62
+ /**
63
+ * Flag: Unicode groups (e.g. {@code \p\ Greek\} ) will be syntax errors.
64
+ */
65
+ RE2JS.DISABLE_UNICODE_GROUPS
66
+ /**
67
+ * Flag: matches longest possible string.
68
+ */
69
+ RE2JS.LONGEST_MATCH
70
+ ```
71
+
72
+ ### Checking for Matches
73
+
74
+ RE2JS allows you to check if a string matches a given regex pattern using the `RE2JS.matches()` function
75
+
76
+ ```js
77
+ import { RE2JS } from 're2js'
78
+
79
+ RE2JS.matches('ab+c', 'abbbc') // true
80
+ RE2JS.matches('ab+c', 'cbbba') // false
81
+ // or
82
+ RE2JS.compile('ab+c').matches('abbbc') // true
83
+ RE2JS.compile('ab+c').matches('cbbba') // false
84
+ // with flags
85
+ RE2JS.compile('ab+c', RE2JS.CASE_INSENSITIVE).matches('AbBBc') // true
86
+ ```
87
+
88
+ ### Finding Matches
89
+
90
+ To find a match for a given regex pattern in a string, you can use the `find()` function
91
+
92
+ ```js
93
+ import { RE2JS } from 're2js'
94
+
95
+ RE2JS.compile('ab+c').matcher('xxabbbc').find() // true
96
+ RE2JS.compile('ab+c').matcher('cbbba').find() // false
97
+ // with flags
98
+ RE2JS.compile('ab+c', RE2JS.CASE_INSENSITIVE).matcher('abBBc').find() // true
99
+ ```
100
+
101
+ The `find()` method searches for a pattern match in a string starting from a specific index
102
+
103
+ ```js
104
+ import { RE2JS } from 're2js'
105
+
106
+ const p = RE2JS.compile('.*[aeiou]')
107
+ const matchString = p.matcher('abcdefgh')
108
+ matchString.find(0) // true
109
+ matchString.group() // 'abcde'
110
+ matchString.find(1) // true
111
+ matchString.group() // 'bcde'
112
+ matchString.find(4) // true
113
+ matchString.group() // 'e'
114
+ matchString.find(7) // false
115
+ ```
116
+
117
+ ### Splitting Strings
118
+
119
+ You can split a string based on a regex pattern using the `split()` function
120
+
121
+ ```js
122
+ import { RE2JS } from 're2js'
123
+
124
+ RE2JS.compile('/').split('abcde') // ['abcde']
125
+ RE2JS.compile('/').split('a/b/cc//d/e//') // ['a', 'b', 'cc', '', 'd', 'e']
126
+ RE2JS.compile(':').split(':a::b') // ['', 'a', '', 'b']
127
+ ```
128
+
129
+ The `split()` function also supports a limit parameter
130
+
131
+ ```js
132
+ import { RE2JS } from 're2js'
133
+
134
+ RE2JS.compile('/').split('a/b/cc//d/e//', 3) // ['a', 'b', 'cc//d/e//']
135
+ RE2JS.compile('/').split('a/b/cc//d/e//', 4) // ['a', 'b', 'cc', '/d/e//']
136
+ RE2JS.compile('/').split('a/b/cc//d/e//', 9) // ['a', 'b', 'cc', '', 'd', 'e', '', '']
137
+ RE2JS.compile(':').split('boo:and:foo', 2) // ['boo', 'and:foo']
138
+ RE2JS.compile(':').split('boo:and:foo', 5) // ['boo', 'and', 'foo']
139
+ ```
140
+
141
+ ### Working with Groups
142
+
143
+ RE2JS supports capturing groups in regex patterns
144
+
145
+ #### Group Count
146
+
147
+ You can get the count of groups in a pattern using the `groupCount()` function
148
+
149
+ ```js
150
+ import { RE2JS } from 're2js'
151
+
152
+ RE2JS.compile('(.*)ab(.*)a').groupCount() // 2
153
+ RE2JS.compile('(.*)((a)b)(.*)a').groupCount() // 4
154
+ RE2JS.compile('(.*)(\\(a\\)b)(.*)a').groupCount() // 3
155
+ ```
156
+
157
+ #### Named Groups
158
+
159
+ You can access the named groups in a pattern using the `namedGroups()` function
160
+
161
+ ```js
162
+ import { RE2JS } from 're2js'
163
+
164
+ RE2JS.compile('(?P<foo>\\d{2})').namedGroups() // { foo: 1 }
165
+ RE2JS.compile('\\d{2}').namedGroups() // {}
166
+ RE2JS.compile('(?P<foo>.*)(?P<bar>.*)').namedGroups() // { foo: 1, bar: 2 }
167
+ ```
168
+
169
+ #### Group Content
170
+
171
+ The `group()` method retrieves the content matched by a specific capturing group
172
+
173
+ ```js
174
+ import { RE2JS } from 're2js'
175
+
176
+ const p = RE2JS.compile('(a)(b(c)?)d?(e)')
177
+ const matchString = p.matcher('xabdez')
178
+ if (matchString.find()) {
179
+ matchString.group(0) // 'abde'
180
+ matchString.group(1) // 'a'
181
+ matchString.group(2) // 'b'
182
+ matchString.group(3) // null
183
+ matchString.group(4) // 'e'
184
+ }
185
+ ```
186
+
187
+ #### Named Group Content
188
+
189
+ The `group()` method retrieves the content matched by a specific name of capturing group
190
+
191
+ ```js
192
+ import { RE2JS } from 're2js'
193
+
194
+ const p = RE2JS.compile(
195
+ '(?P<baz>f(?P<foo>b*a(?P<another>r+)){0,10})(?P<bag>bag)?(?P<nomatch>zzz)?'
196
+ )
197
+ const matchString = p.matcher('fbbarrrrrbag')
198
+ if (matchString.matches()) {
199
+ matchString.group('baz') // 'fbbarrrrr'
200
+ matchString.group('foo') // 'bbarrrrr'
201
+ matchString.group('another') // 'rrrrr'
202
+ matchString.group('bag') // 'bag'
203
+ matchString.group('nomatch') // null
204
+ }
205
+ ```
206
+
207
+ ### Replacing Matches
208
+
209
+ RE2JS allows you to replace all occurrences or the first occurrence of a pattern match in a string with a specific replacement string
210
+
211
+ #### Replacing All Occurrences
212
+
213
+ The `replaceAll()` method replaces all occurrences of a pattern match in a string with the given replacement
214
+
215
+ ```js
216
+ import { RE2JS } from 're2js'
217
+
218
+ RE2JS.compile('Frog')
219
+ .matcher("What the Frog's Eye Tells the Frog's Brain")
220
+ .replaceAll('Lizard') // "What the Lizard's Eye Tells the Lizard's Brain"
221
+ RE2JS.compile('(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)')
222
+ .matcher('abcdefghijklmnopqrstuvwxyz123')
223
+ .replaceAll('$10$20') // 'jb0wo0123'
224
+ ```
225
+
226
+ Note that the replacement string can include references to capturing groups from the pattern
227
+
228
+ #### Replacing the First Occurrence
229
+
230
+ The `replaceFirst()` method replaces the first occurrence of a pattern match in a string with the given replacement
231
+
232
+ ```js
233
+ import { RE2JS } from 're2js'
234
+
235
+ RE2JS.compile('Frog')
236
+ .matcher("What the Frog's Eye Tells the Frog's Brain")
237
+ .replaceFirst('Lizard') // "What the Lizard's Eye Tells the Frog's Brain"
238
+ RE2JS.compile('(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)(.)')
239
+ .matcher('abcdefghijklmnopqrstuvwxyz123')
240
+ .replaceFirst('$10$20') // 'jb0nopqrstuvwxyz123'
241
+ ```
242
+
243
+ ## Performance
244
+
245
+ The RE2JS engine runs more slowly compared to native RegExp objects. This reduced speed is also noticeable when comparing RE2JS to the original RE2 engine. The primary reason behind this is the lack of a synchronous threads solution within the browser environment. This deficiency is significant because the regex engine requires a synchronous API to operate optimally.
246
+
247
+ The C++ implementation of the RE2 engine includes both NFA (Nondeterministic Finite Automaton) and DFA (Deterministic Finite Automaton) engines, as well as a variety of optimizations. Russ Cox ported a simplified version of the NFA engine to Go. Later, Alan Donovan ported the NFA-based Go implementation to Java. I then ported the NFA-based Java implementation to a pure JS version. This is another reason why the pure JS version will perform more slowly compared to the original RE2 engine.
248
+
249
+ Should you require high performance on the server side when using RE2, it would be beneficial to consider the following packages for JS:
250
+
251
+ - [Node-RE2](https://github.com/uhop/node-re2/): A powerful RE2 package for Node.js
252
+ - [RE2-WASM](https://github.com/google/re2-wasm/): This package is a WASM wrapper for RE2. Please note, as of now, it does not work in browsers
253
+
254
+ ## Justification for this JS port existence
255
+
256
+ There are several reasons that underscore the importance of having an RE2 vanilla JavaScript (JS) port.
257
+
258
+ Firstly, it enables RE2 JS validation on the client side within the browser. This is vital as it allows the implementation and execution of regular expression operations directly in the browser, enhancing performance by reducing the necessity of server-side computations and back-and-forth communication.
259
+
260
+ Secondly, it provides a platform for simple RE2 parsing, specifically for the extraction of regex groups. This feature is particularly useful when dealing with complex regular expressions, as it allows for the breakdown of regex patterns into manageable and identifiable segments or 'groups'.
261
+
262
+ These factors combined make the RE2 vanilla JS port a valuable tool for developers needing to work with complex regular expressions within a browser environment.
263
+
264
+ ## Development
265
+
266
+ Some files like `CharGroup.js` and `UnicodeTables.js` is generated and should be edited in generator files
267
+
268
+ ```bash
269
+ ./tools/scripts/make_perl_groups.pl > src/CharGroup.js
270
+ yarn node ./tools/scripts/genUnicodeTable.js > src/UnicodeTables.js
271
+ ```
272
+
273
+ To run `make_perl_groups.pl` you need to have install perl (version inside `.tool-versions`)