regexador 0.4.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: b16d02292d2ae9d888a09c04d2e3330c72be9836
4
+ data.tar.gz: d4b472b343ece2a984b30ceecae7968fefdb6c93
5
+ SHA512:
6
+ metadata.gz: 1f12498dd028fb55dcef7f927d81a1057af6441824c56d2c6fb8dfe9ee9c99e3c6fb7caa9c94069e848489b737e5d025196a9bf74f4fa844014a61509d361adb
7
+ data.tar.gz: 896ba315badba2f4bdfce8d731c3a3802111ce712ac1bbb768bb4fa8c0f54d1306eabbba896214ccd97fef74c9439f66abcd66a5c9612eb494959dc67d2f256f
@@ -0,0 +1,385 @@
1
+ **UPDATING for 2019**
2
+ - create gemspec
3
+ - update code for Ruby 2.6
4
+ - convert RSpec to MiniTest
5
+ - add more tests
6
+ - add more examples
7
+ - add a tutorial
8
+ - begin work on translating Ruby regexes
9
+ - investigate Python/Perl/Elixir compatibility
10
+ - investigate possibility of engine mockup with debugger
11
+
12
+
13
+ # regexador
14
+
15
+ An external DSL for Ruby that tries to make regular expressions readable and maintainable.
16
+
17
+ **PLEASE NOTE**: This README may not be as up-to-date
18
+ as [the wiki](http://github.com/Hal9000/regexador/wiki).
19
+
20
+ ### The Basic Concept
21
+
22
+ Many people are intimidated or confused by regular expressions.
23
+ A large part of this is the confusing syntax.
24
+
25
+ Regexador is a mini-language purely for building regular expressions.
26
+ It's purely a Ruby project for now, though in theory it could be
27
+ implemented in/for other languages.
28
+
29
+ For an analogy, think of how we sometimes manipulate databases by
30
+ constructing SQL queries and passing them into the appropriate
31
+ methods. Regexador works much the same way.
32
+
33
+ ### A Short Example
34
+
35
+ Suppose we want to match a string consisting of a single IP address.
36
+ (Remember that the numbers can only range as high as 255.)
37
+
38
+ Here is traditional regular expression notation:
39
+
40
+ /^(25[0-5]|2[0-4]\d|([01])?(\d){1,2})\.(25[0-5]|2[0-4]\d|([01])?(\d){1,2})\.(25[0-5]|2[0-4]\d|([01])?(\d){1,2})\.(25[0-5]|2[0-4]\d|([01])?(\d){1,2})$/
41
+
42
+ And here is Regexador notation:
43
+
44
+ dot = "."
45
+ num = "25" D5 | `2 D4 D | maybe D1 1,2*D
46
+ match BOS num dot num dot num dot num EOS end
47
+
48
+ In your Ruby code, you can create a Regexador "script" or "program"
49
+ (probably by means of a here-document) that you can then pass into
50
+ the Regexador class. At minimum, you can convert this into a "real"
51
+ Ruby regular expression; there are a few other features and functions,
52
+ and more may be added.
53
+
54
+ So here is a complete Ruby program:
55
+
56
+ require 'regexador'
57
+
58
+ program = <<-EOS
59
+ dot = "."
60
+ num = "25" D5 | `2 D4 D | maybe D1 0,2*D
61
+ match WB num dot num dot num dot num WB end
62
+ EOS
63
+
64
+ pattern = Regexador.new(program)
65
+
66
+ puts "Give me an IP address"
67
+ str = gets.chomp
68
+
69
+ rx = pattern.to_regex # Can retrieve the actual regex
70
+
71
+ if pattern.match?(str) # ...or use in other direct ways
72
+ puts "Valid"
73
+ else
74
+ puts "Invalid"
75
+ end
76
+
77
+
78
+
79
+ **Traditional Syntax: Things I Personally Dislike**
80
+
81
+ - There are no keywords -- only punctuation.
82
+ These symbols all have special meanings: ^$.\[]()+\*? (and others)
83
+ - ^ has at least three different meanings
84
+ - [ and ] each have two or three different meanings
85
+ - Parentheses aren't just for grouping, but for specifying captures
86
+ - Character literals are "naked"
87
+ - Excessive punctuation makes use of backslash common
88
+ - Repetition is strictly postfix form
89
+ - Typically (except for Ruby's /x): They're not multi-line, they don't allow comments, and whitespace is highly significant.
90
+ - There's no way to avoid duplication (e.g.) by assigning subexpressions to variables.
91
+ - And other things I'm forgetting
92
+
93
+
94
+ ### Regexador at a Glance
95
+
96
+ I'm attracted to old-fashioned line-oriented syntax; but I don't want
97
+ to lock myself into that completely.
98
+
99
+ In general, useful definitions (variables) will come first. Many things
100
+ are predefined already, such as all the usual anchors and the POSIX
101
+ character classes. These are in all caps and are considered constants.
102
+
103
+ At the end, a *match* clause drives the actual building of the final
104
+ regular expression. Within this clause, names may be assigned to the
105
+ individual sub-matches (using variables that start with "@"). These will
106
+ naturally be available externally as named captures.
107
+
108
+ Because this is really just a "builder," and because we don't have "hooks"
109
+ into the regular expression engine itself, a Regexador script will not
110
+ look or act much like a "real program." There will be no arithmetic, no
111
+ function calls, no looping or branching. Also there can be no printing
112
+ of debug information "at matching time"; in principle, printing could be
113
+ done during parsing/compilation, but I don't see any value in this.
114
+
115
+ Of course, syntax errors in Regexador will be found and made available
116
+ to the caller.
117
+
118
+
119
+ **Beginning at the Beginning**
120
+
121
+ I've tried to "think ahead" so as not to paint myself into a corner
122
+ too much.
123
+
124
+ However, probably not all of this can be implemented in the first
125
+ version. The current "working version" (0.2.7) has been implemented
126
+ over a period of nine weeks.
127
+
128
+ Therefore some of the syntax described in the following will not be
129
+ available right away.
130
+
131
+ Features still postponed:
132
+ - intra-line comments: #{...}
133
+ - case/end
134
+ - unsure about upto, thru
135
+ - unsure about next, last
136
+ - pos/neg lookahead/behind
137
+
138
+
139
+ **Syntax notes:**
140
+
141
+ "abc" A char string /abc/
142
+ `a A single character /a/
143
+ &2345 Unicode char U+2345
144
+ ~`a Negated char class /[^a]/
145
+ 'abc' One of class a, b, c /[abc]/
146
+ `a-`z Char range /[a-z]/
147
+ `a~`z Negated char range /[^a-z]/
148
+ p1 | p2 Alternative
149
+ maybe PAT Optional pattern PAT?
150
+ any PAT Zero or more of pattern PAT\*
151
+ many PAT One or more of pattern PAT+
152
+ nocase PAT Case-insensitive PAT (?i)PAT
153
+ 0,1 * PAT Same as maybe PAT?
154
+ 1,3 * PAT One to three of PAT PAT{1,3}
155
+ 5 * PAT Five of PAT PAT{5}
156
+ @var A named capture \g<var>{0}
157
+ :var A parameter passed in
158
+ %alpha POSIX or Ruby char class [[:alpha:]]
159
+ var = val Assign value to local var
160
+ match Start assembling the regex
161
+ # ... Comment
162
+ D Digit /[0-9]/
163
+ D1, D2, ... 0 through whatever /[0-1]/ /[0-1]/ ...
164
+ X Any character /./
165
+ WB Word boundary /\b/
166
+ CR Carriage return "\r" /\r/
167
+ LF Linefeed "\n" /\n/
168
+ NL Newline "\n" /\n/
169
+ START Start of the string /\A/
170
+ END End of the string /\Z/
171
+
172
+
173
+ "On hold" for now...
174
+
175
+ upto `a All non-a chars until a /([^a]\*?a)/
176
+ thru `a All chars including next a /(.\*?a)/
177
+ last PAT Greedy (.\*)PAT
178
+ next PAT Non-greedy (default) (.\*)?PAT
179
+ #{...} Inline comment
180
+ case/when/end Complex alternatives
181
+
182
+
183
+ ### Notes, precedence, etc.
184
+
185
+ any, many, maybe, nocase ... These refer to the very next pattern (but parentheses are legal):
186
+
187
+ maybe "abc" many "xyz" /(abc)?(xyz)+/
188
+ maybe many "def" /(def)+?/
189
+ maybe ("abc" many "xyz") /(abc(xyz)+)?/
190
+ "abc" nocase "def" "ghi" /abc((?i)def)ghi/
191
+
192
+ String concatenation is implied:
193
+
194
+ str = "abc" NL "def" /abc\ndef/
195
+
196
+ Strings don't interpolate and the backslash is not special (unsure?):
197
+
198
+ str = "lm\nop" /lm\\nop/
199
+
200
+ A character literal is essentially the same as a one-character string.
201
+
202
+ c1 = `$ /\$/
203
+ s1 = "$" /\$/
204
+
205
+ However, a character can be negated, while a string (at present) cannot.
206
+
207
+ n1 = ~`$ /[^$]/
208
+
209
+ It is possible to use the "ampersand" notation (with four hex digits)
210
+ to specify a Unicode codepoint explicitly.
211
+
212
+ &20ac /€/
213
+
214
+ The encoding is assumed to be UTF-8. Characters used as literals are limited
215
+ only by the editor and the current Ruby encoding.
216
+
217
+ str = "æßçöñ" /æကßçö/
218
+
219
+ Tokens such as any, many, match, (etc.) are keywords, and as such cannot be local variable names
220
+
221
+ However, parameters (starting with colon) and named matches (starting with @) can be named @any, :many, and so on.
222
+
223
+ Capitalized predefined matches such as WB (word boundary) are really keywords also
224
+
225
+ Alternation binds very loosely:
226
+
227
+ many "abc" | "xyz" /(abc)+|xyx/
228
+ (many "abc") | "xyz" /(abc)+|xyz/ # Same as above
229
+ many ("abc" | "xyz") /(abc|xyz)+/ # Different!
230
+
231
+ A variable may refer to a string, a number, or a pattern:
232
+
233
+ var1 = 3
234
+ var2 = "abc" # Really a string is a pattern too
235
+ var3 = maybe many D
236
+
237
+ There is no arithmetic, but variables may be used where numbers may:
238
+
239
+ m = 3
240
+ n = 5
241
+ m,n * "xyz" /(xyz){3,5}/
242
+
243
+ Parameters may be used the same way:
244
+
245
+ # Assuming params :m, :n are 2 and 4
246
+ :m,:n * "xyz" /(xyz){2,4}/
247
+
248
+ But data type matters, of course:
249
+
250
+ m = 3
251
+ n = "foo"
252
+ m,n * "def" # Syntax error!
253
+
254
+ The "match clause" uses all previous definitions to finally build the regular expression. It starts with "match" and ends with "end":
255
+
256
+ match "abc" | "def" | many `x end
257
+
258
+ Named matches are only used inside the match clause; anywhere a pattern may be used, "@var = pattern" may also be used.
259
+
260
+ match @first = (many %alpha) SPACES @last = (many %alpha) end
261
+
262
+ Multiple lines are fine (and more readable):
263
+
264
+ match
265
+ @first = many %alpha
266
+ SPACES
267
+ @last = many %alpha
268
+ end
269
+
270
+ A "case" may be used for more complex alternatives (needed??):
271
+
272
+ case
273
+ when "abc" ...
274
+ when "def" ...
275
+ when "xyz" ...
276
+ end
277
+
278
+ Multiple "programs" can be concatenated, assuming the initial ones are all definitions and there is only one match clause at the end.
279
+
280
+ # Ruby code
281
+ defs = "..."
282
+ prog = "..."
283
+ matcher = Regexador.new(defs + prog)
284
+
285
+ Pass in parameters this way:
286
+
287
+ # Ruby code
288
+ prog = "..."
289
+ matcher = Regexador.new(prog, this: 3, that: "foo")
290
+
291
+ Possibly invoke "on its own" (compile to regex internally) or explicitly compile?
292
+
293
+ result = matcher.match(str)
294
+ if result.ok?
295
+ alpha, beta = result[:alpha, :beta] # Captured matches
296
+ end
297
+
298
+ # Alternatively:
299
+ rx = matcher.regexp # Return a Ruby regex, use however
300
+
301
+ ### Examples
302
+
303
+ Match a signed float /[-+]?[0-9]+\.[0-9]+([Ee][0-9]+)?/
304
+
305
+ sign = '+-'
306
+ digits = many D
307
+ match
308
+ @sign = maybe sign
309
+ @left = digits
310
+ `.
311
+ @right = digits
312
+ maybe ('Ee' @exp=(maybe sign digits))
313
+ end
314
+
315
+ Match balanced HTML tags and capture cdata /\<TAG\b[^\>]\*\>(.\*?)\<\/TAG\>/
316
+
317
+ # Note that :tag is a parameter, so for example,
318
+ # TABLE or BODY might be passed in
319
+ match
320
+ `< :tag WB
321
+ @cdata = (upto `>)
322
+ "</" :tag `>
323
+ end
324
+
325
+
326
+ Match IP address (honoring 255 limit) Regex: /\b(25[0-5]|2[0-4][0-9]|[01]?[0-9]{0,2})\.(25[0-5]|2[0-4][0-9]|[01]?[0-9]{0,2})\.(25[0-5]|2[0-4][0-9]|[01]?[0-9]{0,2})\.(25[0-5]|2[0-4][0-9]|[01]?[0-9]{0,2})\b/
327
+
328
+ dot = "."
329
+ num = "25" D5 | `2 D4 D | maybe D1 1,2*D
330
+ match WB num dot num dot num dot num WB end
331
+
332
+ Determine whether a credit card number is valid Regex: /^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|6(?:011|5[0-9][0-9])[0-9]{12}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-9]{11}|(?:2131|1800|35\d{3})\d{11})$/
333
+
334
+ # Warning: This one likely has errors!
335
+ # Assuming no spaces
336
+
337
+ # Visa: ^4[0-9]{12}(?:[0-9]{3})?$
338
+ # All Visa card numbers start with a 4. New cards have 16 digits. Old cards have 13.
339
+ # MasterCard: ^5[1-5][0-9]{14}$
340
+ # All MasterCard numbers start with the numbers 51 through 55. All have 16 digits.
341
+ # American Express: ^3[47][0-9]{13}$
342
+ # American Express card numbers start with 34 or 37 and have 15 digits.
343
+ # Diners Club: ^3(?:0[0-5]|[68][0-9])[0-9]{11}$
344
+ # Diners Club card numbers begin with 300 through 305, 36 or 38. All have 14 digits.
345
+ # There are Diners Club cards that begin with 5 and have 16 digits. These are a
346
+ # joint venture between Diners Club and MasterCard, and should be processed like
347
+ # a MasterCard.
348
+ # Discover: ^6(?:011|5[0-9]{2})[0-9]{12}$
349
+ # Discover card numbers begin with 6011 or 65. All have 16 digits.
350
+ # JCB: ^(?:2131|1800|35\d{3})\d{11}$
351
+ # JCB cards beginning with 2131 or 1800 have 15 digits. JCB cards beginning with 35 have 16 digits.
352
+
353
+ visa = `4 12\*D maybe 3\*D
354
+ mc = `5 D5 14\*D
355
+ amex = `3 '47' 13\*D
356
+ diners = `3 (`0 D5 | '68' D) 11\*D
357
+ discover = `6 ("011" | `5 2\*D) 12\*D
358
+ jcb = ("2131"|"1800"|"35" 3\*D) 11\*D
359
+
360
+ match visa | mc | amex | diners | discover | jcb end
361
+
362
+ ### Open Questions
363
+
364
+ 1. What about pos/neg lookahead/lookbehind, possessive matches? Laziness??
365
+ 2. Do upto and thru really make sense?
366
+ 3. Do next and last really make sense?
367
+ 4. How to handle /m? /o?
368
+ 5. What special symbols/anchors do we need to predefine?
369
+ 6. Possibly allow postfix repetition as well as prefix? (e.g.: pattern \* 1,3)
370
+ 7. Other issues...
371
+
372
+ ### Update history
373
+
374
+ This history has been maintained only since version 0.4.2
375
+
376
+ *0.4.3*
377
+ - Experimenting with lookarounds (pos/neg lookahead/behind)
378
+ - Rearranged tests
379
+ - Added "escaping" keyword
380
+ *0.4.2*
381
+ - UTF-8 encoding is assumed
382
+ - &xxxx notation can specify an arbitrary Unicode codepoint
383
+ - Backreferences work as expected
384
+ - Backreferences now can be inlined and parenthesized
385
+ - The nocase qualifier permits case-insensitive sub-expressions
@@ -0,0 +1,22 @@
1
+ abort "Require out of order" if ! defined? Regexador
2
+
3
+ class Regexador::Parser
4
+ rule(:cSQUOTE) { str("'") }
5
+ rule(:cQUOTE) { str('"') }
6
+ rule(:cAMPERSAND) { str('&') }
7
+ rule(:cTICK) { str('`') }
8
+ rule(:cBAR) { str('|') }
9
+ rule(:cPERCENT) { str('%') }
10
+ rule(:cCOMMA) { str(',') }
11
+ rule(:cHYPHEN) { str('-') }
12
+ rule(:cTILDE) { str('~') }
13
+ rule(:cUNDERSCORE) { str('_') }
14
+ rule(:cEQUAL) { str('=') }
15
+ rule(:cHASH) { str('#') }
16
+ rule(:cTIMES) { str('*') }
17
+ rule(:cAT) { str("@") }
18
+ rule(:cCOLON) { str(":") }
19
+ rule(:cLPAREN) { str('(') }
20
+ rule(:cRPAREN) { str(')') }
21
+ rule(:cNEWLINE) { str("\n") }
22
+ end
@@ -0,0 +1,22 @@
1
+
2
+ abort "Require out of order" if ! defined? Regexador
3
+
4
+ class Regexador::Parser
5
+
6
+ rule(:kANY) { str("any") }
7
+ rule(:kMANY) { str("many") }
8
+ rule(:kMAYBE) { str("maybe") }
9
+ rule(:kMATCH) { str("match") }
10
+ rule(:kEND) { str("end") }
11
+ rule(:kNOCASE) { str("nocase") }
12
+
13
+ rule(:kWITH) { str("with") }
14
+ rule(:kWITHOUT) { str("without") }
15
+ rule(:kFIND) { str("find") }
16
+
17
+ rule(:kWITHIN) { str("within") }
18
+ rule(:kESCAPING) { str("escaping") }
19
+
20
+ rule(:keyword) { kANY | kMANY | kMAYBE | kMATCH | kEND | kNOCASE |
21
+ kWITH | kWITHOUT | kFIND | kWITHIN | kESCAPING }
22
+ end
@@ -0,0 +1,49 @@
1
+
2
+ abort "Require out of order" if ! defined? Regexador
3
+
4
+ class Regexador::Parser
5
+
6
+ Predef2Regex = {
7
+ pD: "\\d",
8
+ pD0: "0",
9
+ pD1: "[01]",
10
+ pD2: "[0-2]",
11
+ pD3: "[0-3]",
12
+ pD4: "[0-4]",
13
+ pD5: "[0-5]",
14
+ pD6: "[0-6]",
15
+ pD7: "[0-7]",
16
+ pD8: "[0-8]",
17
+ pD9: "\\d",
18
+ pX: ".",
19
+
20
+ pCR: "\r",
21
+ pLF: "\n",
22
+ pNL: "\n",
23
+ pCRLF: "\r\n",
24
+
25
+ pSPACE: "\s", # ?
26
+ pSPACES: "\s+",
27
+ pBLANK: "\s",
28
+ pBLANKS: "\s+",
29
+
30
+ pWB: "\\b",
31
+ pBOS: "^",
32
+ pEOS: "$",
33
+ pSTART: "\A",
34
+ pEND: "\Z"
35
+ }
36
+
37
+ # We need to reverse sort the keys so that longer keys are used before
38
+ # shorter keys. (ie D0 vs. D)
39
+ syms = Predef2Regex.keys.sort.reverse
40
+
41
+ syms.each do |sym|
42
+ # rule(:WB) { str('WB') }
43
+ rule(sym) { str(sym.to_s[1..-1]) } # strip leading "p"
44
+ end
45
+
46
+ # rule(:predef) { (pD | pD0 | ...).as(:predef) }
47
+ rule(:predef) {
48
+ syms.map { |s| self.send(s) }.reduce(&:|).as(:predef) }
49
+ end