regexador 0.4.5

Sign up to get free protection for your applications and to get access to all the features.
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: b16d02292d2ae9d888a09c04d2e3330c72be9836
4
+ data.tar.gz: d4b472b343ece2a984b30ceecae7968fefdb6c93
5
+ SHA512:
6
+ metadata.gz: 1f12498dd028fb55dcef7f927d81a1057af6441824c56d2c6fb8dfe9ee9c99e3c6fb7caa9c94069e848489b737e5d025196a9bf74f4fa844014a61509d361adb
7
+ data.tar.gz: 896ba315badba2f4bdfce8d731c3a3802111ce712ac1bbb768bb4fa8c0f54d1306eabbba896214ccd97fef74c9439f66abcd66a5c9612eb494959dc67d2f256f
@@ -0,0 +1,385 @@
1
+ **UPDATING for 2019**
2
+ - create gemspec
3
+ - update code for Ruby 2.6
4
+ - convert RSpec to MiniTest
5
+ - add more tests
6
+ - add more examples
7
+ - add a tutorial
8
+ - begin work on translating Ruby regexes
9
+ - investigate Python/Perl/Elixir compatibility
10
+ - investigate possibility of engine mockup with debugger
11
+
12
+
13
+ # regexador
14
+
15
+ An external DSL for Ruby that tries to make regular expressions readable and maintainable.
16
+
17
+ **PLEASE NOTE**: This README may not be as up-to-date
18
+ as [the wiki](http://github.com/Hal9000/regexador/wiki).
19
+
20
+ ### The Basic Concept
21
+
22
+ Many people are intimidated or confused by regular expressions.
23
+ A large part of this is the confusing syntax.
24
+
25
+ Regexador is a mini-language purely for building regular expressions.
26
+ It's purely a Ruby project for now, though in theory it could be
27
+ implemented in/for other languages.
28
+
29
+ For an analogy, think of how we sometimes manipulate databases by
30
+ constructing SQL queries and passing them into the appropriate
31
+ methods. Regexador works much the same way.
32
+
33
+ ### A Short Example
34
+
35
+ Suppose we want to match a string consisting of a single IP address.
36
+ (Remember that the numbers can only range as high as 255.)
37
+
38
+ Here is traditional regular expression notation:
39
+
40
+ /^(25[0-5]|2[0-4]\d|([01])?(\d){1,2})\.(25[0-5]|2[0-4]\d|([01])?(\d){1,2})\.(25[0-5]|2[0-4]\d|([01])?(\d){1,2})\.(25[0-5]|2[0-4]\d|([01])?(\d){1,2})$/
41
+
42
+ And here is Regexador notation:
43
+
44
+ dot = "."
45
+ num = "25" D5 | `2 D4 D | maybe D1 1,2*D
46
+ match BOS num dot num dot num dot num EOS end
47
+
48
+ In your Ruby code, you can create a Regexador "script" or "program"
49
+ (probably by means of a here-document) that you can then pass into
50
+ the Regexador class. At minimum, you can convert this into a "real"
51
+ Ruby regular expression; there are a few other features and functions,
52
+ and more may be added.
53
+
54
+ So here is a complete Ruby program:
55
+
56
+ require 'regexador'
57
+
58
+ program = <<-EOS
59
+ dot = "."
60
+ num = "25" D5 | `2 D4 D | maybe D1 0,2*D
61
+ match WB num dot num dot num dot num WB end
62
+ EOS
63
+
64
+ pattern = Regexador.new(program)
65
+
66
+ puts "Give me an IP address"
67
+ str = gets.chomp
68
+
69
+ rx = pattern.to_regex # Can retrieve the actual regex
70
+
71
+ if pattern.match?(str) # ...or use in other direct ways
72
+ puts "Valid"
73
+ else
74
+ puts "Invalid"
75
+ end
76
+
77
+
78
+
79
+ **Traditional Syntax: Things I Personally Dislike**
80
+
81
+ - There are no keywords -- only punctuation.
82
+ These symbols all have special meanings: ^$.\[]()+\*? (and others)
83
+ - ^ has at least three different meanings
84
+ - [ and ] each have two or three different meanings
85
+ - Parentheses aren't just for grouping, but for specifying captures
86
+ - Character literals are "naked"
87
+ - Excessive punctuation makes use of backslash common
88
+ - Repetition is strictly postfix form
89
+ - Typically (except for Ruby's /x): They're not multi-line, they don't allow comments, and whitespace is highly significant.
90
+ - There's no way to avoid duplication (e.g.) by assigning subexpressions to variables.
91
+ - And other things I'm forgetting
92
+
93
+
94
+ ### Regexador at a Glance
95
+
96
+ I'm attracted to old-fashioned line-oriented syntax; but I don't want
97
+ to lock myself into that completely.
98
+
99
+ In general, useful definitions (variables) will come first. Many things
100
+ are predefined already, such as all the usual anchors and the POSIX
101
+ character classes. These are in all caps and are considered constants.
102
+
103
+ At the end, a *match* clause drives the actual building of the final
104
+ regular expression. Within this clause, names may be assigned to the
105
+ individual sub-matches (using variables that start with "@"). These will
106
+ naturally be available externally as named captures.
107
+
108
+ Because this is really just a "builder," and because we don't have "hooks"
109
+ into the regular expression engine itself, a Regexador script will not
110
+ look or act much like a "real program." There will be no arithmetic, no
111
+ function calls, no looping or branching. Also there can be no printing
112
+ of debug information "at matching time"; in principle, printing could be
113
+ done during parsing/compilation, but I don't see any value in this.
114
+
115
+ Of course, syntax errors in Regexador will be found and made available
116
+ to the caller.
117
+
118
+
119
+ **Beginning at the Beginning**
120
+
121
+ I've tried to "think ahead" so as not to paint myself into a corner
122
+ too much.
123
+
124
+ However, probably not all of this can be implemented in the first
125
+ version. The current "working version" (0.2.7) has been implemented
126
+ over a period of nine weeks.
127
+
128
+ Therefore some of the syntax described in the following will not be
129
+ available right away.
130
+
131
+ Features still postponed:
132
+ - intra-line comments: #{...}
133
+ - case/end
134
+ - unsure about upto, thru
135
+ - unsure about next, last
136
+ - pos/neg lookahead/behind
137
+
138
+
139
+ **Syntax notes:**
140
+
141
+ "abc" A char string /abc/
142
+ `a A single character /a/
143
+ &2345 Unicode char U+2345
144
+ ~`a Negated char class /[^a]/
145
+ 'abc' One of class a, b, c /[abc]/
146
+ `a-`z Char range /[a-z]/
147
+ `a~`z Negated char range /[^a-z]/
148
+ p1 | p2 Alternative
149
+ maybe PAT Optional pattern PAT?
150
+ any PAT Zero or more of pattern PAT\*
151
+ many PAT One or more of pattern PAT+
152
+ nocase PAT Case-insensitive PAT (?i)PAT
153
+ 0,1 * PAT Same as maybe PAT?
154
+ 1,3 * PAT One to three of PAT PAT{1,3}
155
+ 5 * PAT Five of PAT PAT{5}
156
+ @var A named capture \g<var>{0}
157
+ :var A parameter passed in
158
+ %alpha POSIX or Ruby char class [[:alpha:]]
159
+ var = val Assign value to local var
160
+ match Start assembling the regex
161
+ # ... Comment
162
+ D Digit /[0-9]/
163
+ D1, D2, ... 0 through whatever /[0-1]/ /[0-1]/ ...
164
+ X Any character /./
165
+ WB Word boundary /\b/
166
+ CR Carriage return "\r" /\r/
167
+ LF Linefeed "\n" /\n/
168
+ NL Newline "\n" /\n/
169
+ START Start of the string /\A/
170
+ END End of the string /\Z/
171
+
172
+
173
+ "On hold" for now...
174
+
175
+ upto `a All non-a chars until a /([^a]\*?a)/
176
+ thru `a All chars including next a /(.\*?a)/
177
+ last PAT Greedy (.\*)PAT
178
+ next PAT Non-greedy (default) (.\*)?PAT
179
+ #{...} Inline comment
180
+ case/when/end Complex alternatives
181
+
182
+
183
+ ### Notes, precedence, etc.
184
+
185
+ any, many, maybe, nocase ... These refer to the very next pattern (but parentheses are legal):
186
+
187
+ maybe "abc" many "xyz" /(abc)?(xyz)+/
188
+ maybe many "def" /(def)+?/
189
+ maybe ("abc" many "xyz") /(abc(xyz)+)?/
190
+ "abc" nocase "def" "ghi" /abc((?i)def)ghi/
191
+
192
+ String concatenation is implied:
193
+
194
+ str = "abc" NL "def" /abc\ndef/
195
+
196
+ Strings don't interpolate and the backslash is not special (unsure?):
197
+
198
+ str = "lm\nop" /lm\\nop/
199
+
200
+ A character literal is essentially the same as a one-character string.
201
+
202
+ c1 = `$ /\$/
203
+ s1 = "$" /\$/
204
+
205
+ However, a character can be negated, while a string (at present) cannot.
206
+
207
+ n1 = ~`$ /[^$]/
208
+
209
+ It is possible to use the "ampersand" notation (with four hex digits)
210
+ to specify a Unicode codepoint explicitly.
211
+
212
+ &20ac /€/
213
+
214
+ The encoding is assumed to be UTF-8. Characters used as literals are limited
215
+ only by the editor and the current Ruby encoding.
216
+
217
+ str = "æßçöñ" /æကßçö/
218
+
219
+ Tokens such as any, many, match, (etc.) are keywords, and as such cannot be local variable names
220
+
221
+ However, parameters (starting with colon) and named matches (starting with @) can be named @any, :many, and so on.
222
+
223
+ Capitalized predefined matches such as WB (word boundary) are really keywords also
224
+
225
+ Alternation binds very loosely:
226
+
227
+ many "abc" | "xyz" /(abc)+|xyx/
228
+ (many "abc") | "xyz" /(abc)+|xyz/ # Same as above
229
+ many ("abc" | "xyz") /(abc|xyz)+/ # Different!
230
+
231
+ A variable may refer to a string, a number, or a pattern:
232
+
233
+ var1 = 3
234
+ var2 = "abc" # Really a string is a pattern too
235
+ var3 = maybe many D
236
+
237
+ There is no arithmetic, but variables may be used where numbers may:
238
+
239
+ m = 3
240
+ n = 5
241
+ m,n * "xyz" /(xyz){3,5}/
242
+
243
+ Parameters may be used the same way:
244
+
245
+ # Assuming params :m, :n are 2 and 4
246
+ :m,:n * "xyz" /(xyz){2,4}/
247
+
248
+ But data type matters, of course:
249
+
250
+ m = 3
251
+ n = "foo"
252
+ m,n * "def" # Syntax error!
253
+
254
+ The "match clause" uses all previous definitions to finally build the regular expression. It starts with "match" and ends with "end":
255
+
256
+ match "abc" | "def" | many `x end
257
+
258
+ Named matches are only used inside the match clause; anywhere a pattern may be used, "@var = pattern" may also be used.
259
+
260
+ match @first = (many %alpha) SPACES @last = (many %alpha) end
261
+
262
+ Multiple lines are fine (and more readable):
263
+
264
+ match
265
+ @first = many %alpha
266
+ SPACES
267
+ @last = many %alpha
268
+ end
269
+
270
+ A "case" may be used for more complex alternatives (needed??):
271
+
272
+ case
273
+ when "abc" ...
274
+ when "def" ...
275
+ when "xyz" ...
276
+ end
277
+
278
+ Multiple "programs" can be concatenated, assuming the initial ones are all definitions and there is only one match clause at the end.
279
+
280
+ # Ruby code
281
+ defs = "..."
282
+ prog = "..."
283
+ matcher = Regexador.new(defs + prog)
284
+
285
+ Pass in parameters this way:
286
+
287
+ # Ruby code
288
+ prog = "..."
289
+ matcher = Regexador.new(prog, this: 3, that: "foo")
290
+
291
+ Possibly invoke "on its own" (compile to regex internally) or explicitly compile?
292
+
293
+ result = matcher.match(str)
294
+ if result.ok?
295
+ alpha, beta = result[:alpha, :beta] # Captured matches
296
+ end
297
+
298
+ # Alternatively:
299
+ rx = matcher.regexp # Return a Ruby regex, use however
300
+
301
+ ### Examples
302
+
303
+ Match a signed float /[-+]?[0-9]+\.[0-9]+([Ee][0-9]+)?/
304
+
305
+ sign = '+-'
306
+ digits = many D
307
+ match
308
+ @sign = maybe sign
309
+ @left = digits
310
+ `.
311
+ @right = digits
312
+ maybe ('Ee' @exp=(maybe sign digits))
313
+ end
314
+
315
+ Match balanced HTML tags and capture cdata /\<TAG\b[^\>]\*\>(.\*?)\<\/TAG\>/
316
+
317
+ # Note that :tag is a parameter, so for example,
318
+ # TABLE or BODY might be passed in
319
+ match
320
+ `< :tag WB
321
+ @cdata = (upto `>)
322
+ "</" :tag `>
323
+ end
324
+
325
+
326
+ Match IP address (honoring 255 limit) Regex: /\b(25[0-5]|2[0-4][0-9]|[01]?[0-9]{0,2})\.(25[0-5]|2[0-4][0-9]|[01]?[0-9]{0,2})\.(25[0-5]|2[0-4][0-9]|[01]?[0-9]{0,2})\.(25[0-5]|2[0-4][0-9]|[01]?[0-9]{0,2})\b/
327
+
328
+ dot = "."
329
+ num = "25" D5 | `2 D4 D | maybe D1 1,2*D
330
+ match WB num dot num dot num dot num WB end
331
+
332
+ Determine whether a credit card number is valid Regex: /^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|6(?:011|5[0-9][0-9])[0-9]{12}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-9]{11}|(?:2131|1800|35\d{3})\d{11})$/
333
+
334
+ # Warning: This one likely has errors!
335
+ # Assuming no spaces
336
+
337
+ # Visa: ^4[0-9]{12}(?:[0-9]{3})?$
338
+ # All Visa card numbers start with a 4. New cards have 16 digits. Old cards have 13.
339
+ # MasterCard: ^5[1-5][0-9]{14}$
340
+ # All MasterCard numbers start with the numbers 51 through 55. All have 16 digits.
341
+ # American Express: ^3[47][0-9]{13}$
342
+ # American Express card numbers start with 34 or 37 and have 15 digits.
343
+ # Diners Club: ^3(?:0[0-5]|[68][0-9])[0-9]{11}$
344
+ # Diners Club card numbers begin with 300 through 305, 36 or 38. All have 14 digits.
345
+ # There are Diners Club cards that begin with 5 and have 16 digits. These are a
346
+ # joint venture between Diners Club and MasterCard, and should be processed like
347
+ # a MasterCard.
348
+ # Discover: ^6(?:011|5[0-9]{2})[0-9]{12}$
349
+ # Discover card numbers begin with 6011 or 65. All have 16 digits.
350
+ # JCB: ^(?:2131|1800|35\d{3})\d{11}$
351
+ # JCB cards beginning with 2131 or 1800 have 15 digits. JCB cards beginning with 35 have 16 digits.
352
+
353
+ visa = `4 12\*D maybe 3\*D
354
+ mc = `5 D5 14\*D
355
+ amex = `3 '47' 13\*D
356
+ diners = `3 (`0 D5 | '68' D) 11\*D
357
+ discover = `6 ("011" | `5 2\*D) 12\*D
358
+ jcb = ("2131"|"1800"|"35" 3\*D) 11\*D
359
+
360
+ match visa | mc | amex | diners | discover | jcb end
361
+
362
+ ### Open Questions
363
+
364
+ 1. What about pos/neg lookahead/lookbehind, possessive matches? Laziness??
365
+ 2. Do upto and thru really make sense?
366
+ 3. Do next and last really make sense?
367
+ 4. How to handle /m? /o?
368
+ 5. What special symbols/anchors do we need to predefine?
369
+ 6. Possibly allow postfix repetition as well as prefix? (e.g.: pattern \* 1,3)
370
+ 7. Other issues...
371
+
372
+ ### Update history
373
+
374
+ This history has been maintained only since version 0.4.2
375
+
376
+ *0.4.3*
377
+ - Experimenting with lookarounds (pos/neg lookahead/behind)
378
+ - Rearranged tests
379
+ - Added "escaping" keyword
380
+ *0.4.2*
381
+ - UTF-8 encoding is assumed
382
+ - &xxxx notation can specify an arbitrary Unicode codepoint
383
+ - Backreferences work as expected
384
+ - Backreferences now can be inlined and parenthesized
385
+ - The nocase qualifier permits case-insensitive sub-expressions
@@ -0,0 +1,22 @@
1
+ abort "Require out of order" if ! defined? Regexador
2
+
3
+ class Regexador::Parser
4
+ rule(:cSQUOTE) { str("'") }
5
+ rule(:cQUOTE) { str('"') }
6
+ rule(:cAMPERSAND) { str('&') }
7
+ rule(:cTICK) { str('`') }
8
+ rule(:cBAR) { str('|') }
9
+ rule(:cPERCENT) { str('%') }
10
+ rule(:cCOMMA) { str(',') }
11
+ rule(:cHYPHEN) { str('-') }
12
+ rule(:cTILDE) { str('~') }
13
+ rule(:cUNDERSCORE) { str('_') }
14
+ rule(:cEQUAL) { str('=') }
15
+ rule(:cHASH) { str('#') }
16
+ rule(:cTIMES) { str('*') }
17
+ rule(:cAT) { str("@") }
18
+ rule(:cCOLON) { str(":") }
19
+ rule(:cLPAREN) { str('(') }
20
+ rule(:cRPAREN) { str(')') }
21
+ rule(:cNEWLINE) { str("\n") }
22
+ end
@@ -0,0 +1,22 @@
1
+
2
+ abort "Require out of order" if ! defined? Regexador
3
+
4
+ class Regexador::Parser
5
+
6
+ rule(:kANY) { str("any") }
7
+ rule(:kMANY) { str("many") }
8
+ rule(:kMAYBE) { str("maybe") }
9
+ rule(:kMATCH) { str("match") }
10
+ rule(:kEND) { str("end") }
11
+ rule(:kNOCASE) { str("nocase") }
12
+
13
+ rule(:kWITH) { str("with") }
14
+ rule(:kWITHOUT) { str("without") }
15
+ rule(:kFIND) { str("find") }
16
+
17
+ rule(:kWITHIN) { str("within") }
18
+ rule(:kESCAPING) { str("escaping") }
19
+
20
+ rule(:keyword) { kANY | kMANY | kMAYBE | kMATCH | kEND | kNOCASE |
21
+ kWITH | kWITHOUT | kFIND | kWITHIN | kESCAPING }
22
+ end
@@ -0,0 +1,49 @@
1
+
2
+ abort "Require out of order" if ! defined? Regexador
3
+
4
+ class Regexador::Parser
5
+
6
+ Predef2Regex = {
7
+ pD: "\\d",
8
+ pD0: "0",
9
+ pD1: "[01]",
10
+ pD2: "[0-2]",
11
+ pD3: "[0-3]",
12
+ pD4: "[0-4]",
13
+ pD5: "[0-5]",
14
+ pD6: "[0-6]",
15
+ pD7: "[0-7]",
16
+ pD8: "[0-8]",
17
+ pD9: "\\d",
18
+ pX: ".",
19
+
20
+ pCR: "\r",
21
+ pLF: "\n",
22
+ pNL: "\n",
23
+ pCRLF: "\r\n",
24
+
25
+ pSPACE: "\s", # ?
26
+ pSPACES: "\s+",
27
+ pBLANK: "\s",
28
+ pBLANKS: "\s+",
29
+
30
+ pWB: "\\b",
31
+ pBOS: "^",
32
+ pEOS: "$",
33
+ pSTART: "\A",
34
+ pEND: "\Z"
35
+ }
36
+
37
+ # We need to reverse sort the keys so that longer keys are used before
38
+ # shorter keys. (ie D0 vs. D)
39
+ syms = Predef2Regex.keys.sort.reverse
40
+
41
+ syms.each do |sym|
42
+ # rule(:WB) { str('WB') }
43
+ rule(sym) { str(sym.to_s[1..-1]) } # strip leading "p"
44
+ end
45
+
46
+ # rule(:predef) { (pD | pD0 | ...).as(:predef) }
47
+ rule(:predef) {
48
+ syms.map { |s| self.send(s) }.reduce(&:|).as(:predef) }
49
+ end