list_matcher 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: a4f9013c830b1d516bbf895732abce4ee0fd327f
4
+ data.tar.gz: a6f49385a4dc17ffe42fe706bfce7e2cef2e4135
5
+ SHA512:
6
+ metadata.gz: d97d5937836d59422ec61c8f4daa1e859ea8b62aa8598f74d875b7dbde4426a6d24368cd94ed0936baedb5dde357cfc10492ab173fe6f7e0b67748856aabf7f8
7
+ data.tar.gz: 343cbde71b7cb693b4dc67965a5334a5cdc05ebcf34d103bfd0be21633abd80aa6950c6718360d0a05fbea7a55ff3e553edcfdaa7d06f0777773a9e623a162a6
data/.gitignore ADDED
@@ -0,0 +1,14 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /Gemfile.lock
4
+ /_yardoc/
5
+ /coverage/
6
+ /doc/
7
+ /pkg/
8
+ /spec/reports/
9
+ /tmp/
10
+ *.bundle
11
+ *.so
12
+ *.o
13
+ *.a
14
+ mkmf.log
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in list_matcher.gemspec
4
+ gemspec
data/LICENSE.txt ADDED
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2015 dfhoughton
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,444 @@
1
+ # ListMatcher
2
+
3
+ For creating compact, non-backtracking regular expressions from a list of strings.
4
+
5
+ ## Installation
6
+
7
+ Add this line to your application's Gemfile:
8
+
9
+ ```ruby
10
+ gem 'list_matcher'
11
+ ```
12
+
13
+ And then execute:
14
+
15
+ $ bundle
16
+
17
+ Or install it yourself as:
18
+
19
+ $ gem install list_matcher
20
+
21
+ ## Usage
22
+
23
+ ```ruby
24
+ require 'list_matcher'
25
+
26
+ m = List::Matcher.new
27
+ puts m.pattern %w( cat dog ) # (?:cat|dog)
28
+ puts m.pattern %w( cat rat ) # (?:[cr]at)
29
+ puts m.pattern %w( cat camel ) # (?:ca(?:mel|t))
30
+ puts m.pattern %w( cat flat sprat ) # (?:(?:c|fl|spr)at)
31
+ puts m.pattern %w( catttttttttt ) # (?:cat{10})
32
+ puts m.pattern %w( cat-t-t-t-t-t-t-t-t-t ) # (?:ca(?:t-){9}t)
33
+ puts m.pattern %w( catttttttttt batttttttttt ) # (?:[bc]at{10})
34
+ puts m.pattern %w( cad bad dad ) # (?:[b-d]ad)
35
+ puts m.pattern %w( cat catalog ) # (?:cat(?:alog)?+)
36
+ puts m.pattern (1..31).to_a # (?:[4-9]|1\d?+|2\d?+|3[01]?+)
37
+ ```
38
+
39
+ ## Description
40
+
41
+ `List::Matcher` facilitates generating efficient regexen programmatically. This is useful, for example, when looking for
42
+ occurrences of particular words or phrases in free-form text. `List::Matcher` will automatically generate regular expressions
43
+ that minimize backtracking, so they tend to be as fast as one could hope a regular expression to be. (The general strategy is
44
+ to represent the items in the list as a trie.)
45
+
46
+ `List::Matcher` has many options and the initialization of a matcher for pattern generation is somewhat complex, so various methods
47
+ are provided to minimize initializations and the number of times you specify options. For one-off patterns, you may as well call
48
+ class methods, either `pattern` which generates a string, or `rx`, which returns a `Regexp` object:
49
+
50
+ ```ruby
51
+ List::Matcher.pattern %( cat dog ) # "(?:cat|dog)"
52
+ List::Matcher.rx %( cat dog ) # /(?:cat|dog)/
53
+ ```
54
+
55
+ If you plan to generate multiple regexen, or have complicated options which you always use, you should generate a configured
56
+ instance first:
57
+
58
+ ```ruby
59
+ m = List::Matcher.new normalize_whitespace: true, bound: true, case_insensitive: true, multiline: true, atomic: false, symbols: { num: '\d++' }
60
+ m.pattern method_that_gets_a_long_list
61
+ m.rx method_that_gets_a_long_list
62
+ ...
63
+ ```
64
+
65
+ If you have a basic set of options and you need to modify these in particular cases, you can:
66
+
67
+ ```ruby
68
+ m.pattern list, case_insensitive: false
69
+ ```
70
+
71
+ You can also generate a prototype list matcher with a particular variation and bud off children with their own properties:
72
+
73
+ ```ruby
74
+ m = List::Matcher.new normalize_whitespace: true, bound: true, case_insensitive: true, multiline: true, atomic: false, symbols: { num: '\d++' }
75
+ m2 = m.bud case_insensitive: false
76
+ ```
77
+
78
+ Basically, you can mix in options in whatever way suits you. Constructing configured instances gives you a tiny bit of efficiency, but
79
+ mostly it saves you from specifying these options in multiple places.
80
+
81
+ ## Options
82
+
83
+ The one can provide to `new`, `bud`, `pattern`, or `rx` are all the same. These are
84
+
85
+ ### atomic
86
+
87
+ ```ruby
88
+ default: true
89
+ ```
90
+
91
+ If true, the returned expression is always wrapped in some grouping expression -- `(?:...)`, `(?>...)`, `(?i:...)`, etc.; whatever
92
+ is appropriate given the other options and defaults -- so it can receive a quantification suffix.
93
+
94
+ ```ruby
95
+ List::Matcher.pattern %w(cat dog), atomic: false # "cat|dog"
96
+ List::Matcher.pattern %w(cat dog), atomic: true # "(?:cat|dog)"
97
+ ```
98
+
99
+ ### backtracking
100
+
101
+ ```ruby
102
+ default: true
103
+ ```
104
+
105
+ If true, the default non-capturing grouping expression is `(?:...)` rather than `(?>...)`, and the optional quantifier is
106
+ `?` rather than `?+`.
107
+
108
+ ```ruby
109
+ List::Matcher.pattern %w( cat dog ) # "(?:cat|dog)"
110
+ List::Matcher.pattern %w( cat dog ), backtracking: false # "(?>cat|dog)"
111
+ ```
112
+
113
+ ### bound
114
+
115
+ ```ruby
116
+ default: false
117
+ ```
118
+
119
+ Whether boundary expressions should be attached to the margins of every expression in the list. If this value is simply true, this means
120
+ each items marginal characters, the first and the last, are tested to see whether they are word characters and if so the word
121
+ boundary symbol, `\b`, is appended to them where appropriate. There are several variants on this, however:
122
+
123
+ ```ruby
124
+ bound: :word
125
+ ```
126
+
127
+ This is the same as `bound: true`.
128
+
129
+ ```ruby
130
+ List::Matcher.pattern %w(cat), bound: :word # "(?:\\bcat\\b)"
131
+ List::Matcher.pattern %w(cat), bound: true # "(?:\\bcat\\b)"
132
+ ```
133
+
134
+ ```ruby
135
+ bound: :line
136
+ ```
137
+
138
+ Each item should take up an entire line, so the boundary symbols are `^` and `$`.
139
+
140
+ ```ruby
141
+ List::Matcher.pattern %w(cat), bound: :line # "(?:^cat$)"
142
+ ```
143
+
144
+ ```ruby
145
+ bound: :string
146
+ ```
147
+
148
+ Each item should match the entire string compared against, so the boundary symbols are `\A` and `\z`.
149
+
150
+ ```ruby
151
+ List::Matcher.pattern %w(cat), bound: :string # "(?:\\Acat\\z)"
152
+ ```
153
+
154
+ ```ruby
155
+ bound: { test: /\d/, left: '(?<!\d)', right: '(?!\d)'}
156
+ ```
157
+
158
+ If you have an ad hoc boundary definition -- here it is a digit/non-digit boundary -- you may specify it so. The test parameter
159
+ identifies marginal characters that require the boundary tests and the `:left` and `:right` symbols identify the boundary conditions.
160
+
161
+ ```ruby
162
+ List::Matcher.pattern (1...1000).to_a, bound: { test: /\d/, left: '(?<!\d)', right: '(?!\d)'}
163
+ # "(?:(?<!\\d)[1-9](?:\\d\\d?)?(?!\\d))"
164
+ ```
165
+
166
+ ### strip
167
+
168
+ ```ruby
169
+ default: false
170
+ ```
171
+
172
+ Strip whitespace off the margins of items in the list.
173
+
174
+ ```ruby
175
+ List::Matcher.pattern [' cat '] # "(?:(?:\\ ){5}cat(?:\\ ){5})"
176
+ List::Matcher.pattern [' cat '], strip: true # "(?:cat)"
177
+ ```
178
+
179
+ ### case_insensitive
180
+
181
+ ```ruby
182
+ default: false
183
+ ```
184
+
185
+ Generate a case-insensitive regular expression.
186
+
187
+ ```ruby
188
+ List::Matcher.pattern %w( Cat cat CAT ) # "(?:C(?:AT|at)|cat)"
189
+ List::Matcher.pattern %w( Cat cat CAT ), case_insensitive: true # "(?i:cat)"
190
+ ```
191
+
192
+ ### multiline
193
+
194
+ ```ruby
195
+ default: false
196
+ ```
197
+
198
+ Generate a multi-line regex.
199
+
200
+ ```ruby
201
+ List::Matcher.pattern %w(cat), multiline: true # "(?m:cat)"
202
+ ```
203
+
204
+ The special feature of a multi-line regular expression is that `.` can grab newline characters. Because `List::Matcher`
205
+ never produces `.` on its own, this option is only useful in conjunction with the `symbols` option, which lets one
206
+ inject snippets of regex into the one generated.
207
+
208
+ ### normalize_whitespace
209
+
210
+ ```ruby
211
+ default: false
212
+ ```
213
+
214
+ This strips whitespace from items in the list and treats all internal whitespace as equivalent.
215
+
216
+ ```ruby
217
+ List::Matcher.pattern [ ' cat walker ', ' dog walker', 'camel walker' ]
218
+ # "(?:\\ (?:\\ dog\\ walker|cat\\ \\ walker\\ )|camel\\ \\ walker)"
219
+ List::Matcher.pattern [ ' cat walker ', ' dog walker', 'camel walker' ], normalize_whitespace: true
220
+ # "(?:(?:ca(?:mel|t)|dog)\\s++walker)"
221
+ ```
222
+
223
+ ### symbols
224
+
225
+ You can tell `List::Matcher` that certain character sequences should be regarded as "symbols". It will then leave
226
+ these unmolested, replacing them in the generated regex with whatever you map the symbol sequences to. The keys in
227
+ the symbol hash are expected to be strings, symbols, or `Regexps`. Symbol keys are converted to their
228
+ sequence by stringification. `Regexp` keys convert any sequence they match.
229
+
230
+ ```ruby
231
+ List::Matcher.pattern [ 'Catch 22', '1984', 'Fahrenheit 451' ], symbols: { /\d+/ => '\d++' }
232
+ # "(?:(?:(?:Catch|Fahrenheit)\\ )?\\d++)"
233
+ List::Matcher.pattern [ 'Catch foo', 'foo', 'Fahrenheit foo' ], symbols: { 'foo' => '\d++' }
234
+ # "(?:(?:(?:Catch|Fahrenheit)\\ )?\\d++)"
235
+ List::Matcher.pattern [ 'Catch foo', 'foo', 'Fahrenheit foo' ], symbols: { foo: '\d++' }
236
+ # "(?:(?:(?:Catch|Fahrenheit)\\ )?\\d++)"
237
+ ```
238
+
239
+ Because it is possible for symbol sequences to overlap, sequences with string or symbol keys are evaluated before `Regexps`, and longer keys are
240
+ evaluated before shorter ones.
241
+
242
+ ### name
243
+
244
+ If you assign your pattern a name, it will be constructed with a named group such that you can extract
245
+ the substring matched.
246
+
247
+ ```ruby
248
+ List::Matcher.pattern %w(cat), name: :cat # "(?<cat>cat)"
249
+ ```
250
+
251
+ This is mostly useful if you are using `List::Matcher` to compose complex regexen incrementally. E.g., from the examples directory,
252
+
253
+ ```ruby
254
+ require 'list_matcher'
255
+
256
+ m = List::Matcher.new atomic: false, bound: true
257
+
258
+ year = m.pattern( (1901..2000).to_a, name: :year )
259
+ mday = m.pattern( (1..31).to_a, name: :mday )
260
+ weekdays = %w( Monday Tuesday Wednesday Thursday Friday Saturday Sunday )
261
+ weekdays += weekdays.map{ |w| w[0...3] }
262
+ wday = m.pattern weekdays, case_insensitive: true, name: :wday
263
+ months = %w( January February March April May June July August September October November December )
264
+ months += months.map{ |w| w[0...3] }
265
+ mo = m.pattern months, case_insensitive: true, name: :mo
266
+
267
+ date_20th_century = m.rx(
268
+ [
269
+ 'wday, mo mday',
270
+ 'wday, mo mday year',
271
+ 'mo mday, year',
272
+ 'mo year',
273
+ 'mday mo year',
274
+ 'wday',
275
+ 'year',
276
+ 'mday mo',
277
+ 'mo mday',
278
+ 'mo mday year'
279
+ ],
280
+ normalize_whitespace: true,
281
+ atomic: true,
282
+ symbols: {
283
+ year: year,
284
+ mday: mday,
285
+ wday: wday,
286
+ mo: mo
287
+ }
288
+ )
289
+
290
+ [
291
+ 'Friday',
292
+ 'August 27',
293
+ 'May 6, 1969',
294
+ '1 Jan 2000',
295
+ 'this is not actually a date'
296
+ ].each do |candidate|
297
+ if m = date_20th_century.match(candidate)
298
+ puts "candidate: #{candidate}; year: #{m[:year]}; month: #{m[:mo]}; weekday: #{m[:wday]}; day of the month: #{m[:mday]}"
299
+ else
300
+ puts "#{candidate} does not look like a plausible date in the 20th century"
301
+ end
302
+ end
303
+ ```
304
+
305
+ ### vet
306
+
307
+ ```ruby
308
+ default: false
309
+ ```
310
+
311
+ If true, all patterns associated with symbols will be tested upon initialization to make sure they will
312
+ create legitimate regular expressions. If you are prone to doing this, for example:
313
+
314
+ ```ruby
315
+ List::Matcher.new symbols: { aw_nuts: '+++' }
316
+ ```
317
+
318
+ then you may want to vet your symbols. Vetting is not done by default because one assumes you've worked out
319
+ your substitutions on your own time and we need not waste runtime checking them.
320
+
321
+ ## Benchmarks
322
+
323
+ Efficiency isn't the principle purpose of List::Matcher, but in almost all cases List::Matcher
324
+ regular expressions are more efficient than a regular expression generated by simply joining alternates
325
+ with `|`. The following results were extracted from the output of the benchmark script included with this
326
+ distribution. Sets are provided as a baseline for comparison, though there are many things one can do
327
+ with a regular expression that one cannot do with a set.
328
+
329
+ ```
330
+ RANDOM WORDS, VARIABLE LENGTH
331
+
332
+ number of words: 100
333
+
334
+ set good: 53360.1 i/s
335
+ List::Matcher good: 22211.7 i/s - 2.40x slower
336
+ simple rx good: 13086.6 i/s - 4.08x slower
337
+ list good: 4748.0 i/s - 11.24x slower
338
+
339
+ set bad: 57387.1 i/s
340
+ List::Matcher bad: 14398.7 i/s - 3.99x slower
341
+ simple rx bad: 7347.1 i/s - 7.81x slower
342
+ list bad: 2583.1 i/s - 22.22x slower
343
+
344
+
345
+ number of words: 1000
346
+
347
+ set good: 5380.5 i/s
348
+ List::Matcher good: 1665.3 i/s - 3.23x slower
349
+ simple rx good: 166.7 i/s - 32.27x slower
350
+ list good: 52.8 i/s - 101.98x slower
351
+
352
+ set bad: 5294.8 i/s
353
+ List::Matcher bad: 1061.1 i/s - 4.99x slower
354
+ simple rx bad: 81.0 i/s - 65.34x slower
355
+ list bad: 26.1 i/s - 202.51x slower
356
+
357
+
358
+ number of words: 10000
359
+
360
+ set good: 361.3 i/s
361
+ List::Matcher good: 146.4 i/s - 2.47x slower
362
+ simple rx good: 1.7 i/s - 210.46x slower
363
+ list good: 0.4 i/s - 1027.74x slower
364
+
365
+ set bad: 370.3 i/s
366
+ List::Matcher bad: 82.2 i/s - 4.51x slower
367
+ simple rx bad: 0.8 i/s - 447.85x slower
368
+ list bad: 0.2 i/s - 1882.35x slower
369
+
370
+
371
+ FIXED LENGTH, FULL RANGE
372
+
373
+ number of words: 10; List::Matcher rx: (?-mix:\A\d\z)
374
+
375
+ set good: 520144.5 i/s
376
+ List::Matcher good: 382968.0 i/s - 1.36x slower
377
+ list good: 323052.6 i/s - 1.61x slower
378
+ simple rx good: 316058.3 i/s - 1.65x slower
379
+
380
+ set bad: 624424.8 i/s
381
+ List::Matcher bad: 270882.3 i/s - 2.31x slower
382
+ simple rx bad: 266277.3 i/s - 2.35x slower
383
+ list bad: 175058.3 i/s - 3.57x slower
384
+
385
+
386
+ number of words: 100; List::Matcher rx: (?-mix:\A\d\d\z)
387
+
388
+ set creation: 20.3 i/s
389
+ simple rx creation: 15.9 i/s - 1.28x slower
390
+ List::Matcher creation: 15.9 i/s - 1.28x slower
391
+
392
+ set good: 52058.4 i/s
393
+ List::Matcher good: 41841.7 i/s - 1.24x slower
394
+ simple rx good: 15095.6 i/s - 3.45x slower
395
+ list good: 4350.1 i/s - 11.97x slower
396
+
397
+ set bad: 59315.4 i/s
398
+ simple rx bad: 28063.6 i/s - 2.11x slower
399
+ List::Matcher bad: 27823.9 i/s - 2.13x slower
400
+ list bad: 2083.9 i/s - 28.46x slower
401
+
402
+
403
+ number of words: 1000; List::Matcher rx: (?-mix:\A\d{3}\z)
404
+
405
+ set creation: 2.1 i/s
406
+ List::Matcher creation: 1.5 i/s - 1.40x slower
407
+ simple rx creation: 1.5 i/s - 1.41x slower
408
+
409
+ set good: 4664.2 i/s
410
+ List::Matcher good: 3514.1 i/s - 1.33x slower
411
+ simple rx good: 225.6 i/s - 20.67x slower
412
+ list good: 44.2 i/s - 105.57x slower
413
+
414
+ set bad: 5830.5 i/s
415
+ simple rx bad: 2802.5 i/s - 2.08x slower
416
+ List::Matcher bad: 2717.0 i/s - 2.15x slower
417
+ list bad: 20.0 i/s - 291.10x slower
418
+
419
+
420
+ number of words: 10000; List::Matcher rx: (?-mix:\A\d{4}\z)
421
+
422
+ set creation: 0.2 i/s
423
+ simple rx creation: 0.1 i/s - 1.21x slower
424
+ List::Matcher creation: 0.1 i/s - 1.31x slower
425
+
426
+ set good: 369.4 i/s
427
+ List::Matcher good: 326.2 i/s - 1.13x slower
428
+ simple rx good: 2.3 i/s - 159.07x slower
429
+ list good: 0.4 i/s - 966.48x slower
430
+
431
+ set bad: 426.6 i/s
432
+ simple rx bad: 285.6 i/s - 1.49x slower
433
+ List::Matcher bad: 277.1 i/s - 1.54x slower
434
+ list bad: 0.2 i/s - 2236.24x slower
435
+
436
+ ```
437
+
438
+ ## Contributing
439
+
440
+ 1. Fork it ( https://github.com/[my-github-username]/list_matcher/fork )
441
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
442
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
443
+ 4. Push to the branch (`git push origin my-new-feature`)
444
+ 5. Create a new Pull Request
data/Rakefile ADDED
@@ -0,0 +1,9 @@
1
+ require "bundler/gem_tasks"
2
+
3
+ require "rake/testtask"
4
+
5
+ Rake::TestTask.new do |t|
6
+ t.test_files = FileList['test/*_test.rb']
7
+ end
8
+
9
+ task default: :test
@@ -0,0 +1,49 @@
1
+ require 'list_matcher'
2
+
3
+ m = List::Matcher.new atomic: false, bound: true
4
+
5
+ year = m.pattern( (1901..2000).to_a, name: :year )
6
+ mday = m.pattern( (1..31).to_a, name: :mday )
7
+ weekdays = %w( Monday Tuesday Wednesday Thursday Friday Saturday Sunday )
8
+ weekdays += weekdays.map{ |w| w[0...3] }
9
+ wday = m.pattern weekdays, case_insensitive: true, name: :wday
10
+ months = %w( January February March April May June July August September October November December )
11
+ months += months.map{ |w| w[0...3] }
12
+ mo = m.pattern months, case_insensitive: true, name: :mo
13
+
14
+ date_20th_century = m.rx(
15
+ [
16
+ 'wday, mo mday',
17
+ 'wday, mo mday year',
18
+ 'mo mday, year',
19
+ 'mo year',
20
+ 'mday mo year',
21
+ 'wday',
22
+ 'year',
23
+ 'mday mo',
24
+ 'mo mday',
25
+ 'mo mday year'
26
+ ],
27
+ normalize_whitespace: true,
28
+ atomic: true,
29
+ symbols: {
30
+ year: year,
31
+ mday: mday,
32
+ wday: wday,
33
+ mo: mo
34
+ }
35
+ )
36
+
37
+ [
38
+ 'Friday',
39
+ 'August 27',
40
+ 'May 6, 1969',
41
+ '1 Jan 2000',
42
+ 'this is not actually a date'
43
+ ].each do |candidate|
44
+ if m = date_20th_century.match(candidate)
45
+ puts "candidate: #{candidate}; year: #{m[:year]}; month: #{m[:mo]}; weekday: #{m[:wday]}; day of the month: #{m[:mday]}"
46
+ else
47
+ puts "#{candidate} does not look like a plausible date in the 20th century"
48
+ end
49
+ end
@@ -0,0 +1,3 @@
1
+ module ListMatcher
2
+ VERSION = "1.0.0"
3
+ end