list_matcher 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: a4f9013c830b1d516bbf895732abce4ee0fd327f
4
+ data.tar.gz: a6f49385a4dc17ffe42fe706bfce7e2cef2e4135
5
+ SHA512:
6
+ metadata.gz: d97d5937836d59422ec61c8f4daa1e859ea8b62aa8598f74d875b7dbde4426a6d24368cd94ed0936baedb5dde357cfc10492ab173fe6f7e0b67748856aabf7f8
7
+ data.tar.gz: 343cbde71b7cb693b4dc67965a5334a5cdc05ebcf34d103bfd0be21633abd80aa6950c6718360d0a05fbea7a55ff3e553edcfdaa7d06f0777773a9e623a162a6
data/.gitignore ADDED
@@ -0,0 +1,14 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /Gemfile.lock
4
+ /_yardoc/
5
+ /coverage/
6
+ /doc/
7
+ /pkg/
8
+ /spec/reports/
9
+ /tmp/
10
+ *.bundle
11
+ *.so
12
+ *.o
13
+ *.a
14
+ mkmf.log
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in list_matcher.gemspec
4
+ gemspec
data/LICENSE.txt ADDED
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2015 dfhoughton
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,444 @@
1
+ # ListMatcher
2
+
3
+ For creating compact, non-backtracking regular expressions from a list of strings.
4
+
5
+ ## Installation
6
+
7
+ Add this line to your application's Gemfile:
8
+
9
+ ```ruby
10
+ gem 'list_matcher'
11
+ ```
12
+
13
+ And then execute:
14
+
15
+ $ bundle
16
+
17
+ Or install it yourself as:
18
+
19
+ $ gem install list_matcher
20
+
21
+ ## Usage
22
+
23
+ ```ruby
24
+ require 'list_matcher'
25
+
26
+ m = List::Matcher.new
27
+ puts m.pattern %w( cat dog ) # (?:cat|dog)
28
+ puts m.pattern %w( cat rat ) # (?:[cr]at)
29
+ puts m.pattern %w( cat camel ) # (?:ca(?:mel|t))
30
+ puts m.pattern %w( cat flat sprat ) # (?:(?:c|fl|spr)at)
31
+ puts m.pattern %w( catttttttttt ) # (?:cat{10})
32
+ puts m.pattern %w( cat-t-t-t-t-t-t-t-t-t ) # (?:ca(?:t-){9}t)
33
+ puts m.pattern %w( catttttttttt batttttttttt ) # (?:[bc]at{10})
34
+ puts m.pattern %w( cad bad dad ) # (?:[b-d]ad)
35
+ puts m.pattern %w( cat catalog ) # (?:cat(?:alog)?+)
36
+ puts m.pattern (1..31).to_a # (?:[4-9]|1\d?+|2\d?+|3[01]?+)
37
+ ```
38
+
39
+ ## Description
40
+
41
+ `List::Matcher` facilitates generating efficient regexen programmatically. This is useful, for example, when looking for
42
+ occurrences of particular words or phrases in free-form text. `List::Matcher` will automatically generate regular expressions
43
+ that minimize backtracking, so they tend to be as fast as one could hope a regular expression to be. (The general strategy is
44
+ to represent the items in the list as a trie.)
45
+
46
+ `List::Matcher` has many options and the initialization of a matcher for pattern generation is somewhat complex, so various methods
47
+ are provided to minimize initializations and the number of times you specify options. For one-off patterns, you may as well call
48
+ class methods, either `pattern` which generates a string, or `rx`, which returns a `Regexp` object:
49
+
50
+ ```ruby
51
+ List::Matcher.pattern %( cat dog ) # "(?:cat|dog)"
52
+ List::Matcher.rx %( cat dog ) # /(?:cat|dog)/
53
+ ```
54
+
55
+ If you plan to generate multiple regexen, or have complicated options which you always use, you should generate a configured
56
+ instance first:
57
+
58
+ ```ruby
59
+ m = List::Matcher.new normalize_whitespace: true, bound: true, case_insensitive: true, multiline: true, atomic: false, symbols: { num: '\d++' }
60
+ m.pattern method_that_gets_a_long_list
61
+ m.rx method_that_gets_a_long_list
62
+ ...
63
+ ```
64
+
65
+ If you have a basic set of options and you need to modify these in particular cases, you can:
66
+
67
+ ```ruby
68
+ m.pattern list, case_insensitive: false
69
+ ```
70
+
71
+ You can also generate a prototype list matcher with a particular variation and bud off children with their own properties:
72
+
73
+ ```ruby
74
+ m = List::Matcher.new normalize_whitespace: true, bound: true, case_insensitive: true, multiline: true, atomic: false, symbols: { num: '\d++' }
75
+ m2 = m.bud case_insensitive: false
76
+ ```
77
+
78
+ Basically, you can mix in options in whatever way suits you. Constructing configured instances gives you a tiny bit of efficiency, but
79
+ mostly it saves you from specifying these options in multiple places.
80
+
81
+ ## Options
82
+
83
+ The one can provide to `new`, `bud`, `pattern`, or `rx` are all the same. These are
84
+
85
+ ### atomic
86
+
87
+ ```ruby
88
+ default: true
89
+ ```
90
+
91
+ If true, the returned expression is always wrapped in some grouping expression -- `(?:...)`, `(?>...)`, `(?i:...)`, etc.; whatever
92
+ is appropriate given the other options and defaults -- so it can receive a quantification suffix.
93
+
94
+ ```ruby
95
+ List::Matcher.pattern %w(cat dog), atomic: false # "cat|dog"
96
+ List::Matcher.pattern %w(cat dog), atomic: true # "(?:cat|dog)"
97
+ ```
98
+
99
+ ### backtracking
100
+
101
+ ```ruby
102
+ default: true
103
+ ```
104
+
105
+ If true, the default non-capturing grouping expression is `(?:...)` rather than `(?>...)`, and the optional quantifier is
106
+ `?` rather than `?+`.
107
+
108
+ ```ruby
109
+ List::Matcher.pattern %w( cat dog ) # "(?:cat|dog)"
110
+ List::Matcher.pattern %w( cat dog ), backtracking: false # "(?>cat|dog)"
111
+ ```
112
+
113
+ ### bound
114
+
115
+ ```ruby
116
+ default: false
117
+ ```
118
+
119
+ Whether boundary expressions should be attached to the margins of every expression in the list. If this value is simply true, this means
120
+ each items marginal characters, the first and the last, are tested to see whether they are word characters and if so the word
121
+ boundary symbol, `\b`, is appended to them where appropriate. There are several variants on this, however:
122
+
123
+ ```ruby
124
+ bound: :word
125
+ ```
126
+
127
+ This is the same as `bound: true`.
128
+
129
+ ```ruby
130
+ List::Matcher.pattern %w(cat), bound: :word # "(?:\\bcat\\b)"
131
+ List::Matcher.pattern %w(cat), bound: true # "(?:\\bcat\\b)"
132
+ ```
133
+
134
+ ```ruby
135
+ bound: :line
136
+ ```
137
+
138
+ Each item should take up an entire line, so the boundary symbols are `^` and `$`.
139
+
140
+ ```ruby
141
+ List::Matcher.pattern %w(cat), bound: :line # "(?:^cat$)"
142
+ ```
143
+
144
+ ```ruby
145
+ bound: :string
146
+ ```
147
+
148
+ Each item should match the entire string compared against, so the boundary symbols are `\A` and `\z`.
149
+
150
+ ```ruby
151
+ List::Matcher.pattern %w(cat), bound: :string # "(?:\\Acat\\z)"
152
+ ```
153
+
154
+ ```ruby
155
+ bound: { test: /\d/, left: '(?<!\d)', right: '(?!\d)'}
156
+ ```
157
+
158
+ If you have an ad hoc boundary definition -- here it is a digit/non-digit boundary -- you may specify it so. The test parameter
159
+ identifies marginal characters that require the boundary tests and the `:left` and `:right` symbols identify the boundary conditions.
160
+
161
+ ```ruby
162
+ List::Matcher.pattern (1...1000).to_a, bound: { test: /\d/, left: '(?<!\d)', right: '(?!\d)'}
163
+ # "(?:(?<!\\d)[1-9](?:\\d\\d?)?(?!\\d))"
164
+ ```
165
+
166
+ ### strip
167
+
168
+ ```ruby
169
+ default: false
170
+ ```
171
+
172
+ Strip whitespace off the margins of items in the list.
173
+
174
+ ```ruby
175
+ List::Matcher.pattern [' cat '] # "(?:(?:\\ ){5}cat(?:\\ ){5})"
176
+ List::Matcher.pattern [' cat '], strip: true # "(?:cat)"
177
+ ```
178
+
179
+ ### case_insensitive
180
+
181
+ ```ruby
182
+ default: false
183
+ ```
184
+
185
+ Generate a case-insensitive regular expression.
186
+
187
+ ```ruby
188
+ List::Matcher.pattern %w( Cat cat CAT ) # "(?:C(?:AT|at)|cat)"
189
+ List::Matcher.pattern %w( Cat cat CAT ), case_insensitive: true # "(?i:cat)"
190
+ ```
191
+
192
+ ### multiline
193
+
194
+ ```ruby
195
+ default: false
196
+ ```
197
+
198
+ Generate a multi-line regex.
199
+
200
+ ```ruby
201
+ List::Matcher.pattern %w(cat), multiline: true # "(?m:cat)"
202
+ ```
203
+
204
+ The special feature of a multi-line regular expression is that `.` can grab newline characters. Because `List::Matcher`
205
+ never produces `.` on its own, this option is only useful in conjunction with the `symbols` option, which lets one
206
+ inject snippets of regex into the one generated.
207
+
208
+ ### normalize_whitespace
209
+
210
+ ```ruby
211
+ default: false
212
+ ```
213
+
214
+ This strips whitespace from items in the list and treats all internal whitespace as equivalent.
215
+
216
+ ```ruby
217
+ List::Matcher.pattern [ ' cat walker ', ' dog walker', 'camel walker' ]
218
+ # "(?:\\ (?:\\ dog\\ walker|cat\\ \\ walker\\ )|camel\\ \\ walker)"
219
+ List::Matcher.pattern [ ' cat walker ', ' dog walker', 'camel walker' ], normalize_whitespace: true
220
+ # "(?:(?:ca(?:mel|t)|dog)\\s++walker)"
221
+ ```
222
+
223
+ ### symbols
224
+
225
+ You can tell `List::Matcher` that certain character sequences should be regarded as "symbols". It will then leave
226
+ these unmolested, replacing them in the generated regex with whatever you map the symbol sequences to. The keys in
227
+ the symbol hash are expected to be strings, symbols, or `Regexps`. Symbol keys are converted to their
228
+ sequence by stringification. `Regexp` keys convert any sequence they match.
229
+
230
+ ```ruby
231
+ List::Matcher.pattern [ 'Catch 22', '1984', 'Fahrenheit 451' ], symbols: { /\d+/ => '\d++' }
232
+ # "(?:(?:(?:Catch|Fahrenheit)\\ )?\\d++)"
233
+ List::Matcher.pattern [ 'Catch foo', 'foo', 'Fahrenheit foo' ], symbols: { 'foo' => '\d++' }
234
+ # "(?:(?:(?:Catch|Fahrenheit)\\ )?\\d++)"
235
+ List::Matcher.pattern [ 'Catch foo', 'foo', 'Fahrenheit foo' ], symbols: { foo: '\d++' }
236
+ # "(?:(?:(?:Catch|Fahrenheit)\\ )?\\d++)"
237
+ ```
238
+
239
+ Because it is possible for symbol sequences to overlap, sequences with string or symbol keys are evaluated before `Regexps`, and longer keys are
240
+ evaluated before shorter ones.
241
+
242
+ ### name
243
+
244
+ If you assign your pattern a name, it will be constructed with a named group such that you can extract
245
+ the substring matched.
246
+
247
+ ```ruby
248
+ List::Matcher.pattern %w(cat), name: :cat # "(?<cat>cat)"
249
+ ```
250
+
251
+ This is mostly useful if you are using `List::Matcher` to compose complex regexen incrementally. E.g., from the examples directory,
252
+
253
+ ```ruby
254
+ require 'list_matcher'
255
+
256
+ m = List::Matcher.new atomic: false, bound: true
257
+
258
+ year = m.pattern( (1901..2000).to_a, name: :year )
259
+ mday = m.pattern( (1..31).to_a, name: :mday )
260
+ weekdays = %w( Monday Tuesday Wednesday Thursday Friday Saturday Sunday )
261
+ weekdays += weekdays.map{ |w| w[0...3] }
262
+ wday = m.pattern weekdays, case_insensitive: true, name: :wday
263
+ months = %w( January February March April May June July August September October November December )
264
+ months += months.map{ |w| w[0...3] }
265
+ mo = m.pattern months, case_insensitive: true, name: :mo
266
+
267
+ date_20th_century = m.rx(
268
+ [
269
+ 'wday, mo mday',
270
+ 'wday, mo mday year',
271
+ 'mo mday, year',
272
+ 'mo year',
273
+ 'mday mo year',
274
+ 'wday',
275
+ 'year',
276
+ 'mday mo',
277
+ 'mo mday',
278
+ 'mo mday year'
279
+ ],
280
+ normalize_whitespace: true,
281
+ atomic: true,
282
+ symbols: {
283
+ year: year,
284
+ mday: mday,
285
+ wday: wday,
286
+ mo: mo
287
+ }
288
+ )
289
+
290
+ [
291
+ 'Friday',
292
+ 'August 27',
293
+ 'May 6, 1969',
294
+ '1 Jan 2000',
295
+ 'this is not actually a date'
296
+ ].each do |candidate|
297
+ if m = date_20th_century.match(candidate)
298
+ puts "candidate: #{candidate}; year: #{m[:year]}; month: #{m[:mo]}; weekday: #{m[:wday]}; day of the month: #{m[:mday]}"
299
+ else
300
+ puts "#{candidate} does not look like a plausible date in the 20th century"
301
+ end
302
+ end
303
+ ```
304
+
305
+ ### vet
306
+
307
+ ```ruby
308
+ default: false
309
+ ```
310
+
311
+ If true, all patterns associated with symbols will be tested upon initialization to make sure they will
312
+ create legitimate regular expressions. If you are prone to doing this, for example:
313
+
314
+ ```ruby
315
+ List::Matcher.new symbols: { aw_nuts: '+++' }
316
+ ```
317
+
318
+ then you may want to vet your symbols. Vetting is not done by default because one assumes you've worked out
319
+ your substitutions on your own time and we need not waste runtime checking them.
320
+
321
+ ## Benchmarks
322
+
323
+ Efficiency isn't the principle purpose of List::Matcher, but in almost all cases List::Matcher
324
+ regular expressions are more efficient than a regular expression generated by simply joining alternates
325
+ with `|`. The following results were extracted from the output of the benchmark script included with this
326
+ distribution. Sets are provided as a baseline for comparison, though there are many things one can do
327
+ with a regular expression that one cannot do with a set.
328
+
329
+ ```
330
+ RANDOM WORDS, VARIABLE LENGTH
331
+
332
+ number of words: 100
333
+
334
+ set good: 53360.1 i/s
335
+ List::Matcher good: 22211.7 i/s - 2.40x slower
336
+ simple rx good: 13086.6 i/s - 4.08x slower
337
+ list good: 4748.0 i/s - 11.24x slower
338
+
339
+ set bad: 57387.1 i/s
340
+ List::Matcher bad: 14398.7 i/s - 3.99x slower
341
+ simple rx bad: 7347.1 i/s - 7.81x slower
342
+ list bad: 2583.1 i/s - 22.22x slower
343
+
344
+
345
+ number of words: 1000
346
+
347
+ set good: 5380.5 i/s
348
+ List::Matcher good: 1665.3 i/s - 3.23x slower
349
+ simple rx good: 166.7 i/s - 32.27x slower
350
+ list good: 52.8 i/s - 101.98x slower
351
+
352
+ set bad: 5294.8 i/s
353
+ List::Matcher bad: 1061.1 i/s - 4.99x slower
354
+ simple rx bad: 81.0 i/s - 65.34x slower
355
+ list bad: 26.1 i/s - 202.51x slower
356
+
357
+
358
+ number of words: 10000
359
+
360
+ set good: 361.3 i/s
361
+ List::Matcher good: 146.4 i/s - 2.47x slower
362
+ simple rx good: 1.7 i/s - 210.46x slower
363
+ list good: 0.4 i/s - 1027.74x slower
364
+
365
+ set bad: 370.3 i/s
366
+ List::Matcher bad: 82.2 i/s - 4.51x slower
367
+ simple rx bad: 0.8 i/s - 447.85x slower
368
+ list bad: 0.2 i/s - 1882.35x slower
369
+
370
+
371
+ FIXED LENGTH, FULL RANGE
372
+
373
+ number of words: 10; List::Matcher rx: (?-mix:\A\d\z)
374
+
375
+ set good: 520144.5 i/s
376
+ List::Matcher good: 382968.0 i/s - 1.36x slower
377
+ list good: 323052.6 i/s - 1.61x slower
378
+ simple rx good: 316058.3 i/s - 1.65x slower
379
+
380
+ set bad: 624424.8 i/s
381
+ List::Matcher bad: 270882.3 i/s - 2.31x slower
382
+ simple rx bad: 266277.3 i/s - 2.35x slower
383
+ list bad: 175058.3 i/s - 3.57x slower
384
+
385
+
386
+ number of words: 100; List::Matcher rx: (?-mix:\A\d\d\z)
387
+
388
+ set creation: 20.3 i/s
389
+ simple rx creation: 15.9 i/s - 1.28x slower
390
+ List::Matcher creation: 15.9 i/s - 1.28x slower
391
+
392
+ set good: 52058.4 i/s
393
+ List::Matcher good: 41841.7 i/s - 1.24x slower
394
+ simple rx good: 15095.6 i/s - 3.45x slower
395
+ list good: 4350.1 i/s - 11.97x slower
396
+
397
+ set bad: 59315.4 i/s
398
+ simple rx bad: 28063.6 i/s - 2.11x slower
399
+ List::Matcher bad: 27823.9 i/s - 2.13x slower
400
+ list bad: 2083.9 i/s - 28.46x slower
401
+
402
+
403
+ number of words: 1000; List::Matcher rx: (?-mix:\A\d{3}\z)
404
+
405
+ set creation: 2.1 i/s
406
+ List::Matcher creation: 1.5 i/s - 1.40x slower
407
+ simple rx creation: 1.5 i/s - 1.41x slower
408
+
409
+ set good: 4664.2 i/s
410
+ List::Matcher good: 3514.1 i/s - 1.33x slower
411
+ simple rx good: 225.6 i/s - 20.67x slower
412
+ list good: 44.2 i/s - 105.57x slower
413
+
414
+ set bad: 5830.5 i/s
415
+ simple rx bad: 2802.5 i/s - 2.08x slower
416
+ List::Matcher bad: 2717.0 i/s - 2.15x slower
417
+ list bad: 20.0 i/s - 291.10x slower
418
+
419
+
420
+ number of words: 10000; List::Matcher rx: (?-mix:\A\d{4}\z)
421
+
422
+ set creation: 0.2 i/s
423
+ simple rx creation: 0.1 i/s - 1.21x slower
424
+ List::Matcher creation: 0.1 i/s - 1.31x slower
425
+
426
+ set good: 369.4 i/s
427
+ List::Matcher good: 326.2 i/s - 1.13x slower
428
+ simple rx good: 2.3 i/s - 159.07x slower
429
+ list good: 0.4 i/s - 966.48x slower
430
+
431
+ set bad: 426.6 i/s
432
+ simple rx bad: 285.6 i/s - 1.49x slower
433
+ List::Matcher bad: 277.1 i/s - 1.54x slower
434
+ list bad: 0.2 i/s - 2236.24x slower
435
+
436
+ ```
437
+
438
+ ## Contributing
439
+
440
+ 1. Fork it ( https://github.com/[my-github-username]/list_matcher/fork )
441
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
442
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
443
+ 4. Push to the branch (`git push origin my-new-feature`)
444
+ 5. Create a new Pull Request
data/Rakefile ADDED
@@ -0,0 +1,9 @@
1
+ require "bundler/gem_tasks"
2
+
3
+ require "rake/testtask"
4
+
5
+ Rake::TestTask.new do |t|
6
+ t.test_files = FileList['test/*_test.rb']
7
+ end
8
+
9
+ task default: :test
@@ -0,0 +1,49 @@
1
+ require 'list_matcher'
2
+
3
+ m = List::Matcher.new atomic: false, bound: true
4
+
5
+ year = m.pattern( (1901..2000).to_a, name: :year )
6
+ mday = m.pattern( (1..31).to_a, name: :mday )
7
+ weekdays = %w( Monday Tuesday Wednesday Thursday Friday Saturday Sunday )
8
+ weekdays += weekdays.map{ |w| w[0...3] }
9
+ wday = m.pattern weekdays, case_insensitive: true, name: :wday
10
+ months = %w( January February March April May June July August September October November December )
11
+ months += months.map{ |w| w[0...3] }
12
+ mo = m.pattern months, case_insensitive: true, name: :mo
13
+
14
+ date_20th_century = m.rx(
15
+ [
16
+ 'wday, mo mday',
17
+ 'wday, mo mday year',
18
+ 'mo mday, year',
19
+ 'mo year',
20
+ 'mday mo year',
21
+ 'wday',
22
+ 'year',
23
+ 'mday mo',
24
+ 'mo mday',
25
+ 'mo mday year'
26
+ ],
27
+ normalize_whitespace: true,
28
+ atomic: true,
29
+ symbols: {
30
+ year: year,
31
+ mday: mday,
32
+ wday: wday,
33
+ mo: mo
34
+ }
35
+ )
36
+
37
+ [
38
+ 'Friday',
39
+ 'August 27',
40
+ 'May 6, 1969',
41
+ '1 Jan 2000',
42
+ 'this is not actually a date'
43
+ ].each do |candidate|
44
+ if m = date_20th_century.match(candidate)
45
+ puts "candidate: #{candidate}; year: #{m[:year]}; month: #{m[:mo]}; weekday: #{m[:wday]}; day of the month: #{m[:mday]}"
46
+ else
47
+ puts "#{candidate} does not look like a plausible date in the 20th century"
48
+ end
49
+ end
@@ -0,0 +1,3 @@
1
+ module ListMatcher
2
+ VERSION = "1.0.0"
3
+ end