RubyGems - list_matcher - Versions diffs - 1.0.0 - Mend

list_matcher 1.0.0

Files changed (14) hide show

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: a4f9013c830b1d516bbf895732abce4ee0fd327f
+  data.tar.gz: a6f49385a4dc17ffe42fe706bfce7e2cef2e4135
+SHA512:
+  metadata.gz: d97d5937836d59422ec61c8f4daa1e859ea8b62aa8598f74d875b7dbde4426a6d24368cd94ed0936baedb5dde357cfc10492ab173fe6f7e0b67748856aabf7f8
+  data.tar.gz: 343cbde71b7cb693b4dc67965a5334a5cdc05ebcf34d103bfd0be21633abd80aa6950c6718360d0a05fbea7a55ff3e553edcfdaa7d06f0777773a9e623a162a6

data/.gitignore ADDED Viewed

@@ -0,0 +1,14 @@
+/.bundle/
+/.yardoc
+/Gemfile.lock
+/_yardoc/
+/coverage/
+/doc/
+/pkg/
+/spec/reports/
+/tmp/
+*.bundle
+*.so
+*.o
+*.a
+mkmf.log

data/Gemfile ADDED Viewed

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in list_matcher.gemspec
+gemspec

data/LICENSE.txt ADDED Viewed

@@ -0,0 +1,22 @@
+Copyright (c) 2015 dfhoughton
+MIT License
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,444 @@
+# ListMatcher
+For creating compact, non-backtracking regular expressions from a list of strings.
+## Installation
+Add this line to your application's Gemfile:
+```ruby
+gem 'list_matcher'
+```
+And then execute:
+    $ bundle
+Or install it yourself as:
+    $ gem install list_matcher
+## Usage
+```ruby
+require 'list_matcher'
+m = List::Matcher.new
+puts m.pattern %w( cat dog )                               # (?:cat|dog)
+puts m.pattern %w( cat rat )                               # (?:[cr]at)
+puts m.pattern %w( cat camel )                             # (?:ca(?:mel|t))
+puts m.pattern %w( cat flat sprat )                        # (?:(?:c|fl|spr)at)
+puts m.pattern %w( catttttttttt )                          # (?:cat{10})
+puts m.pattern %w( cat-t-t-t-t-t-t-t-t-t )                 # (?:ca(?:t-){9}t)
+puts m.pattern %w( catttttttttt batttttttttt )             # (?:[bc]at{10})
+puts m.pattern %w( cad bad dad )                           # (?:[b-d]ad)
+puts m.pattern %w( cat catalog )                           # (?:cat(?:alog)?+)
+puts m.pattern (1..31).to_a                                # (?:[4-9]|1\d?+|2\d?+|3[01]?+)
+```
+## Description
+`List::Matcher` facilitates generating efficient regexen programmatically. This is useful, for example, when looking for
+occurrences of particular words or phrases in free-form text. `List::Matcher` will automatically generate regular expressions
+that minimize backtracking, so they tend to be as fast as one could hope a regular expression to be. (The general strategy is
+to represent the items in the list as a trie.)
+`List::Matcher` has many options and the initialization of a matcher for pattern generation is somewhat complex, so various methods
+are provided to minimize initializations and the number of times you specify options. For one-off patterns, you may as well call
+class methods, either `pattern` which generates a string, or `rx`, which returns a `Regexp` object:
+```ruby
+List::Matcher.pattern %( cat dog )   # "(?:cat|dog)"
+List::Matcher.rx      %( cat dog )   # /(?:cat|dog)/
+```
+If you plan to generate multiple regexen, or have complicated options which you always use, you should generate a configured
+instance first:
+```ruby
+m = List::Matcher.new normalize_whitespace: true, bound: true, case_insensitive: true, multiline: true, atomic: false, symbols: { num: '\d++' }
+m.pattern method_that_gets_a_long_list
+m.rx      method_that_gets_a_long_list
+...
+```
+If you have a basic set of options and you need to modify these in particular cases, you can:
+```ruby
+m.pattern list, case_insensitive: false
+```
+You can also generate a prototype list matcher with a particular variation and bud off children with their own properties:
+```ruby
+m  = List::Matcher.new normalize_whitespace: true, bound: true, case_insensitive: true, multiline: true, atomic: false, symbols: { num: '\d++' }
+m2 = m.bud case_insensitive: false
+```
+Basically, you can mix in options in whatever way suits you. Constructing configured instances gives you a tiny bit of efficiency, but
+mostly it saves you from specifying these options in multiple places.
+## Options
+The one can provide to `new`, `bud`, `pattern`, or `rx` are all the same. These are
+### atomic
+```ruby
+default: true
+```
+If true, the returned expression is always wrapped in some grouping expression -- `(?:...)`, `(?>...)`, `(?i:...)`, etc.; whatever
+is appropriate given the other options and defaults -- so it can receive a quantification suffix.
+```ruby
+List::Matcher.pattern %w(cat dog), atomic: false   # "cat|dog"
+List::Matcher.pattern %w(cat dog), atomic: true    # "(?:cat|dog)"
+```
+### backtracking
+```ruby
+default: true
+```
+If true, the default non-capturing grouping expression is `(?:...)` rather than `(?>...)`, and the optional quantifier is
+`?` rather than `?+`.
+```ruby
+List::Matcher.pattern %w( cat dog )                        # "(?:cat|dog)"
+List::Matcher.pattern %w( cat dog ), backtracking: false   # "(?>cat|dog)"
+```
+### bound
+```ruby
+default: false
+```
+Whether boundary expressions should be attached to the margins of every expression in the list. If this value is simply true, this means
+each items marginal characters, the first and the last, are tested to see whether they are word characters and if so the word
+boundary symbol, `\b`, is appended to them where appropriate. There are several variants on this, however:
+```ruby
+bound: :word
+```
+This is the same as `bound: true`.
+```ruby
+List::Matcher.pattern %w(cat), bound: :word   # "(?:\\bcat\\b)"
+List::Matcher.pattern %w(cat), bound: true    # "(?:\\bcat\\b)"
+```
+```ruby
+bound: :line
+```
+Each item should take up an entire line, so the boundary symbols are `^` and `$`.
+```ruby
+List::Matcher.pattern %w(cat), bound: :line   # "(?:^cat$)"
+```
+```ruby
+bound: :string
+```
+Each item should match the entire string compared against, so the boundary symbols are `\A` and `\z`.
+```ruby
+List::Matcher.pattern %w(cat), bound: :string   # "(?:\\Acat\\z)"
+```
+```ruby
+bound: { test: /\d/, left: '(?<!\d)', right: '(?!\d)'}
+```
+If you have an ad hoc boundary definition -- here it is a digit/non-digit boundary -- you may specify it so. The test parameter
+identifies marginal characters that require the boundary tests and the `:left` and `:right` symbols identify the boundary conditions.
+```ruby
+List::Matcher.pattern (1...1000).to_a, bound: { test: /\d/, left: '(?<!\d)', right: '(?!\d)'}
+# "(?:(?<!\\d)[1-9](?:\\d\\d?)?(?!\\d))"
+```
+### strip
+```ruby
+default: false
+```
+Strip whitespace off the margins of items in the list.
+```ruby
+List::Matcher.pattern ['     cat     ']                # "(?:(?:\\ ){5}cat(?:\\ ){5})"
+List::Matcher.pattern ['     cat     '], strip: true   # "(?:cat)"
+```
+### case_insensitive
+```ruby
+default: false
+```
+Generate a case-insensitive regular expression.
+```ruby
+List::Matcher.pattern %w( Cat cat CAT )                           # "(?:C(?:AT|at)|cat)"
+List::Matcher.pattern %w( Cat cat CAT ), case_insensitive: true   # "(?i:cat)"
+```
+### multiline
+```ruby
+default: false
+```
+Generate a multi-line regex.
+```ruby
+List::Matcher.pattern %w(cat), multiline: true   # "(?m:cat)"
+```
+The special feature of a multi-line regular expression is that `.` can grab newline characters. Because `List::Matcher`
+never produces `.` on its own, this option is only useful in conjunction with the `symbols` option, which lets one
+inject snippets of regex into the one generated.
+### normalize_whitespace
+```ruby
+default: false
+```
+This strips whitespace from items in the list and treats all internal whitespace as equivalent.
+```ruby
+List::Matcher.pattern [ ' cat  walker ', '  dog walker', 'camel  walker' ]
+# "(?:\\ (?:\\ dog\\ walker|cat\\ \\ walker\\ )|camel\\ \\ walker)"
+List::Matcher.pattern [ ' cat  walker ', '  dog walker', 'camel  walker' ], normalize_whitespace: true
+# "(?:(?:ca(?:mel|t)|dog)\\s++walker)"
+```
+### symbols
+You can tell `List::Matcher` that certain character sequences should be regarded as "symbols". It will then leave
+these unmolested, replacing them in the generated regex with whatever you map the symbol sequences to. The keys in
+the symbol hash are expected to be strings, symbols, or `Regexps`. Symbol keys are converted to their
+sequence by stringification. `Regexp` keys convert any sequence they match.
+```ruby
+List::Matcher.pattern [ 'Catch 22', '1984', 'Fahrenheit 451' ], symbols: { /\d+/ => '\d++' }
+# "(?:(?:(?:Catch|Fahrenheit)\\ )?\\d++)"
+List::Matcher.pattern [ 'Catch foo', 'foo', 'Fahrenheit foo' ], symbols: { 'foo' => '\d++' }
+# "(?:(?:(?:Catch|Fahrenheit)\\ )?\\d++)"
+List::Matcher.pattern [ 'Catch foo', 'foo', 'Fahrenheit foo' ], symbols: { foo: '\d++' }
+# "(?:(?:(?:Catch|Fahrenheit)\\ )?\\d++)"
+```
+Because it is possible for symbol sequences to overlap, sequences with string or symbol keys are evaluated before `Regexps`, and longer keys are
+evaluated before shorter ones.
+### name
+If you assign your pattern a name, it will be constructed with a named group such that you can extract
+the substring matched.
+```ruby
+List::Matcher.pattern %w(cat), name: :cat   # "(?<cat>cat)"
+```
+This is mostly useful if you are using `List::Matcher` to compose complex regexen incrementally. E.g., from the examples directory,
+```ruby
+require 'list_matcher'
+m = List::Matcher.new atomic: false, bound: true
+year      = m.pattern( (1901..2000).to_a, name: :year )
+mday      = m.pattern( (1..31).to_a, name: :mday )
+weekdays  = %w( Monday Tuesday Wednesday Thursday Friday Saturday Sunday )
+weekdays += weekdays.map{ |w| w[0...3] }
+wday      = m.pattern weekdays, case_insensitive: true, name: :wday
+months    = %w( January February March April May June July August September October November December )
+months   += months.map{ |w| w[0...3] }
+mo        = m.pattern months, case_insensitive: true, name: :mo
+date_20th_century = m.rx(
+  [
+    'wday, mo mday',
+    'wday, mo mday year',
+    'mo mday, year',
+    'mo year',
+    'mday mo year',
+    'wday',
+    'year',
+    'mday mo',
+    'mo mday',
+    'mo mday year'
+  ],
+  normalize_whitespace: true,
+  atomic: true,
+  symbols: {
+    year: year,
+    mday: mday,
+    wday: wday,
+    mo:   mo
+  }
+)
+[
+  'Friday',
+  'August 27',
+  'May 6, 1969',
+  '1 Jan 2000',
+  'this is not actually a date'
+].each do |candidate|
+  if m = date_20th_century.match(candidate)
+    puts "candidate: #{candidate}; year: #{m[:year]}; month: #{m[:mo]}; weekday: #{m[:wday]}; day of the month: #{m[:mday]}"
+  else
+    puts "#{candidate} does not look like a plausible date in the 20th century"
+  end
+end
+```
+### vet
+```ruby
+default: false
+```
+If true, all patterns associated with symbols will be tested upon initialization to make sure they will
+create legitimate regular expressions. If you are prone to doing this, for example:
+```ruby
+List::Matcher.new symbols: { aw_nuts: '+++' }
+```
+then you may want to vet your symbols. Vetting is not done by default because one assumes you've worked out
+your substitutions on your own time and we need not waste runtime checking them.
+## Benchmarks
+Efficiency isn't the principle purpose of List::Matcher, but in almost all cases List::Matcher
+regular expressions are more efficient than a regular expression generated by simply joining alternates
+with `|`. The following results were extracted from the output of the benchmark script included with this
+distribution. Sets are provided as a baseline for comparison, though there are many things one can do
+with a regular expression that one cannot do with a set.
+```
+RANDOM WORDS, VARIABLE LENGTH
+number of words: 100
+            set good:    53360.1 i/s
+  List::Matcher good:    22211.7 i/s - 2.40x slower
+      simple rx good:    13086.6 i/s - 4.08x slower
+           list good:     4748.0 i/s - 11.24x slower
+             set bad:    57387.1 i/s
+   List::Matcher bad:    14398.7 i/s - 3.99x slower
+       simple rx bad:     7347.1 i/s - 7.81x slower
+            list bad:     2583.1 i/s - 22.22x slower
+number of words: 1000
+            set good:     5380.5 i/s
+  List::Matcher good:     1665.3 i/s - 3.23x slower
+      simple rx good:      166.7 i/s - 32.27x slower
+           list good:       52.8 i/s - 101.98x slower
+             set bad:     5294.8 i/s
+   List::Matcher bad:     1061.1 i/s - 4.99x slower
+       simple rx bad:       81.0 i/s - 65.34x slower
+            list bad:       26.1 i/s - 202.51x slower
+number of words: 10000
+            set good:      361.3 i/s
+  List::Matcher good:      146.4 i/s - 2.47x slower
+      simple rx good:        1.7 i/s - 210.46x slower
+           list good:        0.4 i/s - 1027.74x slower
+             set bad:      370.3 i/s
+   List::Matcher bad:       82.2 i/s - 4.51x slower
+       simple rx bad:        0.8 i/s - 447.85x slower
+            list bad:        0.2 i/s - 1882.35x slower
+FIXED LENGTH, FULL RANGE
+number of words: 10; List::Matcher rx: (?-mix:\A\d\z)
+            set good:   520144.5 i/s
+  List::Matcher good:   382968.0 i/s - 1.36x slower
+           list good:   323052.6 i/s - 1.61x slower
+      simple rx good:   316058.3 i/s - 1.65x slower
+             set bad:   624424.8 i/s
+   List::Matcher bad:   270882.3 i/s - 2.31x slower
+       simple rx bad:   266277.3 i/s - 2.35x slower
+            list bad:   175058.3 i/s - 3.57x slower
+number of words: 100; List::Matcher rx: (?-mix:\A\d\d\z)
+        set creation:       20.3 i/s
+  simple rx creation:       15.9 i/s - 1.28x slower
+List::Matcher creation:       15.9 i/s - 1.28x slower
+            set good:    52058.4 i/s
+  List::Matcher good:    41841.7 i/s - 1.24x slower
+      simple rx good:    15095.6 i/s - 3.45x slower
+           list good:     4350.1 i/s - 11.97x slower
+             set bad:    59315.4 i/s
+       simple rx bad:    28063.6 i/s - 2.11x slower
+   List::Matcher bad:    27823.9 i/s - 2.13x slower
+            list bad:     2083.9 i/s - 28.46x slower
+number of words: 1000; List::Matcher rx: (?-mix:\A\d{3}\z)
+        set creation:        2.1 i/s
+List::Matcher creation:        1.5 i/s - 1.40x slower
+  simple rx creation:        1.5 i/s - 1.41x slower
+            set good:     4664.2 i/s
+  List::Matcher good:     3514.1 i/s - 1.33x slower
+      simple rx good:      225.6 i/s - 20.67x slower
+           list good:       44.2 i/s - 105.57x slower
+             set bad:     5830.5 i/s
+       simple rx bad:     2802.5 i/s - 2.08x slower
+   List::Matcher bad:     2717.0 i/s - 2.15x slower
+            list bad:       20.0 i/s - 291.10x slower
+number of words: 10000; List::Matcher rx: (?-mix:\A\d{4}\z)
+        set creation:        0.2 i/s
+  simple rx creation:        0.1 i/s - 1.21x slower
+List::Matcher creation:        0.1 i/s - 1.31x slower
+            set good:      369.4 i/s
+  List::Matcher good:      326.2 i/s - 1.13x slower
+      simple rx good:        2.3 i/s - 159.07x slower
+           list good:        0.4 i/s - 966.48x slower
+             set bad:      426.6 i/s
+       simple rx bad:      285.6 i/s - 1.49x slower
+   List::Matcher bad:      277.1 i/s - 1.54x slower
+            list bad:        0.2 i/s - 2236.24x slower
+```
+## Contributing
+1. Fork it ( https://github.com/[my-github-username]/list_matcher/fork )
+2. Create your feature branch (`git checkout -b my-new-feature`)
+3. Commit your changes (`git commit -am 'Add some feature'`)
+4. Push to the branch (`git push origin my-new-feature`)
+5. Create a new Pull Request

data/Rakefile ADDED Viewed

@@ -0,0 +1,9 @@
+require "bundler/gem_tasks"
+require "rake/testtask"
+Rake::TestTask.new do |t|
+  t.test_files = FileList['test/*_test.rb']
+end
+task default: :test

data/examples/date_grammar.rb ADDED Viewed

@@ -0,0 +1,49 @@
+require 'list_matcher'
+m = List::Matcher.new atomic: false, bound: true
+year      = m.pattern( (1901..2000).to_a, name: :year )
+mday      = m.pattern( (1..31).to_a, name: :mday )
+weekdays  = %w( Monday Tuesday Wednesday Thursday Friday Saturday Sunday )
+weekdays += weekdays.map{ |w| w[0...3] }
+wday      = m.pattern weekdays, case_insensitive: true, name: :wday
+months    = %w( January February March April May June July August September October November December )
+months   += months.map{ |w| w[0...3] }
+mo        = m.pattern months, case_insensitive: true, name: :mo
+date_20th_century = m.rx(
+  [
+    'wday, mo mday',
+    'wday, mo mday year',
+    'mo mday, year',
+    'mo year',
+    'mday mo year',
+    'wday',
+    'year',
+    'mday mo',
+    'mo mday',
+    'mo mday year'
+  ],
+  normalize_whitespace: true,
+  atomic: true,
+  symbols: {
+    year: year,
+    mday: mday,
+    wday: wday,
+    mo:   mo
+  }
+)
+[
+  'Friday',
+  'August 27',
+  'May 6, 1969',
+  '1 Jan 2000',
+  'this is not actually a date'
+].each do |candidate|
+  if m = date_20th_century.match(candidate)
+    puts "candidate: #{candidate}; year: #{m[:year]}; month: #{m[:mo]}; weekday: #{m[:wday]}; day of the month: #{m[:mday]}"
+  else
+    puts "#{candidate} does not look like a plausible date in the 20th century"
+  end
+end

data/lib/list_matcher/version.rb ADDED Viewed

@@ -0,0 +1,3 @@
+module ListMatcher
+  VERSION = "1.0.0"
+end