RubyGems - list_matcher - Versions diffs - 1.0.0 - Mend

list_matcher 1.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (14) hide show

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: a4f9013c830b1d516bbf895732abce4ee0fd327f
+  data.tar.gz: a6f49385a4dc17ffe42fe706bfce7e2cef2e4135
+SHA512:
+  metadata.gz: d97d5937836d59422ec61c8f4daa1e859ea8b62aa8598f74d875b7dbde4426a6d24368cd94ed0936baedb5dde357cfc10492ab173fe6f7e0b67748856aabf7f8
+  data.tar.gz: 343cbde71b7cb693b4dc67965a5334a5cdc05ebcf34d103bfd0be21633abd80aa6950c6718360d0a05fbea7a55ff3e553edcfdaa7d06f0777773a9e623a162a6

data/.gitignore ADDED Viewed

@@ -0,0 +1,14 @@
+/.bundle/
+/.yardoc
+/Gemfile.lock
+/_yardoc/
+/coverage/
+/doc/
+/pkg/
+/spec/reports/
+/tmp/
+*.bundle
+*.so
+*.o
+*.a
+mkmf.log

data/Gemfile ADDED Viewed

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in list_matcher.gemspec
+gemspec

data/LICENSE.txt ADDED Viewed

@@ -0,0 +1,22 @@
+Copyright (c) 2015 dfhoughton
+MIT License
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,444 @@
+# ListMatcher
+For creating compact, non-backtracking regular expressions from a list of strings.
+## Installation
+Add this line to your application's Gemfile:
+```ruby
+gem 'list_matcher'
+```
+And then execute:
+    $ bundle
+Or install it yourself as:
+    $ gem install list_matcher
+## Usage
+```ruby
+require 'list_matcher'
+m = List::Matcher.new
+puts m.pattern %w( cat dog )                               # (?:cat|dog)
+puts m.pattern %w( cat rat )                               # (?:[cr]at)
+puts m.pattern %w( cat camel )                             # (?:ca(?:mel|t))
+puts m.pattern %w( cat flat sprat )                        # (?:(?:c|fl|spr)at)
+puts m.pattern %w( catttttttttt )                          # (?:cat{10})
+puts m.pattern %w( cat-t-t-t-t-t-t-t-t-t )                 # (?:ca(?:t-){9}t)
+puts m.pattern %w( catttttttttt batttttttttt )             # (?:[bc]at{10})
+puts m.pattern %w( cad bad dad )                           # (?:[b-d]ad)
+puts m.pattern %w( cat catalog )                           # (?:cat(?:alog)?+)
+puts m.pattern (1..31).to_a                                # (?:[4-9]|1\d?+|2\d?+|3[01]?+)
+```
+## Description
+`List::Matcher` facilitates generating efficient regexen programmatically. This is useful, for example, when looking for
+occurrences of particular words or phrases in free-form text. `List::Matcher` will automatically generate regular expressions
+that minimize backtracking, so they tend to be as fast as one could hope a regular expression to be. (The general strategy is
+to represent the items in the list as a trie.)
+`List::Matcher` has many options and the initialization of a matcher for pattern generation is somewhat complex, so various methods
+are provided to minimize initializations and the number of times you specify options. For one-off patterns, you may as well call
+class methods, either `pattern` which generates a string, or `rx`, which returns a `Regexp` object:
+```ruby
+List::Matcher.pattern %( cat dog )   # "(?:cat|dog)"
+List::Matcher.rx      %( cat dog )   # /(?:cat|dog)/
+```
+If you plan to generate multiple regexen, or have complicated options which you always use, you should generate a configured
+instance first:
+```ruby
+m = List::Matcher.new normalize_whitespace: true, bound: true, case_insensitive: true, multiline: true, atomic: false, symbols: { num: '\d++' }
+m.pattern method_that_gets_a_long_list
+m.rx      method_that_gets_a_long_list
+...
+```
+If you have a basic set of options and you need to modify these in particular cases, you can:
+```ruby
+m.pattern list, case_insensitive: false
+```
+You can also generate a prototype list matcher with a particular variation and bud off children with their own properties:
+```ruby
+m  = List::Matcher.new normalize_whitespace: true, bound: true, case_insensitive: true, multiline: true, atomic: false, symbols: { num: '\d++' }
+m2 = m.bud case_insensitive: false
+```
+Basically, you can mix in options in whatever way suits you. Constructing configured instances gives you a tiny bit of efficiency, but
+mostly it saves you from specifying these options in multiple places.
+## Options
+The one can provide to `new`, `bud`, `pattern`, or `rx` are all the same. These are
+### atomic
+```ruby
+default: true
+```
+If true, the returned expression is always wrapped in some grouping expression -- `(?:...)`, `(?>...)`, `(?i:...)`, etc.; whatever
+is appropriate given the other options and defaults -- so it can receive a quantification suffix.
+```ruby
+List::Matcher.pattern %w(cat dog), atomic: false   # "cat|dog"
+List::Matcher.pattern %w(cat dog), atomic: true    # "(?:cat|dog)"
+```
+### backtracking
+```ruby
+default: true
+```
+If true, the default non-capturing grouping expression is `(?:...)` rather than `(?>...)`, and the optional quantifier is
+`?` rather than `?+`.
+```ruby
+List::Matcher.pattern %w( cat dog )                        # "(?:cat|dog)"
+List::Matcher.pattern %w( cat dog ), backtracking: false   # "(?>cat|dog)"
+```
+### bound
+```ruby
+default: false
+```
+Whether boundary expressions should be attached to the margins of every expression in the list. If this value is simply true, this means
+each items marginal characters, the first and the last, are tested to see whether they are word characters and if so the word
+boundary symbol, `\b`, is appended to them where appropriate. There are several variants on this, however:
+```ruby
+bound: :word
+```
+This is the same as `bound: true`.
+```ruby
+List::Matcher.pattern %w(cat), bound: :word   # "(?:\\bcat\\b)"
+List::Matcher.pattern %w(cat), bound: true    # "(?:\\bcat\\b)"
+```
+```ruby
+bound: :line
+```
+Each item should take up an entire line, so the boundary symbols are `^` and `$`.
+```ruby
+List::Matcher.pattern %w(cat), bound: :line   # "(?:^cat$)"
+```
+```ruby
+bound: :string
+```
+Each item should match the entire string compared against, so the boundary symbols are `\A` and `\z`.
+```ruby
+List::Matcher.pattern %w(cat), bound: :string   # "(?:\\Acat\\z)"
+```
+```ruby
+bound: { test: /\d/, left: '(?<!\d)', right: '(?!\d)'}
+```
+If you have an ad hoc boundary definition -- here it is a digit/non-digit boundary -- you may specify it so. The test parameter
+identifies marginal characters that require the boundary tests and the `:left` and `:right` symbols identify the boundary conditions.
+```ruby
+List::Matcher.pattern (1...1000).to_a, bound: { test: /\d/, left: '(?<!\d)', right: '(?!\d)'}
+# "(?:(?<!\\d)[1-9](?:\\d\\d?)?(?!\\d))"
+```
+### strip
+```ruby
+default: false
+```
+Strip whitespace off the margins of items in the list.
+```ruby
+List::Matcher.pattern ['     cat     ']                # "(?:(?:\\ ){5}cat(?:\\ ){5})"
+List::Matcher.pattern ['     cat     '], strip: true   # "(?:cat)"
+```
+### case_insensitive
+```ruby
+default: false
+```
+Generate a case-insensitive regular expression.
+```ruby
+List::Matcher.pattern %w( Cat cat CAT )                           # "(?:C(?:AT|at)|cat)"
+List::Matcher.pattern %w( Cat cat CAT ), case_insensitive: true   # "(?i:cat)"
+```
+### multiline
+```ruby
+default: false
+```
+Generate a multi-line regex.
+```ruby
+List::Matcher.pattern %w(cat), multiline: true   # "(?m:cat)"
+```
+The special feature of a multi-line regular expression is that `.` can grab newline characters. Because `List::Matcher`
+never produces `.` on its own, this option is only useful in conjunction with the `symbols` option, which lets one
+inject snippets of regex into the one generated.
+### normalize_whitespace
+```ruby
+default: false
+```
+This strips whitespace from items in the list and treats all internal whitespace as equivalent.
+```ruby
+List::Matcher.pattern [ ' cat  walker ', '  dog walker', 'camel  walker' ]
+# "(?:\\ (?:\\ dog\\ walker|cat\\ \\ walker\\ )|camel\\ \\ walker)"
+List::Matcher.pattern [ ' cat  walker ', '  dog walker', 'camel  walker' ], normalize_whitespace: true
+# "(?:(?:ca(?:mel|t)|dog)\\s++walker)"
+```
+### symbols
+You can tell `List::Matcher` that certain character sequences should be regarded as "symbols". It will then leave
+these unmolested, replacing them in the generated regex with whatever you map the symbol sequences to. The keys in
+the symbol hash are expected to be strings, symbols, or `Regexps`. Symbol keys are converted to their
+sequence by stringification. `Regexp` keys convert any sequence they match.
+```ruby
+List::Matcher.pattern [ 'Catch 22', '1984', 'Fahrenheit 451' ], symbols: { /\d+/ => '\d++' }
+# "(?:(?:(?:Catch|Fahrenheit)\\ )?\\d++)"
+List::Matcher.pattern [ 'Catch foo', 'foo', 'Fahrenheit foo' ], symbols: { 'foo' => '\d++' }
+# "(?:(?:(?:Catch|Fahrenheit)\\ )?\\d++)"
+List::Matcher.pattern [ 'Catch foo', 'foo', 'Fahrenheit foo' ], symbols: { foo: '\d++' }
+# "(?:(?:(?:Catch|Fahrenheit)\\ )?\\d++)"
+```
+Because it is possible for symbol sequences to overlap, sequences with string or symbol keys are evaluated before `Regexps`, and longer keys are
+evaluated before shorter ones.
+### name
+If you assign your pattern a name, it will be constructed with a named group such that you can extract
+the substring matched.
+```ruby
+List::Matcher.pattern %w(cat), name: :cat   # "(?<cat>cat)"
+```
+This is mostly useful if you are using `List::Matcher` to compose complex regexen incrementally. E.g., from the examples directory,
+```ruby
+require 'list_matcher'
+m = List::Matcher.new atomic: false, bound: true
+year      = m.pattern( (1901..2000).to_a, name: :year )
+mday      = m.pattern( (1..31).to_a, name: :mday )
+weekdays  = %w( Monday Tuesday Wednesday Thursday Friday Saturday Sunday )
+weekdays += weekdays.map{ |w| w[0...3] }
+wday      = m.pattern weekdays, case_insensitive: true, name: :wday
+months    = %w( January February March April May June July August September October November December )
+months   += months.map{ |w| w[0...3] }
+mo        = m.pattern months, case_insensitive: true, name: :mo
+date_20th_century = m.rx(
+  [
+    'wday, mo mday',
+    'wday, mo mday year',
+    'mo mday, year',
+    'mo year',
+    'mday mo year',
+    'wday',
+    'year',
+    'mday mo',
+    'mo mday',
+    'mo mday year'
+  ],
+  normalize_whitespace: true,
+  atomic: true,
+  symbols: {
+    year: year,
+    mday: mday,
+    wday: wday,
+    mo:   mo
+  }
+)
+[
+  'Friday',
+  'August 27',
+  'May 6, 1969',
+  '1 Jan 2000',
+  'this is not actually a date'
+].each do |candidate|
+  if m = date_20th_century.match(candidate)
+    puts "candidate: #{candidate}; year: #{m[:year]}; month: #{m[:mo]}; weekday: #{m[:wday]}; day of the month: #{m[:mday]}"
+  else
+    puts "#{candidate} does not look like a plausible date in the 20th century"
+  end
+end
+```
+### vet
+```ruby
+default: false
+```
+If true, all patterns associated with symbols will be tested upon initialization to make sure they will
+create legitimate regular expressions. If you are prone to doing this, for example:
+```ruby
+List::Matcher.new symbols: { aw_nuts: '+++' }
+```
+then you may want to vet your symbols. Vetting is not done by default because one assumes you've worked out
+your substitutions on your own time and we need not waste runtime checking them.
+## Benchmarks
+Efficiency isn't the principle purpose of List::Matcher, but in almost all cases List::Matcher
+regular expressions are more efficient than a regular expression generated by simply joining alternates
+with `|`. The following results were extracted from the output of the benchmark script included with this
+distribution. Sets are provided as a baseline for comparison, though there are many things one can do
+with a regular expression that one cannot do with a set.
+```
+RANDOM WORDS, VARIABLE LENGTH
+number of words: 100
+            set good:    53360.1 i/s
+  List::Matcher good:    22211.7 i/s - 2.40x slower
+      simple rx good:    13086.6 i/s - 4.08x slower
+           list good:     4748.0 i/s - 11.24x slower
+             set bad:    57387.1 i/s
+   List::Matcher bad:    14398.7 i/s - 3.99x slower
+       simple rx bad:     7347.1 i/s - 7.81x slower
+            list bad:     2583.1 i/s - 22.22x slower
+number of words: 1000
+            set good:     5380.5 i/s
+  List::Matcher good:     1665.3 i/s - 3.23x slower
+      simple rx good:      166.7 i/s - 32.27x slower
+           list good:       52.8 i/s - 101.98x slower
+             set bad:     5294.8 i/s
+   List::Matcher bad:     1061.1 i/s - 4.99x slower
+       simple rx bad:       81.0 i/s - 65.34x slower
+            list bad:       26.1 i/s - 202.51x slower
+number of words: 10000
+            set good:      361.3 i/s
+  List::Matcher good:      146.4 i/s - 2.47x slower
+      simple rx good:        1.7 i/s - 210.46x slower
+           list good:        0.4 i/s - 1027.74x slower
+             set bad:      370.3 i/s
+   List::Matcher bad:       82.2 i/s - 4.51x slower
+       simple rx bad:        0.8 i/s - 447.85x slower
+            list bad:        0.2 i/s - 1882.35x slower
+FIXED LENGTH, FULL RANGE
+number of words: 10; List::Matcher rx: (?-mix:\A\d\z)
+            set good:   520144.5 i/s
+  List::Matcher good:   382968.0 i/s - 1.36x slower
+           list good:   323052.6 i/s - 1.61x slower
+      simple rx good:   316058.3 i/s - 1.65x slower
+             set bad:   624424.8 i/s
+   List::Matcher bad:   270882.3 i/s - 2.31x slower
+       simple rx bad:   266277.3 i/s - 2.35x slower
+            list bad:   175058.3 i/s - 3.57x slower
+number of words: 100; List::Matcher rx: (?-mix:\A\d\d\z)
+        set creation:       20.3 i/s
+  simple rx creation:       15.9 i/s - 1.28x slower
+List::Matcher creation:       15.9 i/s - 1.28x slower
+            set good:    52058.4 i/s
+  List::Matcher good:    41841.7 i/s - 1.24x slower
+      simple rx good:    15095.6 i/s - 3.45x slower
+           list good:     4350.1 i/s - 11.97x slower
+             set bad:    59315.4 i/s
+       simple rx bad:    28063.6 i/s - 2.11x slower
+   List::Matcher bad:    27823.9 i/s - 2.13x slower
+            list bad:     2083.9 i/s - 28.46x slower
+number of words: 1000; List::Matcher rx: (?-mix:\A\d{3}\z)
+        set creation:        2.1 i/s
+List::Matcher creation:        1.5 i/s - 1.40x slower
+  simple rx creation:        1.5 i/s - 1.41x slower
+            set good:     4664.2 i/s
+  List::Matcher good:     3514.1 i/s - 1.33x slower
+      simple rx good:      225.6 i/s - 20.67x slower
+           list good:       44.2 i/s - 105.57x slower
+             set bad:     5830.5 i/s
+       simple rx bad:     2802.5 i/s - 2.08x slower
+   List::Matcher bad:     2717.0 i/s - 2.15x slower
+            list bad:       20.0 i/s - 291.10x slower
+number of words: 10000; List::Matcher rx: (?-mix:\A\d{4}\z)
+        set creation:        0.2 i/s
+  simple rx creation:        0.1 i/s - 1.21x slower
+List::Matcher creation:        0.1 i/s - 1.31x slower
+            set good:      369.4 i/s
+  List::Matcher good:      326.2 i/s - 1.13x slower
+      simple rx good:        2.3 i/s - 159.07x slower
+           list good:        0.4 i/s - 966.48x slower
+             set bad:      426.6 i/s
+       simple rx bad:      285.6 i/s - 1.49x slower
+   List::Matcher bad:      277.1 i/s - 1.54x slower
+            list bad:        0.2 i/s - 2236.24x slower
+```
+## Contributing
+1. Fork it ( https://github.com/[my-github-username]/list_matcher/fork )
+2. Create your feature branch (`git checkout -b my-new-feature`)
+3. Commit your changes (`git commit -am 'Add some feature'`)
+4. Push to the branch (`git push origin my-new-feature`)
+5. Create a new Pull Request

data/Rakefile ADDED Viewed

@@ -0,0 +1,9 @@
+require "bundler/gem_tasks"
+require "rake/testtask"
+Rake::TestTask.new do |t|
+  t.test_files = FileList['test/*_test.rb']
+end
+task default: :test

data/examples/date_grammar.rb ADDED Viewed

@@ -0,0 +1,49 @@
+require 'list_matcher'
+m = List::Matcher.new atomic: false, bound: true
+year      = m.pattern( (1901..2000).to_a, name: :year )
+mday      = m.pattern( (1..31).to_a, name: :mday )
+weekdays  = %w( Monday Tuesday Wednesday Thursday Friday Saturday Sunday )
+weekdays += weekdays.map{ |w| w[0...3] }
+wday      = m.pattern weekdays, case_insensitive: true, name: :wday
+months    = %w( January February March April May June July August September October November December )
+months   += months.map{ |w| w[0...3] }
+mo        = m.pattern months, case_insensitive: true, name: :mo
+date_20th_century = m.rx(
+  [
+    'wday, mo mday',
+    'wday, mo mday year',
+    'mo mday, year',
+    'mo year',
+    'mday mo year',
+    'wday',
+    'year',
+    'mday mo',
+    'mo mday',
+    'mo mday year'
+  ],
+  normalize_whitespace: true,
+  atomic: true,
+  symbols: {
+    year: year,
+    mday: mday,
+    wday: wday,
+    mo:   mo
+  }
+)
+[
+  'Friday',
+  'August 27',
+  'May 6, 1969',
+  '1 Jan 2000',
+  'this is not actually a date'
+].each do |candidate|
+  if m = date_20th_century.match(candidate)
+    puts "candidate: #{candidate}; year: #{m[:year]}; month: #{m[:mo]}; weekday: #{m[:wday]}; day of the month: #{m[:mday]}"
+  else
+    puts "#{candidate} does not look like a plausible date in the 20th century"
+  end
+end

data/lib/list_matcher/version.rb ADDED Viewed

@@ -0,0 +1,3 @@
+module ListMatcher
+  VERSION = "1.0.0"
+end