RubyGems - ruby-boost-regex - Versions diffs - 1.0.1 → 1.1.0 - Mend

ruby-boost-regex 1.0.1 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

data/README.markdown +53 -3
data/VERSION +1 -1
data/benchmark/benchmark19.rb +72 -0
data/ext/ruby-boost-regex/regexp.cpp +143 -58
metadata +4 -3

data/README.markdown CHANGED Viewed

@@ -13,7 +13,20 @@ normal `Regexp`s, and then add Boost features. So far...
 * Supports `#match` method as in normal Regexps, returning the same
   type of object (`MatchData`).
 * Spiffy monkey patch (see below)
+* Ruby 1.9 Support
+* Near-perfect drop-in for built-in Regexes.
+## Problems
+* NO Unicode support. Sorry.
+## Global Variables
+If you use Ruby's regexes, you probably use the handy-dandy global variables that
+come along with them, especially for quickly-written scripts.  I'm talking about
+our buddies, `$1, $2, $3`, `$\``, `$&`, `$~`. This gem makes sure to set those variables
+so that these global variables still work.
 ## Cool monkey patch
 So monkey patching is bad, right? Right. And rubyists monkey patch all the time,
@@ -53,7 +66,8 @@ if confronted with a non-standard Regexp.  So I had to write my own, which is le
 free to be scrutinized!  It uses the standard benchmark module because lifting the spiffy one from ruby-benchmark-suite
 proved to be too much of a hassle for now.
-Anyway, here's some results:
+Anyway, here's some results.  2.4Ghz Intel Core 2 Duo, 4GB 1067 Mhz DDR3, Macbook Pro. Ruby 1.8.7 (2008-08-11 patchlevel 72) [universal-darwin10.0]. Oniguruma is the Oniguruma gem for 1.8.x ruby, not the one bundled with Ruby 1.9+. I ran this a few
+times with differing loads, and they always roughly come out to the same result. YMMV.
     DNA-Matching (Computer Language Shootout)
     =========================================
@@ -80,6 +94,33 @@ Anyway, here's some results:
     Normal regex   0.070000   0.000000   0.070000 (  0.071984)
     Oniguruma      0.040000   0.000000   0.040000 (  0.044686)
     Boost regex    0.030000   0.000000   0.030000 (  0.036421)
+Ruby 1.9 version:
+    DNA-Matching (Computer Language Shootout)
+    =========================================
+    Rehearsal --------------------------------------------------
+    1.9 Ruby regex  17.440000   0.030000  17.470000 ( 17.523726)
+    Boost regex      9.750000   0.010000   9.760000 (  9.757300)
+    ---------------------------------------- total: 27.230000sec
+                         user     system      total        real
+    1.9 Ruby regex  17.420000   0.020000  17.440000 ( 17.434514)
+    Boost regex      9.750000   0.010000   9.760000 (  9.755085)
+    Failing to match a phone number in a big string of text
+    =======================================================
+    Rehearsal --------------------------------------------------
+    1.9 Ruby regex   0.020000   0.000000   0.020000 (  0.016165)
+    Boost regex      0.040000   0.000000   0.040000 (  0.035585)
+    ----------------------------------------- total: 0.060000sec
+                         user     system      total        real
+    1.9 Ruby regex   0.010000   0.000000   0.010000 (  0.016185)
+    Boost regex      0.040000   0.000000   0.040000 (  0.035736)
+Very interesting - 1.9 seems to have gotten faster at the "match a phone number in a big block of text" benchmark.  Boost is even faster,
+though, and Oniguruma integrated with 1.9 is back at the old 1.9 speed.  I'm hoping to add some kind of XML/HTML-matching benchmark ([Yes, I know this is a bad idea](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)). If you know how to benchmark regexps, feel free to [fork away](http://github.com/michaeledgar/ruby-boost-regex/fork)!
 ## Usage
@@ -99,10 +140,19 @@ Install the gem, use as follows:
     r =~ "i'm Mike. Hello!" #==> 10
 ## Installation
+You're going to need Boost.Regex.  You can find the [Boost Project here](http://www.boost.org/), with a
+[Getting Started page](http://www.boost.org/doc/libs/1_42_0/more/getting_started/index.html) for installation.
+Ruby-boost-regex expects you to have the boost regex library available as "libboost\_regex.so" or "libboost\_regex.dylib"
+or whatever your platform specifies - the key point is the name "boost\_regex".
     gem install ruby-boost-regex
+For the Unicode support I'm trying to get in, you'll need IBM's [ICU project](http://site.icu-project.org/) installed,
+with the recommended download page you start from [being ICU4C 4.4](http://icu-project.org/download/4.4.html#ICU4C). In addition,
+your Boost libraries must be [configured to use ICU](http://www.boost.org/doc/libs/1_42_0/libs/regex/doc/html/boost_regex/install.html#boost_regex.install.building_with_unicode_and_icu_support).
 ## Note on Patches/Pull Requests

data/VERSION CHANGED Viewed

	@@ -1 +1 @@
1	- 1.0.1
1	+ 1.1.0

data/benchmark/benchmark19.rb ADDED Viewed

@@ -0,0 +1,72 @@
+#!/usr/bin/env ruby
+# The Computer Language Shootout
+# http://shootout.alioth.debian.org/
+#
+# contributed by jose fco. gonzalez
+# Adapted for the Ruby Benchmark Suite.
+require 'benchmark'
+require 'rubygems'
+require 'ruby-boost-regex'
+require 'lorem'
+fname = File.dirname(__FILE__) + "/fasta.input"
+seq = File.read(fname)
+seq.gsub!(/>.*\n|\n/,"")
+def fair_scan(string, regex)
+  result = []
+  idx = -1
+  while idx
+    match = regex.match(string[(idx + 1)..-1])
+    break unless match
+    idx += match.begin(0) + 1
+    result << match.to_a
+  end
+  result
+end
+regexes = [
+  /agggtaaa|tttaccct/i,
+  /[cgt]gggtaaa|tttaccc[acg]/i,
+  /a[act]ggtaaa|tttacc[agt]t/i,
+  /ag[act]gtaaa|tttac[agt]ct/i,
+  /agg[act]taaa|ttta[agt]cct/i,
+  /aggg[acg]aaa|ttt[cgt]ccct/i,
+  /agggt[cgt]aa|tt[acg]accct/i,
+  /agggta[cgt]a|t[acg]taccct/i,
+  /agggtaa[cgt]|[acg]ttaccct/i
+  ]
+boost_regexes = [
+  Boost::Regexp.new('agggtaaa|tttaccct',         Boost::Regexp::IGNORECASE),
+  Boost::Regexp.new('[cgt]gggtaaa|tttaccc[acg]', Boost::Regexp::IGNORECASE),
+  Boost::Regexp.new('a[act]ggtaaa|tttacc[agt]t', Boost::Regexp::IGNORECASE),
+  Boost::Regexp.new('ag[act]gtaaa|tttac[agt]ct', Boost::Regexp::IGNORECASE),
+  Boost::Regexp.new('agg[act]taaa|ttta[agt]cct', Boost::Regexp::IGNORECASE),
+  Boost::Regexp.new('aggg[acg]aaa|ttt[cgt]ccct', Boost::Regexp::IGNORECASE),
+  Boost::Regexp.new('agggt[cgt]aa|tt[acg]accct', Boost::Regexp::IGNORECASE),
+  Boost::Regexp.new('agggta[cgt]a|t[acg]taccct', Boost::Regexp::IGNORECASE),
+  Boost::Regexp.new('agggtaa[cgt]|[acg]ttaccct', Boost::Regexp::IGNORECASE)
+]
+puts "DNA-Matching (Computer Language Shootout)"
+puts "========================================="
+Benchmark.bmbm do |x|
+    x.report("Boost regex")  { 100.times { boost_regexes.each { |reg| fair_scan(seq, reg)}} }
+    x.report("1.9 Ruby regex") { 100.times { regexes.each { |reg| fair_scan(seq, reg)}} }
+end
+reg = /\d{3}-\d{3}-\d{4}/
+boost_reg = Boost::Regexp.new('\d{3}-\d{3}-\d{4}')
+text = Lorem::Base.new('paragraphs', 2000).output
+puts ""
+puts "Failing to match a phone number in a big string of text"
+puts "======================================================="
+Benchmark.bmbm do |x|
+    x.report("1.9 Ruby regex") { 100.times { fair_scan(text, reg)}}
+    x.report("Boost regex")  { 100.times { fair_scan(text, boost_reg)}}
+end

data/ext/ruby-boost-regex/regexp.cpp CHANGED Viewed

@@ -6,8 +6,11 @@
 #ifdef RUBY_19
 #include "ruby/re.h"
+#include "ruby/oniguruma.h"
 #else
 #include "re.h"
+// RE_NREGS got renamed to ONIG_NREGION. Why? Why not!
+#define ONIG_NREGION RE_NREGS
 #endif
 static VALUE rb_mBoost;
@@ -18,33 +21,36 @@ static VALUE rb_kRegexpIgnorecase;
 static VALUE rb_kRegexpExtended;
 ///////// imported from re.c
+#define MATCH_BUSY FL_USER2
+static VALUE match_alloc(VALUE klass);
 #define RE_TALLOC(n,t)  ((t*)alloca((n)*sizeof(t)))
 #define TMALLOC(n,t)    ((t*)xmalloc((n)*sizeof(t)))
 #define TREALLOC(s,n,t) (s=((t*)xrealloc(s,(n)*sizeof(t))))
-#define MATCH_BUSY FL_USER2
-static VALUE match_alloc(VALUE klass)
-{
-    NEWOBJ(match, struct RMatch);
-    OBJSETUP(match, klass, T_MATCH);
-    match->str = 0;
-    match->regs = 0;
-    match->regs = ALLOC(struct re_registers);
-    MEMZERO(match->regs, struct re_registers, 1);
-    return (VALUE)match;
+static VALUE get_backref_for_modification() {
+    VALUE match;
+    match = rb_backref_get();
+    if (NIL_P(match) || FL_TEST(match, MATCH_BUSY)) {
+        match = match_alloc(rb_cMatch);
+    }
+    else {
+        if (rb_safe_level() >= 3)
+            OBJ_TAINT(match);
+        else
+            FL_UNSET(match, FL_TAINT);
+    }
+    return match;
 }
 static void
-init_regs(struct re_registers *regs, unsigned int num_regs)
+init_regs(struct re_registers *regs,  int num_regs)
 {
   int i;
   regs->num_regs = num_regs;
-  if (num_regs < RE_NREGS)
-    num_regs = RE_NREGS;
+  if (num_regs < ONIG_NREGION)
+    num_regs = ONIG_NREGION;
   if (regs->allocated == 0) {
     regs->beg = TMALLOC(num_regs, int);
@@ -84,6 +90,115 @@ re_copy_registers(struct re_registers *regs1, struct re_registers *regs2)
   regs1->num_regs = regs2->num_regs;
 }
+#ifndef RUBY_19
+// this is 1.8.x global variable stuff
+static VALUE match_alloc(VALUE klass)
+{
+    NEWOBJ(match, struct RMatch);
+    OBJSETUP(match, klass, T_MATCH);
+    match->str = 0;
+    match->regs = 0;
+    match->regs = ALLOC(struct re_registers);
+    MEMZERO(match->regs, struct re_registers, 1);
+    return (VALUE)match;
+}
+static void
+fill_regs_from_smatch(std::string::const_iterator first,
+                      std::string::const_iterator last,
+                      struct re_registers *regs,
+                      boost::smatch matches)
+{
+    init_regs(regs, matches.size());
+    regs->beg[0] = matches[0].first - first;
+    regs->end[0] = matches[0].second - first;
+    for (int idx = 1; idx <= matches.size(); idx++) {
+        if (!matches[idx].matched) {
+            regs->beg[idx] = regs->end[idx] = -1;
+        } else {
+            regs->beg[idx] = matches[idx].first - first;
+            regs->end[idx] = matches[idx].second - first;
+        }
+    }
+}
+static void save_backref_with_smatch(VALUE str,
+                                     std::string::const_iterator& start,
+                                     std::string::const_iterator& stop,
+                                     boost::smatch& matches,
+                                     VALUE regex_obj)
+{
+    static struct re_registers regs;
+    VALUE match = get_backref_for_modification();
+    RMATCH(match)->str = rb_str_new4(str);
+    fill_regs_from_smatch(start, stop, &regs, matches);
+    re_copy_registers(RMATCH(match)->regs, &regs);
+    rb_backref_set(match);
+}
+#else // Is Ruby 1.9+
+static VALUE
+match_alloc(VALUE klass)
+{
+    NEWOBJ(match, struct RMatch);
+    OBJSETUP(match, klass, T_MATCH);
+    match->str = 0;
+    match->rmatch = 0;
+    match->regexp = 0;
+    match->rmatch = ALLOC(struct rmatch);
+    MEMZERO(match->rmatch, struct rmatch, 1);
+    return (VALUE)match;
+}
+static void
+fill_regs_from_smatch(std::string::const_iterator first,
+                      std::string::const_iterator last,
+                      struct re_registers *regs,
+                      boost::smatch matches)
+{
+    init_regs(regs, matches.size());
+    regs->beg[0] = matches[0].first - first;
+    regs->end[0] = matches[0].second - first;
+    for (unsigned int idx = 1; idx <= matches.size(); idx++) {
+        if (!matches[idx].matched) {
+            regs->beg[idx] = regs->end[idx] = -1;
+        } else {
+            regs->beg[idx] = matches[idx].first - first;
+            regs->end[idx] = matches[idx].second - first;
+        }
+    }
+}
+static void save_backref_with_smatch(VALUE str,
+                                     std::string::const_iterator& start,
+                                     std::string::const_iterator& stop,
+                                     boost::smatch& matches,
+                                     VALUE regex_obj)
+{
+    static struct re_registers regs;
+    VALUE match = get_backref_for_modification();
+    RMATCH(match)->str = rb_str_new4(str);
+    RMATCH(match)->rmatch->char_offset_updated = 0;
+    RMATCH(match)->regexp = regex_obj;
+    fill_regs_from_smatch(start, stop, &regs, matches);
+    re_copy_registers(RMATCH_REGS(match), &regs);
+    rb_backref_set(match);
+}
+#endif RUBY_19
 /////////////////////////////
 // extracts the boost regex using Data_Get_Struct
@@ -134,7 +249,7 @@ VALUE br_init(int argc, VALUE *argv, VALUE self) {
             // calculate the flags to use
             newflags = NUM2UINT(flags);
             VALUE oldflags = rb_funcall(reg_to_convert, rb_intern("options"), 0);
-            int oldflagsint = FIX2INT(oldflags);
+            oldflagsint = FIX2INT(oldflags);
             // convert ruby regexp flags to boost regex flags
             if (oldflagsint & FIX2INT(rb_kRegexpIgnorecase))
                 newflags |= boost::regex_constants::icase;
@@ -152,53 +267,23 @@ VALUE br_init(int argc, VALUE *argv, VALUE self) {
         // C++ exceptions have to be re-raised as ruby
         rb_raise(rb_eArgError, "Invalid regular expression");
     }
-}
-static VALUE get_backref_for_modification() {
-    VALUE match;
-    match = rb_backref_get();
-    if (NIL_P(match) || FL_TEST(match, MATCH_BUSY)) {
-        match = match_alloc(rb_cMatch);
-    }
-    else {
-        if (rb_safe_level() >= 3)
-            OBJ_TAINT(match);
-        else
-            FL_UNSET(match, FL_TAINT);
-    }
-    return match;
-}
-static void
-fill_regs_from_smatch(std::string::const_iterator first, std::string::const_iterator last, struct re_registers *regs, boost::smatch matches) {
-    init_regs(regs, matches.size());
-    regs->beg[0] = matches[0].first - first;
-    regs->end[0] = matches[0].second - first;
-    for (int idx = 1; idx <= matches.size(); idx++) {
-        if (!matches[idx].matched) {
-            regs->beg[idx] = regs->end[idx] = -1;
-        } else {
-            regs->beg[idx] = matches[idx].first - first;
-            regs->end[idx] = matches[idx].second - first;
-        }
-    }
+    return self;
 }
 /**
  * General matcher method that re-raises exception as a Ruby exception.  Gotta use this. sorry.
  */
 static bool
-br_reg_match_iters(VALUE str, std::string::const_iterator start, std::string::const_iterator stop, boost::smatch& matches, boost::regex reg)
+br_reg_match_iters(VALUE str,
+                   std::string::const_iterator start,
+                   std::string::const_iterator stop,
+                   boost::smatch& matches,
+                   VALUE reg_obj)
 {
-    static struct re_registers regs;
+    boost::regex reg = *get_br_from_value(reg_obj);
     try {
         if (boost::regex_search(start, stop, matches, reg)) {
-            VALUE match = get_backref_for_modification();
-            RMATCH(match)->str = rb_str_dup(str);
-            fill_regs_from_smatch(start, stop, &regs, matches);
-            re_copy_registers(RMATCH(match)->regs, &regs);
-            rb_backref_set(match);
+            save_backref_with_smatch(str, start, stop, matches, reg_obj);
             return true;
         } else {
             rb_backref_set(Qnil);
@@ -211,7 +296,6 @@ br_reg_match_iters(VALUE str, std::string::const_iterator start, std::string::co
 static int
 br_reg_search(VALUE self, VALUE str) {
-    boost::regex reg = *get_br_from_value(self);
     std::string input = StringValuePtr(str);
     std::string::const_iterator start, end;
@@ -219,7 +303,7 @@ br_reg_search(VALUE self, VALUE str) {
     end = input.end();
     boost::smatch matches;
-    if (br_reg_match_iters(str, start, end, matches, reg)) {
+    if (br_reg_match_iters(str, start, end, matches, self)) {
         return matches[0].first - start;
     } else {
         return -1;
@@ -228,7 +312,6 @@ br_reg_search(VALUE self, VALUE str) {
 static VALUE
 br_reg_do_match(VALUE self, VALUE str) {
-    boost::regex reg = *get_br_from_value(self);
     std::string input = StringValuePtr(str);
     std::string::const_iterator start, end;
@@ -236,7 +319,7 @@ br_reg_do_match(VALUE self, VALUE str) {
     end = input.end();
     boost::smatch matches;
-    if (br_reg_match_iters(str, start, end, matches, reg)) {
+    if (br_reg_match_iters(str, start, end, matches, self)) {
         return rb_backref_get();
     } else {
         return Qnil;
@@ -341,5 +424,7 @@ extern "C" {
         rb_define_const(rb_cBoostRegexp, "BK_PLUS_QM", UINT2NUM(boost::regex_constants::bk_plus_qm));
         rb_define_const(rb_cBoostRegexp, "BK_VBAR", UINT2NUM(boost::regex_constants::bk_vbar));
         rb_define_const(rb_cBoostRegexp, "LITERAL", UINT2NUM(boost::regex_constants::literal));
+        return Qnil;
     }
 }

metadata CHANGED Viewed

@@ -4,9 +4,9 @@ version: !ruby/object:Gem::Version
   prerelease: false
   segments:
   - 1
-  - 0
   - 1
-  version: 1.0.1
+  - 0
+  version: 1.1.0
 platform: ruby
 authors:
 - Michael Edgar
@@ -14,7 +14,7 @@ autorequire:
 bindir: bin
 cert_chain: []
-date: 2010-04-08 00:00:00 -04:00
+date: 2010-04-09 00:00:00 -04:00
 default_executable:
 dependencies:
 - !ruby/object:Gem::Dependency
@@ -62,6 +62,7 @@ files:
 - Rakefile
 - VERSION
 - benchmark/benchmark.rb
+- benchmark/benchmark19.rb
 - benchmark/fasta.input
 - ext/ruby-boost-regex/extconf.rb
 - ext/ruby-boost-regex/regexp.cpp