ruby-boost-regex 1.0.1 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.markdown CHANGED
@@ -13,7 +13,20 @@ normal `Regexp`s, and then add Boost features. So far...
13
13
  * Supports `#match` method as in normal Regexps, returning the same
14
14
  type of object (`MatchData`).
15
15
  * Spiffy monkey patch (see below)
16
-
16
+ * Ruby 1.9 Support
17
+ * Near-perfect drop-in for built-in Regexes.
18
+
19
+ ## Problems
20
+
21
+ * NO Unicode support. Sorry.
22
+
23
+ ## Global Variables
24
+
25
+ If you use Ruby's regexes, you probably use the handy-dandy global variables that
26
+ come along with them, especially for quickly-written scripts. I'm talking about
27
+ our buddies, `$1, $2, $3`, `$\``, `$&`, `$~`. This gem makes sure to set those variables
28
+ so that these global variables still work.
29
+
17
30
  ## Cool monkey patch
18
31
 
19
32
  So monkey patching is bad, right? Right. And rubyists monkey patch all the time,
@@ -53,7 +66,8 @@ if confronted with a non-standard Regexp. So I had to write my own, which is le
53
66
  free to be scrutinized! It uses the standard benchmark module because lifting the spiffy one from ruby-benchmark-suite
54
67
  proved to be too much of a hassle for now.
55
68
 
56
- Anyway, here's some results:
69
+ Anyway, here's some results. 2.4Ghz Intel Core 2 Duo, 4GB 1067 Mhz DDR3, Macbook Pro. Ruby 1.8.7 (2008-08-11 patchlevel 72) [universal-darwin10.0]. Oniguruma is the Oniguruma gem for 1.8.x ruby, not the one bundled with Ruby 1.9+. I ran this a few
70
+ times with differing loads, and they always roughly come out to the same result. YMMV.
57
71
 
58
72
  DNA-Matching (Computer Language Shootout)
59
73
  =========================================
@@ -80,6 +94,33 @@ Anyway, here's some results:
80
94
  Normal regex 0.070000 0.000000 0.070000 ( 0.071984)
81
95
  Oniguruma 0.040000 0.000000 0.040000 ( 0.044686)
82
96
  Boost regex 0.030000 0.000000 0.030000 ( 0.036421)
97
+
98
+ Ruby 1.9 version:
99
+
100
+ DNA-Matching (Computer Language Shootout)
101
+ =========================================
102
+ Rehearsal --------------------------------------------------
103
+ 1.9 Ruby regex 17.440000 0.030000 17.470000 ( 17.523726)
104
+ Boost regex 9.750000 0.010000 9.760000 ( 9.757300)
105
+ ---------------------------------------- total: 27.230000sec
106
+
107
+ user system total real
108
+ 1.9 Ruby regex 17.420000 0.020000 17.440000 ( 17.434514)
109
+ Boost regex 9.750000 0.010000 9.760000 ( 9.755085)
110
+
111
+ Failing to match a phone number in a big string of text
112
+ =======================================================
113
+ Rehearsal --------------------------------------------------
114
+ 1.9 Ruby regex 0.020000 0.000000 0.020000 ( 0.016165)
115
+ Boost regex 0.040000 0.000000 0.040000 ( 0.035585)
116
+ ----------------------------------------- total: 0.060000sec
117
+
118
+ user system total real
119
+ 1.9 Ruby regex 0.010000 0.000000 0.010000 ( 0.016185)
120
+ Boost regex 0.040000 0.000000 0.040000 ( 0.035736)
121
+
122
+ Very interesting - 1.9 seems to have gotten faster at the "match a phone number in a big block of text" benchmark. Boost is even faster,
123
+ though, and Oniguruma integrated with 1.9 is back at the old 1.9 speed. I'm hoping to add some kind of XML/HTML-matching benchmark ([Yes, I know this is a bad idea](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)). If you know how to benchmark regexps, feel free to [fork away](http://github.com/michaeledgar/ruby-boost-regex/fork)!
83
124
 
84
125
  ## Usage
85
126
 
@@ -99,10 +140,19 @@ Install the gem, use as follows:
99
140
  r =~ "i'm Mike. Hello!" #==> 10
100
141
 
101
142
 
102
-
103
143
  ## Installation
104
144
 
145
+ You're going to need Boost.Regex. You can find the [Boost Project here](http://www.boost.org/), with a
146
+ [Getting Started page](http://www.boost.org/doc/libs/1_42_0/more/getting_started/index.html) for installation.
147
+
148
+ Ruby-boost-regex expects you to have the boost regex library available as "libboost\_regex.so" or "libboost\_regex.dylib"
149
+ or whatever your platform specifies - the key point is the name "boost\_regex".
150
+
105
151
  gem install ruby-boost-regex
152
+
153
+ For the Unicode support I'm trying to get in, you'll need IBM's [ICU project](http://site.icu-project.org/) installed,
154
+ with the recommended download page you start from [being ICU4C 4.4](http://icu-project.org/download/4.4.html#ICU4C). In addition,
155
+ your Boost libraries must be [configured to use ICU](http://www.boost.org/doc/libs/1_42_0/libs/regex/doc/html/boost_regex/install.html#boost_regex.install.building_with_unicode_and_icu_support).
106
156
 
107
157
  ## Note on Patches/Pull Requests
108
158
 
data/VERSION CHANGED
@@ -1 +1 @@
1
- 1.0.1
1
+ 1.1.0
@@ -0,0 +1,72 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ # The Computer Language Shootout
4
+ # http://shootout.alioth.debian.org/
5
+ #
6
+ # contributed by jose fco. gonzalez
7
+ # Adapted for the Ruby Benchmark Suite.
8
+ require 'benchmark'
9
+ require 'rubygems'
10
+ require 'ruby-boost-regex'
11
+ require 'lorem'
12
+
13
+ fname = File.dirname(__FILE__) + "/fasta.input"
14
+ seq = File.read(fname)
15
+ seq.gsub!(/>.*\n|\n/,"")
16
+
17
+ def fair_scan(string, regex)
18
+ result = []
19
+ idx = -1
20
+ while idx
21
+ match = regex.match(string[(idx + 1)..-1])
22
+ break unless match
23
+ idx += match.begin(0) + 1
24
+ result << match.to_a
25
+ end
26
+ result
27
+ end
28
+
29
+
30
+ regexes = [
31
+ /agggtaaa|tttaccct/i,
32
+ /[cgt]gggtaaa|tttaccc[acg]/i,
33
+ /a[act]ggtaaa|tttacc[agt]t/i,
34
+ /ag[act]gtaaa|tttac[agt]ct/i,
35
+ /agg[act]taaa|ttta[agt]cct/i,
36
+ /aggg[acg]aaa|ttt[cgt]ccct/i,
37
+ /agggt[cgt]aa|tt[acg]accct/i,
38
+ /agggta[cgt]a|t[acg]taccct/i,
39
+ /agggtaa[cgt]|[acg]ttaccct/i
40
+ ]
41
+
42
+ boost_regexes = [
43
+ Boost::Regexp.new('agggtaaa|tttaccct', Boost::Regexp::IGNORECASE),
44
+ Boost::Regexp.new('[cgt]gggtaaa|tttaccc[acg]', Boost::Regexp::IGNORECASE),
45
+ Boost::Regexp.new('a[act]ggtaaa|tttacc[agt]t', Boost::Regexp::IGNORECASE),
46
+ Boost::Regexp.new('ag[act]gtaaa|tttac[agt]ct', Boost::Regexp::IGNORECASE),
47
+ Boost::Regexp.new('agg[act]taaa|ttta[agt]cct', Boost::Regexp::IGNORECASE),
48
+ Boost::Regexp.new('aggg[acg]aaa|ttt[cgt]ccct', Boost::Regexp::IGNORECASE),
49
+ Boost::Regexp.new('agggt[cgt]aa|tt[acg]accct', Boost::Regexp::IGNORECASE),
50
+ Boost::Regexp.new('agggta[cgt]a|t[acg]taccct', Boost::Regexp::IGNORECASE),
51
+ Boost::Regexp.new('agggtaa[cgt]|[acg]ttaccct', Boost::Regexp::IGNORECASE)
52
+ ]
53
+
54
+ puts "DNA-Matching (Computer Language Shootout)"
55
+ puts "========================================="
56
+ Benchmark.bmbm do |x|
57
+ x.report("Boost regex") { 100.times { boost_regexes.each { |reg| fair_scan(seq, reg)}} }
58
+ x.report("1.9 Ruby regex") { 100.times { regexes.each { |reg| fair_scan(seq, reg)}} }
59
+ end
60
+
61
+
62
+ reg = /\d{3}-\d{3}-\d{4}/
63
+ boost_reg = Boost::Regexp.new('\d{3}-\d{3}-\d{4}')
64
+ text = Lorem::Base.new('paragraphs', 2000).output
65
+
66
+ puts ""
67
+ puts "Failing to match a phone number in a big string of text"
68
+ puts "======================================================="
69
+ Benchmark.bmbm do |x|
70
+ x.report("1.9 Ruby regex") { 100.times { fair_scan(text, reg)}}
71
+ x.report("Boost regex") { 100.times { fair_scan(text, boost_reg)}}
72
+ end
@@ -6,8 +6,11 @@
6
6
 
7
7
  #ifdef RUBY_19
8
8
  #include "ruby/re.h"
9
+ #include "ruby/oniguruma.h"
9
10
  #else
10
11
  #include "re.h"
12
+ // RE_NREGS got renamed to ONIG_NREGION. Why? Why not!
13
+ #define ONIG_NREGION RE_NREGS
11
14
  #endif
12
15
 
13
16
  static VALUE rb_mBoost;
@@ -18,33 +21,36 @@ static VALUE rb_kRegexpIgnorecase;
18
21
  static VALUE rb_kRegexpExtended;
19
22
 
20
23
  ///////// imported from re.c
24
+ #define MATCH_BUSY FL_USER2
25
+ static VALUE match_alloc(VALUE klass);
21
26
 
22
27
  #define RE_TALLOC(n,t) ((t*)alloca((n)*sizeof(t)))
23
28
  #define TMALLOC(n,t) ((t*)xmalloc((n)*sizeof(t)))
24
29
  #define TREALLOC(s,n,t) (s=((t*)xrealloc(s,(n)*sizeof(t))))
25
- #define MATCH_BUSY FL_USER2
26
30
 
27
- static VALUE match_alloc(VALUE klass)
28
- {
29
- NEWOBJ(match, struct RMatch);
30
- OBJSETUP(match, klass, T_MATCH);
31
-
32
- match->str = 0;
33
- match->regs = 0;
34
- match->regs = ALLOC(struct re_registers);
35
- MEMZERO(match->regs, struct re_registers, 1);
36
-
37
- return (VALUE)match;
31
+ static VALUE get_backref_for_modification() {
32
+ VALUE match;
33
+ match = rb_backref_get();
34
+ if (NIL_P(match) || FL_TEST(match, MATCH_BUSY)) {
35
+ match = match_alloc(rb_cMatch);
36
+ }
37
+ else {
38
+ if (rb_safe_level() >= 3)
39
+ OBJ_TAINT(match);
40
+ else
41
+ FL_UNSET(match, FL_TAINT);
42
+ }
43
+ return match;
38
44
  }
39
45
 
40
46
  static void
41
- init_regs(struct re_registers *regs, unsigned int num_regs)
47
+ init_regs(struct re_registers *regs, int num_regs)
42
48
  {
43
49
  int i;
44
50
 
45
51
  regs->num_regs = num_regs;
46
- if (num_regs < RE_NREGS)
47
- num_regs = RE_NREGS;
52
+ if (num_regs < ONIG_NREGION)
53
+ num_regs = ONIG_NREGION;
48
54
 
49
55
  if (regs->allocated == 0) {
50
56
  regs->beg = TMALLOC(num_regs, int);
@@ -84,6 +90,115 @@ re_copy_registers(struct re_registers *regs1, struct re_registers *regs2)
84
90
  regs1->num_regs = regs2->num_regs;
85
91
  }
86
92
 
93
+ #ifndef RUBY_19
94
+
95
+ // this is 1.8.x global variable stuff
96
+
97
+
98
+ static VALUE match_alloc(VALUE klass)
99
+ {
100
+ NEWOBJ(match, struct RMatch);
101
+ OBJSETUP(match, klass, T_MATCH);
102
+
103
+ match->str = 0;
104
+ match->regs = 0;
105
+ match->regs = ALLOC(struct re_registers);
106
+ MEMZERO(match->regs, struct re_registers, 1);
107
+
108
+ return (VALUE)match;
109
+ }
110
+
111
+ static void
112
+ fill_regs_from_smatch(std::string::const_iterator first,
113
+ std::string::const_iterator last,
114
+ struct re_registers *regs,
115
+ boost::smatch matches)
116
+ {
117
+ init_regs(regs, matches.size());
118
+ regs->beg[0] = matches[0].first - first;
119
+ regs->end[0] = matches[0].second - first;
120
+
121
+ for (int idx = 1; idx <= matches.size(); idx++) {
122
+ if (!matches[idx].matched) {
123
+ regs->beg[idx] = regs->end[idx] = -1;
124
+ } else {
125
+ regs->beg[idx] = matches[idx].first - first;
126
+ regs->end[idx] = matches[idx].second - first;
127
+ }
128
+ }
129
+ }
130
+
131
+ static void save_backref_with_smatch(VALUE str,
132
+ std::string::const_iterator& start,
133
+ std::string::const_iterator& stop,
134
+ boost::smatch& matches,
135
+ VALUE regex_obj)
136
+ {
137
+ static struct re_registers regs;
138
+ VALUE match = get_backref_for_modification();
139
+ RMATCH(match)->str = rb_str_new4(str);
140
+ fill_regs_from_smatch(start, stop, &regs, matches);
141
+ re_copy_registers(RMATCH(match)->regs, &regs);
142
+ rb_backref_set(match);
143
+ }
144
+
145
+ #else // Is Ruby 1.9+
146
+
147
+ static VALUE
148
+ match_alloc(VALUE klass)
149
+ {
150
+ NEWOBJ(match, struct RMatch);
151
+ OBJSETUP(match, klass, T_MATCH);
152
+
153
+ match->str = 0;
154
+ match->rmatch = 0;
155
+ match->regexp = 0;
156
+ match->rmatch = ALLOC(struct rmatch);
157
+ MEMZERO(match->rmatch, struct rmatch, 1);
158
+
159
+ return (VALUE)match;
160
+ }
161
+
162
+
163
+ static void
164
+ fill_regs_from_smatch(std::string::const_iterator first,
165
+ std::string::const_iterator last,
166
+ struct re_registers *regs,
167
+ boost::smatch matches)
168
+ {
169
+ init_regs(regs, matches.size());
170
+ regs->beg[0] = matches[0].first - first;
171
+ regs->end[0] = matches[0].second - first;
172
+
173
+ for (unsigned int idx = 1; idx <= matches.size(); idx++) {
174
+ if (!matches[idx].matched) {
175
+ regs->beg[idx] = regs->end[idx] = -1;
176
+ } else {
177
+ regs->beg[idx] = matches[idx].first - first;
178
+ regs->end[idx] = matches[idx].second - first;
179
+ }
180
+ }
181
+ }
182
+
183
+ static void save_backref_with_smatch(VALUE str,
184
+ std::string::const_iterator& start,
185
+ std::string::const_iterator& stop,
186
+ boost::smatch& matches,
187
+ VALUE regex_obj)
188
+ {
189
+ static struct re_registers regs;
190
+ VALUE match = get_backref_for_modification();
191
+
192
+ RMATCH(match)->str = rb_str_new4(str);
193
+ RMATCH(match)->rmatch->char_offset_updated = 0;
194
+ RMATCH(match)->regexp = regex_obj;
195
+ fill_regs_from_smatch(start, stop, &regs, matches);
196
+ re_copy_registers(RMATCH_REGS(match), &regs);
197
+ rb_backref_set(match);
198
+ }
199
+
200
+ #endif RUBY_19
201
+
87
202
  /////////////////////////////
88
203
 
89
204
  // extracts the boost regex using Data_Get_Struct
@@ -134,7 +249,7 @@ VALUE br_init(int argc, VALUE *argv, VALUE self) {
134
249
  // calculate the flags to use
135
250
  newflags = NUM2UINT(flags);
136
251
  VALUE oldflags = rb_funcall(reg_to_convert, rb_intern("options"), 0);
137
- int oldflagsint = FIX2INT(oldflags);
252
+ oldflagsint = FIX2INT(oldflags);
138
253
  // convert ruby regexp flags to boost regex flags
139
254
  if (oldflagsint & FIX2INT(rb_kRegexpIgnorecase))
140
255
  newflags |= boost::regex_constants::icase;
@@ -152,53 +267,23 @@ VALUE br_init(int argc, VALUE *argv, VALUE self) {
152
267
  // C++ exceptions have to be re-raised as ruby
153
268
  rb_raise(rb_eArgError, "Invalid regular expression");
154
269
  }
155
- }
156
-
157
- static VALUE get_backref_for_modification() {
158
- VALUE match;
159
- match = rb_backref_get();
160
- if (NIL_P(match) || FL_TEST(match, MATCH_BUSY)) {
161
- match = match_alloc(rb_cMatch);
162
- }
163
- else {
164
- if (rb_safe_level() >= 3)
165
- OBJ_TAINT(match);
166
- else
167
- FL_UNSET(match, FL_TAINT);
168
- }
169
- return match;
170
- }
171
-
172
- static void
173
- fill_regs_from_smatch(std::string::const_iterator first, std::string::const_iterator last, struct re_registers *regs, boost::smatch matches) {
174
- init_regs(regs, matches.size());
175
- regs->beg[0] = matches[0].first - first;
176
- regs->end[0] = matches[0].second - first;
177
-
178
- for (int idx = 1; idx <= matches.size(); idx++) {
179
- if (!matches[idx].matched) {
180
- regs->beg[idx] = regs->end[idx] = -1;
181
- } else {
182
- regs->beg[idx] = matches[idx].first - first;
183
- regs->end[idx] = matches[idx].second - first;
184
- }
185
- }
270
+ return self;
186
271
  }
187
272
 
188
273
  /**
189
274
  * General matcher method that re-raises exception as a Ruby exception. Gotta use this. sorry.
190
275
  */
191
276
  static bool
192
- br_reg_match_iters(VALUE str, std::string::const_iterator start, std::string::const_iterator stop, boost::smatch& matches, boost::regex reg)
277
+ br_reg_match_iters(VALUE str,
278
+ std::string::const_iterator start,
279
+ std::string::const_iterator stop,
280
+ boost::smatch& matches,
281
+ VALUE reg_obj)
193
282
  {
194
- static struct re_registers regs;
283
+ boost::regex reg = *get_br_from_value(reg_obj);
195
284
  try {
196
285
  if (boost::regex_search(start, stop, matches, reg)) {
197
- VALUE match = get_backref_for_modification();
198
- RMATCH(match)->str = rb_str_dup(str);
199
- fill_regs_from_smatch(start, stop, &regs, matches);
200
- re_copy_registers(RMATCH(match)->regs, &regs);
201
- rb_backref_set(match);
286
+ save_backref_with_smatch(str, start, stop, matches, reg_obj);
202
287
  return true;
203
288
  } else {
204
289
  rb_backref_set(Qnil);
@@ -211,7 +296,6 @@ br_reg_match_iters(VALUE str, std::string::const_iterator start, std::string::co
211
296
 
212
297
  static int
213
298
  br_reg_search(VALUE self, VALUE str) {
214
- boost::regex reg = *get_br_from_value(self);
215
299
  std::string input = StringValuePtr(str);
216
300
 
217
301
  std::string::const_iterator start, end;
@@ -219,7 +303,7 @@ br_reg_search(VALUE self, VALUE str) {
219
303
  end = input.end();
220
304
 
221
305
  boost::smatch matches;
222
- if (br_reg_match_iters(str, start, end, matches, reg)) {
306
+ if (br_reg_match_iters(str, start, end, matches, self)) {
223
307
  return matches[0].first - start;
224
308
  } else {
225
309
  return -1;
@@ -228,7 +312,6 @@ br_reg_search(VALUE self, VALUE str) {
228
312
 
229
313
  static VALUE
230
314
  br_reg_do_match(VALUE self, VALUE str) {
231
- boost::regex reg = *get_br_from_value(self);
232
315
  std::string input = StringValuePtr(str);
233
316
 
234
317
  std::string::const_iterator start, end;
@@ -236,7 +319,7 @@ br_reg_do_match(VALUE self, VALUE str) {
236
319
  end = input.end();
237
320
 
238
321
  boost::smatch matches;
239
- if (br_reg_match_iters(str, start, end, matches, reg)) {
322
+ if (br_reg_match_iters(str, start, end, matches, self)) {
240
323
  return rb_backref_get();
241
324
  } else {
242
325
  return Qnil;
@@ -341,5 +424,7 @@ extern "C" {
341
424
  rb_define_const(rb_cBoostRegexp, "BK_PLUS_QM", UINT2NUM(boost::regex_constants::bk_plus_qm));
342
425
  rb_define_const(rb_cBoostRegexp, "BK_VBAR", UINT2NUM(boost::regex_constants::bk_vbar));
343
426
  rb_define_const(rb_cBoostRegexp, "LITERAL", UINT2NUM(boost::regex_constants::literal));
427
+
428
+ return Qnil;
344
429
  }
345
430
  }
metadata CHANGED
@@ -4,9 +4,9 @@ version: !ruby/object:Gem::Version
4
4
  prerelease: false
5
5
  segments:
6
6
  - 1
7
- - 0
8
7
  - 1
9
- version: 1.0.1
8
+ - 0
9
+ version: 1.1.0
10
10
  platform: ruby
11
11
  authors:
12
12
  - Michael Edgar
@@ -14,7 +14,7 @@ autorequire:
14
14
  bindir: bin
15
15
  cert_chain: []
16
16
 
17
- date: 2010-04-08 00:00:00 -04:00
17
+ date: 2010-04-09 00:00:00 -04:00
18
18
  default_executable:
19
19
  dependencies:
20
20
  - !ruby/object:Gem::Dependency
@@ -62,6 +62,7 @@ files:
62
62
  - Rakefile
63
63
  - VERSION
64
64
  - benchmark/benchmark.rb
65
+ - benchmark/benchmark19.rb
65
66
  - benchmark/fasta.input
66
67
  - ext/ruby-boost-regex/extconf.rb
67
68
  - ext/ruby-boost-regex/regexp.cpp