ruby-boost-regex 1.0.1 → 1.1.0

Sign up to get free protection for your applications and to get access to all the features.
data/README.markdown CHANGED
@@ -13,7 +13,20 @@ normal `Regexp`s, and then add Boost features. So far...
13
13
  * Supports `#match` method as in normal Regexps, returning the same
14
14
  type of object (`MatchData`).
15
15
  * Spiffy monkey patch (see below)
16
-
16
+ * Ruby 1.9 Support
17
+ * Near-perfect drop-in for built-in Regexes.
18
+
19
+ ## Problems
20
+
21
+ * NO Unicode support. Sorry.
22
+
23
+ ## Global Variables
24
+
25
+ If you use Ruby's regexes, you probably use the handy-dandy global variables that
26
+ come along with them, especially for quickly-written scripts. I'm talking about
27
+ our buddies, `$1, $2, $3`, `$\``, `$&`, `$~`. This gem makes sure to set those variables
28
+ so that these global variables still work.
29
+
17
30
  ## Cool monkey patch
18
31
 
19
32
  So monkey patching is bad, right? Right. And rubyists monkey patch all the time,
@@ -53,7 +66,8 @@ if confronted with a non-standard Regexp. So I had to write my own, which is le
53
66
  free to be scrutinized! It uses the standard benchmark module because lifting the spiffy one from ruby-benchmark-suite
54
67
  proved to be too much of a hassle for now.
55
68
 
56
- Anyway, here's some results:
69
+ Anyway, here's some results. 2.4Ghz Intel Core 2 Duo, 4GB 1067 Mhz DDR3, Macbook Pro. Ruby 1.8.7 (2008-08-11 patchlevel 72) [universal-darwin10.0]. Oniguruma is the Oniguruma gem for 1.8.x ruby, not the one bundled with Ruby 1.9+. I ran this a few
70
+ times with differing loads, and they always roughly come out to the same result. YMMV.
57
71
 
58
72
  DNA-Matching (Computer Language Shootout)
59
73
  =========================================
@@ -80,6 +94,33 @@ Anyway, here's some results:
80
94
  Normal regex 0.070000 0.000000 0.070000 ( 0.071984)
81
95
  Oniguruma 0.040000 0.000000 0.040000 ( 0.044686)
82
96
  Boost regex 0.030000 0.000000 0.030000 ( 0.036421)
97
+
98
+ Ruby 1.9 version:
99
+
100
+ DNA-Matching (Computer Language Shootout)
101
+ =========================================
102
+ Rehearsal --------------------------------------------------
103
+ 1.9 Ruby regex 17.440000 0.030000 17.470000 ( 17.523726)
104
+ Boost regex 9.750000 0.010000 9.760000 ( 9.757300)
105
+ ---------------------------------------- total: 27.230000sec
106
+
107
+ user system total real
108
+ 1.9 Ruby regex 17.420000 0.020000 17.440000 ( 17.434514)
109
+ Boost regex 9.750000 0.010000 9.760000 ( 9.755085)
110
+
111
+ Failing to match a phone number in a big string of text
112
+ =======================================================
113
+ Rehearsal --------------------------------------------------
114
+ 1.9 Ruby regex 0.020000 0.000000 0.020000 ( 0.016165)
115
+ Boost regex 0.040000 0.000000 0.040000 ( 0.035585)
116
+ ----------------------------------------- total: 0.060000sec
117
+
118
+ user system total real
119
+ 1.9 Ruby regex 0.010000 0.000000 0.010000 ( 0.016185)
120
+ Boost regex 0.040000 0.000000 0.040000 ( 0.035736)
121
+
122
+ Very interesting - 1.9 seems to have gotten faster at the "match a phone number in a big block of text" benchmark. Boost is even faster,
123
+ though, and Oniguruma integrated with 1.9 is back at the old 1.9 speed. I'm hoping to add some kind of XML/HTML-matching benchmark ([Yes, I know this is a bad idea](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)). If you know how to benchmark regexps, feel free to [fork away](http://github.com/michaeledgar/ruby-boost-regex/fork)!
83
124
 
84
125
  ## Usage
85
126
 
@@ -99,10 +140,19 @@ Install the gem, use as follows:
99
140
  r =~ "i'm Mike. Hello!" #==> 10
100
141
 
101
142
 
102
-
103
143
  ## Installation
104
144
 
145
+ You're going to need Boost.Regex. You can find the [Boost Project here](http://www.boost.org/), with a
146
+ [Getting Started page](http://www.boost.org/doc/libs/1_42_0/more/getting_started/index.html) for installation.
147
+
148
+ Ruby-boost-regex expects you to have the boost regex library available as "libboost\_regex.so" or "libboost\_regex.dylib"
149
+ or whatever your platform specifies - the key point is the name "boost\_regex".
150
+
105
151
  gem install ruby-boost-regex
152
+
153
+ For the Unicode support I'm trying to get in, you'll need IBM's [ICU project](http://site.icu-project.org/) installed,
154
+ with the recommended download page you start from [being ICU4C 4.4](http://icu-project.org/download/4.4.html#ICU4C). In addition,
155
+ your Boost libraries must be [configured to use ICU](http://www.boost.org/doc/libs/1_42_0/libs/regex/doc/html/boost_regex/install.html#boost_regex.install.building_with_unicode_and_icu_support).
106
156
 
107
157
  ## Note on Patches/Pull Requests
108
158
 
data/VERSION CHANGED
@@ -1 +1 @@
1
- 1.0.1
1
+ 1.1.0
@@ -0,0 +1,72 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ # The Computer Language Shootout
4
+ # http://shootout.alioth.debian.org/
5
+ #
6
+ # contributed by jose fco. gonzalez
7
+ # Adapted for the Ruby Benchmark Suite.
8
+ require 'benchmark'
9
+ require 'rubygems'
10
+ require 'ruby-boost-regex'
11
+ require 'lorem'
12
+
13
+ fname = File.dirname(__FILE__) + "/fasta.input"
14
+ seq = File.read(fname)
15
+ seq.gsub!(/>.*\n|\n/,"")
16
+
17
+ def fair_scan(string, regex)
18
+ result = []
19
+ idx = -1
20
+ while idx
21
+ match = regex.match(string[(idx + 1)..-1])
22
+ break unless match
23
+ idx += match.begin(0) + 1
24
+ result << match.to_a
25
+ end
26
+ result
27
+ end
28
+
29
+
30
+ regexes = [
31
+ /agggtaaa|tttaccct/i,
32
+ /[cgt]gggtaaa|tttaccc[acg]/i,
33
+ /a[act]ggtaaa|tttacc[agt]t/i,
34
+ /ag[act]gtaaa|tttac[agt]ct/i,
35
+ /agg[act]taaa|ttta[agt]cct/i,
36
+ /aggg[acg]aaa|ttt[cgt]ccct/i,
37
+ /agggt[cgt]aa|tt[acg]accct/i,
38
+ /agggta[cgt]a|t[acg]taccct/i,
39
+ /agggtaa[cgt]|[acg]ttaccct/i
40
+ ]
41
+
42
+ boost_regexes = [
43
+ Boost::Regexp.new('agggtaaa|tttaccct', Boost::Regexp::IGNORECASE),
44
+ Boost::Regexp.new('[cgt]gggtaaa|tttaccc[acg]', Boost::Regexp::IGNORECASE),
45
+ Boost::Regexp.new('a[act]ggtaaa|tttacc[agt]t', Boost::Regexp::IGNORECASE),
46
+ Boost::Regexp.new('ag[act]gtaaa|tttac[agt]ct', Boost::Regexp::IGNORECASE),
47
+ Boost::Regexp.new('agg[act]taaa|ttta[agt]cct', Boost::Regexp::IGNORECASE),
48
+ Boost::Regexp.new('aggg[acg]aaa|ttt[cgt]ccct', Boost::Regexp::IGNORECASE),
49
+ Boost::Regexp.new('agggt[cgt]aa|tt[acg]accct', Boost::Regexp::IGNORECASE),
50
+ Boost::Regexp.new('agggta[cgt]a|t[acg]taccct', Boost::Regexp::IGNORECASE),
51
+ Boost::Regexp.new('agggtaa[cgt]|[acg]ttaccct', Boost::Regexp::IGNORECASE)
52
+ ]
53
+
54
+ puts "DNA-Matching (Computer Language Shootout)"
55
+ puts "========================================="
56
+ Benchmark.bmbm do |x|
57
+ x.report("Boost regex") { 100.times { boost_regexes.each { |reg| fair_scan(seq, reg)}} }
58
+ x.report("1.9 Ruby regex") { 100.times { regexes.each { |reg| fair_scan(seq, reg)}} }
59
+ end
60
+
61
+
62
+ reg = /\d{3}-\d{3}-\d{4}/
63
+ boost_reg = Boost::Regexp.new('\d{3}-\d{3}-\d{4}')
64
+ text = Lorem::Base.new('paragraphs', 2000).output
65
+
66
+ puts ""
67
+ puts "Failing to match a phone number in a big string of text"
68
+ puts "======================================================="
69
+ Benchmark.bmbm do |x|
70
+ x.report("1.9 Ruby regex") { 100.times { fair_scan(text, reg)}}
71
+ x.report("Boost regex") { 100.times { fair_scan(text, boost_reg)}}
72
+ end
@@ -6,8 +6,11 @@
6
6
 
7
7
  #ifdef RUBY_19
8
8
  #include "ruby/re.h"
9
+ #include "ruby/oniguruma.h"
9
10
  #else
10
11
  #include "re.h"
12
+ // RE_NREGS got renamed to ONIG_NREGION. Why? Why not!
13
+ #define ONIG_NREGION RE_NREGS
11
14
  #endif
12
15
 
13
16
  static VALUE rb_mBoost;
@@ -18,33 +21,36 @@ static VALUE rb_kRegexpIgnorecase;
18
21
  static VALUE rb_kRegexpExtended;
19
22
 
20
23
  ///////// imported from re.c
24
+ #define MATCH_BUSY FL_USER2
25
+ static VALUE match_alloc(VALUE klass);
21
26
 
22
27
  #define RE_TALLOC(n,t) ((t*)alloca((n)*sizeof(t)))
23
28
  #define TMALLOC(n,t) ((t*)xmalloc((n)*sizeof(t)))
24
29
  #define TREALLOC(s,n,t) (s=((t*)xrealloc(s,(n)*sizeof(t))))
25
- #define MATCH_BUSY FL_USER2
26
30
 
27
- static VALUE match_alloc(VALUE klass)
28
- {
29
- NEWOBJ(match, struct RMatch);
30
- OBJSETUP(match, klass, T_MATCH);
31
-
32
- match->str = 0;
33
- match->regs = 0;
34
- match->regs = ALLOC(struct re_registers);
35
- MEMZERO(match->regs, struct re_registers, 1);
36
-
37
- return (VALUE)match;
31
+ static VALUE get_backref_for_modification() {
32
+ VALUE match;
33
+ match = rb_backref_get();
34
+ if (NIL_P(match) || FL_TEST(match, MATCH_BUSY)) {
35
+ match = match_alloc(rb_cMatch);
36
+ }
37
+ else {
38
+ if (rb_safe_level() >= 3)
39
+ OBJ_TAINT(match);
40
+ else
41
+ FL_UNSET(match, FL_TAINT);
42
+ }
43
+ return match;
38
44
  }
39
45
 
40
46
  static void
41
- init_regs(struct re_registers *regs, unsigned int num_regs)
47
+ init_regs(struct re_registers *regs, int num_regs)
42
48
  {
43
49
  int i;
44
50
 
45
51
  regs->num_regs = num_regs;
46
- if (num_regs < RE_NREGS)
47
- num_regs = RE_NREGS;
52
+ if (num_regs < ONIG_NREGION)
53
+ num_regs = ONIG_NREGION;
48
54
 
49
55
  if (regs->allocated == 0) {
50
56
  regs->beg = TMALLOC(num_regs, int);
@@ -84,6 +90,115 @@ re_copy_registers(struct re_registers *regs1, struct re_registers *regs2)
84
90
  regs1->num_regs = regs2->num_regs;
85
91
  }
86
92
 
93
+ #ifndef RUBY_19
94
+
95
+ // this is 1.8.x global variable stuff
96
+
97
+
98
+ static VALUE match_alloc(VALUE klass)
99
+ {
100
+ NEWOBJ(match, struct RMatch);
101
+ OBJSETUP(match, klass, T_MATCH);
102
+
103
+ match->str = 0;
104
+ match->regs = 0;
105
+ match->regs = ALLOC(struct re_registers);
106
+ MEMZERO(match->regs, struct re_registers, 1);
107
+
108
+ return (VALUE)match;
109
+ }
110
+
111
+ static void
112
+ fill_regs_from_smatch(std::string::const_iterator first,
113
+ std::string::const_iterator last,
114
+ struct re_registers *regs,
115
+ boost::smatch matches)
116
+ {
117
+ init_regs(regs, matches.size());
118
+ regs->beg[0] = matches[0].first - first;
119
+ regs->end[0] = matches[0].second - first;
120
+
121
+ for (int idx = 1; idx <= matches.size(); idx++) {
122
+ if (!matches[idx].matched) {
123
+ regs->beg[idx] = regs->end[idx] = -1;
124
+ } else {
125
+ regs->beg[idx] = matches[idx].first - first;
126
+ regs->end[idx] = matches[idx].second - first;
127
+ }
128
+ }
129
+ }
130
+
131
+ static void save_backref_with_smatch(VALUE str,
132
+ std::string::const_iterator& start,
133
+ std::string::const_iterator& stop,
134
+ boost::smatch& matches,
135
+ VALUE regex_obj)
136
+ {
137
+ static struct re_registers regs;
138
+ VALUE match = get_backref_for_modification();
139
+ RMATCH(match)->str = rb_str_new4(str);
140
+ fill_regs_from_smatch(start, stop, &regs, matches);
141
+ re_copy_registers(RMATCH(match)->regs, &regs);
142
+ rb_backref_set(match);
143
+ }
144
+
145
+ #else // Is Ruby 1.9+
146
+
147
+ static VALUE
148
+ match_alloc(VALUE klass)
149
+ {
150
+ NEWOBJ(match, struct RMatch);
151
+ OBJSETUP(match, klass, T_MATCH);
152
+
153
+ match->str = 0;
154
+ match->rmatch = 0;
155
+ match->regexp = 0;
156
+ match->rmatch = ALLOC(struct rmatch);
157
+ MEMZERO(match->rmatch, struct rmatch, 1);
158
+
159
+ return (VALUE)match;
160
+ }
161
+
162
+
163
+ static void
164
+ fill_regs_from_smatch(std::string::const_iterator first,
165
+ std::string::const_iterator last,
166
+ struct re_registers *regs,
167
+ boost::smatch matches)
168
+ {
169
+ init_regs(regs, matches.size());
170
+ regs->beg[0] = matches[0].first - first;
171
+ regs->end[0] = matches[0].second - first;
172
+
173
+ for (unsigned int idx = 1; idx <= matches.size(); idx++) {
174
+ if (!matches[idx].matched) {
175
+ regs->beg[idx] = regs->end[idx] = -1;
176
+ } else {
177
+ regs->beg[idx] = matches[idx].first - first;
178
+ regs->end[idx] = matches[idx].second - first;
179
+ }
180
+ }
181
+ }
182
+
183
+ static void save_backref_with_smatch(VALUE str,
184
+ std::string::const_iterator& start,
185
+ std::string::const_iterator& stop,
186
+ boost::smatch& matches,
187
+ VALUE regex_obj)
188
+ {
189
+ static struct re_registers regs;
190
+ VALUE match = get_backref_for_modification();
191
+
192
+ RMATCH(match)->str = rb_str_new4(str);
193
+ RMATCH(match)->rmatch->char_offset_updated = 0;
194
+ RMATCH(match)->regexp = regex_obj;
195
+ fill_regs_from_smatch(start, stop, &regs, matches);
196
+ re_copy_registers(RMATCH_REGS(match), &regs);
197
+ rb_backref_set(match);
198
+ }
199
+
200
+ #endif RUBY_19
201
+
87
202
  /////////////////////////////
88
203
 
89
204
  // extracts the boost regex using Data_Get_Struct
@@ -134,7 +249,7 @@ VALUE br_init(int argc, VALUE *argv, VALUE self) {
134
249
  // calculate the flags to use
135
250
  newflags = NUM2UINT(flags);
136
251
  VALUE oldflags = rb_funcall(reg_to_convert, rb_intern("options"), 0);
137
- int oldflagsint = FIX2INT(oldflags);
252
+ oldflagsint = FIX2INT(oldflags);
138
253
  // convert ruby regexp flags to boost regex flags
139
254
  if (oldflagsint & FIX2INT(rb_kRegexpIgnorecase))
140
255
  newflags |= boost::regex_constants::icase;
@@ -152,53 +267,23 @@ VALUE br_init(int argc, VALUE *argv, VALUE self) {
152
267
  // C++ exceptions have to be re-raised as ruby
153
268
  rb_raise(rb_eArgError, "Invalid regular expression");
154
269
  }
155
- }
156
-
157
- static VALUE get_backref_for_modification() {
158
- VALUE match;
159
- match = rb_backref_get();
160
- if (NIL_P(match) || FL_TEST(match, MATCH_BUSY)) {
161
- match = match_alloc(rb_cMatch);
162
- }
163
- else {
164
- if (rb_safe_level() >= 3)
165
- OBJ_TAINT(match);
166
- else
167
- FL_UNSET(match, FL_TAINT);
168
- }
169
- return match;
170
- }
171
-
172
- static void
173
- fill_regs_from_smatch(std::string::const_iterator first, std::string::const_iterator last, struct re_registers *regs, boost::smatch matches) {
174
- init_regs(regs, matches.size());
175
- regs->beg[0] = matches[0].first - first;
176
- regs->end[0] = matches[0].second - first;
177
-
178
- for (int idx = 1; idx <= matches.size(); idx++) {
179
- if (!matches[idx].matched) {
180
- regs->beg[idx] = regs->end[idx] = -1;
181
- } else {
182
- regs->beg[idx] = matches[idx].first - first;
183
- regs->end[idx] = matches[idx].second - first;
184
- }
185
- }
270
+ return self;
186
271
  }
187
272
 
188
273
  /**
189
274
  * General matcher method that re-raises exception as a Ruby exception. Gotta use this. sorry.
190
275
  */
191
276
  static bool
192
- br_reg_match_iters(VALUE str, std::string::const_iterator start, std::string::const_iterator stop, boost::smatch& matches, boost::regex reg)
277
+ br_reg_match_iters(VALUE str,
278
+ std::string::const_iterator start,
279
+ std::string::const_iterator stop,
280
+ boost::smatch& matches,
281
+ VALUE reg_obj)
193
282
  {
194
- static struct re_registers regs;
283
+ boost::regex reg = *get_br_from_value(reg_obj);
195
284
  try {
196
285
  if (boost::regex_search(start, stop, matches, reg)) {
197
- VALUE match = get_backref_for_modification();
198
- RMATCH(match)->str = rb_str_dup(str);
199
- fill_regs_from_smatch(start, stop, &regs, matches);
200
- re_copy_registers(RMATCH(match)->regs, &regs);
201
- rb_backref_set(match);
286
+ save_backref_with_smatch(str, start, stop, matches, reg_obj);
202
287
  return true;
203
288
  } else {
204
289
  rb_backref_set(Qnil);
@@ -211,7 +296,6 @@ br_reg_match_iters(VALUE str, std::string::const_iterator start, std::string::co
211
296
 
212
297
  static int
213
298
  br_reg_search(VALUE self, VALUE str) {
214
- boost::regex reg = *get_br_from_value(self);
215
299
  std::string input = StringValuePtr(str);
216
300
 
217
301
  std::string::const_iterator start, end;
@@ -219,7 +303,7 @@ br_reg_search(VALUE self, VALUE str) {
219
303
  end = input.end();
220
304
 
221
305
  boost::smatch matches;
222
- if (br_reg_match_iters(str, start, end, matches, reg)) {
306
+ if (br_reg_match_iters(str, start, end, matches, self)) {
223
307
  return matches[0].first - start;
224
308
  } else {
225
309
  return -1;
@@ -228,7 +312,6 @@ br_reg_search(VALUE self, VALUE str) {
228
312
 
229
313
  static VALUE
230
314
  br_reg_do_match(VALUE self, VALUE str) {
231
- boost::regex reg = *get_br_from_value(self);
232
315
  std::string input = StringValuePtr(str);
233
316
 
234
317
  std::string::const_iterator start, end;
@@ -236,7 +319,7 @@ br_reg_do_match(VALUE self, VALUE str) {
236
319
  end = input.end();
237
320
 
238
321
  boost::smatch matches;
239
- if (br_reg_match_iters(str, start, end, matches, reg)) {
322
+ if (br_reg_match_iters(str, start, end, matches, self)) {
240
323
  return rb_backref_get();
241
324
  } else {
242
325
  return Qnil;
@@ -341,5 +424,7 @@ extern "C" {
341
424
  rb_define_const(rb_cBoostRegexp, "BK_PLUS_QM", UINT2NUM(boost::regex_constants::bk_plus_qm));
342
425
  rb_define_const(rb_cBoostRegexp, "BK_VBAR", UINT2NUM(boost::regex_constants::bk_vbar));
343
426
  rb_define_const(rb_cBoostRegexp, "LITERAL", UINT2NUM(boost::regex_constants::literal));
427
+
428
+ return Qnil;
344
429
  }
345
430
  }
metadata CHANGED
@@ -4,9 +4,9 @@ version: !ruby/object:Gem::Version
4
4
  prerelease: false
5
5
  segments:
6
6
  - 1
7
- - 0
8
7
  - 1
9
- version: 1.0.1
8
+ - 0
9
+ version: 1.1.0
10
10
  platform: ruby
11
11
  authors:
12
12
  - Michael Edgar
@@ -14,7 +14,7 @@ autorequire:
14
14
  bindir: bin
15
15
  cert_chain: []
16
16
 
17
- date: 2010-04-08 00:00:00 -04:00
17
+ date: 2010-04-09 00:00:00 -04:00
18
18
  default_executable:
19
19
  dependencies:
20
20
  - !ruby/object:Gem::Dependency
@@ -62,6 +62,7 @@ files:
62
62
  - Rakefile
63
63
  - VERSION
64
64
  - benchmark/benchmark.rb
65
+ - benchmark/benchmark19.rb
65
66
  - benchmark/fasta.input
66
67
  - ext/ruby-boost-regex/extconf.rb
67
68
  - ext/ruby-boost-regex/regexp.cpp