object_regex 1.0.0 → 1.0.1

Sign up to get free protection for your applications and to get access to all the features.
Files changed (3) hide show
  1. data/README.md +217 -246
  2. data/VERSION +1 -1
  3. metadata +4 -8
data/README.md CHANGED
@@ -1,3 +1,32 @@
1
+ # Super-Quick Introduction
2
+
3
+ Ruby 1.9+ only. Not for a strict technical reason, but because I like 1.9's standard-library features and didn't
4
+ want to rewrite it to not use them.
5
+
6
+ gem install object_regex
7
+
8
+ require 'object_regex'
9
+ class Token < Struct.new(:type, :contents)
10
+ def reg_desc
11
+ type.to_s
12
+ end
13
+ end
14
+ input = [Token.new(:str, '"hello"'),
15
+ Token.new(:str, '"there"'),
16
+ Token.new(:int, '2'),
17
+ Token.new(:str, '"worldagain"'),
18
+ Token.new(:str, '"highfive"'),
19
+ Token.new(:int, '5'),
20
+ Token.new(:str, 'jklkjl'),
21
+ Token.new(:int, '3'),
22
+ Token.new(:comment, '#lol'),
23
+ Token.new(:str, ''),
24
+ Token.new(:comment, '#no pairs'),
25
+ Token.new(:str, 'jkl'),
26
+ Token.new(:eof, '')]
27
+ # all contiguous string tokens, and any number that follows them (if any)
28
+ ObjectRegex.new('str+ int?').all_matches(input)
29
+
1
30
  ## Introduction
2
31
 
3
32
  I present a small Ruby class which provides full Ruby Regexp matching on sequences of (potentially) heterogenous objects, conditioned on those objects implementing a single, no-argument method returning a String. I propose it should be used to implement the desired behavior in the Ruby standard library.
@@ -12,35 +41,29 @@ I decided a while ago I wouldn't use [YARD](http://yardoc.org/)'s Ripper-based p
12
41
 
13
42
  Since Ripper strips the comments out when you use `Ripper.sexp`, and I'm not going to switch to the SAX-model of parsing just for comments, I had to use `Ripper.lex` to grab the comments. I immediately found this would prove annoying:
14
43
 
15
- {{{
16
- pp Ripper.lex(" # some comment\n # another comment\n def abc; end")
17
- }}}
44
+ pp Ripper.lex(" # some comment\n # another comment\n def abc; end")
18
45
 
19
46
  gives
20
47
 
21
- {{{
22
- [[[1, 0], :on_sp, " "],
23
- [[1, 2], :on_comment, "# some comment\n"],
24
- [[2, 0], :on_sp, " "],
25
- [[2, 2], :on_comment, "# another comment\n"],
26
- [[3, 0], :on_sp, " "],
27
- [[3, 1], :on_kw, "def"],
28
- [[3, 4], :on_sp, " "],
29
- [[3, 5], :on_ident, "abc"],
30
- [[3, 8], :on_semicolon, ";"],
31
- [[3, 9], :on_sp, " "],
32
- [[3, 10], :on_kw, "end"]]
33
- }}}
48
+ [[[1, 0], :on_sp, " "],
49
+ [[1, 2], :on_comment, "# some comment\n"],
50
+ [[2, 0], :on_sp, " "],
51
+ [[2, 2], :on_comment, "# another comment\n"],
52
+ [[3, 0], :on_sp, " "],
53
+ [[3, 1], :on_kw, "def"],
54
+ [[3, 4], :on_sp, " "],
55
+ [[3, 5], :on_ident, "abc"],
56
+ [[3, 8], :on_semicolon, ";"],
57
+ [[3, 9], :on_sp, " "],
58
+ [[3, 10], :on_kw, "end"]]
34
59
 
35
60
  Naturally, Ripper is separating each line-comment into its own token, even those that follow on subsequent lines. I'd have to combine those comment tokens to get what a typical programmer considers one logical comment.
36
61
 
37
62
  I didn't want to write an ugly, imperative algorithm to do this: part of the beauty of writing Ruby is you don't often have to actually write a `while` loop. I described my frustration to my roommate, and he quickly observed the obvious connection to regular expressions. That's when I remembered [Ripper.slice and Ripper.token_match](http://ruby-doc.org/ruby-1.9/classes/Ripper.html#M001274) (token_match is undocumented), which provide almost exactly what I needed:
38
63
 
39
- {{{
40
- Ripper.slice(" # some comment\n # another comment\n def abc; end",
41
- 'comment (sp? comment)*')
42
- # => "# some comment\n # another comment\n"
43
- }}}
64
+ Ripper.slice(" # some comment\n # another comment\n def abc; end",
65
+ 'comment (sp? comment)*')
66
+ # => "# some comment\n # another comment\n"
44
67
 
45
68
  A few problems: `Ripper.slice` lexes its input on each invocation and then searches it from the start for one match. I need *all* matches. `Ripper.slice` also returns the exact string, and not the location in the source text of the match, which I need - how else will I know where the comments are? The lexer output includes line and column locations, so it should be easy to retrieve.
46
69
 
@@ -52,10 +75,8 @@ The core of regular expressions - the [actually "regular" kind](http://en.wikipe
52
75
 
53
76
  We could construct a separate DFA engine for searching sequences of our new alphabet, but we'd much rather piggyback an existing (and more-featured) implementation. Since the set of token types is countable, one can create a one-to-one mapping from token types to finite strings of an alphabet that Ruby's `Regexp` class can search, namely regular old characters. If we replace each occurrence of a member of our alphabet with a member of the target, Regexp alphabet, then we should be able to use Regexp to do regex searching on our token sequence. That transformation on the token sequence is easy: just map each token's type onto some string using a 1-to-1 function. However, one important bit that remains is how the search pattern is specified. As you saw above, we used:
54
77
 
55
- {{{
56
- 'comment (sp? comment)*'
57
- }}}
58
-
78
+ 'comment (sp? comment)*'
79
+
59
80
  to specify a search for "a comment token, followed by zero or more groups, where each group is an optional space token followed by a comment token." This departs from traditional Regexp syntax, because our alphabet is no longer composed of individual characters, it is composed of tokens. For this implementation's sake, we can observe that we require whitespace be insensitive, and that `?` and `*` operators apply to tokens, not to characters. We could specify this input however we like, as long as we can generate the correct string-searching pattern from it.
60
81
 
61
82
  One last observation that allows us to use Regexp to search our tokens: we must be able to specify a one-to-one function from a token name to the set of tokens that it should match. In other words, no two tokens that we consider "different" can have the same token type. For a normal Regex, this is a trivial condition, as a character matches only that character. However, 'comment' must match the infinite set of all comment tokens. If we satisfy that condition, then there exists a function from a regex on token-types to a regex on strings. This is still pretty trivial to show for tokens, but later when we generalize this approach further, it becomes even more important to do correctly.
@@ -78,31 +99,27 @@ Let's run through the previous example:
78
99
 
79
100
  Ripper runs this code at load-time:
80
101
 
81
- {{{
82
- seed = ('a'..'z').to_a + ('A'..'Z').to_a + ('0'..'9').to_a
83
- SCANNER_EVENT_TABLE.each do |ev, |
84
- raise CompileError, "[RIPPER FATAL] too many system token" if seed.empty?
85
- MAP[ev.to_s.sub(/\Aon_/,'')] = seed.shift
86
- end
87
- }}}
102
+ seed = ('a'..'z').to_a + ('A'..'Z').to_a + ('0'..'9').to_a
103
+ SCANNER_EVENT_TABLE.each do |ev, |
104
+ raise CompileError, "[RIPPER FATAL] too many system token" if seed.empty?
105
+ MAP[ev.to_s.sub(/\Aon_/,'')] = seed.shift
106
+ end
88
107
 
89
108
  I fired up an `irb` instance and checked the result:
90
109
 
91
- {{{
92
- Ripper::TokenPattern::MAP
93
- # => {"CHAR"=>"a", "__end__"=>"b", "backref"=>"c", "backtick"=>"d",
94
- "comma"=>"e", "comment"=>"f", "const"=>"g", "cvar"=>"h", "embdoc"=>"i",
95
- "embdoc_beg"=>"j", "embdoc_end"=>"k", "embexpr_beg"=>"l",
96
- "embexpr_end"=>"m", "embvar"=>"n", "float"=>"o", "gvar"=>"p",
97
- "heredoc_beg"=>"q", "heredoc_end"=>"r", "ident"=>"s", "ignored_nl"=>"t",
98
- "int"=>"u", "ivar"=>"v", "kw"=>"w", "label"=>"x", "lbrace"=>"y",
99
- "lbracket"=>"z", "lparen"=>"A", "nl"=>"B", "op"=>"C", "period"=>"D",
100
- "qwords_beg"=>"E", "rbrace"=>"F", "rbracket"=>"G", "regexp_beg"=>"H",
101
- "regexp_end"=>"I", "rparen"=>"J", "semicolon"=>"K", "sp"=>"L",
102
- "symbeg"=>"M", "tlambda"=>"N", "tlambeg"=>"O", "tstring_beg"=>"P",
103
- "tstring_content"=>"Q", "tstring_end"=>"R", "words_beg"=>"S",
104
- "words_sep"=>"T"}
105
- }}}
110
+ Ripper::TokenPattern::MAP
111
+ # => {"CHAR"=>"a", "__end__"=>"b", "backref"=>"c", "backtick"=>"d",
112
+ "comma"=>"e", "comment"=>"f", "const"=>"g", "cvar"=>"h", "embdoc"=>"i",
113
+ "embdoc_beg"=>"j", "embdoc_end"=>"k", "embexpr_beg"=>"l",
114
+ "embexpr_end"=>"m", "embvar"=>"n", "float"=>"o", "gvar"=>"p",
115
+ "heredoc_beg"=>"q", "heredoc_end"=>"r", "ident"=>"s", "ignored_nl"=>"t",
116
+ "int"=>"u", "ivar"=>"v", "kw"=>"w", "label"=>"x", "lbrace"=>"y",
117
+ "lbracket"=>"z", "lparen"=>"A", "nl"=>"B", "op"=>"C", "period"=>"D",
118
+ "qwords_beg"=>"E", "rbrace"=>"F", "rbracket"=>"G", "regexp_beg"=>"H",
119
+ "regexp_end"=>"I", "rparen"=>"J", "semicolon"=>"K", "sp"=>"L",
120
+ "symbeg"=>"M", "tlambda"=>"N", "tlambeg"=>"O", "tstring_beg"=>"P",
121
+ "tstring_content"=>"Q", "tstring_end"=>"R", "words_beg"=>"S",
122
+ "words_sep"=>"T"}
106
123
 
107
124
  This is completely implementation-dependent, but these characters are an implementation detail for the algorithm anyway.
108
125
 
@@ -110,27 +127,21 @@ This is completely implementation-dependent, but these characters are an impleme
110
127
 
111
128
  Ripper implements this as follows:
112
129
 
113
- {{{
114
- def map_tokens(tokens)
115
- tokens.map {|pos,type,str| map_token(type.to_s.sub(/\Aon_/,'')) }.join
116
- end
117
- }}}
130
+ def map_tokens(tokens)
131
+ tokens.map {|pos,type,str| map_token(type.to_s.sub(/\Aon_/,'')) }.join
132
+ end
118
133
 
119
134
  Running this on our token stream before (markdown doesn't support anchors, so scroll up if necessary), we get this:
120
135
 
121
- {{{
122
- "LfLfLwLsKLw"
123
- }}}
124
-
136
+ "LfLfLwLsKLw"
137
+
125
138
  This is what we will eventually run our modified Regexp against.
126
139
 
127
140
  ### The search pattern is transformed into a pattern that can search this mapped representation of the token sequence. Each token found in the search pattern is replaced by its corresponding single character, and whitespace is removed.
128
141
 
129
142
  What we want is `comment (sp? comment)*`. In this mapped representation, a quick look at the table above shows the regex we need is
130
143
 
131
- {{{
132
- /f(L?f)*/
133
- }}}
144
+ /f(L?f)*/
134
145
 
135
146
  Ripper implements this in a somewhat roundabout fashion, as it seems they wanted to experiment with slightly different syntax. Since my implementation (which I'll present shortly) does not retain these syntax changes, I choose not to list the Ripper version here.
136
147
 
@@ -140,33 +151,27 @@ We run `/f(L?f)*/` on `"LfLfLwLsKLw"`. It matches `fLf` at position 1.
140
151
 
141
152
  As expected, the implementation is quite simple for Ripper:
142
153
 
143
- {{{
144
- def match_list(tokens)
145
- if m = @re.match(map_tokens(tokens))
146
- then MatchData.new(tokens, m)
147
- else nil
148
- end
149
- end
150
- }}}
154
+ def match_list(tokens)
155
+ if m = @re.match(map_tokens(tokens))
156
+ then MatchData.new(tokens, m)
157
+ else nil
158
+ end
159
+ end
151
160
 
152
161
  ### Since each character in the mapped sequence corresponds to a single token, we can index into the original token sequence using the exact boundaries of the match result.
153
162
 
154
163
  The boundaries returned were `(1..4]` in mathematical notation, or `(1...4)`/`(1..3)` as Ruby ranges. We then use this range on the original sequence, which returns:
155
164
 
156
- {{{
157
- [[[1, 2], :on_comment, "# some comment\n"],
158
- [[2, 0], :on_sp, " "],
159
- [[2, 2], :on_comment, "# another comment\n"]]
160
- }}}
165
+ [[[1, 2], :on_comment, "# some comment\n"],
166
+ [[2, 0], :on_sp, " "],
167
+ [[2, 2], :on_comment, "# another comment\n"]]
161
168
 
162
169
  The implementation is again quite simple in Ripper, yet it for some reason immediately extracts the token contents:
163
170
 
164
- {{{
165
- def match(n = 0)
166
- return [] unless @match
167
- @tokens[@match.begin(n)...@match.end(n)].map {|pos,type,str| str }
168
- end
169
- }}}
171
+ def match(n = 0)
172
+ return [] unless @match
173
+ @tokens[@match.begin(n)...@match.end(n)].map {|pos,type,str| str }
174
+ end
170
175
 
171
176
  ## Generalization
172
177
 
@@ -187,74 +192,72 @@ For lack of a better name, we'll call this an `ObjectRegex`.
187
192
 
188
193
  The full listing follows. You'll quickly notice that I haven't yet implemented the API that I actually need for Wool. Keeping focused seems incompatible with curiosity in my case, unfortunately.
189
194
 
190
- {{{
191
- class ObjectRegex
192
- def initialize(pattern)
193
- @map = generate_map(pattern)
194
- @pattern = generate_pattern(pattern)
195
- end
196
-
197
- def mapped_value(reg_desc)
198
- @map[reg_desc] || @map[:FAILBOAT]
199
- end
200
-
201
- MAPPING_CHARS = ('a'..'z').to_a + ('A'..'Z').to_a + ('0'..'9').to_a
202
- def generate_map(pattern)
203
- alphabet = pattern.scan(/[A-Za-z]+/).uniq
204
- repr_size = Math.log(alphabet.size + 1, MAPPING_CHARS.size).ceil
205
- @item_size = repr_size + 1
206
-
207
- map = Hash[alphabet.map.with_index do |symbol, idx|
208
- [symbol, mapping_for_idx(repr_size, idx)]
209
- end]
210
- map.merge!(FAILBOAT: mapping_for_idx(repr_size, map.size))
211
- end
212
-
213
- def mapping_for_idx(repr_size, idx)
214
- convert_to_mapping_radix(repr_size, idx).map do |char|
215
- MAPPING_CHARS[char]
216
- end.join + ';'
217
- end
218
-
219
- def convert_to_mapping_radix(repr_size, num)
220
- result = []
221
- repr_size.times do
222
- result.unshift(num % MAPPING_CHARS.size)
223
- num /= MAPPING_CHARS.size
224
- end
225
- result
226
- end
227
-
228
- def generate_pattern(pattern)
229
- replace_tokens(fix_dots(remove_ranges(pattern)))
230
- end
231
-
232
- def remove_ranges(pattern)
233
- pattern.gsub(/\[([A-Za-z ]*)\]/) do |match|
234
- '(?:' + match[1..-2].split(/\s+/).join('|') + ')'
235
- end
236
- end
237
-
238
- def fix_dots(pattern)
239
- pattern.gsub('.', '.' * (@item_size - 1) + ';')
240
- end
241
-
242
- def replace_tokens(pattern)
243
- pattern.gsub(/[A-Za-z]+/) do |match|
244
- '(?:' + mapped_value(match) + ')'
245
- end.gsub(/\s/, '')
246
- end
247
-
248
- def match(input)
249
- new_input = input.map { |object| object.reg_desc }.
250
- map { |desc| mapped_value(desc) }.join
251
- if (match = new_input.match(@pattern))
252
- start, stop = match.begin(0) / @item_size, match.end(0) / @item_size
253
- input[start...stop]
254
- end
255
- end
256
- end
257
- }}}
195
+ class ObjectRegex
196
+ def initialize(pattern)
197
+ @map = generate_map(pattern)
198
+ @pattern = generate_pattern(pattern)
199
+ end
200
+
201
+ def mapped_value(reg_desc)
202
+ @map[reg_desc] || @map[:FAILBOAT]
203
+ end
204
+
205
+ MAPPING_CHARS = ('a'..'z').to_a + ('A'..'Z').to_a + ('0'..'9').to_a
206
+ def generate_map(pattern)
207
+ alphabet = pattern.scan(/[A-Za-z]+/).uniq
208
+ repr_size = Math.log(alphabet.size + 1, MAPPING_CHARS.size).ceil
209
+ @item_size = repr_size + 1
210
+
211
+ map = Hash[alphabet.map.with_index do |symbol, idx|
212
+ [symbol, mapping_for_idx(repr_size, idx)]
213
+ end]
214
+ map.merge!(FAILBOAT: mapping_for_idx(repr_size, map.size))
215
+ end
216
+
217
+ def mapping_for_idx(repr_size, idx)
218
+ convert_to_mapping_radix(repr_size, idx).map do |char|
219
+ MAPPING_CHARS[char]
220
+ end.join + ';'
221
+ end
222
+
223
+ def convert_to_mapping_radix(repr_size, num)
224
+ result = []
225
+ repr_size.times do
226
+ result.unshift(num % MAPPING_CHARS.size)
227
+ num /= MAPPING_CHARS.size
228
+ end
229
+ result
230
+ end
231
+
232
+ def generate_pattern(pattern)
233
+ replace_tokens(fix_dots(remove_ranges(pattern)))
234
+ end
235
+
236
+ def remove_ranges(pattern)
237
+ pattern.gsub(/\[([A-Za-z ]*)\]/) do |match|
238
+ '(?:' + match[1..-2].split(/\s+/).join('|') + ')'
239
+ end
240
+ end
241
+
242
+ def fix_dots(pattern)
243
+ pattern.gsub('.', '.' * (@item_size - 1) + ';')
244
+ end
245
+
246
+ def replace_tokens(pattern)
247
+ pattern.gsub(/[A-Za-z]+/) do |match|
248
+ '(?:' + mapped_value(match) + ')'
249
+ end.gsub(/\s/, '')
250
+ end
251
+
252
+ def match(input)
253
+ new_input = input.map { |object| object.reg_desc }.
254
+ map { |desc| mapped_value(desc) }.join
255
+ if (match = new_input.match(@pattern))
256
+ start, stop = match.begin(0) / @item_size, match.end(0) / @item_size
257
+ input[start...stop]
258
+ end
259
+ end
260
+ end
258
261
 
259
262
  ## Generalized Map Generation
260
263
 
@@ -262,61 +265,47 @@ Generating the map is the primary interest here, so I'll start there.
262
265
 
263
266
  First, we discover the alphabet by extracting all matches for `/[A-Za-z]+/` from the input pattern.
264
267
 
265
- {{{
266
- alphabet = pattern.scan(/[A-Za-z]+/).uniq
267
- }}}
268
+ alphabet = pattern.scan(/[A-Za-z]+/).uniq
268
269
 
269
270
  We figure out how many characters we need to represent that many elements, and save that for later:
270
271
 
271
- {{{
272
- # alphabet.size + 1 because of the catch-all, "not-in-pattern" mapping
273
- repr_size = Math.log(alphabet.size + 1, MAPPING_CHARS.size).ceil
274
- # repr_size + 1 because we will be inserting a terminator in a moment
275
- @item_size = repr_size + 1
276
- }}}
272
+ # alphabet.size + 1 because of the catch-all, "not-in-pattern" mapping
273
+ repr_size = Math.log(alphabet.size + 1, MAPPING_CHARS.size).ceil
274
+ # repr_size + 1 because we will be inserting a terminator in a moment
275
+ @item_size = repr_size + 1
277
276
 
278
277
  Now, we just calculate the [symbol, mapped\_symbol] pairs for each symbol in the input alphabet:
279
278
 
280
- {{{
281
- map = Hash[alphabet.map.with_index do |symbol, idx|
282
- [symbol, mapping_for_idx(repr_size, idx)]
283
- end]
284
- }}}
279
+ map = Hash[alphabet.map.with_index do |symbol, idx|
280
+ [symbol, mapping_for_idx(repr_size, idx)]
281
+ end]
285
282
 
286
283
  We'll come back to how this works, but we must add the catch-all map entry: the entry that is triggered if we see a token in the searched sequence that didn't appear in the search pattern:
287
284
 
288
- {{{
289
- map.merge!(FAILBOAT: mapping_for_idx(repr_size, map.size))
290
- }}}
285
+ map.merge!(FAILBOAT: mapping_for_idx(repr_size, map.size))
291
286
 
292
287
  Note that we avoid the use of the `inject({})` idiom common for constructing Hashes, since the computation of each tuple is independent from the others. `mapping_for_idx` is responsible for finding the mapped string for the given element. In Ripper, this was just an index into an array. However, if we want more than 62 possible elements in our alphabet, we instead need to convert the index into a base-62 number, first. `convert_to_mapping_radix` does this, using the size of the `MAPPING_CHARS` constant as the new radix:
293
288
 
294
- {{{
295
- # Standard radix conversion.
296
- def convert_to_mapping_radix(repr_size, num)
297
- result = []
298
- repr_size.times do
299
- result.unshift(num % MAPPING_CHARS.size)
300
- num /= MAPPING_CHARS.size
301
- end
302
- result
303
- end
304
- }}}
289
+ # Standard radix conversion.
290
+ def convert_to_mapping_radix(repr_size, num)
291
+ result = []
292
+ repr_size.times do
293
+ result.unshift(num % MAPPING_CHARS.size)
294
+ num /= MAPPING_CHARS.size
295
+ end
296
+ result
297
+ end
305
298
 
306
299
  If MAPPING\_CHARS.size = 62, then:
307
300
 
308
- {{{
309
- convert_to_mapping_radix(3, 12498)
310
- # => [3, 15, 36]
311
- }}}
301
+ convert_to_mapping_radix(3, 12498)
302
+ # => [3, 15, 36]
312
303
 
313
304
  After we convert each number into the necessary radix, we can then convert that array of place-value integers into a string by mapping each place value to its corresponding character in the MAPPING\_CHARS array:
314
305
 
315
- {{{
316
- def mapping_for_idx(repr_size, idx)
317
- convert_to_mapping_radix(repr_size, idx).map { |char| MAPPING_CHARS[char] }.join + ';'
318
- end
319
- }}}
306
+ def mapping_for_idx(repr_size, idx)
307
+ convert_to_mapping_radix(repr_size, idx).map { |char| MAPPING_CHARS[char] }.join + ';'
308
+ end
320
309
 
321
310
  Notice that we added a semicolon at the end there. The choice of semicolon was arbitrary - it could be any valid character that isn't in MAPPING\_CHARS. Why'd I add that?
322
311
 
@@ -326,97 +315,79 @@ Imagine we were searching for a long input sequence that needed 2 characters per
326
315
 
327
316
  After building the new map, constructing the corresponding search pattern is quite simple:
328
317
 
329
- {{{
330
- def generate_pattern(pattern)
331
- replace_tokens(fix_dots(remove_ranges(pattern)))
332
- end
333
-
334
- def remove_ranges(pattern)
335
- pattern.gsub(/\[([A-Za-z ]*)\]/) do |match|
336
- '(?:' + match[1..-2].split(/\s+/).join('|') + ')'
337
- end
338
- end
339
-
340
- def fix_dots(pattern)
341
- pattern.gsub('.', '.' * (@item_size - 1) + ';')
342
- end
343
-
344
- def replace_tokens(pattern)
345
- pattern.gsub(/[A-Za-z]+/) do |match|
346
- '(?:' + mapped_value(match) + ')'
347
- end.gsub(/\s/, '')
348
- end
349
- }}}
318
+ def generate_pattern(pattern)
319
+ replace_tokens(fix_dots(remove_ranges(pattern)))
320
+ end
321
+
322
+ def remove_ranges(pattern)
323
+ pattern.gsub(/\[([A-Za-z ]*)\]/) do |match|
324
+ '(?:' + match[1..-2].split(/\s+/).join('|') + ')'
325
+ end
326
+ end
327
+
328
+ def fix_dots(pattern)
329
+ pattern.gsub('.', '.' * (@item_size - 1) + ';')
330
+ end
331
+
332
+ def replace_tokens(pattern)
333
+ pattern.gsub(/[A-Za-z]+/) do |match|
334
+ '(?:' + mapped_value(match) + ')'
335
+ end.gsub(/\s/, '')
336
+ end
350
337
 
351
338
  First, we have to account for this regex syntax:
352
339
 
353
- {{{
354
- [comment embdoc_beg int]
355
- }}}
340
+ [comment embdoc_beg int]
356
341
 
357
342
  which we assume to mean "comment or eof or int", much like `[Acf]` means "A or c or f". Since constructs such as `A-Z` don't make sense with an arbitrary alphabet, we don't need to concern ourselves with that syntax. However, if we simply replace "comment" with its mapped string, and do the same with eof and int, we get something like this:
358
343
 
359
- {{{
360
- [f;j;u;]
361
- }}}
344
+ [f;j;u;]
362
345
 
363
346
  which won't work: it'll match any semicolon! So we manually replace all instances of `[tok1 tok2 ... tokn]` with `tok1|tok2|...|tokn`. A simple gsub does the trick, since nested ranges don't really make much sense. This is implemented in #remove\_ranges:
364
347
 
365
- {{{
366
- def remove_ranges(pattern)
367
- pattern.gsub(/\[([A-Za-z ]*)\]/) do |match|
368
- '(?:' + match[1..-2].split(/\s+/).join('|') + ')'
369
- end
370
- end
371
- }}}
348
+ def remove_ranges(pattern)
349
+ pattern.gsub(/\[([A-Za-z ]*)\]/) do |match|
350
+ '(?:' + match[1..-2].split(/\s+/).join('|') + ')'
351
+ end
352
+ end
372
353
 
373
354
  Next, we replace the '.' matcher with a sequence of dots equal to the size of our token mapping, followed by a semicolon: this is how we properly match "any alphabet element" in our mapped form.
374
355
 
375
- {{{
376
- def fix_dots(pattern)
377
- pattern.gsub('.', '.' * (@item_size - 1) + ';')
378
- end
379
- }}}
356
+ def fix_dots(pattern)
357
+ pattern.gsub('.', '.' * (@item_size - 1) + ';')
358
+ end
380
359
 
381
360
  Then, we simply replace each alphabet element with its mapped value. Since those mapped values could be more than one character, we must group them for other Regex features such as `+` or `*` to work properly; since we may want to extract subexpressions, we must make the group we introduce here non-capturing. Then we just strip whitespace.
382
361
 
383
- {{{
384
- def replace_tokens(pattern)
385
- pattern.gsub(/[A-Za-z]+/) do |match|
386
- '(?:' + mapped_value(match) + ')'
387
- end.gsub(/\s/, '')
388
- end
389
- }}}
362
+ def replace_tokens(pattern)
363
+ pattern.gsub(/[A-Za-z]+/) do |match|
364
+ '(?:' + mapped_value(match) + ')'
365
+ end.gsub(/\s/, '')
366
+ end
390
367
 
391
368
  ## Generalized Matching
392
369
 
393
370
  Lastly, we have a simple #match method:
394
371
 
395
- {{{
396
- def match(input)
397
- new_input = input.map { |object| object.reg_desc }.map { |desc| mapped_value(desc) }.join
398
- if (match = new_input.match(@pattern))
399
- start, stop = match.begin(0) / @item_size, match.end(0) / @item_size
400
- input[start...stop]
401
- end
402
- end
403
- }}}
372
+ def match(input)
373
+ new_input = input.map { |object| object.reg_desc }.map { |desc| mapped_value(desc) }.join
374
+ if (match = new_input.match(@pattern))
375
+ start, stop = match.begin(0) / @item_size, match.end(0) / @item_size
376
+ input[start...stop]
377
+ end
378
+ end
404
379
 
405
380
  While there's many ways of extracting results from a Regex match, here we do the simplest: return the subsequence of the original sequence that matches first (using the usual leftmost, longest rule of course). Here comes the one part where you have to modify the objects that are in the sequence: in the first line, you'll see:
406
381
 
407
- {{{
408
- input.map { |object| object.reg_desc }.map { |desc| mapped_value(desc) }
409
- }}}
382
+ input.map { |object| object.reg_desc }.map { |desc| mapped_value(desc) }
410
383
 
411
384
  This interrogates each object for its string representation: the string you typed into your search pattern if you wanted to find it. The method name (`reg_desc` in this case) is arbitrary, and this could also be implemented by providing a `Proc` to the ObjectRegex at initialization, and having the Proc be responsible for determining string representations.
412
385
 
413
386
  We also see on the 3rd and 4th lines of the method why we stored @item\_size earlier: for boundary calculations:
414
387
 
415
- {{{
416
- start, stop = match.begin(0) / @item_size, match.end(0) / @item_size
417
- input[start...stop]
418
- }}}
419
-
388
+ start, stop = match.begin(0) / @item_size, match.end(0) / @item_size
389
+ input[start...stop]
390
+
420
391
  Sometimes I wish `begin` and `end` could be local variable names in Ruby. Alas.
421
392
 
422
393
  ## Conclusion
data/VERSION CHANGED
@@ -1 +1 @@
1
- 1.0.0
1
+ 1.0.1
metadata CHANGED
@@ -5,8 +5,8 @@ version: !ruby/object:Gem::Version
5
5
  segments:
6
6
  - 1
7
7
  - 0
8
- - 0
9
- version: 1.0.0
8
+ - 1
9
+ version: 1.0.1
10
10
  platform: ruby
11
11
  authors:
12
12
  - Michael Edgar
@@ -14,14 +14,13 @@ autorequire:
14
14
  bindir: bin
15
15
  cert_chain: []
16
16
 
17
- date: 2011-01-25 00:00:00 -05:00
17
+ date: 2011-01-31 00:00:00 -05:00
18
18
  default_executable:
19
19
  dependencies:
20
20
  - !ruby/object:Gem::Dependency
21
21
  name: rspec
22
22
  prerelease: false
23
23
  requirement: &id001 !ruby/object:Gem::Requirement
24
- none: false
25
24
  requirements:
26
25
  - - ">="
27
26
  - !ruby/object:Gem::Version
@@ -36,7 +35,6 @@ dependencies:
36
35
  name: yard
37
36
  prerelease: false
38
37
  requirement: &id002 !ruby/object:Gem::Requirement
39
- none: false
40
38
  requirements:
41
39
  - - ">="
42
40
  - !ruby/object:Gem::Version
@@ -78,7 +76,6 @@ rdoc_options:
78
76
  require_paths:
79
77
  - lib
80
78
  required_ruby_version: !ruby/object:Gem::Requirement
81
- none: false
82
79
  requirements:
83
80
  - - ">="
84
81
  - !ruby/object:Gem::Version
@@ -86,7 +83,6 @@ required_ruby_version: !ruby/object:Gem::Requirement
86
83
  - 0
87
84
  version: "0"
88
85
  required_rubygems_version: !ruby/object:Gem::Requirement
89
- none: false
90
86
  requirements:
91
87
  - - ">="
92
88
  - !ruby/object:Gem::Version
@@ -96,7 +92,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
96
92
  requirements: []
97
93
 
98
94
  rubyforge_project:
99
- rubygems_version: 1.3.7
95
+ rubygems_version: 1.3.6
100
96
  signing_key:
101
97
  specification_version: 3
102
98
  summary: Perform regex searches on arbitrary sequences.