regexp_parser 0.1.6 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (84) hide show
  1. checksums.yaml +4 -4
  2. data/ChangeLog +57 -0
  3. data/Gemfile +8 -0
  4. data/LICENSE +1 -1
  5. data/README.md +225 -206
  6. data/Rakefile +9 -3
  7. data/lib/regexp_parser.rb +7 -11
  8. data/lib/regexp_parser/expression.rb +72 -14
  9. data/lib/regexp_parser/expression/classes/alternation.rb +3 -16
  10. data/lib/regexp_parser/expression/classes/conditional.rb +57 -0
  11. data/lib/regexp_parser/expression/classes/free_space.rb +17 -0
  12. data/lib/regexp_parser/expression/classes/keep.rb +7 -0
  13. data/lib/regexp_parser/expression/classes/set.rb +28 -7
  14. data/lib/regexp_parser/expression/methods/strfregexp.rb +113 -0
  15. data/lib/regexp_parser/expression/methods/tests.rb +116 -0
  16. data/lib/regexp_parser/expression/methods/traverse.rb +63 -0
  17. data/lib/regexp_parser/expression/quantifier.rb +10 -0
  18. data/lib/regexp_parser/expression/sequence.rb +45 -0
  19. data/lib/regexp_parser/expression/subexpression.rb +29 -1
  20. data/lib/regexp_parser/lexer.rb +31 -8
  21. data/lib/regexp_parser/parser.rb +118 -45
  22. data/lib/regexp_parser/scanner.rb +1745 -1404
  23. data/lib/regexp_parser/scanner/property.rl +57 -3
  24. data/lib/regexp_parser/scanner/scanner.rl +161 -34
  25. data/lib/regexp_parser/syntax.rb +12 -2
  26. data/lib/regexp_parser/syntax/ruby/1.9.1.rb +3 -3
  27. data/lib/regexp_parser/syntax/ruby/1.9.3.rb +2 -7
  28. data/lib/regexp_parser/syntax/ruby/2.0.0.rb +4 -1
  29. data/lib/regexp_parser/syntax/ruby/2.1.4.rb +13 -0
  30. data/lib/regexp_parser/syntax/ruby/2.1.5.rb +13 -0
  31. data/lib/regexp_parser/syntax/ruby/2.1.rb +2 -2
  32. data/lib/regexp_parser/syntax/ruby/2.2.0.rb +16 -0
  33. data/lib/regexp_parser/syntax/ruby/2.2.rb +8 -0
  34. data/lib/regexp_parser/syntax/tokens.rb +19 -2
  35. data/lib/regexp_parser/syntax/tokens/conditional.rb +22 -0
  36. data/lib/regexp_parser/syntax/tokens/keep.rb +14 -0
  37. data/lib/regexp_parser/syntax/tokens/unicode_property.rb +45 -4
  38. data/lib/regexp_parser/token.rb +23 -8
  39. data/lib/regexp_parser/version.rb +5 -0
  40. data/regexp_parser.gemspec +35 -0
  41. data/test/expression/test_all.rb +6 -1
  42. data/test/expression/test_base.rb +19 -0
  43. data/test/expression/test_conditionals.rb +114 -0
  44. data/test/expression/test_free_space.rb +33 -0
  45. data/test/expression/test_set.rb +61 -0
  46. data/test/expression/test_strfregexp.rb +214 -0
  47. data/test/expression/test_subexpression.rb +24 -0
  48. data/test/expression/test_tests.rb +99 -0
  49. data/test/expression/test_to_h.rb +48 -0
  50. data/test/expression/test_to_s.rb +46 -0
  51. data/test/expression/test_traverse.rb +164 -0
  52. data/test/lexer/test_all.rb +16 -3
  53. data/test/lexer/test_conditionals.rb +101 -0
  54. data/test/lexer/test_keep.rb +24 -0
  55. data/test/lexer/test_literals.rb +51 -51
  56. data/test/lexer/test_nesting.rb +62 -62
  57. data/test/lexer/test_refcalls.rb +18 -20
  58. data/test/parser/test_all.rb +18 -3
  59. data/test/parser/test_alternation.rb +11 -14
  60. data/test/parser/test_conditionals.rb +148 -0
  61. data/test/parser/test_escapes.rb +29 -5
  62. data/test/parser/test_free_space.rb +139 -0
  63. data/test/parser/test_groups.rb +40 -0
  64. data/test/parser/test_keep.rb +21 -0
  65. data/test/scanner/test_all.rb +8 -2
  66. data/test/scanner/test_conditionals.rb +166 -0
  67. data/test/scanner/test_escapes.rb +8 -5
  68. data/test/scanner/test_free_space.rb +133 -0
  69. data/test/scanner/test_groups.rb +28 -0
  70. data/test/scanner/test_keep.rb +33 -0
  71. data/test/scanner/test_properties.rb +4 -0
  72. data/test/scanner/test_scripts.rb +71 -1
  73. data/test/syntax/ruby/test_1.9.3.rb +2 -2
  74. data/test/syntax/ruby/test_2.0.0.rb +38 -0
  75. data/test/syntax/ruby/test_2.2.0.rb +38 -0
  76. data/test/syntax/ruby/test_all.rb +1 -8
  77. data/test/syntax/ruby/test_files.rb +104 -0
  78. data/test/test_all.rb +2 -1
  79. data/test/token/test_all.rb +2 -0
  80. data/test/token/test_token.rb +109 -0
  81. metadata +75 -21
  82. data/VERSION.yml +0 -5
  83. data/lib/regexp_parser/ctype.rb +0 -48
  84. data/test/syntax/ruby/test_2.x.rb +0 -46
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 6ef4ef1e296f8e15fe5316a5603c15e96446cb45
4
- data.tar.gz: 0430451d4d0fb874dbdcc123d017d2867856d891
3
+ metadata.gz: 231a27b00daf24a41710b45ef92fef5b6963dc5a
4
+ data.tar.gz: 1cd1f75da74654cd20a0ac7716aed8519490fef5
5
5
  SHA512:
6
- metadata.gz: 05706f3dbe8f1fe9684ea63abf9a0b0e4ff354146bddd488a72fb9bd5352979123bb6d6adc75624dae33c39f1d41e34b76c33cc3706f1ee6f6a989b8e2e259f1
7
- data.tar.gz: d427c8ec82b4f955f47b1f27043a23604f654ff25244160fa416b0331c587cca0f2c4f1e9f47ee898934b753391e03cfa1ca6a3eb88b04c34efc255adf6391b8
6
+ metadata.gz: 620846b89adb5b8d27efe722af58951951ce0e7362f646fcc397c94f5597dc999d79851a507d13338ca7f59b67a25539a205878209f7787f6d4053ba95b2555c
7
+ data.tar.gz: 2f2400eede6011229f6690c8230ae1c6b3abdc2e83380cf0e7b8b6e1dd72c2c7035e82f87f808f64e0f47442df961ba13dae04164c35c776f41cf92aefa2f515
data/ChangeLog CHANGED
@@ -1,3 +1,60 @@
1
+ Wed Dec 3 05:21:27 2014 Ammar Ali <ammarabuali@gmail.com>
2
+
3
+ * Added expand_members method to CharacterSet, returns traditional
4
+ or unicode property forms of shothands (\d, \W, \s, etc.)
5
+
6
+ Tue Dec 2 02:42:39 2014 Ammar Ali <ammarabuali@gmail.com>
7
+
8
+ * Improved meaning and output of %t and %T in strfregexp.
9
+
10
+ * Added syntax versions for ruby 2.1.4 and 2.1.5 and updated
11
+ latest 2.1 version.
12
+
13
+ Mon Dec 1 15:52:31 2014 Ammar Ali <ammarabuali@gmail.com>
14
+
15
+ * Added to_h methods to Expression, Subexpression, and Quantifier.
16
+
17
+ Tue Oct 21 19:14:03 2014 Ammar Ali <ammarabuali@gmail.com>
18
+
19
+ * Added traversal methods; traverse, each_expression, and map.
20
+
21
+ * Added token/type test methods; type?, is?, and one_of?
22
+
23
+ * Added printing method strfregexp, inspired by strftime.
24
+
25
+ Mon Oct 20 01:03:46 2014 Ammar Ali <ammarabuali@gmail.com>
26
+
27
+ * Added scanning and parsing of free spacing (x mode) expressions.
28
+
29
+ * Improved handling of inline options (?mixdau:...)
30
+
31
+ Fri Oct 18 14:09:38 2014 Ammar Ali <ammarabuali@gmail.com>
32
+
33
+ * Added conditional expressions. Ruby 2.0.
34
+
35
+ * Added keep (\K) markers. Ruby 2.0.
36
+
37
+ * Added d, a, and u options. Ruby 2.0.
38
+
39
+ * Added missing meta sequences to the parser. They were supported
40
+ by the scanner only.
41
+
42
+ * Renamed Lexer's method to lex, added an alias to the old name (scan)
43
+
44
+ * Use #map instead of #each to run the block in Lexer.lex.
45
+
46
+ * Replaced VERSION.yml file with a constant.
47
+
48
+ * Updated README
49
+
50
+ Fri Oct 10 11:49:38 2014 Ammar Ali <ammarabuali@gmail.com>
51
+
52
+ * Update tokens and scanner with new additions in Unicode 7.0.
53
+
54
+ Mon Oct 6 04:30:24 2014 Ammar Ali <ammarabuali@gmail.com>
55
+
56
+ * Released version 0.1.6
57
+
1
58
  Sun Oct 5 19:58:17 2014 Ammar Ali <ammarabuali@gmail.com>
2
59
 
3
60
  * Fixed test and gem building rake tasks and extracted the gem
data/Gemfile ADDED
@@ -0,0 +1,8 @@
1
+ source 'https://rubygems.org'
2
+
3
+ gemspec
4
+
5
+ group :development, :test do
6
+ gem 'rake'
7
+ gem 'test-unit'
8
+ end
data/LICENSE CHANGED
@@ -1,4 +1,4 @@
1
- Copyright (c) 2010 Ammar Ali
1
+ Copyright (c) 2010, 2012-2014, Ammar Ali
2
2
 
3
3
  Permission is hereby granted, free of charge, to any person
4
4
  obtaining a copy of this software and associated documentation
data/README.md CHANGED
@@ -1,57 +1,87 @@
1
- # Regexp::Parser [![Gem Version](https://badge.fury.io/rb/regexp_parser.svg)](http://badge.fury.io/rb/regexp_parser) [![Build Status](https://secure.travis-ci.org/ammar/regexp_parser.png?branch=master)](http://travis-ci.org/ammar/regexp_parser) [![Code Climate](https://codeclimate.com/github/ammar/regexp_parser.png)](https://codeclimate.com/github/ammar/regexp_parser/badges)
1
+ # Regexp::Parser
2
2
 
3
- A ruby library to help with lexing, parsing, and transforming regular expressions.
3
+ [![Gem Version](https://badge.fury.io/rb/regexp_parser.svg)](http://badge.fury.io/rb/regexp_parser) [![Build Status](https://secure.travis-ci.org/ammar/regexp_parser.png?branch=master)](http://travis-ci.org/ammar/regexp_parser) [![Code Climate](https://codeclimate.com/github/ammar/regexp_parser.png)](https://codeclimate.com/github/ammar/regexp_parser/badges)
4
+
5
+ A ruby gem for tokenizing, parsing, and transforming regular expressions.
4
6
 
5
7
  * Multilayered
6
- * A scanner based on [ragel](http://www.complang.org/ragel/)
7
- * A lexer that produces a "stream" of tokens
8
- * A parser that produces a "tree" of Regexp::Expression objects (OO API)
9
- * Supports ruby 1.8, 1.9, and all but one of the 2.x expressions [See Scanner Syntax](#scanner-syntax)
10
- * Supports ruby 1.8, 1.9, 2.0, and 2.1 runtimes.
8
+ * A scanner/tokenizer based on [ragel](http://www.colm.net/open-source/ragel/)
9
+ * A lexer that produces a "stream" of token objects.
10
+ * A parser that produces a "tree" of Expression objects (OO API)
11
+ * Runs on ruby 1.8, 1.9, 2.x, and jruby (1.9 mode) runtimes.
12
+ * Recognizes ruby 1.8, 1.9, and 2.x regular expressions [See Scanner Syntax](#scanner-syntax)
13
+
11
14
 
12
15
  _For an example of regexp_parser in use, see the [meta_re project](https://github.com/ammar/meta_re)_
13
16
 
17
+
14
18
  ---
15
19
  ## Requirements
16
20
 
17
- * ruby '1.8.7'..'2.1.3'
18
- * ragel, but only if you want to build the gem or work on the scanner
21
+ * Ruby >= 1.8.7
22
+ * Ragel >= 6.0, but only if you want to build the gem or work on the scanner.
19
23
 
20
24
 
21
25
  _Note: See the .travis.yml file for covered versions._
22
26
 
27
+
23
28
  ---
24
29
  ## Install
25
30
 
31
+ Install the gem with:
32
+
26
33
  `gem install regexp_parser`
27
34
 
35
+ Or, add it to your project's `Gemfile`:
36
+
37
+ ```gem 'regexp_parser', '~> X.Y.Z'```
38
+
39
+ See rubygems for the the [latest version number](https://rubygems.org/gems/regexp_parser)
40
+
41
+
28
42
  ---
29
43
  ## Usage
30
44
 
45
+ The three main modules are **Scanner**, **Lexer**, and **Parser**. Each of them
46
+ provides a single method that takes a regular expression (as a RegExp object or
47
+ a string) and returns its results. The **Lexer** and the **Parser** accept an
48
+ optional second argument that specifies the syntax version, like 'ruby/2.0',
49
+ which defaults to the host ruby version (using RUBY_VERSION).
50
+
51
+ Here are the basic usage examples:
52
+
31
53
  ```ruby
32
- # require the gem, then call one of:
33
54
  require 'regexp_parser'
34
55
 
35
- # The Scanner
36
- Regexp::Scanner.scan regexp
56
+ Regexp::Scanner.scan(regexp)
37
57
 
38
- # The Lexer
39
- Regexp::Lexer.scan regexp
58
+ Regexp::Lexer.lex(regexp)
40
59
 
41
- # Or the Parser
42
- Regexp::Parser.parse regexp
60
+ Regexp::Parser.parse(regexp)
43
61
  ```
44
62
 
45
- _All three can either return their results or take a block to perform further handling._
63
+ All three methods accept a block as the last argument, which, if given, gets
64
+ called with the results as follows:
65
+
66
+ * **Scanner**: the block gets passed the results as they are scanned. See the
67
+ example in the next section for details.
68
+
69
+ * **Lexer**: after completion, the block gets passed the tokens one by one.
70
+ _The result of the block is returned._
71
+
72
+ * **Parser**: after completion, the block gets passed the root expression.
73
+ _The result of the block is returned._
74
+
46
75
 
47
76
  ---
48
77
  ## Components
49
78
 
50
79
  ### Scanner
51
- A ragel generated scanner that recognizes the cumulative syntax of both
52
- supported flavors. Breaks the expression's text into tokens, including
53
- their type, token, text, and start/end offsets within the original
54
- pattern.
80
+ A ragel generated scanner that recognizes the cumulative syntax of all
81
+ supported syntax versions. It breaks a given expression's text into the
82
+ smallest parts, and identifies their type, token, text, and start/end
83
+ offsets within the pattern.
84
+
55
85
 
56
86
  #### Example
57
87
  The following scans the given pattern and prints out the type, token, text and
@@ -79,7 +109,8 @@ end
79
109
  # type: group, token: close, text: ')' [15..16]
80
110
  ```
81
111
 
82
- A one-liner that returns an array of the textual parts of the given pattern:
112
+ A one-liner that uses map on the result of the scan to return the textual
113
+ parts of the pattern:
83
114
 
84
115
  ```ruby
85
116
  Regexp::Scanner.scan( /(cat?([bhm]at)){3,5}/ ).map {|token| token[2]}
@@ -90,17 +121,18 @@ Regexp::Scanner.scan( /(cat?([bhm]at)){3,5}/ ).map {|token| token[2]}
90
121
  #### Notes
91
122
  * The scanner performs basic syntax error checking, like detecting missing
92
123
  balancing punctuation and premature end of pattern. Flavor validity checks
93
- are performed in the lexer.
124
+ are performed in the lexer, which uses a syntax object.
94
125
 
95
- * If the input is a ruby Regexp object, the scanner calls #source on it to
126
+ * If the input is a ruby **Regexp** object, the scanner calls #source on it to
96
127
  get its string representation. #source does not include the options of
97
- expression (m, i, and x) To include the options the scan, #to_s should
98
- be called on the Regexp before passing it to the scanner, or any of the
99
- higher layers.
128
+ the expression (m, i, and x) To include the options in the scan, #to_s
129
+ should be called on the **Regexp** before passing it to the scanner or any
130
+ of the other modules.
100
131
 
101
132
  * To keep the scanner simple(r) and fairly reusable for other purposes, it
102
133
  does not perform lexical analysis on the tokens, sticking to the task
103
- of tokenizing and leaving lexical analysis upto to the lexer.
134
+ of identifying the smallest possible tokens and leaving lexical analysis
135
+ to the lexer.
104
136
 
105
137
 
106
138
  ---
@@ -110,28 +142,36 @@ flavor). Syntax classes act as lookup tables, and are layered to create
110
142
  flavor variations. Syntax only comes into play in the lexer.
111
143
 
112
144
  #### Example
113
- The following instantiates the syntax for Ruby 1.9 and checks a couple of its
114
- implementations features, and then does the same for Ruby 1.8:
145
+ The following instantiates syntax objects for Ruby 2.0, 1.9, 1.8, and
146
+ checks a few of their implementation features.
115
147
 
116
148
  ```ruby
117
149
  require 'regexp_parser'
118
150
 
151
+ ruby_20 = Regexp::Syntax.new 'ruby/2.0'
152
+ ruby_20.implements? :quantifier, :zero_or_one # => true
153
+ ruby_20.implements? :quantifier, :zero_or_one_reluctant # => true
154
+ ruby_20.implements? :quantifier, :zero_or_one_possessive # => true
155
+ ruby_20.implements? :conditional, :condition # => true
156
+
119
157
  ruby_19 = Regexp::Syntax.new 'ruby/1.9'
120
- ruby_19.implements? :quantifier, :zero_or_one # => true
121
- ruby_19.implements? :quantifier, :zero_or_one_reluctant # => true
122
- ruby_19.implements? :quantifier, :zero_or_one_possessive # => true
158
+ ruby_19.implements? :quantifier, :zero_or_one # => true
159
+ ruby_19.implements? :quantifier, :zero_or_one_reluctant # => true
160
+ ruby_19.implements? :quantifier, :zero_or_one_possessive # => true
161
+ ruby_19.implements? :conditional, :condition # => false
123
162
 
124
163
  ruby_18 = Regexp::Syntax.new 'ruby/1.8'
125
- ruby_18.implements? :quantifier, :zero_or_one # => true
126
- ruby_18.implements? :quantifier, :zero_or_one_reluctant # => true
127
- ruby_18.implements? :quantifier, :zero_or_one_possessive # => false
164
+ ruby_18.implements? :quantifier, :zero_or_one # => true
165
+ ruby_18.implements? :quantifier, :zero_or_one_reluctant # => true
166
+ ruby_18.implements? :quantifier, :zero_or_one_possessive # => false
167
+ ruby_18.implements? :conditional, :condition # => false
128
168
  ```
129
169
 
130
170
 
131
171
  #### Notes
132
- * Variatiions on a token, for example a named group with < and > vs one with a
133
- pair of single quotes, are specified with an underscore followed by two
134
- characters appended to the base token. In the previous named group example,
172
+ * Variations on a token, for example a named group with angle brackets (< and >)
173
+ vs one with a pair of single quotes, are specified with an underscore followed
174
+ by two characters appended to the base token. In the previous named group example,
135
175
  the tokens would be :named_ab (angle brackets) and :named_sq (single quotes).
136
176
  These variations are normalized by the syntax to :named.
137
177
 
@@ -139,22 +179,23 @@ ruby_18.implements? :quantifier, :zero_or_one_possessive # => false
139
179
  ---
140
180
  ### Lexer
141
181
  Sits on top of the scanner and performs lexical analysis on the tokens that
142
- it emits. Among its tasks are breaking quantified literal runs, collecting the
143
- emitted token structures into an array of Token objects, calculating their
144
- nesting depth, normalizing tokens for the parser, and checkng if the tokens
145
- are implemented by the given syntax flavor.
182
+ it emits. Among its tasks are; breaking quantified literal runs, collecting the
183
+ emitted token attributes into Token objects, calculating their nesting depth,
184
+ normalizing tokens for the parser, and checkng if the tokens are implemented by
185
+ the given syntax version.
186
+
187
+ See the [Token Objects](https://github.com/ammar/regexp_parser/wiki/Token-Objects)
188
+ wiki page for more information on Token objects.
146
189
 
147
- Tokens are Struct objects, with a few helper methods; #next, #previous, #offsets
148
- and #length.
149
190
 
150
191
  #### Example
151
- The following example scans the given pattern, checks it against the ruby 1.8
152
- syntax, and prints the token objects' text.
192
+ The following example lexes the given pattern, checks it against the ruby 1.9
193
+ syntax, and prints the token objects' text indented to their level.
153
194
 
154
195
  ```ruby
155
196
  require 'regexp_parser'
156
197
 
157
- Regexp::Lexer.scan /a?(b(c))*[d]+/ do |token|
198
+ Regexp::Lexer.scan /a?(b(c))*[d]+/, 'ruby/1.9' do |token|
158
199
  puts "#{' ' * token.level}#{token.text}"
159
200
  end
160
201
 
@@ -175,8 +216,9 @@ end
175
216
  ```
176
217
 
177
218
  A one-liner that returns an array of the textual parts of the given pattern.
178
- Compare the output with that of the one-liner example of the Scanner; notably
179
- how the sequence 'cat' is treated.
219
+ Compare the output with that of the one-liner example of the **Scanner**; notably
220
+ how the sequence 'cat' is treated. The 't' is seperated because it's followed
221
+ by a quantifier that only applies to it.
180
222
 
181
223
  ```ruby
182
224
  Regexp::Lexer.scan( /(cat?([b]at)){3,5}/ ).map {|token| token.text}
@@ -184,50 +226,70 @@ Regexp::Lexer.scan( /(cat?([b]at)){3,5}/ ).map {|token| token.text}
184
226
  ```
185
227
 
186
228
  #### Notes
187
- * The default syntax is that of the latest released version of ruby.
229
+ * The syntax argument is optional. It defaults to the version of the ruby
230
+ interpreter in use, as returned by RUBY_VERSION.
188
231
 
189
- * The lexer performs some basic parsing to determine the depth of the
190
- emitted tokens. This responsibility might be relegated to the scanner
191
- in a future release.
232
+ * The lexer normalizes some tokens, as noted in the Syntax section above.
192
233
 
193
234
 
194
235
  ---
195
236
  ### Parser
196
237
  Sits on top of the lexer and transforms the "stream" of Token objects emitted
197
238
  by it into a tree of Expression objects represented by an instance of the
198
- Expression::Root class. See Expression below for more information.
239
+ Expression::Root class.
240
+
241
+ See the [Expression Objects](https://github.com/ammar/regexp_parser/wiki/Expression-Objects)
242
+ wiki page for attributes and methods.
243
+
199
244
 
200
245
  #### Example
201
246
 
202
247
  ```ruby
203
248
  require 'regexp_parser'
204
249
 
205
- regex = /a?(b)*[c]+/m
250
+ regex = /a?(b+(c)d)*(?<name>[0-9]+)/
206
251
 
207
- # using #to_s on the Regexp object to include options. Note that this turns the
208
- # expression into '(?m-ix:a?(b)*[c]+)', thus the Group::Options in the output
209
- root = Regexp::Parser.parse( regex.to_s, 'ruby/2.1')
252
+ tree = Regexp::Parser.parse( regex, 'ruby/2.1' )
210
253
 
211
- root.multiline? # => true (aliased as m?)
212
- root.case_insensitive? # => false (aliased as i?)
254
+ tree.traverse do |event, exp|
255
+ puts "#{event}: #{exp.type} `#{exp.to_s}`"
256
+ end
213
257
 
214
- # simple tree walking method (depth-first, pre-order)
215
- def walk(e, depth = 0)
216
- puts "#{' ' * depth}> #{e.class}"
258
+ # Output
259
+ # visit: literal `a?`
260
+ # enter: group `(b+(c)d)*`
261
+ # visit: literal `b+`
262
+ # enter: group `(c)`
263
+ # visit: literal `c`
264
+ # exit: group `(c)`
265
+ # visit: literal `d`
266
+ # exit: group `(b+(c)d)*`
267
+ # enter: group `(?<name>[0-9]+)`
268
+ # visit: set `[0-9]+`
269
+ # exit: group `(?<name>[0-9]+)`
270
+ ```
217
271
 
218
- if e.respond_to?(:expressions)
219
- e.each {|s| walk(s, depth+1) }
220
- end
221
- end
272
+ Another example, using each_expression and strfregexp to print the object tree.
273
+ _See the traverse.rb and strfregexp.rb files under `lib/regexp_parser/expression/methods`
274
+ for more information on these methods._
222
275
 
223
- walk(root)
276
+ ```ruby
277
+ include_root = true
278
+ indent_offset = include_root ? 1 : 0
224
279
 
225
- # output
280
+ tree.each_expression(include_root) do |exp, level_index|
281
+ puts exp.strfregexp("%>> %c", indent_offset)
282
+ end
283
+
284
+ # Output
226
285
  # > Regexp::Expression::Root
227
- # > Regexp::Expression::Group::Options
286
+ # > Regexp::Expression::Literal
287
+ # > Regexp::Expression::Group::Capture
228
288
  # > Regexp::Expression::Literal
229
289
  # > Regexp::Expression::Group::Capture
230
290
  # > Regexp::Expression::Literal
291
+ # > Regexp::Expression::Literal
292
+ # > Regexp::Expression::Group::Named
231
293
  # > Regexp::Expression::CharacterSet
232
294
  ```
233
295
 
@@ -236,122 +298,84 @@ Expression class. See the next section for details._
236
298
 
237
299
 
238
300
  ---
239
- ### Expression
240
- The base class of all objects returned by the parser, implements most of the
241
- functions that are common to all expression classes.
242
-
243
- Each Expression object contains the following members:
244
-
245
- * **quantifier**: an instance of Expression::Quantifier that holds the details
246
- of repetition for the Expression. Has a nil value if the expression is not
247
- quantified.
248
- * **expressions**: an array, holds the sub-expressions for the expression if it
249
- is a group or alternation expression. Empty if the expression doesn't have
250
- sub-expressions.
251
- * **options**: a hash, holds the keys :i, :m, and :x with a boolean value that
252
- indicates if the expression has a given option.
253
-
254
-
255
- Expressions also contain the following members from the scanner/lexer:
256
-
257
- * **type**: a symbol, denoting the expression type, such as :group, :quantifier
258
- * **token**: a symbol, for the object's token, or opening token (in the case of
259
- groups and sets)
260
- * **text**: a string, the text of the expression (same as token for nesting expressions)
261
-
262
-
263
- Every expression also has the following methods:
264
-
265
- * **to_s**: returns the string representation of the expression.
266
- * **<<**: adds sub-expresions to the expression.
267
- * **each**: iterates over the expressions sub-expressions, if any.
268
- * **[]**: access sub-expressions by index.
269
- * **quantified?**: return true if the expression was followed by a quantifier.
270
- * **quantity**: returns an array of the expression's min and max repetitions.
271
- * **greedy?**: returns true if the expression's quantifier is greedy.
272
- * **reluctant?** or **lazy?**: returns true if the expression's quantifier is
273
- reluctant.
274
- * **possessive?**: returns true if the expression's quantifier is possessive.
275
- * **multiline?** or **m?**: returns true if the expression has the m option
276
- * **case_insensitive?** or **ignore_case?** or **i?**: returns true if the expression
277
- has the i option
278
- * **free_spacing?** or **extended?** or **x?**: returns true if the expression has the x
279
- option
280
-
281
-
282
- A special expression class **Expression::Sequence** is used to hold the
283
- expressions of a branch within an **Expression::Alternation** expression. For
284
- example, the expression 'bat|cat|hat' would result in an alternation with 3
285
- sequences, one for each possible alternative.
286
-
287
-
288
- ## Scanner Syntax
289
- The following syntax elements are supported by the scanner.
290
-
291
- - Alternation: a|b|c, etc.
292
- - Anchors: ^, $, \b, etc.
293
- - Character Classes _(aka Sets)_: [abc], [^\]]
294
- - Character Types: \d, \H, \s, etc.
295
- - Escape Sequences: \t, \+, \?, etc.
296
- - Grouped Expressions
297
- - Assertions
298
- - Lookahead: (?=abc)
299
- - Negative Lookahead: (?!abc)
300
- - Lookabehind: (?<=abc)
301
- - Negative Lookbehind: (?<\!abc)
302
- - Atomic: (?>abc)
303
- - Back-references:
304
- - Named: \k<name>
305
- - Nest Level: \k<n-1>
306
- - Numbered: \k<1>
307
- - Relative: \k<-2>
308
- - Capturing: (abc)
309
- - Comment: (?# comment)
310
- - Named: (?<name>abc)
311
- - Options: (?mi-x:abc)
312
- - Passive: (?:abc)
313
- - Sub-expression Calls: \g<name>, \g<1>
314
- - Literals: abc, def?, etc.
315
- - POSIX classes: [:alpha:], [:print:], etc.
316
- - Quantifiers
317
- - Greedy: ?, *, +, {m,M}
318
- - Reluctant: ??, *?, +?, {m,M}?
319
- - Possessive: ?+, *+, ++, {m,M}+
320
- - String Escapes
321
- - Control: \C-C, \cD, etc.
322
- - Hex: \x20, \x{701230}, etc.
323
- - Meta: \M-c, \M-\C-C etc.
324
- - Octal: \0, \01, \012
325
- - Unicode: \uHHHH, \u{H+ H+}
326
- - Traditional Back-references: \1 thru \9
327
- - Unicode Properties:
328
- - Age: \p{Age=2.1}, \P{age=5.2}, etc.
329
- - Classes: \p{Alpha}, \P{Space}, etc.
330
- - Derived Properties: \p{Math}, \P{Lowercase}, etc.
331
- - General Categories: \p{Lu}, \P{Cs}, etc.
332
- - Scripts: \p{Arabic}, \P{Hiragana}, etc.
333
- - Simple Properties: \p{Dash}, \p{Extender}, etc.
334
-
335
-
336
- ### Missing Features
337
-
338
- The following were added by the Onigmo regular expression library used by
339
- ruby 2.x and are not currently recognized by the scanner:
340
-
341
- - Planned for support
342
- - Conditional Expressions: (?(cond)yes-subexp), (?(cond)yes-subexp|no-subexp)
343
- - Negative POSIX Brackets: [:^alpha:], [:^digit:]
344
- - New Character Set Options: d, a, and u _[see](https://github.com/k-takata/Onigmo/blob/master/doc/RE#L234)_
345
- - Not planned for support
346
- - Keep: \K _(not enabled for ruby syntax)_
347
- - Quotes: \Q...\E _(perl and java syntax only) [see](https://github.com/k-takata/Onigmo/blob/master/doc/RE#L452)_
348
- - Capture History: (?@...), (?@<name>...) _(not enabled for ruby syntax) [see](https://github.com/k-takata/Onigmo/blob/master/doc/RE#L499)_
301
+
302
+
303
+ ## Supported Syntax
304
+ The three modules support all the regular expression syntax features of Ruby 1.8
305
+ , 1.9, and 2.x:
306
+
307
+ _Note that not all of these are available in all versions of Ruby_
308
+
309
+
310
+ | Syntax Feature | Examples | &#x22ef; |
311
+ | ------------------------------------- | ------------------------------------------------------- |:--------:|
312
+ | **Alternation** | `a|b|c` | &#x2713; |
313
+ | **Anchors** | `^`, `$`, `\b` | &#x2713; |
314
+ | **Character Classes** | `[abc]`, `[^\\]`, `[a-d&&g-h]`, `[a=e=b]` | &#x2713; |
315
+ | **Character Types** | `\d`, `\H`, `\s` | &#x2713; |
316
+ | **Conditional Exps.** | `(?(cond)yes-subexp)`, `(?(cond)yes-subexp|no-subexp)` | &#x2713; |
317
+ | **Escape Sequences** | `\t`, `\\+`, `\?` | &#x2713; |
318
+ | **Free Space** | whitespace and `# Comments` _(x modifier)_ | &#x2713; |
319
+ | **Grouped Exps.** | | &#x22f1; |
320
+ | &emsp;&nbsp;_**Assertions**_ | | &#x22f1; |
321
+ | &emsp;&emsp;_Lookahead_ | `(?=abc)` | &#x2713; |
322
+ | &emsp;&emsp;_Negative Lookahead_ | `(?!abc)` | &#x2713; |
323
+ | &emsp;&emsp;_Lookbehind_ | `(?<=abc)` | &#x2713; |
324
+ | &emsp;&emsp;_Negative Lookbehind_ | `(?<!abc)` | &#x2713; |
325
+ | &emsp;&nbsp;_**Atomic**_ | `(?>abc)` | &#x2713; |
326
+ | &emsp;&nbsp;_**Back-references**_ | | &#x22f1; |
327
+ | &emsp;&emsp;_Named_ | `\k<name>` | &#x2713; |
328
+ | &emsp;&emsp;_Nest Level_ | `\k<n-1>` | &#x2713; |
329
+ | &emsp;&emsp;_Numbered_ | `\k<1>` | &#x2713; |
330
+ | &emsp;&emsp;_Relative_ | `\k<-2>` | &#x2713; |
331
+ | &emsp;&emsp;_Traditional_ | `\1` thru `\9` | &#x2713; |
332
+ | &emsp;&nbsp;_**Capturing**_ | `(abc)` | &#x2713; |
333
+ | &emsp;&nbsp;_**Comments**_ | `(?# comment text)` | &#x2713; |
334
+ | &emsp;&nbsp;_**Named**_ | `(?<name>abc)`, `(?'name'abc)` | &#x2713; |
335
+ | &emsp;&nbsp;_**Options**_ | `(?mi-x:abc)`, `(?a:\s\w+)` | &#x2713; |
336
+ | &emsp;&nbsp;_**Passive**_ | `(?:abc)` | &#x2713; |
337
+ | &emsp;&nbsp;_**Subexp. Calls**_ | `\g<name>`, `\g<1>` | &#x2713; |
338
+ | **Keep** | `\K`, `(ab\Kc|d\Ke)f` | &#x2713; |
339
+ | **Literals** _(utf-8)_ | `Ruby`, `ルビー`, `روبي` | &#x2713; |
340
+ | **POSIX Classes** | `[:alpha:]`, `[:^digit:]` | &#x2713; |
341
+ | **Quantifiers** | | &#x22f1; |
342
+ | &emsp;&nbsp;_**Greedy**_ | `?`, `*`, `+`, `{m,M}` | &#x2713; |
343
+ | &emsp;&nbsp;_**Reluctant** (Lazy)_ | `??`, `*?`, `+?`, `{m,M}?` | &#x2713; |
344
+ | &emsp;&nbsp;_**Possessive**_ | `?+`, `*+`, `++`, `{m,M}+` | &#x2713; |
345
+ | **String Escapes** | | &#x22f1; |
346
+ | &emsp;&nbsp;_**Control**_ | `\C-C`, `\cD` | &#x2713; |
347
+ | &emsp;&nbsp;_**Hex**_ | `\x20`, `\x{701230}` | &#x2713; |
348
+ | &emsp;&nbsp;_**Meta**_ | `\M-c`, `\M-\C-C` | &#x2713; |
349
+ | &emsp;&nbsp;_**Octal**_ | `\0`, `\01`, `\012` | &#x2713; |
350
+ | &emsp;&nbsp;_**Unicode**_ | `\uHHHH`, `\u{H+ H+}` | &#x2713; |
351
+ | **Unicode Properties** | _<sub>([Unicode 7.0.0](http://www.unicode.org/versions/Unicode7.0.0/))</sub>_ | &#x22f1; |
352
+ | &emsp;&nbsp;_**Age**_ | `\p{Age=5.2}`, `\P{age=7.0}` | &#x2713; |
353
+ | &emsp;&nbsp;_**Classes**_ | `\p{Alpha}`, `\P{Space}` | &#x2713; |
354
+ | &emsp;&nbsp;_**Derived**_ | `\p{Math}`, `\P{Lowercase}` | &#x2713; |
355
+ | &emsp;&nbsp;_**General Categories**_ | `\p{Lu}`, `\P{Cs}` | &#x2713; |
356
+ | &emsp;&nbsp;_**Scripts**_ | `\p{Arabic}`, `\P{Hiragana}` | &#x2713; |
357
+ | &emsp;&nbsp;_**Simple**_ | `\p{Dash}`, `\p{Extender}` | &#x2713; |
358
+
359
+
360
+ <br/>
361
+ ##### Inapplicable Features
362
+
363
+ Some modifiers, like `o` and `s`, apply to the **Regexp** object itself and do not
364
+ appear in its source. Others such modifiers include the encoding modifiers `e` and `n`
365
+ [See](http://www.ruby-doc.org/core-2.1.3/Regexp.html#class-Regexp-label-Encoding).
366
+ These are not seen by the scanner.
367
+
368
+ The following features are not currently enabled for Ruby by its regular
369
+ expressions library (Onigmo). They are not supported by the scanner.
370
+
371
+ - **Quotes**: `\Q...\E` _<a href="https://github.com/k-takata/Onigmo/blob/master/doc/RE#L452/" title="Links to master branch, may change">[See]</a>_
372
+ - **Capture History**: `(?@...)`, `(?@<name>...)` _<a href="https://github.com/k-takata/Onigmo/blob/master/doc/RE#L499" title="Links to master branch, may change">[See]</a>_
349
373
 
350
374
 
351
375
  See something else missing? Please submit an [issue](https://github.com/ammar/regexp_parser/issues)
352
376
 
353
- _**Note**: Attempting to process expressions with any of the missing syntax features will
354
- cause an error._
377
+ _**Note**: Attempting to process expressions with unsupported syntax features can raise an error,
378
+ or incorrectly return tokens/objects as literals._
355
379
 
356
380
 
357
381
  ## Testing
@@ -366,38 +390,44 @@ tasks, which only run the tests for one component at a time. These are:
366
390
  * test:expression
367
391
  * test:syntax
368
392
 
369
- _A special task 'test:full' generatees the scanner's code from the ragel source files and
370
- runs all the tests. This requires ragel to be installed._
393
+ _A special task 'test:full' generates the scanner's code from the ragel source files and
394
+ runs all the tests. This task requires ragel to be installed._
371
395
 
372
396
 
373
- The tests use ruby's test_unit, so they can also be run with:
397
+ The tests use ruby's test/unit, so they can also be run with:
374
398
 
375
399
  ```
376
- ruby test/test_all.rb
400
+ ruby -Ilib test/test_all.rb
377
401
  ```
378
402
 
379
403
  This is useful when there is a need to focus on specific test files, for example:
380
404
 
381
405
  ```
382
- ruby test/scanner/test_properties.rb
406
+ ruby -Ilib test/scanner/test_properties.rb
407
+ ```
408
+
409
+ It is sometimes helpful during development to focus on a specific test case, for example:
410
+
411
+ ```
412
+ ruby -Ilib test/expression/test_base.rb -n test_expression_to_re
383
413
  ```
384
414
 
385
415
 
386
416
  ## Building
387
- Building the scanner and the gem requires [ragel](http://www.complang.org/ragel/) to be
417
+ Building the scanner and the gem requires [ragel](http://www.colm.net/open-source/ragel/) to be
388
418
  installed. The build tasks will automatically invoke the 'ragel:rb' task to generate the
389
419
  ruby scanner code.
390
420
 
391
421
 
392
- The project uses the standard rubygems package tasks:
422
+ The project uses the standard rubygems package tasks, so:
393
423
 
394
424
 
395
- To build, run:
425
+ To build the gem, run:
396
426
  ```
397
427
  rake build
398
428
  ```
399
429
 
400
- To install, run:
430
+ To install the gem from the cloned project, run:
401
431
  ```
402
432
  rake install
403
433
  ```
@@ -408,14 +438,15 @@ Documentation and books used while working on this project.
408
438
 
409
439
 
410
440
  #### Ruby Flavors
411
- * Oniguruma Regular Expressions [link](http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt)
412
- * Read Ruby > Regexps [link](https://github.com/runpaint/read-ruby/blob/master/src/regexps.xml)
441
+ * Oniguruma Regular Expressions (Ruby 1.9.x) [link](http://www.geocities.jp/kosako3/oniguruma/doc/RE.txt)
442
+ * Onigmo Regular Expressions (Ruby >= 2.0) [link](https://github.com/k-takata/Onigmo/blob/master/doc/RE)
413
443
 
414
444
 
415
445
  #### Regular Expressions
416
446
  * Mastering Regular Expressions, By Jeffrey E.F. Friedl (2nd Edition) [book](http://oreilly.com/catalog/9781565922570/)
417
447
  * Regular Expression Flavor Comparison [link](http://www.regular-expressions.info/refflavors.html)
418
448
  * Enumerating the strings of regular languages [link](http://www.cs.dartmouth.edu/~doug/nfa.ps.gz)
449
+ * Stack Overflow Regular Expressions FAQ [link](http://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean/22944075#22944075)
419
450
 
420
451
 
421
452
  #### Unicode
@@ -425,18 +456,6 @@ Documentation and books used while working on this project.
425
456
  * Unicode Regular Expressions [link](http://www.unicode.org/reports/tr18/)
426
457
  * Unicode Standard Annex #44 [link](http://www.unicode.org/reports/tr44/)
427
458
 
428
- ## Thanks
429
- This work is based on and inspired by the hard work and ideas of many people,
430
- directly or indirectly. The following are only a few of those that should be
431
- thanked.
432
-
433
- * Adrian Thurston, for developing [ragel](http://www.complang.org/ragel/).
434
- * Caleb Clausen, for feedback, which inspired this, valuable insights on structuring the parser,
435
- and lots of [cool code](http://github.com/coatl).
436
- * Jan Goyvaerts, for his [excellent resource](http://www.regular-expressions.info) on regular expressions.
437
- * Run Paint Run Run, for his work on [Read Ruby](https://github.com/runpaint/read-ruby)
438
- * Yukihiro Matsumoto, of course! For "The Ruby", of course!
439
-
440
459
 
441
460
  ---
442
461
  ##### Copyright