regextest 0.1.2

Sign up to get free protection for your applications and to get access to all the features.
Files changed (64) hide show
  1. checksums.yaml +7 -0
  2. data/.gitignore +11 -0
  3. data/.rspec +2 -0
  4. data/.travis.yml +3 -0
  5. data/Gemfile +4 -0
  6. data/LICENSE.txt +25 -0
  7. data/README.md +88 -0
  8. data/Rakefile +55 -0
  9. data/bin/console +14 -0
  10. data/bin/regextest +4 -0
  11. data/bin/setup +7 -0
  12. data/contrib/Onigmo/RE.txt +522 -0
  13. data/contrib/Onigmo/UnicodeProps.txt +728 -0
  14. data/contrib/Onigmo/testpy.py +1319 -0
  15. data/contrib/unicode/Blocks.txt +298 -0
  16. data/contrib/unicode/CaseFolding.txt +1414 -0
  17. data/contrib/unicode/DerivedAge.txt +1538 -0
  18. data/contrib/unicode/DerivedCoreProperties.txt +11029 -0
  19. data/contrib/unicode/PropList.txt +1525 -0
  20. data/contrib/unicode/PropertyAliases.txt +193 -0
  21. data/contrib/unicode/PropertyValueAliases.txt +1420 -0
  22. data/contrib/unicode/README.txt +25 -0
  23. data/contrib/unicode/Scripts.txt +2539 -0
  24. data/contrib/unicode/UnicodeData.txt +29215 -0
  25. data/lib/pre-case-folding.rb +101 -0
  26. data/lib/pre-posix-char-class.rb +150 -0
  27. data/lib/pre-unicode.rb +116 -0
  28. data/lib/regextest.rb +268 -0
  29. data/lib/regextest/back.rb +58 -0
  30. data/lib/regextest/back/element.rb +151 -0
  31. data/lib/regextest/back/main.rb +356 -0
  32. data/lib/regextest/back/result.rb +498 -0
  33. data/lib/regextest/back/test-case.rb +268 -0
  34. data/lib/regextest/back/work-thread.rb +119 -0
  35. data/lib/regextest/common.rb +63 -0
  36. data/lib/regextest/front.rb +60 -0
  37. data/lib/regextest/front/anchor.rb +45 -0
  38. data/lib/regextest/front/back-refer.rb +120 -0
  39. data/lib/regextest/front/bracket-parser.rb +400 -0
  40. data/lib/regextest/front/bracket-parser.y +117 -0
  41. data/lib/regextest/front/bracket-scanner.rb +124 -0
  42. data/lib/regextest/front/bracket.rb +64 -0
  43. data/lib/regextest/front/builtin-functions.rb +31 -0
  44. data/lib/regextest/front/case-folding.rb +18 -0
  45. data/lib/regextest/front/char-class.rb +243 -0
  46. data/lib/regextest/front/empty.rb +43 -0
  47. data/lib/regextest/front/letter.rb +327 -0
  48. data/lib/regextest/front/manage-parentheses.rb +74 -0
  49. data/lib/regextest/front/parenthesis.rb +153 -0
  50. data/lib/regextest/front/parser.rb +1366 -0
  51. data/lib/regextest/front/parser.y +271 -0
  52. data/lib/regextest/front/range.rb +60 -0
  53. data/lib/regextest/front/repeat.rb +90 -0
  54. data/lib/regextest/front/repeatable.rb +77 -0
  55. data/lib/regextest/front/scanner.rb +187 -0
  56. data/lib/regextest/front/selectable.rb +65 -0
  57. data/lib/regextest/front/sequence.rb +73 -0
  58. data/lib/regextest/front/unicode.rb +1272 -0
  59. data/lib/regextest/regex-option.rb +144 -0
  60. data/lib/regextest/regexp.rb +44 -0
  61. data/lib/regextest/version.rb +5 -0
  62. data/lib/tst-reg-test.rb +159 -0
  63. data/regextest.gemspec +26 -0
  64. metadata +162 -0
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 530438fd0ca372511d5dcc1154e0bbabe06a66d3
4
+ data.tar.gz: 90c8e3a8d8786a269f770817fcb864ab63e1536b
5
+ SHA512:
6
+ metadata.gz: 138de62d533d4c0040edf228901cee008f92e1ad50c94a7527c8de57f09c450b4b7397851ccac3af353fc64ddca1a694415b196bbacb3fe7ded4a4c68cdeaaa9
7
+ data.tar.gz: ecee2e7b6994a1feaa151ad9976481721a44a7545a520f7f0f87430dee6f942c56765cda93110011873e198a5c21680ff7b5befe940a0ab2313d068d22163b8a
@@ -0,0 +1,11 @@
1
+ /.bundle/
2
+ /.yardoc
3
+ /Gemfile.lock
4
+ /_yardoc/
5
+ /coverage/
6
+ /doc/
7
+ /pkg/
8
+ /spec/reports/
9
+ /tmp/
10
+ /work/
11
+ /*.gem
data/.rspec ADDED
@@ -0,0 +1,2 @@
1
+ --format documentation
2
+ --color
@@ -0,0 +1,3 @@
1
+ language: ruby
2
+ rvm:
3
+ - 2.0.0
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in regextest.gemspec
4
+ gemspec
@@ -0,0 +1,25 @@
1
+ Copyright (c) 2016, Mikio Ikoma. All rights reserved.
2
+
3
+ Redistribution and use in source and binary forms, with or without
4
+ modification, are permitted provided that the following conditions
5
+ are met:
6
+
7
+ 1. Redistributions of source code must retain the above copyright
8
+ notice, this list of conditions and the following disclaimer.
9
+
10
+ 2. Redistributions in binary form must reproduce the above copyright
11
+ notice, this list of conditions and the following disclaimer in the
12
+ documentation and/or other materials provided with the distribution.
13
+
14
+ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
15
+ "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
16
+ LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS
17
+ FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
18
+ COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
19
+ INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
20
+ BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
21
+ LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
22
+ CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
23
+ LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
24
+ ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
25
+ POSSIBILITY OF SUCH DAMAGE.
@@ -0,0 +1,88 @@
1
+ # Regextest
2
+ Regextest generates sample string that matches with regular expression. Unlike similar tools, it recognizes anchors, charactor classes and other advanced notation of ruby regex. Target users are programmers or students for debugging/learning regular expression.
3
+
4
+ ## Installation
5
+
6
+ You can use [sample application](https://regtestweb.herokuapp.com/test_data/home) without installation. For using at your local machine, add this line to your application's Gemfile:
7
+
8
+ ```ruby
9
+ gem 'regextest'
10
+ ```
11
+
12
+ And then execute:
13
+
14
+ $ bundle
15
+
16
+ Or install it yourself as:
17
+
18
+ $ gem install regextest
19
+
20
+
21
+ ## Usage
22
+
23
+ ```ruby
24
+ require "regextest"
25
+
26
+ /\d{5}/.sample #=> "62853"
27
+ 5.times.map{/\w{5}/.sample} #=> ["mCcA5", "1s3Ae", "9HYbe", "x3T0A", "TJHlQ"]
28
+ /(?<=pre)body(?=post)/.sample #=> "prebodypost"
29
+ /(?=[a-z])\w{5}(?<=_\d)/.sample #=> "nCc_0"
30
+
31
+ /(?<=pre)body(?=post)/.match_data #=> #<MatchData "body">
32
+
33
+ palindrome = /\A(?<a>|.|(?:(?<b>.)\g<a>\k<b+0>))\z/
34
+ palindrome.sample #=> "a]r\\CC\\r]a"
35
+ palindrome.match_data #=> #<MatchData "z2#2z" a:"z2#2z" b:"2">
36
+ ```
37
+
38
+ ## Parameters (environment variables)
39
+ - **REGEXTEST_DEBUG**
40
+ - Specify "1" to print verbose debugging information
41
+ - **REGEXTEST_MAX_RETRY**
42
+ - Retry count for generation. 5 retry by default.
43
+ - **REGEXTEST_MAX_REPEAT**
44
+ - Maximum repeat of element when * or + specified. Default value is 32.
45
+ - **REGEXTEST_MAX_RECURSION**
46
+ - Maximum nest of \g<..>. Default value is 32.
47
+ - **REGEXTEST_UNICODE_CHAR_SET**
48
+ - Whole character set at unicode mode. Specify unicode char-set names joined with "|". Default value is 'ascii|katakana|hiragana'
49
+ - **REGEXTEST_TIMEOUT**
50
+ - Specify timeout second for verifying generated string (by ruby regexp). Default value is 1 second. Note no timeout detected for generating string. It can be used for fuzzering.
51
+
52
+
53
+ ## Exceptions
54
+ - **Regextest::RegextestError**
55
+ - Impossible to generate string. It is sub-class of standard exception RuntimeError.
56
+ - **Regextest::RegextestFailedToGenerate**
57
+ - Failed to generate string. It is sub-class of standard exception RuntimeError. In many cases, caused by Regextest's restriction)
58
+ - **RuntimeError**
59
+ - Bug of Regextest
60
+ - **Regextest::RegextestTimeout**
61
+ - Timeout (default is 1 sec) detected while verification. It is sub-class of standard exception RuntimeError. For ignoring verification, you can use sample method with 'verification: false' option.
62
+
63
+ ```ruby
64
+ require "regextest"
65
+ /(1|11){100}$/.sample #=> raise Regextest::Common::RegextestTimeout: ...
66
+ /(1|11){100}$/.sample(verification: false) #=> '11111111...'
67
+ ```
68
+
69
+ ## Development
70
+
71
+ Visit [git repository](https://bitbucket.org/ikomamik/regextest/src) for developing
72
+
73
+ ## Contributing
74
+
75
+ 1. Fork it ( https://bitbucket.org/ikomamik/regextest/fork )
76
+ 2. Create your feature branch (`git checkout -b my-new-feature`)
77
+ 3. Commit your changes (`git commit -am 'Add some feature'`)
78
+ 4. Push to the branch (`git push origin my-new-feature`)
79
+ 5. Create a new Pull Request
80
+
81
+ ## Major Bugs/Restrictions
82
+ 1. Insufficient support of unicode classes
83
+ 2. Too slow to process regex contains "Han" class
84
+ 3. Limited support of possesive repeat
85
+ 4. Limited support of grapheme cluster (\R or \X)
86
+
87
+ See [issues tracker](https://bitbucket.org/ikomamik/regextest/issues?status=new&status=open) for more detail.
88
+
@@ -0,0 +1,55 @@
1
+ require "bundler/gem_tasks"
2
+ require "rspec/core/rake_task"
3
+
4
+ RSpec::Core::RakeTask.new(:spec)
5
+
6
+ # task :default => :spec
7
+
8
+ task :default => [:make, :spec]
9
+
10
+ # Generating parser
11
+ file 'lib/regextest/front/parser.rb' => 'lib/regextest/front/parser.y' do
12
+ puts 'making regextest/front/parser.rb'
13
+ sh 'racc lib/regextest/front/parser.y -o lib/regextest/front/parser.rb'
14
+ end
15
+
16
+ # Generating bracket parser
17
+ file 'lib/regextest/front/bracket-parser.rb' => 'lib/regextest/front/bracket-parser.y' do
18
+ puts 'making regextest/front/bracket-parser.rb'
19
+ sh 'racc lib/regextest/front/bracket-parser.y -o lib/regextest/front/bracket-parser.rb'
20
+ end
21
+
22
+ # Generating Unicode parser
23
+ file 'lib/regextest/front/unicode.rb' => 'lib/pre-unicode.rb' do
24
+ puts "making regextest/front/unicode.rb"
25
+ sh 'ruby lib/pre-unicode.rb'
26
+ end
27
+
28
+ # Generating case-folding mapping
29
+ file 'lib/regextest/front/case-folding.rb' => 'lib/pre-case-folding.rb' do
30
+ puts "making regextest/front/case-folding.rb"
31
+ sh 'ruby lib/pre-case-folding.rb'
32
+ end
33
+
34
+ # Generating documents
35
+ file 'doc/index.html' => ['lib/regextest.rb', 'lib/regextest/regexp.rb', 'README.md'] do
36
+ puts "making document for Regextest"
37
+ sh 'yardoc lib/regextest.rb lib/regextest/regexp.rb'
38
+ end
39
+
40
+ task :make =>
41
+ ['lib/regextest/front/parser.rb',
42
+ 'lib/regextest/front/bracket-parser.rb',
43
+ 'lib/regextest/front/unicode.rb',
44
+ 'lib/regextest/front/case-folding.rb',
45
+ 'doc/index.html',
46
+ ] do
47
+ puts "Rake it!"
48
+ end
49
+
50
+ task :test => :make do
51
+ puts "Test it!"
52
+ sh 'ruby test.rb'
53
+ end
54
+
55
+
@@ -0,0 +1,14 @@
1
+ #!/usr/bin/env ruby
2
+
3
+ require "bundler/setup"
4
+ require "regextest"
5
+
6
+ # You can add fixtures and/or initialization code here to make experimenting
7
+ # with your gem easier. You can also use a different console, if you like.
8
+
9
+ # (If you use this, don't forget to add pry to your Gemfile!)
10
+ # require "pry"
11
+ # Pry.start
12
+
13
+ require "irb"
14
+ IRB.start
@@ -0,0 +1,4 @@
1
+ #!/usr/bin/ruby
2
+ ENV['RUBYLIB'] = "../lib" # File.dirname(__FILE__)
3
+ command = 'ruby "${RUBYLIB}/regextest.rb" ' + $*.map{|e| e.inspect}.join(" ")
4
+ system command
@@ -0,0 +1,7 @@
1
+ #!/bin/bash
2
+ set -euo pipefail
3
+ IFS=$'\n\t'
4
+
5
+ bundle install
6
+
7
+ # Do any other automated setup that you need to do here
@@ -0,0 +1,522 @@
1
+ Onigmo (Oniguruma-mod) Regular Expressions Version 5.13.0 2012/01/19
2
+
3
+ syntax: ONIG_SYNTAX_RUBY (default)
4
+
5
+
6
+ 1. Syntax elements
7
+
8
+ \ escape (enable or disable meta character meaning)
9
+ | alternation
10
+ (...) group
11
+ [...] character class
12
+
13
+
14
+ 2. Characters
15
+
16
+ \t horizontal tab (0x09)
17
+ \v vertical tab (0x0B)
18
+ \n newline (0x0A)
19
+ \r return (0x0D)
20
+ \b back space (0x08)
21
+ \f form feed (0x0C)
22
+ \a bell (0x07)
23
+ \e escape (0x1B)
24
+ \nnn octal char (encoded byte value)
25
+ \xHH hexadecimal char (encoded byte value)
26
+ \x{7HHHHHHH} wide hexadecimal char (character code point value)
27
+ \cx control char (character code point value)
28
+ \C-x control char (character code point value)
29
+ \M-x meta (x|0x80) (character code point value)
30
+ \M-\C-x meta control char (character code point value)
31
+
32
+ (* \b is effective in character class [...] only)
33
+
34
+
35
+ 3. Character types
36
+
37
+ . any character (except newline)
38
+
39
+ \w word character
40
+
41
+ Not Unicode:
42
+ alphanumeric and "_".
43
+
44
+ Unicode:
45
+ General_Category -- (Letter|Mark|Number|Connector_Punctuation)
46
+
47
+ It depends on ONIG_OPTION_ASCII_RANGE option that non-ASCII char
48
+ includes or not.
49
+
50
+ \W non word char
51
+
52
+ \s whitespace char
53
+
54
+ Not Unicode:
55
+ \t, \n, \v, \f, \r, \x20
56
+
57
+ Unicode:
58
+ 0009, 000A, 000B, 000C, 000D, 0085(NEL),
59
+ General_Category -- Line_Separator
60
+ -- Paragraph_Separator
61
+ -- Space_Separator
62
+
63
+ It depends on ONIG_OPTION_ASCII_RANGE option that non-ASCII char
64
+ includes or not.
65
+
66
+ \S non whitespace char
67
+
68
+ \d decimal digit char
69
+
70
+ Unicode: General_Category -- Decimal_Number
71
+
72
+ It depends on ONIG_OPTION_ASCII_RANGE option that non-ASCII char
73
+ includes or not.
74
+
75
+ \D non decimal digit char
76
+
77
+ \h hexadecimal digit char [0-9a-fA-F]
78
+
79
+ \H non hexadecimal digit char
80
+
81
+
82
+ Character Property
83
+
84
+ * \p{property-name}
85
+ * \p{^property-name} (negative)
86
+ * \P{property-name} (negative)
87
+
88
+ property-name:
89
+
90
+ + works on all encodings
91
+ Alnum, Alpha, Blank, Cntrl, Digit, Graph, Lower,
92
+ Print, Punct, Space, Upper, XDigit, Word, ASCII,
93
+
94
+ + works on EUC_JP, Shift_JIS, CP932
95
+ Hiragana, Katakana, Han, Latin, Greek, Cyrillic
96
+
97
+ + works on UTF8, UTF16, UTF32
98
+ see UnicodeProps.txt
99
+
100
+
101
+ \R Linebreak
102
+
103
+ Unicode:
104
+ (?>\x0D\x0A|[\x0A-\x0D\x{85}\x{2028}\x{2029}])
105
+
106
+ Not Unicode:
107
+ (?>\x0D\x0A|[\x0A-\x0D])
108
+
109
+ \X eXtended grapheme cluster
110
+
111
+ Unicode:
112
+ (?>\P{M}\p{M}*)
113
+
114
+ Not Unicode:
115
+ (?m:.)
116
+
117
+
118
+
119
+ 4. Quantifier
120
+
121
+ greedy
122
+
123
+ ? 1 or 0 times
124
+ * 0 or more times
125
+ + 1 or more times
126
+ {n,m} at least n but not more than m times
127
+ {n,} at least n times
128
+ {,n} at least 0 but not more than n times ({0,n})
129
+ {n} n times
130
+
131
+ reluctant
132
+
133
+ ?? 1 or 0 times
134
+ *? 0 or more times
135
+ +? 1 or more times
136
+ {n,m}? at least n but not more than m times
137
+ {n,}? at least n times
138
+ {,n}? at least 0 but not more than n times (== {0,n}?)
139
+
140
+ possessive (greedy and does not backtrack after repeated)
141
+
142
+ ?+ 1 or 0 times
143
+ *+ 0 or more times
144
+ ++ 1 or more times
145
+
146
+ ({n,m}+, {n,}+, {n}+ are possessive op. in ONIG_SYNTAX_JAVA and
147
+ ONIG_SYNTAX_PERL only)
148
+
149
+ ex. /a*+/ === /(?>a*)/
150
+
151
+
152
+ 5. Anchors
153
+
154
+ ^ beginning of the line
155
+ $ end of the line
156
+ \b word boundary
157
+ \B not word boundary
158
+ \A beginning of string
159
+ \Z end of string, or before newline at the end
160
+ \z end of string
161
+ \G matching start position
162
+
163
+
164
+ 6. Character class
165
+
166
+ ^... negative class (lowest precedence operator)
167
+ x-y range from x to y
168
+ [...] set (character class in character class)
169
+ ..&&.. intersection (low precedence at the next of ^)
170
+
171
+ ex. [a-w&&[^c-g]z] ==> ([a-w] AND ([^c-g] OR z)) ==> [abh-w]
172
+
173
+ * If you want to use '[', '-', ']' as a normal character
174
+ in a character class, you should escape these characters by '\'.
175
+
176
+
177
+ POSIX bracket ([:xxxxx:], negate [:^xxxxx:])
178
+
179
+ Not Unicode Case:
180
+
181
+ alnum alphabet or digit char
182
+ alpha alphabet
183
+ ascii code value: [0 - 127]
184
+ blank \t, \x20
185
+ cntrl
186
+ digit 0-9
187
+ graph \x21-\x7E and all of multibyte encoded characters
188
+ lower
189
+ print \x20-\x7E and all of multibyte encoded characters
190
+ punct
191
+ space \t, \n, \v, \f, \r, \x20
192
+ upper
193
+ xdigit 0-9, a-f, A-F
194
+ word alphanumeric, "_" and multibyte characters
195
+
196
+
197
+ Unicode Case:
198
+
199
+ alnum Letter | Mark | Decimal_Number
200
+ alpha Letter | Mark
201
+ ascii 0000 - 007F
202
+ blank Space_Separator | 0009
203
+ cntrl Control | Format | Unassigned | Private_Use | Surrogate
204
+ digit Decimal_Number
205
+ graph [[:^space:]] && ^Control && ^Unassigned && ^Surrogate
206
+ lower Lowercase_Letter
207
+ print [[:graph:]] | Space_Separator
208
+ punct Connector_Punctuation | Dash_Punctuation | Close_Punctuation |
209
+ Final_Punctuation | Initial_Punctuation | Other_Punctuation |
210
+ Open_Punctuation
211
+ space Space_Separator | Line_Separator | Paragraph_Separator |
212
+ 0009 | 000A | 000B | 000C | 000D | 0085
213
+ upper Uppercase_Letter
214
+ xdigit 0030 - 0039 | 0041 - 0046 | 0061 - 0066
215
+ (0-9, a-f, A-F)
216
+ word Letter | Mark | Decimal_Number | Connector_Punctuation
217
+
218
+
219
+ It depends on ONIG_OPTION_ASCII_RANGE option and
220
+ ONIG_OPTION_POSIX_BRACKET_ALL_RANGE option that POSIX brackets
221
+ match non-ASCII char or not.
222
+
223
+
224
+
225
+ 7. Extended groups
226
+
227
+ (?#...) comment
228
+
229
+ (?imxdau-imx) option on/off
230
+ i: ignore case
231
+ m: multi-line (dot(.) match newline)
232
+ x: extended form
233
+
234
+ character set option (character range option)
235
+ d: Default (compatible with Ruby 1.9.3)
236
+ \w, \d and \s doesn't match non-ASCII characters.
237
+ \b, \B and POSIX brackets use the each encoding's
238
+ rules.
239
+ a: ASCII
240
+ ONIG_OPTION_ASCII_RANGE option is turned on.
241
+ \w, \d, \s and POSIX brackets doesn't match
242
+ non-ASCII characters.
243
+ \b and \B use the ASCII rules.
244
+ u: Unicode
245
+ ONIG_OPTION_ASCII_RANGE option is turned off.
246
+ \w (\W), \d (\D), \s (\S), \b (\B) and POSIX
247
+ brackets use the each encoding's rules.
248
+
249
+ (?imxdau-imx:subexp)
250
+ option on/off for subexp
251
+
252
+ (?:subexp) not captured group
253
+ (subexp) captured group
254
+
255
+ (?=subexp) look-ahead
256
+ (?!subexp) negative look-ahead
257
+ (?<=subexp) look-behind
258
+ (?<!subexp) negative look-behind
259
+
260
+ Subexp of look-behind must be fixed character length.
261
+ But different character length is allowed in top level
262
+ alternatives only.
263
+ ex. (?<=a|bc) is OK. (?<=aaa(?:b|cd)) is not allowed.
264
+
265
+ In negative-look-behind, captured group isn't allowed,
266
+ but shy group(?:) is allowed.
267
+
268
+ \K keep
269
+ Another expression of look-behind. Keep the stuff left
270
+ of the \K, don't include it in the result.
271
+
272
+ (?>subexp) atomic group
273
+ don't backtrack in subexp.
274
+
275
+ (?<name>subexp), (?'name'subexp)
276
+ define named group
277
+ (All characters of the name must be a word character.)
278
+
279
+ Not only a name but a number is assigned like a captured
280
+ group.
281
+
282
+ Assigning the same name as two or more subexps is allowed.
283
+ In this case, a subexp call can not be performed although
284
+ the back reference is possible.
285
+ (ONIG_SYNTAX_PERL: a subexp call is allowed in this case.)
286
+
287
+ (?(cond)yes-subexp), (?(cond)yes-subexp|no-subexp)
288
+ conditional expression
289
+ Matches yes-subexp if (cond) yields a true value, matches
290
+ no-subexp otherwise.
291
+ Following (cond) can be used:
292
+
293
+ (n) (n >= 1)
294
+ Checks if the numbered capturing group has matched
295
+ something.
296
+
297
+ (<name>), ('name')
298
+ Checks if a group with the given name has matched
299
+ something.
300
+
301
+
302
+ 8. Back reference
303
+
304
+ \n back reference by group number (n >= 1)
305
+ \k<n> back reference by group number (n >= 1)
306
+ \k'n' back reference by group number (n >= 1)
307
+ \k<-n> back reference by relative group number (n >= 1)
308
+ \k'-n' back reference by relative group number (n >= 1)
309
+ \k<name> back reference by group name
310
+ \k'name' back reference by group name
311
+
312
+ In the back reference by the multiplex definition name,
313
+ a subexp with a large number is referred to preferentially.
314
+ (When not matched, a group of the small number is referred to.)
315
+
316
+ * Back reference by group number is forbidden if named group is defined
317
+ in the pattern and ONIG_OPTION_CAPTURE_GROUP is not set.
318
+
319
+ * ONIG_SYNTAX_PERL: \g{n}, \g{-n} and \g{name} can also be used.
320
+
321
+
322
+ back reference with nest level
323
+
324
+ level: 0, 1, 2, ...
325
+
326
+ \k<n+level> (n >= 1)
327
+ \k<n-level> (n >= 1)
328
+ \k'n+level' (n >= 1)
329
+ \k'n-level' (n >= 1)
330
+ \k<-n+level> (n >= 1)
331
+ \k<-n-level> (n >= 1)
332
+ \k'-n+level' (n >= 1)
333
+ \k'-n-level' (n >= 1)
334
+
335
+ \k<name+level>
336
+ \k<name-level>
337
+ \k'name+level'
338
+ \k'name-level'
339
+
340
+ Destinate relative nest level from back reference position.
341
+
342
+ ex 1.
343
+
344
+ /\A(?<a>|.|(?:(?<b>.)\g<a>\k<b+0>))\z/.match("reer")
345
+
346
+ ex 2.
347
+
348
+ r = Regexp.compile(<<'__REGEXP__'.strip, Regexp::EXTENDED)
349
+ (?<element> \g<stag> \g<content>* \g<etag> ){0}
350
+ (?<stag> < \g<name> \s* > ){0}
351
+ (?<name> [a-zA-Z_:]+ ){0}
352
+ (?<content> [^<&]+ (\g<element> | [^<&]+)* ){0}
353
+ (?<etag> </ \k<name+1> >){0}
354
+ \g<element>
355
+ __REGEXP__
356
+
357
+ p r.match('<foo>f<bar>bbb</bar>f</foo>').captures
358
+
359
+
360
+
361
+ 9. Subexp call ("Tanaka Akira special")
362
+
363
+ \g<name> call by group name
364
+ \g'name' call by group name
365
+ \g<n> call by group number (n >= 1)
366
+ \g'n' call by group number (n >= 1)
367
+ \g<0> call the whole pattern recursively
368
+ \g'0' call the whole pattern recursively
369
+ \g<-n> call by relative group number (n >= 1)
370
+ \g'-n' call by relative group number (n >= 1)
371
+ \g<+n> call by relative group number (n >= 1)
372
+ \g'+n' call by relative group number (n >= 1)
373
+
374
+ * left-most recursive call is not allowed.
375
+ ex. (?<name>a|\g<name>b) => error
376
+ (?<name>a|b\g<name>c) => OK
377
+
378
+ * Call by group number is forbidden if named group is defined in the pattern
379
+ and ONIG_OPTION_CAPTURE_GROUP is not set.
380
+
381
+ * If the option status of called group is different from calling position
382
+ then the group's option is effective.
383
+
384
+ ex. (?-i:\g<name>)(?i:(?<name>a)){0} match to "A"
385
+
386
+ * ONIG_SYNTAX_PERL: use (?&name), (?n), (?-n), (?+n), (?R) or (?0) instead.
387
+
388
+
389
+ 10. Captured group
390
+
391
+ Behavior of the no-named group (...) changes with the following conditions.
392
+ (But named group is not changed.)
393
+
394
+ case 1. /.../ (named group is not used, no option)
395
+
396
+ (...) is treated as a captured group.
397
+
398
+ case 2. /.../g (named group is not used, 'g' option)
399
+
400
+ (...) is treated as a no-captured group (?:...).
401
+
402
+ case 3. /..(?<name>..)../ (named group is used, no option)
403
+
404
+ (...) is treated as a no-captured group (?:...).
405
+ numbered-backref/call is not allowed.
406
+
407
+ case 4. /..(?<name>..)../G (named group is used, 'G' option)
408
+
409
+ (...) is treated as a captured group.
410
+ numbered-backref/call is allowed.
411
+
412
+ where
413
+ g: ONIG_OPTION_DONT_CAPTURE_GROUP
414
+ G: ONIG_OPTION_CAPTURE_GROUP
415
+
416
+ ('g' and 'G' options are argued in ruby-dev ML)
417
+
418
+
419
+
420
+ -----------------------------
421
+ A-1. Syntax depend options
422
+
423
+ + ONIG_SYNTAX_RUBY
424
+ (?m): dot(.) match newline
425
+
426
+ + ONIG_SYNTAX_PERL, ONIG_SYNTAX_JAVA and ONIG_SYNTAX_PYTHON
427
+ (?s): dot(.) match newline
428
+ (?m): ^ match after newline, $ match before newline
429
+
430
+ + ONIG_SYNTAX_PERL
431
+ (?d), (?l): same as (?u)
432
+
433
+
434
+ A-2. Original extensions
435
+
436
+ + hexadecimal digit char type \h, \H
437
+ + named group (?<name>...), (?'name'...)
438
+ + named backref \k<name>
439
+ + subexp call \g<name>, \g<group-num>
440
+
441
+
442
+ A-3. Lacked features compare with perl 5.14.0
443
+
444
+ + \N{name}, \N{U+xxxx}, \N
445
+ + \l,\u,\L,\U, \C
446
+ + \v, \V, \h, \H, \o{xxx}
447
+ + (?{code})
448
+ + (??{code})
449
+ + (?|...)
450
+ + (*VERB:ARG)
451
+
452
+ * \Q...\E
453
+ This is effective on ONIG_SYNTAX_PERL and ONIG_SYNTAX_JAVA.
454
+
455
+
456
+ A-4. Differences with Japanized GNU regex(version 0.12) of Ruby 1.8
457
+
458
+ + add character property (\p{property}, \P{property})
459
+ + add hexadecimal digit char type (\h, \H)
460
+ + add look-behind
461
+ (?<=fixed-char-length-pattern), (?<!fixed-char-length-pattern)
462
+ + add possessive quantifier. ?+, *+, ++
463
+ + add operations in character class. [], &&
464
+ ('[' must be escaped as an usual char in character class.)
465
+ + add named group and subexp call.
466
+ + octal or hexadecimal number sequence can be treated as
467
+ a multibyte code char in character class if multibyte encoding
468
+ is specified.
469
+ (ex. [\xa1\xa2], [\xa1\xa7-\xa4\xa1])
470
+ + allow the range of single byte char and multibyte char in character
471
+ class.
472
+ ex. /[a-<<any EUC-JP character>>]/ in EUC-JP encoding.
473
+ + effect range of isolated option is to next ')'.
474
+ ex. (?:(?i)a|b) is interpreted as (?:(?i:a|b)), not (?:(?i:a)|b).
475
+ + isolated option is not transparent to previous pattern.
476
+ ex. a(?i)* is a syntax error pattern.
477
+ + allowed incomplete left brace as an usual string.
478
+ ex. /{/, /({)/, /a{2,3/ etc...
479
+ + negative POSIX bracket [:^xxxx:] is supported.
480
+ + POSIX bracket [:ascii:] is added.
481
+ + repeat of look-ahead is not allowed.
482
+ ex. /(?=a)*/, /(?!b){5}/
483
+ + Ignore case option is effective to numbered character.
484
+ ex. /\x61/i =~ "A"
485
+ + In the range quantifier, the number of the minimum is omissible.
486
+ /a{,n}/ == /a{0,n}/
487
+ The simultaneous abbreviation of the number of times of the minimum
488
+ and the maximum is not allowed. (/a{,}/)
489
+ + /a{n}?/ is not a non-greedy operator.
490
+ /a{n}?/ == /(?:a{n})?/
491
+ + invalid back reference is checked and cause error.
492
+ /\1/, /(a)\2/
493
+ + Zero-length match in infinite repeat stops the repeat,
494
+ then changes of the capture group status are checked as stop condition.
495
+ /(?:()|())*\1\2/ =~ ""
496
+ /(?:\1a|())*/ =~ "a"
497
+
498
+
499
+ A-5. Disabled functions by default syntax
500
+
501
+ + capture history
502
+
503
+ (?@...) and (?@<name>...)
504
+
505
+ ex. /(?@a)*/.match("aaa") ==> [<0-1>, <1-2>, <2-3>]
506
+
507
+ see sample/listcap.c file.
508
+
509
+
510
+ A-6. Problems
511
+
512
+ + Invalid encoding byte sequence is not checked.
513
+
514
+ ex. UTF-8
515
+
516
+ * Invalid first byte is treated as a character.
517
+ /./u =~ "\xa3"
518
+
519
+ * Incomplete byte sequence is not checked.
520
+ /\w+/ =~ "a\xf3\x8ec"
521
+
522
+ // END