regexp_parser 2.6.0 → 2.9.2
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/Gemfile +5 -5
- data/LICENSE +1 -1
- data/lib/regexp_parser/expression/base.rb +0 -7
- data/lib/regexp_parser/expression/classes/alternation.rb +1 -1
- data/lib/regexp_parser/expression/classes/backreference.rb +17 -3
- data/lib/regexp_parser/expression/classes/character_set/range.rb +2 -7
- data/lib/regexp_parser/expression/classes/character_set.rb +4 -8
- data/lib/regexp_parser/expression/classes/conditional.rb +2 -6
- data/lib/regexp_parser/expression/classes/escape_sequence.rb +3 -1
- data/lib/regexp_parser/expression/classes/free_space.rb +3 -1
- data/lib/regexp_parser/expression/classes/group.rb +0 -22
- data/lib/regexp_parser/expression/classes/keep.rb +1 -1
- data/lib/regexp_parser/expression/classes/posix_class.rb +5 -5
- data/lib/regexp_parser/expression/classes/unicode_property.rb +11 -11
- data/lib/regexp_parser/expression/methods/construct.rb +2 -4
- data/lib/regexp_parser/expression/methods/match_length.rb +8 -4
- data/lib/regexp_parser/expression/methods/negative.rb +20 -0
- data/lib/regexp_parser/expression/methods/parts.rb +23 -0
- data/lib/regexp_parser/expression/methods/printing.rb +26 -0
- data/lib/regexp_parser/expression/methods/tests.rb +40 -3
- data/lib/regexp_parser/expression/methods/traverse.rb +35 -19
- data/lib/regexp_parser/expression/quantifier.rb +30 -17
- data/lib/regexp_parser/expression/sequence.rb +5 -10
- data/lib/regexp_parser/expression/sequence_operation.rb +4 -9
- data/lib/regexp_parser/expression/shared.rb +37 -20
- data/lib/regexp_parser/expression/subexpression.rb +20 -15
- data/lib/regexp_parser/expression.rb +34 -31
- data/lib/regexp_parser/lexer.rb +76 -36
- data/lib/regexp_parser/parser.rb +101 -100
- data/lib/regexp_parser/scanner/errors/premature_end_error.rb +8 -0
- data/lib/regexp_parser/scanner/errors/scanner_error.rb +6 -0
- data/lib/regexp_parser/scanner/errors/validation_error.rb +63 -0
- data/lib/regexp_parser/scanner/properties/long.csv +29 -0
- data/lib/regexp_parser/scanner/properties/short.csv +3 -0
- data/lib/regexp_parser/scanner/property.rl +2 -2
- data/lib/regexp_parser/scanner/scanner.rl +101 -172
- data/lib/regexp_parser/scanner.rb +1132 -1283
- data/lib/regexp_parser/syntax/token/backreference.rb +3 -0
- data/lib/regexp_parser/syntax/token/character_set.rb +3 -0
- data/lib/regexp_parser/syntax/token/escape.rb +3 -1
- data/lib/regexp_parser/syntax/token/meta.rb +9 -2
- data/lib/regexp_parser/syntax/token/unicode_property.rb +35 -1
- data/lib/regexp_parser/syntax/token/virtual.rb +11 -0
- data/lib/regexp_parser/syntax/token.rb +13 -13
- data/lib/regexp_parser/syntax/version_lookup.rb +0 -8
- data/lib/regexp_parser/syntax/versions.rb +3 -1
- data/lib/regexp_parser/syntax.rb +1 -1
- data/lib/regexp_parser/version.rb +1 -1
- data/lib/regexp_parser.rb +6 -6
- data/regexp_parser.gemspec +5 -5
- metadata +14 -8
- data/CHANGELOG.md +0 -601
- data/README.md +0 -503
data/README.md
DELETED
@@ -1,503 +0,0 @@
|
|
1
|
-
# Regexp::Parser
|
2
|
-
|
3
|
-
[![Gem Version](https://badge.fury.io/rb/regexp_parser.svg)](http://badge.fury.io/rb/regexp_parser)
|
4
|
-
[![Build Status](https://github.com/ammar/regexp_parser/workflows/tests/badge.svg)](https://github.com/ammar/regexp_parser/actions)
|
5
|
-
[![Build Status](https://github.com/ammar/regexp_parser/workflows/gouteur/badge.svg)](https://github.com/ammar/regexp_parser/actions)
|
6
|
-
[![Code Climate](https://codeclimate.com/github/ammar/regexp_parser.svg)](https://codeclimate.com/github/ammar/regexp_parser/badges)
|
7
|
-
|
8
|
-
A Ruby gem for tokenizing, parsing, and transforming regular expressions.
|
9
|
-
|
10
|
-
* Multilayered
|
11
|
-
* A scanner/tokenizer based on [Ragel](http://www.colm.net/open-source/ragel/)
|
12
|
-
* A lexer that produces a "stream" of [Token objects](https://github.com/ammar/regexp_parser/wiki/Token-Objects)
|
13
|
-
* A parser that produces a "tree" of [Expression objects (OO API)](https://github.com/ammar/regexp_parser/wiki/Expression-Objects)
|
14
|
-
* Runs on Ruby 2.x, 3.x and JRuby runtimes
|
15
|
-
* Recognizes Ruby 1.8, 1.9, 2.x and 3.x regular expressions [See Supported Syntax](#supported-syntax)
|
16
|
-
|
17
|
-
|
18
|
-
_For examples of regexp_parser in use, see [Example Projects](#example-projects)._
|
19
|
-
|
20
|
-
|
21
|
-
---
|
22
|
-
## Requirements
|
23
|
-
|
24
|
-
* Ruby >= 2.0
|
25
|
-
* Ragel >= 6.0, but only if you want to build the gem or work on the scanner.
|
26
|
-
|
27
|
-
|
28
|
-
---
|
29
|
-
## Install
|
30
|
-
|
31
|
-
Install the gem with:
|
32
|
-
|
33
|
-
`gem install regexp_parser`
|
34
|
-
|
35
|
-
Or, add it to your project's `Gemfile`:
|
36
|
-
|
37
|
-
```gem 'regexp_parser', '~> X.Y.Z'```
|
38
|
-
|
39
|
-
See the badge at the top of this README or [rubygems](https://rubygems.org/gems/regexp_parser)
|
40
|
-
for the the latest version number.
|
41
|
-
|
42
|
-
|
43
|
-
---
|
44
|
-
## Usage
|
45
|
-
|
46
|
-
The three main modules are **Scanner**, **Lexer**, and **Parser**. Each of them
|
47
|
-
provides a single method that takes a regular expression (as a Regexp object or
|
48
|
-
a string) and returns its results. The **Lexer** and the **Parser** accept an
|
49
|
-
optional second argument that specifies the syntax version, like 'ruby/2.0',
|
50
|
-
which defaults to the host Ruby version (using RUBY_VERSION).
|
51
|
-
|
52
|
-
Here are the basic usage examples:
|
53
|
-
|
54
|
-
```ruby
|
55
|
-
require 'regexp_parser'
|
56
|
-
|
57
|
-
Regexp::Scanner.scan(regexp)
|
58
|
-
|
59
|
-
Regexp::Lexer.lex(regexp)
|
60
|
-
|
61
|
-
Regexp::Parser.parse(regexp)
|
62
|
-
```
|
63
|
-
|
64
|
-
All three methods accept a block as the last argument, which, if given, gets
|
65
|
-
called with the results as follows:
|
66
|
-
|
67
|
-
* **Scanner**: the block gets passed the results as they are scanned. See the
|
68
|
-
example in the next section for details.
|
69
|
-
|
70
|
-
* **Lexer**: after completion, the block gets passed the tokens one by one.
|
71
|
-
_The result of the block is returned._
|
72
|
-
|
73
|
-
* **Parser**: after completion, the block gets passed the root expression.
|
74
|
-
_The result of the block is returned._
|
75
|
-
|
76
|
-
All three methods accept either a `Regexp` or `String` (containing the pattern)
|
77
|
-
- if a String is passed, `options` can be supplied:
|
78
|
-
|
79
|
-
```ruby
|
80
|
-
require 'regexp_parser'
|
81
|
-
|
82
|
-
Regexp::Parser.parse(
|
83
|
-
"a+ # Recognizes a and A...",
|
84
|
-
options: ::Regexp::EXTENDED | ::Regexp::IGNORECASE
|
85
|
-
)
|
86
|
-
```
|
87
|
-
|
88
|
-
---
|
89
|
-
## Components
|
90
|
-
|
91
|
-
### Scanner
|
92
|
-
A Ragel-generated scanner that recognizes the cumulative syntax of all
|
93
|
-
supported syntax versions. It breaks a given expression's text into the
|
94
|
-
smallest parts, and identifies their type, token, text, and start/end
|
95
|
-
offsets within the pattern.
|
96
|
-
|
97
|
-
|
98
|
-
#### Example
|
99
|
-
The following scans the given pattern and prints out the type, token, text and
|
100
|
-
start/end offsets for each token found.
|
101
|
-
|
102
|
-
```ruby
|
103
|
-
require 'regexp_parser'
|
104
|
-
|
105
|
-
Regexp::Scanner.scan(/(ab?(cd)*[e-h]+)/) do |type, token, text, ts, te|
|
106
|
-
puts "type: #{type}, token: #{token}, text: '#{text}' [#{ts}..#{te}]"
|
107
|
-
end
|
108
|
-
|
109
|
-
# output
|
110
|
-
# type: group, token: capture, text: '(' [0..1]
|
111
|
-
# type: literal, token: literal, text: 'ab' [1..3]
|
112
|
-
# type: quantifier, token: zero_or_one, text: '?' [3..4]
|
113
|
-
# type: group, token: capture, text: '(' [4..5]
|
114
|
-
# type: literal, token: literal, text: 'cd' [5..7]
|
115
|
-
# type: group, token: close, text: ')' [7..8]
|
116
|
-
# type: quantifier, token: zero_or_more, text: '*' [8..9]
|
117
|
-
# type: set, token: open, text: '[' [9..10]
|
118
|
-
# type: set, token: range, text: 'e-h' [10..13]
|
119
|
-
# type: set, token: close, text: ']' [13..14]
|
120
|
-
# type: quantifier, token: one_or_more, text: '+' [14..15]
|
121
|
-
# type: group, token: close, text: ')' [15..16]
|
122
|
-
```
|
123
|
-
|
124
|
-
A one-liner that uses map on the result of the scan to return the textual
|
125
|
-
parts of the pattern:
|
126
|
-
|
127
|
-
```ruby
|
128
|
-
Regexp::Scanner.scan(/(cat?([bhm]at)){3,5}/).map { |token| token[2] }
|
129
|
-
#=> ["(", "cat", "?", "(", "[", "b", "h", "m", "]", "at", ")", ")", "{3,5}"]
|
130
|
-
```
|
131
|
-
|
132
|
-
|
133
|
-
#### Notes
|
134
|
-
* The scanner performs basic syntax error checking, like detecting missing
|
135
|
-
balancing punctuation and premature end of pattern. Flavor validity checks
|
136
|
-
are performed in the lexer, which uses a syntax object.
|
137
|
-
|
138
|
-
* If the input is a Ruby **Regexp** object, the scanner calls #source on it to
|
139
|
-
get its string representation. #source does not include the options of
|
140
|
-
the expression (m, i, and x). To include the options in the scan, #to_s
|
141
|
-
should be called on the **Regexp** before passing it to the scanner or the
|
142
|
-
lexer. For the parser, however, this is not necessary. It automatically
|
143
|
-
exposes the options of a passed **Regexp** in the returned root expression.
|
144
|
-
|
145
|
-
* To keep the scanner simple(r) and fairly reusable for other purposes, it
|
146
|
-
does not perform lexical analysis on the tokens, sticking to the task
|
147
|
-
of identifying the smallest possible tokens and leaving lexical analysis
|
148
|
-
to the lexer.
|
149
|
-
|
150
|
-
* The MRI implementation may accept expressions that either conflict with
|
151
|
-
the documentation or are undocumented, like `{}` and `]` _(unescaped)_.
|
152
|
-
The scanner will try to support as many of these cases as possible.
|
153
|
-
|
154
|
-
---
|
155
|
-
### Syntax
|
156
|
-
Defines the supported tokens for a specific engine implementation (aka a
|
157
|
-
flavor). Syntax classes act as lookup tables, and are layered to create
|
158
|
-
flavor variations. Syntax only comes into play in the lexer.
|
159
|
-
|
160
|
-
#### Example
|
161
|
-
The following fetches syntax objects for Ruby 2.0, 1.9, 1.8, and
|
162
|
-
checks a few of their implementation features.
|
163
|
-
|
164
|
-
```ruby
|
165
|
-
require 'regexp_parser'
|
166
|
-
|
167
|
-
ruby_20 = Regexp::Syntax.for 'ruby/2.0'
|
168
|
-
ruby_20.implements? :quantifier, :zero_or_one # => true
|
169
|
-
ruby_20.implements? :quantifier, :zero_or_one_reluctant # => true
|
170
|
-
ruby_20.implements? :quantifier, :zero_or_one_possessive # => true
|
171
|
-
ruby_20.implements? :conditional, :condition # => true
|
172
|
-
|
173
|
-
ruby_19 = Regexp::Syntax.for 'ruby/1.9'
|
174
|
-
ruby_19.implements? :quantifier, :zero_or_one # => true
|
175
|
-
ruby_19.implements? :quantifier, :zero_or_one_reluctant # => true
|
176
|
-
ruby_19.implements? :quantifier, :zero_or_one_possessive # => true
|
177
|
-
ruby_19.implements? :conditional, :condition # => false
|
178
|
-
|
179
|
-
ruby_18 = Regexp::Syntax.for 'ruby/1.8'
|
180
|
-
ruby_18.implements? :quantifier, :zero_or_one # => true
|
181
|
-
ruby_18.implements? :quantifier, :zero_or_one_reluctant # => true
|
182
|
-
ruby_18.implements? :quantifier, :zero_or_one_possessive # => false
|
183
|
-
ruby_18.implements? :conditional, :condition # => false
|
184
|
-
```
|
185
|
-
|
186
|
-
Syntax objects can also be queried about their complete and relative feature sets.
|
187
|
-
|
188
|
-
```ruby
|
189
|
-
require 'regexp_parser'
|
190
|
-
|
191
|
-
ruby_20 = Regexp::Syntax.for 'ruby/2.0' # => Regexp::Syntax::V2_0_0
|
192
|
-
ruby_20.added_features # => { conditional: [...], ... }
|
193
|
-
ruby_20.removed_features # => { property: [:newline], ... }
|
194
|
-
ruby_20.features # => { anchor: [...], ... }
|
195
|
-
```
|
196
|
-
|
197
|
-
#### Notes
|
198
|
-
* Variations on a token, for example a named group with angle brackets (< and >)
|
199
|
-
vs one with a pair of single quotes, are specified with an underscore followed
|
200
|
-
by two characters appended to the base token. In the previous named group example,
|
201
|
-
the tokens would be :named_ab (angle brackets) and :named_sq (single quotes).
|
202
|
-
These variations are normalized by the syntax to :named.
|
203
|
-
|
204
|
-
|
205
|
-
---
|
206
|
-
### Lexer
|
207
|
-
Sits on top of the scanner and performs lexical analysis on the tokens that
|
208
|
-
it emits. Among its tasks are; breaking quantified literal runs, collecting the
|
209
|
-
emitted token attributes into Token objects, calculating their nesting depth,
|
210
|
-
normalizing tokens for the parser, and checking if the tokens are implemented by
|
211
|
-
the given syntax version.
|
212
|
-
|
213
|
-
See the [Token Objects](https://github.com/ammar/regexp_parser/wiki/Token-Objects)
|
214
|
-
wiki page for more information on Token objects.
|
215
|
-
|
216
|
-
|
217
|
-
#### Example
|
218
|
-
The following example lexes the given pattern, checks it against the Ruby 1.9
|
219
|
-
syntax, and prints the token objects' text indented to their level.
|
220
|
-
|
221
|
-
```ruby
|
222
|
-
require 'regexp_parser'
|
223
|
-
|
224
|
-
Regexp::Lexer.lex(/a?(b(c))*[d]+/, 'ruby/1.9') do |token|
|
225
|
-
puts "#{' ' * token.level}#{token.text}"
|
226
|
-
end
|
227
|
-
|
228
|
-
# output
|
229
|
-
# a
|
230
|
-
# ?
|
231
|
-
# (
|
232
|
-
# b
|
233
|
-
# (
|
234
|
-
# c
|
235
|
-
# )
|
236
|
-
# )
|
237
|
-
# *
|
238
|
-
# [
|
239
|
-
# d
|
240
|
-
# ]
|
241
|
-
# +
|
242
|
-
```
|
243
|
-
|
244
|
-
A one-liner that returns an array of the textual parts of the given pattern.
|
245
|
-
Compare the output with that of the one-liner example of the **Scanner**; notably
|
246
|
-
how the sequence 'cat' is treated. The 't' is separated because it's followed
|
247
|
-
by a quantifier that only applies to it.
|
248
|
-
|
249
|
-
```ruby
|
250
|
-
Regexp::Lexer.scan(/(cat?([b]at)){3,5}/).map { |token| token.text }
|
251
|
-
#=> ["(", "ca", "t", "?", "(", "[", "b", "]", "at", ")", ")", "{3,5}"]
|
252
|
-
```
|
253
|
-
|
254
|
-
#### Notes
|
255
|
-
* The syntax argument is optional. It defaults to the version of the Ruby
|
256
|
-
interpreter in use, as returned by RUBY_VERSION.
|
257
|
-
|
258
|
-
* The lexer normalizes some tokens, as noted in the Syntax section above.
|
259
|
-
|
260
|
-
|
261
|
-
---
|
262
|
-
### Parser
|
263
|
-
Sits on top of the lexer and transforms the "stream" of Token objects emitted
|
264
|
-
by it into a tree of Expression objects represented by an instance of the
|
265
|
-
Expression::Root class.
|
266
|
-
|
267
|
-
See the [Expression Objects](https://github.com/ammar/regexp_parser/wiki/Expression-Objects)
|
268
|
-
wiki page for attributes and methods.
|
269
|
-
|
270
|
-
|
271
|
-
#### Example
|
272
|
-
|
273
|
-
```ruby
|
274
|
-
require 'regexp_parser'
|
275
|
-
|
276
|
-
regex = /a?(b+(c)d)*(?<name>[0-9]+)/
|
277
|
-
|
278
|
-
tree = Regexp::Parser.parse(regex, 'ruby/2.1')
|
279
|
-
|
280
|
-
tree.traverse do |event, exp|
|
281
|
-
puts "#{event}: #{exp.type} `#{exp.to_s}`"
|
282
|
-
end
|
283
|
-
|
284
|
-
# Output
|
285
|
-
# visit: literal `a?`
|
286
|
-
# enter: group `(b+(c)d)*`
|
287
|
-
# visit: literal `b+`
|
288
|
-
# enter: group `(c)`
|
289
|
-
# visit: literal `c`
|
290
|
-
# exit: group `(c)`
|
291
|
-
# visit: literal `d`
|
292
|
-
# exit: group `(b+(c)d)*`
|
293
|
-
# enter: group `(?<name>[0-9]+)`
|
294
|
-
# visit: set `[0-9]+`
|
295
|
-
# exit: group `(?<name>[0-9]+)`
|
296
|
-
```
|
297
|
-
|
298
|
-
Another example, using each_expression and strfregexp to print the object tree.
|
299
|
-
_See the traverse.rb and strfregexp.rb files under `lib/regexp_parser/expression/methods`
|
300
|
-
for more information on these methods._
|
301
|
-
|
302
|
-
```ruby
|
303
|
-
include_root = true
|
304
|
-
indent_offset = include_root ? 1 : 0
|
305
|
-
|
306
|
-
tree.each_expression(include_root) do |exp, level_index|
|
307
|
-
puts exp.strfregexp("%>> %c", indent_offset)
|
308
|
-
end
|
309
|
-
|
310
|
-
# Output
|
311
|
-
# > Regexp::Expression::Root
|
312
|
-
# > Regexp::Expression::Literal
|
313
|
-
# > Regexp::Expression::Group::Capture
|
314
|
-
# > Regexp::Expression::Literal
|
315
|
-
# > Regexp::Expression::Group::Capture
|
316
|
-
# > Regexp::Expression::Literal
|
317
|
-
# > Regexp::Expression::Literal
|
318
|
-
# > Regexp::Expression::Group::Named
|
319
|
-
# > Regexp::Expression::CharacterSet
|
320
|
-
```
|
321
|
-
|
322
|
-
_Note: quantifiers do not appear in the output because they are members of the
|
323
|
-
Expression class. See the next section for details._
|
324
|
-
|
325
|
-
|
326
|
-
---
|
327
|
-
|
328
|
-
|
329
|
-
## Supported Syntax
|
330
|
-
The three modules support all the regular expression syntax features of Ruby 1.8,
|
331
|
-
1.9, 2.x and 3.x:
|
332
|
-
|
333
|
-
_Note that not all of these are available in all versions of Ruby_
|
334
|
-
|
335
|
-
|
336
|
-
| Syntax Feature | Examples | ⋯ |
|
337
|
-
| ------------------------------------- | ------------------------------------------------------- |:--------:|
|
338
|
-
| **Alternation** | `a\|b\|c` | ✓ |
|
339
|
-
| **Anchors** | `\A`, `^`, `\b` | ✓ |
|
340
|
-
| **Character Classes** | `[abc]`, `[^\\]`, `[a-d&&aeiou]`, `[a=e=b]` | ✓ |
|
341
|
-
| **Character Types** | `\d`, `\H`, `\s` | ✓ |
|
342
|
-
| **Cluster Types** | `\R`, `\X` | ✓ |
|
343
|
-
| **Conditional Exps.** | `(?(cond)yes-subexp)`, `(?(cond)yes-subexp\|no-subexp)` | ✓ |
|
344
|
-
| **Escape Sequences** | `\t`, `\\+`, `\?` | ✓ |
|
345
|
-
| **Free Space** | whitespace and `# Comments` _(x modifier)_ | ✓ |
|
346
|
-
| **Grouped Exps.** | | ⋱ |
|
347
|
-
|   _**Assertions**_ | | ⋱ |
|
348
|
-
|   _Lookahead_ | `(?=abc)` | ✓ |
|
349
|
-
|   _Negative Lookahead_ | `(?!abc)` | ✓ |
|
350
|
-
|   _Lookbehind_ | `(?<=abc)` | ✓ |
|
351
|
-
|   _Negative Lookbehind_ | `(?<!abc)` | ✓ |
|
352
|
-
|   _**Atomic**_ | `(?>abc)` | ✓ |
|
353
|
-
|   _**Absence**_ | `(?~abc)` | ✓ |
|
354
|
-
|   _**Back-references**_ | | ⋱ |
|
355
|
-
|   _Named_ | `\k<name>` | ✓ |
|
356
|
-
|   _Nest Level_ | `\k<n-1>` | ✓ |
|
357
|
-
|   _Numbered_ | `\k<1>` | ✓ |
|
358
|
-
|   _Relative_ | `\k<-2>` | ✓ |
|
359
|
-
|   _Traditional_ | `\1` through `\9` | ✓ |
|
360
|
-
|   _**Capturing**_ | `(abc)` | ✓ |
|
361
|
-
|   _**Comments**_ | `(?# comment text)` | ✓ |
|
362
|
-
|   _**Named**_ | `(?<name>abc)`, `(?'name'abc)` | ✓ |
|
363
|
-
|   _**Options**_ | `(?mi-x:abc)`, `(?a:\s\w+)`, `(?i)` | ✓ |
|
364
|
-
|   _**Passive**_ | `(?:abc)` | ✓ |
|
365
|
-
|   _**Subexp. Calls**_ | `\g<name>`, `\g<1>` | ✓ |
|
366
|
-
| **Keep** | `\K`, `(ab\Kc\|d\Ke)f` | ✓ |
|
367
|
-
| **Literals** _(utf-8)_ | `Ruby`, `ルビー`, `روبي` | ✓ |
|
368
|
-
| **POSIX Classes** | `[:alpha:]`, `[:^digit:]` | ✓ |
|
369
|
-
| **Quantifiers** | | ⋱ |
|
370
|
-
|   _**Greedy**_ | `?`, `*`, `+`, `{m,M}` | ✓ |
|
371
|
-
|   _**Reluctant** (Lazy)_ | `??`, `*?`, `+?` \[1\] | ✓ |
|
372
|
-
|   _**Possessive**_ | `?+`, `*+`, `++` \[1\] | ✓ |
|
373
|
-
| **String Escapes** | | ⋱ |
|
374
|
-
|   _**Control** \[2\]_ | `\C-C`, `\cD` | ✓ |
|
375
|
-
|   _**Hex**_ | `\x20`, `\x{701230}` | ✓ |
|
376
|
-
|   _**Meta** \[2\]_ | `\M-c`, `\M-\C-C`, `\M-\cC`, `\C-\M-C`, `\c\M-C` | ✓ |
|
377
|
-
|   _**Octal**_ | `\0`, `\01`, `\012` | ✓ |
|
378
|
-
|   _**Unicode**_ | `\uHHHH`, `\u{H+ H+}` | ✓ |
|
379
|
-
| **Unicode Properties** | _<sub>([Unicode 13.0.0])</sub>_ | ⋱ |
|
380
|
-
|   _**Age**_ | `\p{Age=5.2}`, `\P{age=7.0}`, `\p{^age=8.0}` | ✓ |
|
381
|
-
|   _**Blocks**_ | `\p{InArmenian}`, `\P{InKhmer}`, `\p{^InThai}` | ✓ |
|
382
|
-
|   _**Classes**_ | `\p{Alpha}`, `\P{Space}`, `\p{^Alnum}` | ✓ |
|
383
|
-
|   _**Derived**_ | `\p{Math}`, `\P{Lowercase}`, `\p{^Cased}` | ✓ |
|
384
|
-
|   _**General Categories**_ | `\p{Lu}`, `\P{Cs}`, `\p{^sc}` | ✓ |
|
385
|
-
|   _**Scripts**_ | `\p{Arabic}`, `\P{Hiragana}`, `\p{^Greek}` | ✓ |
|
386
|
-
|   _**Simple**_ | `\p{Dash}`, `\p{Extender}`, `\p{^Hyphen}` | ✓ |
|
387
|
-
|
388
|
-
[Unicode 13.0.0]: https://www.unicode.org/versions/Unicode13.0.0/
|
389
|
-
|
390
|
-
**\[1\]**: Ruby does not support lazy or possessive interval quantifiers.
|
391
|
-
Any `+` or `?` that follows an interval quantifier will be treated as another,
|
392
|
-
chained quantifier. See also [#3](https://github.com/ammar/regexp_parser/issue/3),
|
393
|
-
[#69](https://github.com/ammar/regexp_parser/pull/69).
|
394
|
-
|
395
|
-
**\[2\]**: As of Ruby 3.1, meta and control sequences are [pre-processed to hex
|
396
|
-
escapes when used in Regexp literals](https://github.com/ruby/ruby/commit/11ae581),
|
397
|
-
so they will only reach the scanner and will only be emitted if a String or a Regexp
|
398
|
-
that has been built with the `::new` constructor is scanned.
|
399
|
-
|
400
|
-
##### Inapplicable Features
|
401
|
-
|
402
|
-
Some modifiers, like `o` and `s`, apply to the **Regexp** object itself and do not
|
403
|
-
appear in its source. Other such modifiers include the encoding modifiers `e` and `n`
|
404
|
-
[See](http://www.ruby-doc.org/core-2.5.0/Regexp.html#class-Regexp-label-Encoding).
|
405
|
-
These are not seen by the scanner.
|
406
|
-
|
407
|
-
The following features are not currently enabled for Ruby by its regular
|
408
|
-
expressions library (Onigmo). They are not supported by the scanner.
|
409
|
-
|
410
|
-
- **Quotes**: `\Q...\E` _[[See]](https://github.com/k-takata/Onigmo/blob/7911409/doc/RE#L499)_
|
411
|
-
- **Capture History**: `(?@...)`, `(?@<name>...)` _[[See]](https://github.com/k-takata/Onigmo/blob/7911409/doc/RE#L550)_
|
412
|
-
|
413
|
-
See something missing? Please submit an [issue](https://github.com/ammar/regexp_parser/issues)
|
414
|
-
|
415
|
-
_**Note**: Attempting to process expressions with unsupported syntax features can raise
|
416
|
-
an error, or incorrectly return tokens/objects as literals._
|
417
|
-
|
418
|
-
|
419
|
-
## Testing
|
420
|
-
To run the tests simply run rake from the root directory.
|
421
|
-
|
422
|
-
The default task generates the scanner's code from the Ragel source files and runs
|
423
|
-
all the specs, thus it requires Ragel to be installed.
|
424
|
-
|
425
|
-
Note that changes to Ragel files will not be reflected when running `rspec` on its own,
|
426
|
-
so to run individual tests you might want to run:
|
427
|
-
|
428
|
-
```
|
429
|
-
rake ragel:rb && rspec spec/scanner/properties_spec.rb
|
430
|
-
```
|
431
|
-
|
432
|
-
## Building
|
433
|
-
Building the scanner and the gem requires [Ragel](http://www.colm.net/open-source/ragel/)
|
434
|
-
to be installed. The build tasks will automatically invoke the 'ragel:rb' task to generate
|
435
|
-
the Ruby scanner code.
|
436
|
-
|
437
|
-
|
438
|
-
The project uses the standard rubygems package tasks, so:
|
439
|
-
|
440
|
-
|
441
|
-
To build the gem, run:
|
442
|
-
```
|
443
|
-
rake build
|
444
|
-
```
|
445
|
-
|
446
|
-
To install the gem from the cloned project, run:
|
447
|
-
```
|
448
|
-
rake install
|
449
|
-
```
|
450
|
-
|
451
|
-
|
452
|
-
## Example Projects
|
453
|
-
Projects using regexp_parser.
|
454
|
-
|
455
|
-
- [capybara](https://github.com/teamcapybara/capybara) is an integration testing tool
|
456
|
-
that uses regexp_parser to convert Regexps to css/xpath selectors.
|
457
|
-
|
458
|
-
- [js_regex](https://github.com/jaynetics/js_regex) converts Ruby regular expressions
|
459
|
-
to JavaScript-compatible regular expressions.
|
460
|
-
|
461
|
-
- [meta_re](https://github.com/ammar/meta_re) is a regular expression preprocessor
|
462
|
-
with alias support.
|
463
|
-
|
464
|
-
- [mutant](https://github.com/mbj/mutant) manipulates your regular expressions
|
465
|
-
(amongst others) to see if your tests cover their behavior.
|
466
|
-
|
467
|
-
- [repper](https://github.com/jaynetics/repper) is a regular expression
|
468
|
-
pretty-printer and formatter for Ruby.
|
469
|
-
|
470
|
-
- [rubocop](https://github.com/rubocop-hq/rubocop) is a linter for Ruby that
|
471
|
-
uses regexp_parser to lint Regexps.
|
472
|
-
|
473
|
-
- [twitter-cldr-rb](https://github.com/twitter/twitter-cldr-rb) is a localization helper
|
474
|
-
that uses regexp_parser to generate examples of postal codes.
|
475
|
-
|
476
|
-
|
477
|
-
## References
|
478
|
-
Documentation and books used while working on this project.
|
479
|
-
|
480
|
-
|
481
|
-
#### Ruby Flavors
|
482
|
-
* Oniguruma Regular Expressions (Ruby 1.9.x) [link](https://github.com/kkos/oniguruma/blob/master/doc/RE)
|
483
|
-
* Onigmo Regular Expressions (Ruby >= 2.0) [link](https://github.com/k-takata/Onigmo/blob/master/doc/RE)
|
484
|
-
|
485
|
-
|
486
|
-
#### Regular Expressions
|
487
|
-
* Mastering Regular Expressions, By Jeffrey E.F. Friedl (2nd Edition) [book](http://oreilly.com/catalog/9781565922570/)
|
488
|
-
* Regular Expression Flavor Comparison [link](http://www.regular-expressions.info/refflavors.html)
|
489
|
-
* Enumerating the strings of regular languages [link](http://www.cs.dartmouth.edu/~doug/nfa.ps.gz)
|
490
|
-
* Stack Overflow Regular Expressions FAQ [link](http://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean/22944075#22944075)
|
491
|
-
|
492
|
-
|
493
|
-
#### Unicode
|
494
|
-
* Unicode Explained, By Jukka K. Korpela. [book](http://oreilly.com/catalog/9780596101213)
|
495
|
-
* Unicode Derived Properties [link](http://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt)
|
496
|
-
* Unicode Property Aliases [link](http://www.unicode.org/Public/UNIDATA/PropertyAliases.txt)
|
497
|
-
* Unicode Regular Expressions [link](http://www.unicode.org/reports/tr18/)
|
498
|
-
* Unicode Standard Annex #44 [link](http://www.unicode.org/reports/tr44/)
|
499
|
-
|
500
|
-
|
501
|
-
---
|
502
|
-
##### Copyright
|
503
|
-
_Copyright (c) 2010-2022 Ammar Ali. See LICENSE file for details._
|