regexp_parser 2.0.3 → 2.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 42283562f90dc131bff21d7988b76867d1bd3bfc828373be9dce75c336300e1e
4
- data.tar.gz: c7d0122495e338d2535ac7569f20257b70bbdce15a50a5ede6677897cfacc736
3
+ metadata.gz: 79c8b7838ef53335c9d0fbd21ffdf6815473ee560380a3687e8fab514d031d53
4
+ data.tar.gz: 2a91f7c7640fc5f2d304c2cbf240886d8e8642994861a9c092f1d4db2ae6b77a
5
5
  SHA512:
6
- metadata.gz: cd29fd59a5bdad5344d19a86c39680d9d22e961c7478d702643e9b2340a0c0c8d62b61ff7fb44b404096a079267e0126c9fd92797306062a1c66711e29af1a24
7
- data.tar.gz: 752b4824e5104a29de6b8582b51f39fe72dc40ddd222a58b61612b0b4cc9e5fc0311b31d2523f7e516b36991f49843b682309178934b6306d6ba094856e9d50c
6
+ metadata.gz: 3559a8c7af9c0087ab7a54862c9913e40a3703ffa23f62e6919eec50042523424c2aa4c99b3de9d28d03fc0edd14af37e0dcd0eab7bf822b9af73113be468b59
7
+ data.tar.gz: 31ed468565bd41fe2d0bd7b82d53d64e213a15e1ade2108ddf813637c228c18f6f7b456725c7e359a08754188ee19c90d06e013be90775ee6a64723b04fa25f0
data/CHANGELOG.md CHANGED
@@ -1,14 +1,45 @@
1
1
  ## [Unreleased]
2
2
 
3
+ ## [2.1.0] - 2021-02-22 - [Janosch Müller](mailto:janosch84@gmail.com)
4
+
5
+ ### Added
6
+
7
+ - common ancestor for all scanning/parsing/lexing errors
8
+ * `Regexp::Parser::Error` can now be rescued as a catch-all
9
+ * the following errors (and their many descendants) now inherit from it:
10
+ - `Regexp::Expression::Conditional::TooManyBranches`
11
+ - `Regexp::Parser::ParserError`
12
+ - `Regexp::Scanner::ScannerError`
13
+ - `Regexp::Scanner::ValidationError`
14
+ - `Regexp::Syntax::SyntaxError`
15
+ * it replaces `ArgumentError` in some rare cases (`Regexp::Parser.parse('?')`)
16
+ * thanks to [sandstrom](https://github.com/sandstrom) for the cue
17
+
18
+ ### Fixed
19
+
20
+ - fixed scanning of whole-pattern recursion calls `\g<0>` and `\g'0'`
21
+ * a regression in v2.0.1 had caused them to be scanned as literals
22
+ - fixed scanning of some backreference and subexpression call edge cases
23
+ * e.g. `\k<+1>`, `\g<x-1>`
24
+ - fixed tokenization of some escapes in character sets
25
+ * `.`, `|`, `{`, `}`, `(`, `)`, `^`, `$`, `?`, `+`, `*`
26
+ * all of these correctly emitted `#type` `:literal` and `#token` `:literal` if *not* escaped
27
+ * if escaped, they emitted e.g. `#type` `:escape` and `#token` `:group_open` for `[\(]`
28
+ * the escaped versions now correctly emit `#type` `:escape` and `#token` `:literal`
29
+ - fixed handling of control/metacontrol escapes in character sets
30
+ * e.g. `[\cX]`, `[\M-\C-X]`
31
+ * they were misread as bunch of individual literals, escapes, and ranges
32
+ - fixed some cases where calling `#dup`/`#clone` on expressions led to shared state
33
+
3
34
  ## [2.0.3] - 2020-12-28 - [Janosch Müller](mailto:janosch84@gmail.com)
4
35
 
5
36
  ### Fixed
6
37
 
7
38
  - fixed error when scanning some unlikely and redundant but valid charset patterns
8
- - e.g. `/[[.a-b.]]/`, `/[[=e=]]/`,
39
+ * e.g. `/[[.a-b.]]/`, `/[[=e=]]/`,
9
40
  - fixed ancestry of some error classes related to syntax version lookup
10
- - `NotImplementedError`, `InvalidVersionNameError`, `UnknownSyntaxNameError`
11
- - they now correctly inherit from `Regexp::Syntax::SyntaxError` instead of Rubys `::SyntaxError`
41
+ * `NotImplementedError`, `InvalidVersionNameError`, `UnknownSyntaxNameError`
42
+ * they now correctly inherit from `Regexp::Syntax::SyntaxError` instead of Rubys `::SyntaxError`
12
43
 
13
44
  ## [2.0.2] - 2020-12-25 - [Janosch Müller](mailto:janosch84@gmail.com)
14
45
 
data/Gemfile CHANGED
@@ -6,5 +6,9 @@ group :development, :test do
6
6
  gem 'ice_nine', '~> 0.11.2'
7
7
  gem 'rake', '~> 13.0'
8
8
  gem 'regexp_property_values', '~> 1.0'
9
- gem 'rspec', '~> 3.8'
9
+ gem 'rspec', '~> 3.10'
10
+ if RUBY_VERSION.to_f >= 2.7
11
+ gem 'gouteur'
12
+ gem 'rubocop', '~> 1.7'
13
+ end
10
14
  end
data/README.md CHANGED
@@ -1,6 +1,6 @@
1
1
  # Regexp::Parser
2
2
 
3
- [![Gem Version](https://badge.fury.io/rb/regexp_parser.svg)](http://badge.fury.io/rb/regexp_parser) [![Build Status](https://github.com/ammar/regexp_parser/workflows/tests/badge.svg)](https://github.com/ammar/regexp_parser/actions) [![Code Climate](https://codeclimate.com/github/ammar/regexp_parser.svg)](https://codeclimate.com/github/ammar/regexp_parser/badges)
3
+ [![Gem Version](https://badge.fury.io/rb/regexp_parser.svg)](http://badge.fury.io/rb/regexp_parser) [![Build Status](https://github.com/ammar/regexp_parser/workflows/tests/badge.svg)](https://github.com/ammar/regexp_parser/actions) [![Build Status](https://github.com/ammar/regexp_parser/workflows/gouteur/badge.svg)](https://github.com/ammar/regexp_parser/actions) [![Code Climate](https://codeclimate.com/github/ammar/regexp_parser.svg)](https://codeclimate.com/github/ammar/regexp_parser/badges)
4
4
 
5
5
  A Ruby gem for tokenizing, parsing, and transforming regular expressions.
6
6
 
data/Rakefile CHANGED
@@ -7,8 +7,8 @@ require 'bundler'
7
7
  require 'rubygems/package_task'
8
8
 
9
9
 
10
- RAGEL_SOURCE_DIR = File.expand_path '../lib/regexp_parser/scanner', __FILE__
11
- RAGEL_OUTPUT_DIR = File.expand_path '../lib/regexp_parser', __FILE__
10
+ RAGEL_SOURCE_DIR = File.join(__dir__, 'lib/regexp_parser/scanner')
11
+ RAGEL_OUTPUT_DIR = File.join(__dir__, 'lib/regexp_parser')
12
12
  RAGEL_SOURCE_FILES = %w{scanner} # scanner.rl includes property.rl
13
13
 
14
14
 
@@ -26,10 +26,10 @@ end
26
26
  namespace :ragel do
27
27
  desc "Process the ragel source files and output ruby code"
28
28
  task :rb do
29
- RAGEL_SOURCE_FILES.each do |file|
30
- output_file = "#{RAGEL_OUTPUT_DIR}/#{file}.rb"
29
+ RAGEL_SOURCE_FILES.each do |source_file|
30
+ output_file = "#{RAGEL_OUTPUT_DIR}/#{source_file}.rb"
31
31
  # using faster flat table driven FSM, about 25% larger code, but about 30% faster
32
- sh "ragel -F1 -R #{RAGEL_SOURCE_DIR}/#{file}.rl -o #{output_file}"
32
+ sh "ragel -F1 -R #{RAGEL_SOURCE_DIR}/#{source_file}.rl -o #{output_file}"
33
33
 
34
34
  contents = File.read(output_file)
35
35
 
@@ -61,7 +61,7 @@ namespace :props do
61
61
  task :update do
62
62
  require 'regexp_property_values'
63
63
  RegexpPropertyValues.update
64
- dir = File.expand_path('../lib/regexp_parser/scanner/properties', __FILE__)
64
+ dir = File.join(__dir__, 'lib/regexp_parser/scanner/properties')
65
65
 
66
66
  require 'psych'
67
67
  write_hash_to_file = ->(hash, path) do
data/lib/regexp_parser.rb CHANGED
@@ -1,6 +1,7 @@
1
1
  # encoding: utf-8
2
2
 
3
3
  require 'regexp_parser/version'
4
+ require 'regexp_parser/error'
4
5
  require 'regexp_parser/token'
5
6
  require 'regexp_parser/scanner'
6
7
  require 'regexp_parser/syntax'
@@ -0,0 +1,4 @@
1
+ class Regexp::Parser
2
+ # base class for all gem-specific errors (inherited but never raised itself)
3
+ class Error < StandardError; end
4
+ end
@@ -21,7 +21,7 @@ module Regexp::Expression
21
21
  self.options = options
22
22
  end
23
23
 
24
- def initialize_clone(orig)
24
+ def initialize_copy(orig)
25
25
  self.text = (orig.text ? orig.text.dup : nil)
26
26
  self.options = (orig.options ? orig.options.dup : nil)
27
27
  self.quantifier = (orig.quantifier ? orig.quantifier.clone : nil)
@@ -2,6 +2,11 @@ module Regexp::Expression
2
2
  module Backreference
3
3
  class Base < Regexp::Expression::Base
4
4
  attr_accessor :referenced_expression
5
+
6
+ def initialize_copy(orig)
7
+ self.referenced_expression = orig.referenced_expression.dup
8
+ super
9
+ end
5
10
  end
6
11
 
7
12
  class Number < Backreference::Base
@@ -1,6 +1,6 @@
1
1
  module Regexp::Expression
2
2
  module Conditional
3
- class TooManyBranches < StandardError
3
+ class TooManyBranches < Regexp::Parser::Error
4
4
  def initialize
5
5
  super('The conditional expression has more than 2 branches')
6
6
  end
@@ -15,6 +15,11 @@ module Regexp::Expression
15
15
  ref = text.tr("'<>()", "")
16
16
  ref =~ /\D/ ? ref : Integer(ref)
17
17
  end
18
+
19
+ def initialize_copy(orig)
20
+ self.referenced_expression = orig.referenced_expression.dup
21
+ super
22
+ end
18
23
  end
19
24
 
20
25
  class Branch < Regexp::Expression::Sequence; end
@@ -53,6 +58,11 @@ module Regexp::Expression
53
58
  def to_s(format = :full)
54
59
  "#{text}#{condition}#{branches.join('|')})#{quantifier_affix(format)}"
55
60
  end
61
+
62
+ def initialize_copy(orig)
63
+ self.referenced_expression = orig.referenced_expression.dup
64
+ super
65
+ end
56
66
  end
57
67
  end
58
68
  end
@@ -2,7 +2,7 @@ module Regexp::Expression
2
2
 
3
3
  class FreeSpace < Regexp::Expression::Base
4
4
  def quantify(_token, _text, _min = nil, _max = nil, _mode = :greedy)
5
- raise "Can not quantify a free space object"
5
+ raise Regexp::Parser::Error, 'Can not quantify a free space object'
6
6
  end
7
7
  end
8
8
 
@@ -35,6 +35,11 @@ module Regexp::Expression
35
35
  class Atomic < Group::Base; end
36
36
  class Options < Group::Base
37
37
  attr_accessor :option_changes
38
+
39
+ def initialize_copy(orig)
40
+ self.option_changes = orig.option_changes.dup
41
+ super
42
+ end
38
43
  end
39
44
 
40
45
  class Capture < Group::Base
@@ -53,7 +58,7 @@ module Regexp::Expression
53
58
  super
54
59
  end
55
60
 
56
- def initialize_clone(orig)
61
+ def initialize_copy(orig)
57
62
  @name = orig.name.dup
58
63
  super
59
64
  end
@@ -7,7 +7,7 @@ module Regexp::Expression
7
7
  end
8
8
 
9
9
  def name
10
- text =~ /\A\\[pP]\{([^}]+)\}\z/; $1
10
+ text[/\A\\[pP]\{([^}]+)\}\z/, 1]
11
11
  end
12
12
 
13
13
  def shortcut
@@ -7,7 +7,8 @@ module Regexp::Expression
7
7
  alias :ts :starts_at
8
8
 
9
9
  def <<(exp)
10
- complete? && raise("Can't add more than 2 expressions to a Range")
10
+ complete? and raise Regexp::Parser::Error,
11
+ "Can't add more than 2 expressions to a Range"
11
12
  super
12
13
  end
13
14
 
@@ -12,7 +12,7 @@ module Regexp::Expression
12
12
  @max = max
13
13
  end
14
14
 
15
- def initialize_clone(orig)
15
+ def initialize_copy(orig)
16
16
  @text = orig.text.dup
17
17
  super
18
18
  end
@@ -41,17 +41,11 @@ module Regexp::Expression
41
41
  alias :ts :starts_at
42
42
 
43
43
  def quantify(token, text, min = nil, max = nil, mode = :greedy)
44
- offset = -1
45
- target = expressions[offset]
46
- while target.is_a?(FreeSpace)
47
- target = expressions[offset -= 1]
48
- end
49
-
50
- target || raise(ArgumentError, "No valid target found for '#{text}' "\
51
- 'quantifier')
44
+ target = expressions.reverse.find { |exp| !exp.is_a?(FreeSpace) }
45
+ target or raise Regexp::Parser::Error,
46
+ "No valid target found for '#{text}' quantifier"
52
47
 
53
48
  target.quantify(token, text, min, max, mode)
54
49
  end
55
50
  end
56
-
57
51
  end
@@ -12,7 +12,7 @@ module Regexp::Expression
12
12
  end
13
13
 
14
14
  # Override base method to clone the expressions as well.
15
- def initialize_clone(orig)
15
+ def initialize_copy(orig)
16
16
  self.expressions = orig.expressions.map(&:clone)
17
17
  super
18
18
  end
@@ -2,9 +2,8 @@ require 'regexp_parser/expression'
2
2
 
3
3
  class Regexp::Parser
4
4
  include Regexp::Expression
5
- include Regexp::Syntax
6
5
 
7
- class ParserError < StandardError; end
6
+ class ParserError < Regexp::Parser::Error; end
8
7
 
9
8
  class UnknownTokenTypeError < ParserError
10
9
  def initialize(type, token)
@@ -70,93 +69,155 @@ class Regexp::Parser
70
69
  enabled_options
71
70
  end
72
71
 
73
- def nest(exp)
74
- nesting.push(exp)
75
- node << exp
76
- update_transplanted_subtree(exp, node)
77
- self.node = exp
78
- end
72
+ def parse_token(token)
73
+ case token.type
74
+ when :anchor; anchor(token)
75
+ when :assertion, :group; group(token)
76
+ when :backref; backref(token)
77
+ when :conditional; conditional(token)
78
+ when :escape; escape(token)
79
+ when :free_space; free_space(token)
80
+ when :keep; keep(token)
81
+ when :literal; literal(token)
82
+ when :meta; meta(token)
83
+ when :posixclass, :nonposixclass; posixclass(token)
84
+ when :property, :nonproperty; property(token)
85
+ when :quantifier; quantifier(token)
86
+ when :set; set(token)
87
+ when :type; type(token)
88
+ else
89
+ raise UnknownTokenTypeError.new(token.type, token)
90
+ end
79
91
 
80
- # subtrees are transplanted to build Alternations, Intersections, Ranges
81
- def update_transplanted_subtree(exp, new_parent)
82
- exp.nesting_level = new_parent.nesting_level + 1
83
- exp.respond_to?(:each) &&
84
- exp.each { |subexp| update_transplanted_subtree(subexp, exp) }
92
+ close_completed_character_set_range
85
93
  end
86
94
 
87
- def decrease_nesting
88
- while nesting.last.is_a?(SequenceOperation)
89
- nesting.pop
90
- self.node = nesting.last
95
+ def anchor(token)
96
+ case token.token
97
+ when :bol; node << Anchor::BeginningOfLine.new(token, active_opts)
98
+ when :bos; node << Anchor::BOS.new(token, active_opts)
99
+ when :eol; node << Anchor::EndOfLine.new(token, active_opts)
100
+ when :eos; node << Anchor::EOS.new(token, active_opts)
101
+ when :eos_ob_eol; node << Anchor::EOSobEOL.new(token, active_opts)
102
+ when :match_start; node << Anchor::MatchStart.new(token, active_opts)
103
+ when :nonword_boundary; node << Anchor::NonWordBoundary.new(token, active_opts)
104
+ when :word_boundary; node << Anchor::WordBoundary.new(token, active_opts)
105
+ else
106
+ raise UnknownTokenError.new('Anchor', token)
91
107
  end
92
- nesting.pop
93
- yield(node) if block_given?
94
- self.node = nesting.last
95
- self.node = node.last if node.last.is_a?(SequenceOperation)
96
108
  end
97
109
 
98
- def nest_conditional(exp)
99
- conditional_nesting.push(exp)
100
- nest(exp)
110
+ def group(token)
111
+ case token.token
112
+ when :options, :options_switch
113
+ options_group(token)
114
+ when :close
115
+ close_group
116
+ when :comment
117
+ node << Group::Comment.new(token, active_opts)
118
+ else
119
+ open_group(token)
120
+ end
101
121
  end
102
122
 
103
- def parse_token(token)
104
- close_completed_character_set_range
123
+ MOD_FLAGS = %w[i m x].map(&:to_sym)
124
+ ENC_FLAGS = %w[a d u].map(&:to_sym)
105
125
 
106
- case token.type
107
- when :meta; meta(token)
108
- when :quantifier; quantifier(token)
109
- when :anchor; anchor(token)
110
- when :escape; escape(token)
111
- when :group; group(token)
112
- when :assertion; group(token)
113
- when :set; set(token)
114
- when :type; type(token)
115
- when :backref; backref(token)
116
- when :conditional; conditional(token)
117
- when :keep; keep(token)
118
-
119
- when :posixclass, :nonposixclass
120
- posixclass(token)
121
- when :property, :nonproperty
122
- property(token)
123
-
124
- when :literal
125
- node << Literal.new(token, active_opts)
126
- when :free_space
127
- free_space(token)
126
+ def options_group(token)
127
+ positive, negative = token.text.split('-', 2)
128
+ negative ||= ''
129
+ self.switching_options = token.token.equal?(:options_switch)
128
130
 
129
- else
130
- raise UnknownTokenTypeError.new(token.type, token)
131
+ opt_changes = {}
132
+ new_active_opts = active_opts.dup
133
+
134
+ MOD_FLAGS.each do |flag|
135
+ if positive.include?(flag.to_s)
136
+ opt_changes[flag] = new_active_opts[flag] = true
137
+ end
138
+ if negative.include?(flag.to_s)
139
+ opt_changes[flag] = false
140
+ new_active_opts.delete(flag)
141
+ end
142
+ end
143
+
144
+ if (enc_flag = positive.reverse[/[adu]/])
145
+ enc_flag = enc_flag.to_sym
146
+ (ENC_FLAGS - [enc_flag]).each do |other|
147
+ opt_changes[other] = false if new_active_opts[other]
148
+ new_active_opts.delete(other)
149
+ end
150
+ opt_changes[enc_flag] = new_active_opts[enc_flag] = true
131
151
  end
152
+
153
+ options_stack << new_active_opts
154
+
155
+ options_group = Group::Options.new(token, active_opts)
156
+ options_group.option_changes = opt_changes
157
+
158
+ nest(options_group)
132
159
  end
133
160
 
134
- def set(token)
135
- case token.token
136
- when :open
137
- open_set(token)
138
- when :close
139
- close_set
140
- when :negate
141
- negate_set
142
- when :range
143
- range(token)
144
- when :intersection
145
- intersection(token)
146
- else
147
- raise UnknownTokenError.new('CharacterSet', token)
161
+ def open_group(token)
162
+ group_class =
163
+ case token.token
164
+ when :absence; Group::Absence
165
+ when :atomic; Group::Atomic
166
+ when :capture; Group::Capture
167
+ when :named; Group::Named
168
+ when :passive; Group::Passive
169
+
170
+ when :lookahead; Assertion::Lookahead
171
+ when :lookbehind; Assertion::Lookbehind
172
+ when :nlookahead; Assertion::NegativeLookahead
173
+ when :nlookbehind; Assertion::NegativeLookbehind
174
+
175
+ else
176
+ raise UnknownTokenError.new('Group type open', token)
177
+ end
178
+
179
+ group = group_class.new(token, active_opts)
180
+
181
+ if group.capturing?
182
+ group.number = total_captured_group_count + 1
183
+ group.number_at_level = captured_group_count_at_level + 1
184
+ count_captured_group
148
185
  end
186
+
187
+ # Push the active options to the stack again. This way we can simply pop the
188
+ # stack for any group we close, no matter if it had its own options or not.
189
+ options_stack << active_opts
190
+
191
+ nest(group)
149
192
  end
150
193
 
151
- def meta(token)
152
- case token.token
153
- when :dot
154
- node << CharacterType::Any.new(token, active_opts)
155
- when :alternation
156
- sequence_operation(Alternation, token)
157
- else
158
- raise UnknownTokenError.new('Meta', token)
194
+ def total_captured_group_count
195
+ captured_group_counts.values.reduce(0, :+)
196
+ end
197
+
198
+ def captured_group_count_at_level
199
+ captured_group_counts[node.level]
200
+ end
201
+
202
+ def count_captured_group
203
+ captured_group_counts[node.level] += 1
204
+ end
205
+
206
+ def close_group
207
+ options_stack.pop unless switching_options
208
+ self.switching_options = false
209
+ decrease_nesting
210
+ end
211
+
212
+ def decrease_nesting
213
+ while nesting.last.is_a?(SequenceOperation)
214
+ nesting.pop
215
+ self.node = nesting.last
159
216
  end
217
+ nesting.pop
218
+ yield(node) if block_given?
219
+ self.node = nesting.last
220
+ self.node = node.last if node.last.is_a?(SequenceOperation)
160
221
  end
161
222
 
162
223
  def backref(token)
@@ -186,31 +247,9 @@ class Regexp::Parser
186
247
  end
187
248
  end
188
249
 
189
- def type(token)
190
- case token.token
191
- when :digit
192
- node << CharacterType::Digit.new(token, active_opts)
193
- when :nondigit
194
- node << CharacterType::NonDigit.new(token, active_opts)
195
- when :hex
196
- node << CharacterType::Hex.new(token, active_opts)
197
- when :nonhex
198
- node << CharacterType::NonHex.new(token, active_opts)
199
- when :space
200
- node << CharacterType::Space.new(token, active_opts)
201
- when :nonspace
202
- node << CharacterType::NonSpace.new(token, active_opts)
203
- when :word
204
- node << CharacterType::Word.new(token, active_opts)
205
- when :nonword
206
- node << CharacterType::NonWord.new(token, active_opts)
207
- when :linebreak
208
- node << CharacterType::Linebreak.new(token, active_opts)
209
- when :xgrapheme
210
- node << CharacterType::ExtendedGrapheme.new(token, active_opts)
211
- else
212
- raise UnknownTokenError.new('CharacterType', token)
213
- end
250
+ def assign_effective_number(exp)
251
+ exp.effective_number =
252
+ exp.number + total_captured_group_count + (exp.number < 0 ? 1 : 0)
214
253
  end
215
254
 
216
255
  def conditional(token)
@@ -238,11 +277,118 @@ class Regexp::Parser
238
277
  end
239
278
  end
240
279
 
280
+ def nest_conditional(exp)
281
+ conditional_nesting.push(exp)
282
+ nest(exp)
283
+ end
284
+
285
+ def nest(exp)
286
+ nesting.push(exp)
287
+ node << exp
288
+ update_transplanted_subtree(exp, node)
289
+ self.node = exp
290
+ end
291
+
292
+ # subtrees are transplanted to build Alternations, Intersections, Ranges
293
+ def update_transplanted_subtree(exp, new_parent)
294
+ exp.nesting_level = new_parent.nesting_level + 1
295
+ exp.respond_to?(:each) &&
296
+ exp.each { |subexp| update_transplanted_subtree(subexp, exp) }
297
+ end
298
+
299
+ def escape(token)
300
+ case token.token
301
+
302
+ when :backspace; node << EscapeSequence::Backspace.new(token, active_opts)
303
+
304
+ when :escape; node << EscapeSequence::AsciiEscape.new(token, active_opts)
305
+ when :bell; node << EscapeSequence::Bell.new(token, active_opts)
306
+ when :form_feed; node << EscapeSequence::FormFeed.new(token, active_opts)
307
+ when :newline; node << EscapeSequence::Newline.new(token, active_opts)
308
+ when :carriage; node << EscapeSequence::Return.new(token, active_opts)
309
+ when :tab; node << EscapeSequence::Tab.new(token, active_opts)
310
+ when :vertical_tab; node << EscapeSequence::VerticalTab.new(token, active_opts)
311
+
312
+ when :codepoint; node << EscapeSequence::Codepoint.new(token, active_opts)
313
+ when :codepoint_list; node << EscapeSequence::CodepointList.new(token, active_opts)
314
+ when :hex; node << EscapeSequence::Hex.new(token, active_opts)
315
+ when :octal; node << EscapeSequence::Octal.new(token, active_opts)
316
+
317
+ when :control
318
+ if token.text =~ /\A(?:\\C-\\M|\\c\\M)/
319
+ node << EscapeSequence::MetaControl.new(token, active_opts)
320
+ else
321
+ node << EscapeSequence::Control.new(token, active_opts)
322
+ end
323
+
324
+ when :meta_sequence
325
+ if token.text =~ /\A\\M-\\[Cc]/
326
+ node << EscapeSequence::MetaControl.new(token, active_opts)
327
+ else
328
+ node << EscapeSequence::Meta.new(token, active_opts)
329
+ end
330
+
331
+ else
332
+ # treating everything else as a literal
333
+ # TODO: maybe split this up a bit more in v3.0.0?
334
+ # E.g. escaped quantifiers or set meta chars are not the same
335
+ # as stuff that would be a literal even without the backslash.
336
+ # Right now, they all end up here.
337
+ node << EscapeSequence::Literal.new(token, active_opts)
338
+ end
339
+ end
340
+
341
+ def free_space(token)
342
+ case token.token
343
+ when :comment
344
+ node << Comment.new(token, active_opts)
345
+ when :whitespace
346
+ if node.last.is_a?(WhiteSpace)
347
+ node.last.merge(WhiteSpace.new(token, active_opts))
348
+ else
349
+ node << WhiteSpace.new(token, active_opts)
350
+ end
351
+ else
352
+ raise UnknownTokenError.new('FreeSpace', token)
353
+ end
354
+ end
355
+
356
+ def keep(token)
357
+ node << Keep::Mark.new(token, active_opts)
358
+ end
359
+
360
+ def literal(token)
361
+ node << Literal.new(token, active_opts)
362
+ end
363
+
364
+ def meta(token)
365
+ case token.token
366
+ when :dot
367
+ node << CharacterType::Any.new(token, active_opts)
368
+ when :alternation
369
+ sequence_operation(Alternation, token)
370
+ else
371
+ raise UnknownTokenError.new('Meta', token)
372
+ end
373
+ end
374
+
375
+ def sequence_operation(klass, token)
376
+ unless node.is_a?(klass)
377
+ operator = klass.new(token, active_opts)
378
+ sequence = operator.add_sequence(active_opts)
379
+ sequence.expressions = node.expressions
380
+ node.expressions = []
381
+ nest(operator)
382
+ end
383
+ node.add_sequence(active_opts)
384
+ end
385
+
241
386
  def posixclass(token)
242
387
  node << PosixClass.new(token, active_opts)
243
388
  end
244
389
 
245
390
  include Regexp::Expression::UnicodeProperty
391
+ UPTokens = Regexp::Syntax::Token::UnicodeProperty
246
392
 
247
393
  def property(token)
248
394
  case token.token
@@ -314,127 +460,20 @@ class Regexp::Parser
314
460
  when :private_use; node << Codepoint::PrivateUse.new(token, active_opts)
315
461
  when :unassigned; node << Codepoint::Unassigned.new(token, active_opts)
316
462
 
317
- when *Token::UnicodeProperty::Age
318
- node << Age.new(token, active_opts)
319
-
320
- when *Token::UnicodeProperty::Derived
321
- node << Derived.new(token, active_opts)
322
-
323
- when *Token::UnicodeProperty::Emoji
324
- node << Emoji.new(token, active_opts)
325
-
326
- when *Token::UnicodeProperty::Script
327
- node << Script.new(token, active_opts)
328
-
329
- when *Token::UnicodeProperty::UnicodeBlock
330
- node << Block.new(token, active_opts)
463
+ when *UPTokens::Age; node << Age.new(token, active_opts)
464
+ when *UPTokens::Derived; node << Derived.new(token, active_opts)
465
+ when *UPTokens::Emoji; node << Emoji.new(token, active_opts)
466
+ when *UPTokens::Script; node << Script.new(token, active_opts)
467
+ when *UPTokens::UnicodeBlock; node << Block.new(token, active_opts)
331
468
 
332
469
  else
333
470
  raise UnknownTokenError.new('UnicodeProperty', token)
334
471
  end
335
472
  end
336
473
 
337
- def anchor(token)
338
- case token.token
339
- when :bol
340
- node << Anchor::BeginningOfLine.new(token, active_opts)
341
- when :eol
342
- node << Anchor::EndOfLine.new(token, active_opts)
343
- when :bos
344
- node << Anchor::BOS.new(token, active_opts)
345
- when :eos
346
- node << Anchor::EOS.new(token, active_opts)
347
- when :eos_ob_eol
348
- node << Anchor::EOSobEOL.new(token, active_opts)
349
- when :word_boundary
350
- node << Anchor::WordBoundary.new(token, active_opts)
351
- when :nonword_boundary
352
- node << Anchor::NonWordBoundary.new(token, active_opts)
353
- when :match_start
354
- node << Anchor::MatchStart.new(token, active_opts)
355
- else
356
- raise UnknownTokenError.new('Anchor', token)
357
- end
358
- end
359
-
360
- def escape(token)
361
- case token.token
362
-
363
- when :backspace
364
- node << EscapeSequence::Backspace.new(token, active_opts)
365
-
366
- when :escape
367
- node << EscapeSequence::AsciiEscape.new(token, active_opts)
368
- when :bell
369
- node << EscapeSequence::Bell.new(token, active_opts)
370
- when :form_feed
371
- node << EscapeSequence::FormFeed.new(token, active_opts)
372
- when :newline
373
- node << EscapeSequence::Newline.new(token, active_opts)
374
- when :carriage
375
- node << EscapeSequence::Return.new(token, active_opts)
376
- when :tab
377
- node << EscapeSequence::Tab.new(token, active_opts)
378
- when :vertical_tab
379
- node << EscapeSequence::VerticalTab.new(token, active_opts)
380
-
381
- when :hex
382
- node << EscapeSequence::Hex.new(token, active_opts)
383
- when :octal
384
- node << EscapeSequence::Octal.new(token, active_opts)
385
- when :codepoint
386
- node << EscapeSequence::Codepoint.new(token, active_opts)
387
- when :codepoint_list
388
- node << EscapeSequence::CodepointList.new(token, active_opts)
389
-
390
- when :control
391
- if token.text =~ /\A(?:\\C-\\M|\\c\\M)/
392
- node << EscapeSequence::MetaControl.new(token, active_opts)
393
- else
394
- node << EscapeSequence::Control.new(token, active_opts)
395
- end
396
-
397
- when :meta_sequence
398
- if token.text =~ /\A\\M-\\[Cc]/
399
- node << EscapeSequence::MetaControl.new(token, active_opts)
400
- else
401
- node << EscapeSequence::Meta.new(token, active_opts)
402
- end
403
-
404
- else
405
- # treating everything else as a literal
406
- node << EscapeSequence::Literal.new(token, active_opts)
407
- end
408
- end
409
-
410
- def keep(token)
411
- node << Keep::Mark.new(token, active_opts)
412
- end
413
-
414
- def free_space(token)
415
- case token.token
416
- when :comment
417
- node << Comment.new(token, active_opts)
418
- when :whitespace
419
- if node.last.is_a?(WhiteSpace)
420
- node.last.merge(WhiteSpace.new(token, active_opts))
421
- else
422
- node << WhiteSpace.new(token, active_opts)
423
- end
424
- else
425
- raise UnknownTokenError.new('FreeSpace', token)
426
- end
427
- end
428
-
429
474
  def quantifier(token)
430
- offset = -1
431
- target_node = node.expressions[offset]
432
- while target_node.is_a?(FreeSpace)
433
- target_node = node.expressions[offset -= 1]
434
- end
435
-
436
- target_node || raise(ArgumentError, 'No valid target found for '\
437
- "'#{token.text}' ")
475
+ target_node = node.expressions.reverse.find { |exp| !exp.is_a?(FreeSpace) }
476
+ target_node or raise ParserError, "No valid target found for '#{token.text}'"
438
477
 
439
478
  # in case of chained quantifiers, wrap target in an implicit passive group
440
479
  # description of the problem: https://github.com/ammar/regexp_parser/issues/3
@@ -454,7 +493,7 @@ class Regexp::Parser
454
493
  new_group.implicit = true
455
494
  new_group << target_node
456
495
  increase_level(target_node)
457
- node.expressions[offset] = new_group
496
+ node.expressions[node.expressions.index(target_node)] = new_group
458
497
  target_node = new_group
459
498
  end
460
499
 
@@ -515,100 +554,16 @@ class Regexp::Parser
515
554
  target_node.quantify(:interval, text, min.to_i, max.to_i, mode)
516
555
  end
517
556
 
518
- def group(token)
519
- case token.token
520
- when :options, :options_switch
521
- options_group(token)
522
- when :close
523
- close_group
524
- when :comment
525
- node << Group::Comment.new(token, active_opts)
526
- else
527
- open_group(token)
528
- end
529
- end
530
-
531
- MOD_FLAGS = %w[i m x].map(&:to_sym)
532
- ENC_FLAGS = %w[a d u].map(&:to_sym)
533
-
534
- def options_group(token)
535
- positive, negative = token.text.split('-', 2)
536
- negative ||= ''
537
- self.switching_options = token.token.equal?(:options_switch)
538
-
539
- opt_changes = {}
540
- new_active_opts = active_opts.dup
541
-
542
- MOD_FLAGS.each do |flag|
543
- if positive.include?(flag.to_s)
544
- opt_changes[flag] = new_active_opts[flag] = true
545
- end
546
- if negative.include?(flag.to_s)
547
- opt_changes[flag] = false
548
- new_active_opts.delete(flag)
549
- end
550
- end
551
-
552
- if (enc_flag = positive.reverse[/[adu]/])
553
- enc_flag = enc_flag.to_sym
554
- (ENC_FLAGS - [enc_flag]).each do |other|
555
- opt_changes[other] = false if new_active_opts[other]
556
- new_active_opts.delete(other)
557
- end
558
- opt_changes[enc_flag] = new_active_opts[enc_flag] = true
559
- end
560
-
561
- options_stack << new_active_opts
562
-
563
- options_group = Group::Options.new(token, active_opts)
564
- options_group.option_changes = opt_changes
565
-
566
- nest(options_group)
567
- end
568
-
569
- def open_group(token)
557
+ def set(token)
570
558
  case token.token
571
- when :passive
572
- exp = Group::Passive.new(token, active_opts)
573
- when :atomic
574
- exp = Group::Atomic.new(token, active_opts)
575
- when :named
576
- exp = Group::Named.new(token, active_opts)
577
- when :capture
578
- exp = Group::Capture.new(token, active_opts)
579
- when :absence
580
- exp = Group::Absence.new(token, active_opts)
581
-
582
- when :lookahead
583
- exp = Assertion::Lookahead.new(token, active_opts)
584
- when :nlookahead
585
- exp = Assertion::NegativeLookahead.new(token, active_opts)
586
- when :lookbehind
587
- exp = Assertion::Lookbehind.new(token, active_opts)
588
- when :nlookbehind
589
- exp = Assertion::NegativeLookbehind.new(token, active_opts)
590
-
559
+ when :open; open_set(token)
560
+ when :close; close_set
561
+ when :negate; negate_set
562
+ when :range; range(token)
563
+ when :intersection; intersection(token)
591
564
  else
592
- raise UnknownTokenError.new('Group type open', token)
593
- end
594
-
595
- if exp.capturing?
596
- exp.number = total_captured_group_count + 1
597
- exp.number_at_level = captured_group_count_at_level + 1
598
- count_captured_group
565
+ raise UnknownTokenError.new('CharacterSet', token)
599
566
  end
600
-
601
- # Push the active options to the stack again. This way we can simply pop the
602
- # stack for any group we close, no matter if it had its own options or not.
603
- options_stack << active_opts
604
-
605
- nest(exp)
606
- end
607
-
608
- def close_group
609
- options_stack.pop unless switching_options
610
- self.switching_options = false
611
- decrease_nesting
612
567
  end
613
568
 
614
569
  def open_set(token)
@@ -631,51 +586,45 @@ class Regexp::Parser
631
586
  nest(exp)
632
587
  end
633
588
 
634
- def close_completed_character_set_range
635
- decrease_nesting if node.is_a?(CharacterSet::Range) && node.complete?
636
- end
637
-
638
589
  def intersection(token)
639
590
  sequence_operation(CharacterSet::Intersection, token)
640
591
  end
641
592
 
642
- def sequence_operation(klass, token)
643
- unless node.is_a?(klass)
644
- operator = klass.new(token, active_opts)
645
- sequence = operator.add_sequence(active_opts)
646
- sequence.expressions = node.expressions
647
- node.expressions = []
648
- nest(operator)
593
+ def type(token)
594
+ case token.token
595
+ when :digit; node << CharacterType::Digit.new(token, active_opts)
596
+ when :hex; node << CharacterType::Hex.new(token, active_opts)
597
+ when :linebreak; node << CharacterType::Linebreak.new(token, active_opts)
598
+ when :nondigit; node << CharacterType::NonDigit.new(token, active_opts)
599
+ when :nonhex; node << CharacterType::NonHex.new(token, active_opts)
600
+ when :nonspace; node << CharacterType::NonSpace.new(token, active_opts)
601
+ when :nonword; node << CharacterType::NonWord.new(token, active_opts)
602
+ when :space; node << CharacterType::Space.new(token, active_opts)
603
+ when :word; node << CharacterType::Word.new(token, active_opts)
604
+ when :xgrapheme; node << CharacterType::ExtendedGrapheme.new(token, active_opts)
605
+ else
606
+ raise UnknownTokenError.new('CharacterType', token)
649
607
  end
650
- node.add_sequence(active_opts)
651
- end
652
-
653
- def active_opts
654
- options_stack.last
655
- end
656
-
657
- def total_captured_group_count
658
- captured_group_counts.values.reduce(0, :+)
659
- end
660
-
661
- def captured_group_count_at_level
662
- captured_group_counts[node.level]
663
608
  end
664
609
 
665
- def count_captured_group
666
- captured_group_counts[node.level] += 1
610
+ def close_completed_character_set_range
611
+ decrease_nesting if node.is_a?(CharacterSet::Range) && node.complete?
667
612
  end
668
613
 
669
- def assign_effective_number(exp)
670
- exp.effective_number =
671
- exp.number + total_captured_group_count + (exp.number < 0 ? 1 : 0)
614
+ def active_opts
615
+ options_stack.last
672
616
  end
673
617
 
618
+ # Assigns referenced expressions to refering expressions, e.g. if there is
619
+ # an instance of Backreference::Number, its #referenced_expression is set to
620
+ # the instance of Group::Capture that it refers to via its number.
674
621
  def assign_referenced_expressions
675
622
  targets = {}
623
+ # find all referencable expressions
676
624
  root.each_expression do |exp|
677
625
  exp.is_a?(Group::Capture) && targets[exp.identifier] = exp
678
626
  end
627
+ # assign them to any refering expressions
679
628
  root.each_expression do |exp|
680
629
  exp.respond_to?(:reference) &&
681
630
  exp.referenced_expression = targets[exp.reference]