regexp_parser 1.7.0 → 2.0.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (36) hide show
  1. checksums.yaml +4 -4
  2. data/CHANGELOG.md +80 -1
  3. data/README.md +24 -12
  4. data/lib/regexp_parser/expression.rb +10 -19
  5. data/lib/regexp_parser/expression/classes/group.rb +17 -2
  6. data/lib/regexp_parser/expression/classes/root.rb +4 -16
  7. data/lib/regexp_parser/expression/quantifier.rb +9 -0
  8. data/lib/regexp_parser/expression/sequence.rb +0 -10
  9. data/lib/regexp_parser/lexer.rb +6 -6
  10. data/lib/regexp_parser/parser.rb +45 -12
  11. data/lib/regexp_parser/scanner.rb +1305 -1193
  12. data/lib/regexp_parser/scanner/char_type.rl +11 -11
  13. data/lib/regexp_parser/scanner/property.rl +2 -2
  14. data/lib/regexp_parser/scanner/scanner.rl +194 -171
  15. data/lib/regexp_parser/syntax/version_lookup.rb +2 -2
  16. data/lib/regexp_parser/version.rb +1 -1
  17. data/regexp_parser.gemspec +1 -1
  18. data/spec/expression/base_spec.rb +10 -0
  19. data/spec/expression/to_s_spec.rb +16 -0
  20. data/spec/lexer/delimiters_spec.rb +68 -0
  21. data/spec/lexer/literals_spec.rb +24 -49
  22. data/spec/parser/escapes_spec.rb +1 -1
  23. data/spec/parser/options_spec.rb +28 -0
  24. data/spec/parser/quantifiers_spec.rb +16 -0
  25. data/spec/parser/set/ranges_spec.rb +3 -3
  26. data/spec/scanner/delimiters_spec.rb +52 -0
  27. data/spec/scanner/errors_spec.rb +0 -1
  28. data/spec/scanner/escapes_spec.rb +10 -0
  29. data/spec/scanner/free_space_spec.rb +32 -0
  30. data/spec/scanner/literals_spec.rb +28 -38
  31. data/spec/scanner/options_spec.rb +36 -0
  32. data/spec/scanner/quantifiers_spec.rb +18 -13
  33. data/spec/scanner/sets_spec.rb +8 -2
  34. metadata +65 -61
  35. data/spec/expression/root_spec.rb +0 -9
  36. data/spec/expression/sequence_spec.rb +0 -9
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: d7b93dde993f6fe427ff43755738bf7de50f8613cf6e8097c9d791646d803e4c
4
- data.tar.gz: 993a88720a4ee1d8a34f4c95e167089adc6455289bfeb356de8c028a9bbee63d
3
+ metadata.gz: dcf56dd42e703e7f1f846762c418e83792d46f2c7d9efffc1fb1612b4325e076
4
+ data.tar.gz: a6197d98af2325a93ed60a102f09bc98e2163dec0d2b57fbede82dbf0479ea8b
5
5
  SHA512:
6
- metadata.gz: 0bf5c142591b2d5a65023c53f76a64a13106074050042d24614963cc14dabda197ea9140fccd93f26ad06885293369b076bb5e9198967a6e3762654df8033455
7
- data.tar.gz: 1311b3dfa90633ef456edc12abf6ace2d7311c7be8450f3768a436f9c8491c3a87987f3d6ac24c6966b6f4de5363e0f6f874bfe9e1b038a6cf5d9c043553b58e
6
+ metadata.gz: 8b9db6543c87b63c49e24666e06bc012ea9a6e330711c9fbd35961c70d1222988bf2403927fdbe2f8797176d674b83c7c3f2d1215c5f92b0d6f00c6ab7fe37af
7
+ data.tar.gz: 6cf07796d7c6ab1520a63b0f3b65d2e513caccf0b81306321376473fc438c7573bcf867e74ea8465122e12a47fe2e7e09652473f83e26f1038e112c2d66b3d2c
@@ -1,10 +1,89 @@
1
1
  ## [Unreleased]
2
2
 
3
+ ## [2.0.0] - 2020-11-25 - [Janosch Müller](mailto:janosch84@gmail.com)
4
+
5
+ ### Changed
6
+
7
+ - some methods that used to return byte-based indices now return char-based indices
8
+ * the returned values have only changed for Regexps that contain multibyte chars
9
+ * this is only a breaking change if you used such methods directly AND relied on them pointing to bytes
10
+ * affected methods:
11
+ * `Regexp::Token` `#length`, `#offset`, `#te`, `#ts`
12
+ * `Regexp::Expression::Base` `#full_length`, `#offset`, `#starts_at`, `#te`, `#ts`
13
+ * thanks to [Akinori MUSHA](https://github.com/knu) for the report
14
+ - removed some deprecated methods/signatures
15
+ * these are rarely used and have been showing deprecation warnings for a long time
16
+ * `Regexp::Expression::Subexpression.new` with 3 arguments
17
+ * `Regexp::Expression::Root.new` without a token argument
18
+ * `Regexp::Expression.parsed`
19
+
20
+ ### Added
21
+
22
+ - `Regexp::Expression::Base#base_length`
23
+ * returns the character count of an expression body, ignoring any quantifier
24
+ - pragmatic, experimental support for chained quantifiers
25
+ * e.g.: `/^a{10}{4,6}$/` matches exactly 40, 50 or 60 `a`s
26
+ * successive quantifiers used to be silently dropped by the parser
27
+ * they are now wrapped with passive groups as if they were written `(?:a{10}){4,6}`
28
+ * thanks to [calfeld](https://github.com/calfeld) for reporting this a while back
29
+
30
+ ### Fixed
31
+
32
+ - incorrect encoding output for non-ascii comments
33
+ * this led to a crash when calling `#to_s` on parse results containing such comments
34
+ * thanks to [Michael Glass](https://github.com/michaelglass) for the report
35
+ - some crashes when scanning contrived patterns such as `'\😋'`
36
+
37
+ ### [1.8.2] - 2020-10-11 - [Janosch Müller](mailto:janosch84@gmail.com)
38
+
39
+ ### Fixed
40
+
41
+ - fix `FrozenError` in `Expression::Base#repetitions` on Ruby 3.0
42
+ * thanks to [Thomas Walpole](https://github.com/twalpole)
43
+ - removed "unknown future version" warning on Ruby 3.0
44
+
45
+ ### [1.8.1] - 2020-09-28 - [Janosch Müller](mailto:janosch84@gmail.com)
46
+
47
+ ### Fixed
48
+
49
+ - fixed scanning of comment-like text in normal mode
50
+ * this was an old bug, but had become more prevalent in v1.8.0
51
+ * thanks to [Tietew](https://github.com/Tietew) for the report
52
+ - specified correct minimum Ruby version in gemspec
53
+ * it said 1.9 but really required 2.0 as of v1.8.0
54
+
55
+ ### [1.8.0] - 2020-09-20 - [Janosch Müller](mailto:janosch84@gmail.com)
56
+
57
+ ### Changed
58
+
59
+ - dropped support for running on Ruby 1.9.x
60
+
61
+ ### Added
62
+
63
+ - regexp flags can now be passed when parsing a `String` as regexp body
64
+ * see the [README](/README.md#usage) for details
65
+ * thanks to [Owen Stephens](https://github.com/owst)
66
+ - bare occurrences of `\g` and `\k` are now allowed and scanned as literal escapes
67
+ * matches Onigmo behavior
68
+ * thanks for the report to [Marc-André Lafortune](https://github.com/marcandre)
69
+
70
+ ### Fixed
71
+
72
+ - fixed parsing comments without preceding space or trailing newline in x-mode
73
+ * thanks to [Owen Stephens](https://github.com/owst)
74
+
75
+ ### [1.7.1] - 2020-06-07 - [Ammar Ali](mailto:ammarabuali@gmail.com)
76
+
77
+ ### Fixed
78
+
79
+ - Support for literals that include the unescaped delimiters `{`, `}`, and `]`. These
80
+ delimiters are informally supported by various regexp engines.
81
+
3
82
  ### [1.7.0] - 2020-02-23 - [Janosch Müller](mailto:janosch84@gmail.com)
4
83
 
5
84
  ### Added
6
85
 
7
- - `Expression#each_expression` and `1.#traverse` can now be called without a block
86
+ - `Expression#each_expression` and `#traverse` can now be called without a block
8
87
  * this returns an `Enumerator` and allows chaining, e.g. `each_expression.select`
9
88
  * thanks to [Masataka Kuwabara](https://github.com/pocke)
10
89
 
data/README.md CHANGED
@@ -8,8 +8,8 @@ A Ruby gem for tokenizing, parsing, and transforming regular expressions.
8
8
  * A scanner/tokenizer based on [Ragel](http://www.colm.net/open-source/ragel/)
9
9
  * A lexer that produces a "stream" of token objects.
10
10
  * A parser that produces a "tree" of Expression objects (OO API)
11
- * Runs on Ruby 1.9, 2.x, and JRuby (1.9 mode) runtimes.
12
- * Recognizes Ruby 1.8, 1.9, and 2.x regular expressions [See Supported Syntax](#supported-syntax)
11
+ * Runs on Ruby 2.x, 3.x and JRuby runtimes
12
+ * Recognizes Ruby 1.8, 1.9, 2.x and 3.x regular expressions [See Supported Syntax](#supported-syntax)
13
13
 
14
14
 
15
15
  _For examples of regexp_parser in use, see [Example Projects](#example-projects)._
@@ -18,7 +18,7 @@ _For examples of regexp_parser in use, see [Example Projects](#example-projects)
18
18
  ---
19
19
  ## Requirements
20
20
 
21
- * Ruby >= 1.9
21
+ * Ruby >= 2.0
22
22
  * Ragel >= 6.0, but only if you want to build the gem or work on the scanner.
23
23
 
24
24
 
@@ -72,6 +72,17 @@ called with the results as follows:
72
72
  * **Parser**: after completion, the block gets passed the root expression.
73
73
  _The result of the block is returned._
74
74
 
75
+ All three methods accept either a `Regexp` or `String` (containing the pattern)
76
+ - if a String is passed, `options` can be supplied:
77
+
78
+ ```ruby
79
+ require 'regexp_parser'
80
+
81
+ Regexp::Parser.parse(
82
+ "a+ # Recognises a and A...",
83
+ options: ::Regexp::EXTENDED | ::Regexp::IGNORECASE
84
+ )
85
+ ```
75
86
 
76
87
  ---
77
88
  ## Components
@@ -136,11 +147,8 @@ Regexp::Scanner.scan( /(cat?([bhm]at)){3,5}/ ).map {|token| token[2]}
136
147
  to the lexer.
137
148
 
138
149
  * The MRI implementation may accept expressions that either conflict with
139
- the documentation or are undocumented. The scanner does not support such
140
- implementation quirks.
141
- _(See issues [#3](https://github.com/ammar/regexp_parser/issues/3) and
142
- [#15](https://github.com/ammar/regexp_parser/issues/15) for examples)_
143
-
150
+ the documentation or are undocumented, like `{}` and `]` _(unescaped)_.
151
+ The scanner will try to support as many of these cases as possible.
144
152
 
145
153
  ---
146
154
  ### Syntax
@@ -309,7 +317,7 @@ Expression class. See the next section for details._
309
317
 
310
318
  ## Supported Syntax
311
319
  The three modules support all the regular expression syntax features of Ruby 1.8,
312
- 1.9, and 2.x:
320
+ 1.9, 2.x and 3.x:
313
321
 
314
322
  _Note that not all of these are available in all versions of Ruby_
315
323
 
@@ -432,13 +440,17 @@ rake install
432
440
  ## Example Projects
433
441
  Projects using regexp_parser.
434
442
 
443
+ - [capybara](https://github.com/teamcapybara/capybara) is an integration testing tool that uses regexp_parser to convert Regexps to css/xpath selectors.
444
+
445
+ - [js_regex](https://github.com/janosch-x/js_regex) converts Ruby regular expressions to JavaScript-compatible regular expressions.
446
+
435
447
  - [meta_re](https://github.com/ammar/meta_re) is a regular expression preprocessor with alias support.
436
448
 
437
449
  - [mutant](https://github.com/mbj/mutant) (before v0.9.0) manipulates your regular expressions (amongst others) to see if your tests cover their behavior.
438
450
 
439
- - [twitter-cldr-rb](https://github.com/twitter/twitter-cldr-rb) uses regexp_parser to generate examples of postal codes.
451
+ - [rubocop](https://github.com/rubocop-hq/rubocop) is a linter for Ruby that uses regexp_parser to lint Regexps.
440
452
 
441
- - [js_regex](https://github.com/janosch-x/js_regex) converts Ruby regular expressions to JavaScript-compatible regular expressions.
453
+ - [twitter-cldr-rb](https://github.com/twitter/twitter-cldr-rb) is a localization helper that uses regexp_parser to generate examples of postal codes.
442
454
 
443
455
 
444
456
  ## References
@@ -467,4 +479,4 @@ Documentation and books used while working on this project.
467
479
 
468
480
  ---
469
481
  ##### Copyright
470
- _Copyright (c) 2010-2019 Ammar Ali. See LICENSE file for details._
482
+ _Copyright (c) 2010-2020 Ammar Ali. See LICENSE file for details._
@@ -34,6 +34,10 @@ module Regexp::Expression
34
34
 
35
35
  alias :starts_at :ts
36
36
 
37
+ def base_length
38
+ to_s(:base).length
39
+ end
40
+
37
41
  def full_length
38
42
  to_s.length
39
43
  end
@@ -80,8 +84,12 @@ module Regexp::Expression
80
84
  return 1..1 unless quantified?
81
85
  min = quantifier.min
82
86
  max = quantifier.max < 0 ? Float::INFINITY : quantifier.max
83
- # fix Range#minmax - https://bugs.ruby-lang.org/issues/15807
84
- (min..max).tap { |r| r.define_singleton_method(:minmax) { [min, max] } }
87
+ range = min..max
88
+ # fix Range#minmax on old Rubies - https://bugs.ruby-lang.org/issues/15807
89
+ if RUBY_VERSION.to_f < 2.7
90
+ range.define_singleton_method(:minmax) { [min, max] }
91
+ end
92
+ range
85
93
  end
86
94
 
87
95
  def greedy?
@@ -114,23 +122,6 @@ module Regexp::Expression
114
122
  alias :to_h :attributes
115
123
  end
116
124
 
117
- def self.parsed(exp)
118
- warn('WARNING: Regexp::Expression::Base.parsed is buggy and '\
119
- 'will be removed in 2.0.0. Use Regexp::Parser.parse instead.')
120
- case exp
121
- when String
122
- Regexp::Parser.parse(exp)
123
- when Regexp
124
- Regexp::Parser.parse(exp.source) # <- causes loss of root options
125
- when Regexp::Expression # <- never triggers
126
- exp
127
- else
128
- raise ArgumentError, 'Expression.parsed accepts a String, Regexp, or '\
129
- 'a Regexp::Expression as a value for exp, but it '\
130
- "was given #{exp.class.name}."
131
- end
132
- end
133
-
134
125
  end # module Regexp::Expression
135
126
 
136
127
  require 'regexp_parser/expression/quantifier'
@@ -10,9 +10,24 @@ module Regexp::Expression
10
10
  def comment?; false end
11
11
  end
12
12
 
13
- class Atomic < Group::Base; end
14
- class Passive < Group::Base; end
13
+ class Passive < Group::Base
14
+ attr_writer :implicit
15
+
16
+ def to_s(format = :full)
17
+ if implicit?
18
+ "#{expressions.join}#{quantifier_affix(format)}"
19
+ else
20
+ super
21
+ end
22
+ end
23
+
24
+ def implicit?
25
+ @implicit ||= false
26
+ end
27
+ end
28
+
15
29
  class Absence < Group::Base; end
30
+ class Atomic < Group::Base; end
16
31
  class Options < Group::Base
17
32
  attr_accessor :option_changes
18
33
  end
@@ -1,24 +1,12 @@
1
1
  module Regexp::Expression
2
2
 
3
3
  class Root < Regexp::Expression::Subexpression
4
- # TODO: this override is here for backwards compatibility, remove in 2.0.0
5
- def initialize(*args)
6
- unless args.first.is_a?(Regexp::Token)
7
- warn('WARNING: Root.new without a Token argument is deprecated and '\
8
- 'will be removed in 2.0.0. Use Root.build for the old behavior.')
9
- return super(self.class.build_token, *args)
10
- end
11
- super
4
+ def self.build(options = {})
5
+ new(build_token, options)
12
6
  end
13
7
 
14
- class << self
15
- def build(options = {})
16
- new(build_token, options)
17
- end
18
-
19
- def build_token
20
- Regexp::Token.new(:expression, :root, '', 0)
21
- end
8
+ def self.build_token
9
+ Regexp::Token.new(:expression, :root, '', 0)
22
10
  end
23
11
  end
24
12
  end
@@ -40,5 +40,14 @@ module Regexp::Expression
40
40
  RUBY
41
41
  end
42
42
  alias :lazy? :reluctant?
43
+
44
+ def ==(other)
45
+ other.class == self.class &&
46
+ other.token == token &&
47
+ other.mode == mode &&
48
+ other.min == min &&
49
+ other.max == max
50
+ end
51
+ alias :eq :==
43
52
  end
44
53
  end
@@ -7,16 +7,6 @@ module Regexp::Expression
7
7
  # Used as the base class for the Alternation alternatives, Conditional
8
8
  # branches, and CharacterSet::Intersection intersected sequences.
9
9
  class Sequence < Regexp::Expression::Subexpression
10
- # TODO: this override is here for backwards compatibility, remove in 2.0.0
11
- def initialize(*args)
12
- if args.count == 3
13
- warn('WARNING: Sequence.new without a Regexp::Token argument is '\
14
- 'deprecated and will be removed in 2.0.0.')
15
- return self.class.at_levels(*args)
16
- end
17
- super
18
- end
19
-
20
10
  class << self
21
11
  def add_to(subexpression, params = {}, active_opts = {})
22
12
  sequence = at_levels(
@@ -11,11 +11,11 @@ class Regexp::Lexer
11
11
 
12
12
  CLOSING_TOKENS = [:close].freeze
13
13
 
14
- def self.lex(input, syntax = "ruby/#{RUBY_VERSION}", &block)
15
- new.lex(input, syntax, &block)
14
+ def self.lex(input, syntax = "ruby/#{RUBY_VERSION}", options: nil, &block)
15
+ new.lex(input, syntax, options: options, &block)
16
16
  end
17
17
 
18
- def lex(input, syntax = "ruby/#{RUBY_VERSION}", &block)
18
+ def lex(input, syntax = "ruby/#{RUBY_VERSION}", options: nil, &block)
19
19
  syntax = Regexp::Syntax.new(syntax)
20
20
 
21
21
  self.tokens = []
@@ -25,7 +25,7 @@ class Regexp::Lexer
25
25
  self.shift = 0
26
26
 
27
27
  last = nil
28
- Regexp::Scanner.scan(input) do |type, token, text, ts, te|
28
+ Regexp::Scanner.scan(input, options: options) do |type, token, text, ts, te|
29
29
  type, token = *syntax.normalize(type, token)
30
30
  syntax.check! type, token
31
31
 
@@ -96,10 +96,10 @@ class Regexp::Lexer
96
96
 
97
97
  tokens.pop
98
98
  tokens << Regexp::Token.new(:literal, :literal, lead,
99
- token.ts, (token.te - last.bytesize),
99
+ token.ts, (token.te - last.length),
100
100
  nesting, set_nesting, conditional_nesting)
101
101
  tokens << Regexp::Token.new(:literal, :literal, last,
102
- (token.ts + lead.bytesize), token.te,
102
+ (token.ts + lead.length), token.te,
103
103
  nesting, set_nesting, conditional_nesting)
104
104
  end
105
105
 
@@ -18,12 +18,12 @@ class Regexp::Parser
18
18
  end
19
19
  end
20
20
 
21
- def self.parse(input, syntax = "ruby/#{RUBY_VERSION}", &block)
22
- new.parse(input, syntax, &block)
21
+ def self.parse(input, syntax = "ruby/#{RUBY_VERSION}", options: nil, &block)
22
+ new.parse(input, syntax, options: options, &block)
23
23
  end
24
24
 
25
- def parse(input, syntax = "ruby/#{RUBY_VERSION}", &block)
26
- root = Root.build(options_from_input(input))
25
+ def parse(input, syntax = "ruby/#{RUBY_VERSION}", options: nil, &block)
26
+ root = Root.build(extract_options(input, options))
27
27
 
28
28
  self.root = root
29
29
  self.node = root
@@ -35,7 +35,7 @@ class Regexp::Parser
35
35
 
36
36
  self.captured_group_counts = Hash.new(0)
37
37
 
38
- Regexp::Lexer.scan(input, syntax) do |token|
38
+ Regexp::Lexer.scan(input, syntax, options: options) do |token|
39
39
  parse_token(token)
40
40
  end
41
41
 
@@ -54,14 +54,20 @@ class Regexp::Parser
54
54
  :options_stack, :switching_options, :conditional_nesting,
55
55
  :captured_group_counts
56
56
 
57
- def options_from_input(input)
58
- return {} unless input.is_a?(::Regexp)
57
+ def extract_options(input, options)
58
+ if options && !input.is_a?(String)
59
+ raise ArgumentError, 'options cannot be supplied unless parsing a String'
60
+ end
61
+
62
+ options = input.options if input.is_a?(::Regexp)
59
63
 
60
- options = {}
61
- options[:i] = true if input.options & ::Regexp::IGNORECASE != 0
62
- options[:m] = true if input.options & ::Regexp::MULTILINE != 0
63
- options[:x] = true if input.options & ::Regexp::EXTENDED != 0
64
- options
64
+ return {} unless options
65
+
66
+ enabled_options = {}
67
+ enabled_options[:i] = true if options & ::Regexp::IGNORECASE != 0
68
+ enabled_options[:m] = true if options & ::Regexp::MULTILINE != 0
69
+ enabled_options[:x] = true if options & ::Regexp::EXTENDED != 0
70
+ enabled_options
65
71
  end
66
72
 
67
73
  def nest(exp)
@@ -432,6 +438,28 @@ class Regexp::Parser
432
438
  target_node || raise(ArgumentError, 'No valid target found for '\
433
439
  "'#{token.text}' ")
434
440
 
441
+ # in case of chained quantifiers, wrap target in an implicit passive group
442
+ # description of the problem: https://github.com/ammar/regexp_parser/issues/3
443
+ # rationale for this solution: https://github.com/ammar/regexp_parser/pull/69
444
+ if target_node.quantified?
445
+ new_token = Regexp::Token.new(
446
+ :group,
447
+ :passive,
448
+ '', # text
449
+ target_node.ts,
450
+ nil, # te (unused)
451
+ target_node.level,
452
+ target_node.set_level,
453
+ target_node.conditional_level
454
+ )
455
+ new_group = Group::Passive.new(new_token, active_opts)
456
+ new_group.implicit = true
457
+ new_group << target_node
458
+ increase_level(target_node)
459
+ node.expressions[offset] = new_group
460
+ target_node = new_group
461
+ end
462
+
435
463
  case token.token
436
464
  when :zero_or_one
437
465
  target_node.quantify(:zero_or_one, token.text, 0, 1, :greedy)
@@ -462,6 +490,11 @@ class Regexp::Parser
462
490
  end
463
491
  end
464
492
 
493
+ def increase_level(exp)
494
+ exp.level += 1
495
+ exp.respond_to?(:each) && exp.each { |subexp| increase_level(subexp) }
496
+ end
497
+
465
498
  def interval(target_node, token)
466
499
  text = token.text
467
500
  mchr = text[text.length-1].chr =~ /[?+]/ ? text[text.length-1].chr : nil