regexp_parser 1.7.0 → 2.0.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (36) hide show
  1. checksums.yaml +4 -4
  2. data/CHANGELOG.md +80 -1
  3. data/README.md +24 -12
  4. data/lib/regexp_parser/expression.rb +10 -19
  5. data/lib/regexp_parser/expression/classes/group.rb +17 -2
  6. data/lib/regexp_parser/expression/classes/root.rb +4 -16
  7. data/lib/regexp_parser/expression/quantifier.rb +9 -0
  8. data/lib/regexp_parser/expression/sequence.rb +0 -10
  9. data/lib/regexp_parser/lexer.rb +6 -6
  10. data/lib/regexp_parser/parser.rb +45 -12
  11. data/lib/regexp_parser/scanner.rb +1305 -1193
  12. data/lib/regexp_parser/scanner/char_type.rl +11 -11
  13. data/lib/regexp_parser/scanner/property.rl +2 -2
  14. data/lib/regexp_parser/scanner/scanner.rl +194 -171
  15. data/lib/regexp_parser/syntax/version_lookup.rb +2 -2
  16. data/lib/regexp_parser/version.rb +1 -1
  17. data/regexp_parser.gemspec +1 -1
  18. data/spec/expression/base_spec.rb +10 -0
  19. data/spec/expression/to_s_spec.rb +16 -0
  20. data/spec/lexer/delimiters_spec.rb +68 -0
  21. data/spec/lexer/literals_spec.rb +24 -49
  22. data/spec/parser/escapes_spec.rb +1 -1
  23. data/spec/parser/options_spec.rb +28 -0
  24. data/spec/parser/quantifiers_spec.rb +16 -0
  25. data/spec/parser/set/ranges_spec.rb +3 -3
  26. data/spec/scanner/delimiters_spec.rb +52 -0
  27. data/spec/scanner/errors_spec.rb +0 -1
  28. data/spec/scanner/escapes_spec.rb +10 -0
  29. data/spec/scanner/free_space_spec.rb +32 -0
  30. data/spec/scanner/literals_spec.rb +28 -38
  31. data/spec/scanner/options_spec.rb +36 -0
  32. data/spec/scanner/quantifiers_spec.rb +18 -13
  33. data/spec/scanner/sets_spec.rb +8 -2
  34. metadata +65 -61
  35. data/spec/expression/root_spec.rb +0 -9
  36. data/spec/expression/sequence_spec.rb +0 -9
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: d7b93dde993f6fe427ff43755738bf7de50f8613cf6e8097c9d791646d803e4c
4
- data.tar.gz: 993a88720a4ee1d8a34f4c95e167089adc6455289bfeb356de8c028a9bbee63d
3
+ metadata.gz: dcf56dd42e703e7f1f846762c418e83792d46f2c7d9efffc1fb1612b4325e076
4
+ data.tar.gz: a6197d98af2325a93ed60a102f09bc98e2163dec0d2b57fbede82dbf0479ea8b
5
5
  SHA512:
6
- metadata.gz: 0bf5c142591b2d5a65023c53f76a64a13106074050042d24614963cc14dabda197ea9140fccd93f26ad06885293369b076bb5e9198967a6e3762654df8033455
7
- data.tar.gz: 1311b3dfa90633ef456edc12abf6ace2d7311c7be8450f3768a436f9c8491c3a87987f3d6ac24c6966b6f4de5363e0f6f874bfe9e1b038a6cf5d9c043553b58e
6
+ metadata.gz: 8b9db6543c87b63c49e24666e06bc012ea9a6e330711c9fbd35961c70d1222988bf2403927fdbe2f8797176d674b83c7c3f2d1215c5f92b0d6f00c6ab7fe37af
7
+ data.tar.gz: 6cf07796d7c6ab1520a63b0f3b65d2e513caccf0b81306321376473fc438c7573bcf867e74ea8465122e12a47fe2e7e09652473f83e26f1038e112c2d66b3d2c
@@ -1,10 +1,89 @@
1
1
  ## [Unreleased]
2
2
 
3
+ ## [2.0.0] - 2020-11-25 - [Janosch Müller](mailto:janosch84@gmail.com)
4
+
5
+ ### Changed
6
+
7
+ - some methods that used to return byte-based indices now return char-based indices
8
+ * the returned values have only changed for Regexps that contain multibyte chars
9
+ * this is only a breaking change if you used such methods directly AND relied on them pointing to bytes
10
+ * affected methods:
11
+ * `Regexp::Token` `#length`, `#offset`, `#te`, `#ts`
12
+ * `Regexp::Expression::Base` `#full_length`, `#offset`, `#starts_at`, `#te`, `#ts`
13
+ * thanks to [Akinori MUSHA](https://github.com/knu) for the report
14
+ - removed some deprecated methods/signatures
15
+ * these are rarely used and have been showing deprecation warnings for a long time
16
+ * `Regexp::Expression::Subexpression.new` with 3 arguments
17
+ * `Regexp::Expression::Root.new` without a token argument
18
+ * `Regexp::Expression.parsed`
19
+
20
+ ### Added
21
+
22
+ - `Regexp::Expression::Base#base_length`
23
+ * returns the character count of an expression body, ignoring any quantifier
24
+ - pragmatic, experimental support for chained quantifiers
25
+ * e.g.: `/^a{10}{4,6}$/` matches exactly 40, 50 or 60 `a`s
26
+ * successive quantifiers used to be silently dropped by the parser
27
+ * they are now wrapped with passive groups as if they were written `(?:a{10}){4,6}`
28
+ * thanks to [calfeld](https://github.com/calfeld) for reporting this a while back
29
+
30
+ ### Fixed
31
+
32
+ - incorrect encoding output for non-ascii comments
33
+ * this led to a crash when calling `#to_s` on parse results containing such comments
34
+ * thanks to [Michael Glass](https://github.com/michaelglass) for the report
35
+ - some crashes when scanning contrived patterns such as `'\😋'`
36
+
37
+ ### [1.8.2] - 2020-10-11 - [Janosch Müller](mailto:janosch84@gmail.com)
38
+
39
+ ### Fixed
40
+
41
+ - fix `FrozenError` in `Expression::Base#repetitions` on Ruby 3.0
42
+ * thanks to [Thomas Walpole](https://github.com/twalpole)
43
+ - removed "unknown future version" warning on Ruby 3.0
44
+
45
+ ### [1.8.1] - 2020-09-28 - [Janosch Müller](mailto:janosch84@gmail.com)
46
+
47
+ ### Fixed
48
+
49
+ - fixed scanning of comment-like text in normal mode
50
+ * this was an old bug, but had become more prevalent in v1.8.0
51
+ * thanks to [Tietew](https://github.com/Tietew) for the report
52
+ - specified correct minimum Ruby version in gemspec
53
+ * it said 1.9 but really required 2.0 as of v1.8.0
54
+
55
+ ### [1.8.0] - 2020-09-20 - [Janosch Müller](mailto:janosch84@gmail.com)
56
+
57
+ ### Changed
58
+
59
+ - dropped support for running on Ruby 1.9.x
60
+
61
+ ### Added
62
+
63
+ - regexp flags can now be passed when parsing a `String` as regexp body
64
+ * see the [README](/README.md#usage) for details
65
+ * thanks to [Owen Stephens](https://github.com/owst)
66
+ - bare occurrences of `\g` and `\k` are now allowed and scanned as literal escapes
67
+ * matches Onigmo behavior
68
+ * thanks for the report to [Marc-André Lafortune](https://github.com/marcandre)
69
+
70
+ ### Fixed
71
+
72
+ - fixed parsing comments without preceding space or trailing newline in x-mode
73
+ * thanks to [Owen Stephens](https://github.com/owst)
74
+
75
+ ### [1.7.1] - 2020-06-07 - [Ammar Ali](mailto:ammarabuali@gmail.com)
76
+
77
+ ### Fixed
78
+
79
+ - Support for literals that include the unescaped delimiters `{`, `}`, and `]`. These
80
+ delimiters are informally supported by various regexp engines.
81
+
3
82
  ### [1.7.0] - 2020-02-23 - [Janosch Müller](mailto:janosch84@gmail.com)
4
83
 
5
84
  ### Added
6
85
 
7
- - `Expression#each_expression` and `1.#traverse` can now be called without a block
86
+ - `Expression#each_expression` and `#traverse` can now be called without a block
8
87
  * this returns an `Enumerator` and allows chaining, e.g. `each_expression.select`
9
88
  * thanks to [Masataka Kuwabara](https://github.com/pocke)
10
89
 
data/README.md CHANGED
@@ -8,8 +8,8 @@ A Ruby gem for tokenizing, parsing, and transforming regular expressions.
8
8
  * A scanner/tokenizer based on [Ragel](http://www.colm.net/open-source/ragel/)
9
9
  * A lexer that produces a "stream" of token objects.
10
10
  * A parser that produces a "tree" of Expression objects (OO API)
11
- * Runs on Ruby 1.9, 2.x, and JRuby (1.9 mode) runtimes.
12
- * Recognizes Ruby 1.8, 1.9, and 2.x regular expressions [See Supported Syntax](#supported-syntax)
11
+ * Runs on Ruby 2.x, 3.x and JRuby runtimes
12
+ * Recognizes Ruby 1.8, 1.9, 2.x and 3.x regular expressions [See Supported Syntax](#supported-syntax)
13
13
 
14
14
 
15
15
  _For examples of regexp_parser in use, see [Example Projects](#example-projects)._
@@ -18,7 +18,7 @@ _For examples of regexp_parser in use, see [Example Projects](#example-projects)
18
18
  ---
19
19
  ## Requirements
20
20
 
21
- * Ruby >= 1.9
21
+ * Ruby >= 2.0
22
22
  * Ragel >= 6.0, but only if you want to build the gem or work on the scanner.
23
23
 
24
24
 
@@ -72,6 +72,17 @@ called with the results as follows:
72
72
  * **Parser**: after completion, the block gets passed the root expression.
73
73
  _The result of the block is returned._
74
74
 
75
+ All three methods accept either a `Regexp` or `String` (containing the pattern)
76
+ - if a String is passed, `options` can be supplied:
77
+
78
+ ```ruby
79
+ require 'regexp_parser'
80
+
81
+ Regexp::Parser.parse(
82
+ "a+ # Recognises a and A...",
83
+ options: ::Regexp::EXTENDED | ::Regexp::IGNORECASE
84
+ )
85
+ ```
75
86
 
76
87
  ---
77
88
  ## Components
@@ -136,11 +147,8 @@ Regexp::Scanner.scan( /(cat?([bhm]at)){3,5}/ ).map {|token| token[2]}
136
147
  to the lexer.
137
148
 
138
149
  * The MRI implementation may accept expressions that either conflict with
139
- the documentation or are undocumented. The scanner does not support such
140
- implementation quirks.
141
- _(See issues [#3](https://github.com/ammar/regexp_parser/issues/3) and
142
- [#15](https://github.com/ammar/regexp_parser/issues/15) for examples)_
143
-
150
+ the documentation or are undocumented, like `{}` and `]` _(unescaped)_.
151
+ The scanner will try to support as many of these cases as possible.
144
152
 
145
153
  ---
146
154
  ### Syntax
@@ -309,7 +317,7 @@ Expression class. See the next section for details._
309
317
 
310
318
  ## Supported Syntax
311
319
  The three modules support all the regular expression syntax features of Ruby 1.8,
312
- 1.9, and 2.x:
320
+ 1.9, 2.x and 3.x:
313
321
 
314
322
  _Note that not all of these are available in all versions of Ruby_
315
323
 
@@ -432,13 +440,17 @@ rake install
432
440
  ## Example Projects
433
441
  Projects using regexp_parser.
434
442
 
443
+ - [capybara](https://github.com/teamcapybara/capybara) is an integration testing tool that uses regexp_parser to convert Regexps to css/xpath selectors.
444
+
445
+ - [js_regex](https://github.com/janosch-x/js_regex) converts Ruby regular expressions to JavaScript-compatible regular expressions.
446
+
435
447
  - [meta_re](https://github.com/ammar/meta_re) is a regular expression preprocessor with alias support.
436
448
 
437
449
  - [mutant](https://github.com/mbj/mutant) (before v0.9.0) manipulates your regular expressions (amongst others) to see if your tests cover their behavior.
438
450
 
439
- - [twitter-cldr-rb](https://github.com/twitter/twitter-cldr-rb) uses regexp_parser to generate examples of postal codes.
451
+ - [rubocop](https://github.com/rubocop-hq/rubocop) is a linter for Ruby that uses regexp_parser to lint Regexps.
440
452
 
441
- - [js_regex](https://github.com/janosch-x/js_regex) converts Ruby regular expressions to JavaScript-compatible regular expressions.
453
+ - [twitter-cldr-rb](https://github.com/twitter/twitter-cldr-rb) is a localization helper that uses regexp_parser to generate examples of postal codes.
442
454
 
443
455
 
444
456
  ## References
@@ -467,4 +479,4 @@ Documentation and books used while working on this project.
467
479
 
468
480
  ---
469
481
  ##### Copyright
470
- _Copyright (c) 2010-2019 Ammar Ali. See LICENSE file for details._
482
+ _Copyright (c) 2010-2020 Ammar Ali. See LICENSE file for details._
@@ -34,6 +34,10 @@ module Regexp::Expression
34
34
 
35
35
  alias :starts_at :ts
36
36
 
37
+ def base_length
38
+ to_s(:base).length
39
+ end
40
+
37
41
  def full_length
38
42
  to_s.length
39
43
  end
@@ -80,8 +84,12 @@ module Regexp::Expression
80
84
  return 1..1 unless quantified?
81
85
  min = quantifier.min
82
86
  max = quantifier.max < 0 ? Float::INFINITY : quantifier.max
83
- # fix Range#minmax - https://bugs.ruby-lang.org/issues/15807
84
- (min..max).tap { |r| r.define_singleton_method(:minmax) { [min, max] } }
87
+ range = min..max
88
+ # fix Range#minmax on old Rubies - https://bugs.ruby-lang.org/issues/15807
89
+ if RUBY_VERSION.to_f < 2.7
90
+ range.define_singleton_method(:minmax) { [min, max] }
91
+ end
92
+ range
85
93
  end
86
94
 
87
95
  def greedy?
@@ -114,23 +122,6 @@ module Regexp::Expression
114
122
  alias :to_h :attributes
115
123
  end
116
124
 
117
- def self.parsed(exp)
118
- warn('WARNING: Regexp::Expression::Base.parsed is buggy and '\
119
- 'will be removed in 2.0.0. Use Regexp::Parser.parse instead.')
120
- case exp
121
- when String
122
- Regexp::Parser.parse(exp)
123
- when Regexp
124
- Regexp::Parser.parse(exp.source) # <- causes loss of root options
125
- when Regexp::Expression # <- never triggers
126
- exp
127
- else
128
- raise ArgumentError, 'Expression.parsed accepts a String, Regexp, or '\
129
- 'a Regexp::Expression as a value for exp, but it '\
130
- "was given #{exp.class.name}."
131
- end
132
- end
133
-
134
125
  end # module Regexp::Expression
135
126
 
136
127
  require 'regexp_parser/expression/quantifier'
@@ -10,9 +10,24 @@ module Regexp::Expression
10
10
  def comment?; false end
11
11
  end
12
12
 
13
- class Atomic < Group::Base; end
14
- class Passive < Group::Base; end
13
+ class Passive < Group::Base
14
+ attr_writer :implicit
15
+
16
+ def to_s(format = :full)
17
+ if implicit?
18
+ "#{expressions.join}#{quantifier_affix(format)}"
19
+ else
20
+ super
21
+ end
22
+ end
23
+
24
+ def implicit?
25
+ @implicit ||= false
26
+ end
27
+ end
28
+
15
29
  class Absence < Group::Base; end
30
+ class Atomic < Group::Base; end
16
31
  class Options < Group::Base
17
32
  attr_accessor :option_changes
18
33
  end
@@ -1,24 +1,12 @@
1
1
  module Regexp::Expression
2
2
 
3
3
  class Root < Regexp::Expression::Subexpression
4
- # TODO: this override is here for backwards compatibility, remove in 2.0.0
5
- def initialize(*args)
6
- unless args.first.is_a?(Regexp::Token)
7
- warn('WARNING: Root.new without a Token argument is deprecated and '\
8
- 'will be removed in 2.0.0. Use Root.build for the old behavior.')
9
- return super(self.class.build_token, *args)
10
- end
11
- super
4
+ def self.build(options = {})
5
+ new(build_token, options)
12
6
  end
13
7
 
14
- class << self
15
- def build(options = {})
16
- new(build_token, options)
17
- end
18
-
19
- def build_token
20
- Regexp::Token.new(:expression, :root, '', 0)
21
- end
8
+ def self.build_token
9
+ Regexp::Token.new(:expression, :root, '', 0)
22
10
  end
23
11
  end
24
12
  end
@@ -40,5 +40,14 @@ module Regexp::Expression
40
40
  RUBY
41
41
  end
42
42
  alias :lazy? :reluctant?
43
+
44
+ def ==(other)
45
+ other.class == self.class &&
46
+ other.token == token &&
47
+ other.mode == mode &&
48
+ other.min == min &&
49
+ other.max == max
50
+ end
51
+ alias :eq :==
43
52
  end
44
53
  end
@@ -7,16 +7,6 @@ module Regexp::Expression
7
7
  # Used as the base class for the Alternation alternatives, Conditional
8
8
  # branches, and CharacterSet::Intersection intersected sequences.
9
9
  class Sequence < Regexp::Expression::Subexpression
10
- # TODO: this override is here for backwards compatibility, remove in 2.0.0
11
- def initialize(*args)
12
- if args.count == 3
13
- warn('WARNING: Sequence.new without a Regexp::Token argument is '\
14
- 'deprecated and will be removed in 2.0.0.')
15
- return self.class.at_levels(*args)
16
- end
17
- super
18
- end
19
-
20
10
  class << self
21
11
  def add_to(subexpression, params = {}, active_opts = {})
22
12
  sequence = at_levels(
@@ -11,11 +11,11 @@ class Regexp::Lexer
11
11
 
12
12
  CLOSING_TOKENS = [:close].freeze
13
13
 
14
- def self.lex(input, syntax = "ruby/#{RUBY_VERSION}", &block)
15
- new.lex(input, syntax, &block)
14
+ def self.lex(input, syntax = "ruby/#{RUBY_VERSION}", options: nil, &block)
15
+ new.lex(input, syntax, options: options, &block)
16
16
  end
17
17
 
18
- def lex(input, syntax = "ruby/#{RUBY_VERSION}", &block)
18
+ def lex(input, syntax = "ruby/#{RUBY_VERSION}", options: nil, &block)
19
19
  syntax = Regexp::Syntax.new(syntax)
20
20
 
21
21
  self.tokens = []
@@ -25,7 +25,7 @@ class Regexp::Lexer
25
25
  self.shift = 0
26
26
 
27
27
  last = nil
28
- Regexp::Scanner.scan(input) do |type, token, text, ts, te|
28
+ Regexp::Scanner.scan(input, options: options) do |type, token, text, ts, te|
29
29
  type, token = *syntax.normalize(type, token)
30
30
  syntax.check! type, token
31
31
 
@@ -96,10 +96,10 @@ class Regexp::Lexer
96
96
 
97
97
  tokens.pop
98
98
  tokens << Regexp::Token.new(:literal, :literal, lead,
99
- token.ts, (token.te - last.bytesize),
99
+ token.ts, (token.te - last.length),
100
100
  nesting, set_nesting, conditional_nesting)
101
101
  tokens << Regexp::Token.new(:literal, :literal, last,
102
- (token.ts + lead.bytesize), token.te,
102
+ (token.ts + lead.length), token.te,
103
103
  nesting, set_nesting, conditional_nesting)
104
104
  end
105
105
 
@@ -18,12 +18,12 @@ class Regexp::Parser
18
18
  end
19
19
  end
20
20
 
21
- def self.parse(input, syntax = "ruby/#{RUBY_VERSION}", &block)
22
- new.parse(input, syntax, &block)
21
+ def self.parse(input, syntax = "ruby/#{RUBY_VERSION}", options: nil, &block)
22
+ new.parse(input, syntax, options: options, &block)
23
23
  end
24
24
 
25
- def parse(input, syntax = "ruby/#{RUBY_VERSION}", &block)
26
- root = Root.build(options_from_input(input))
25
+ def parse(input, syntax = "ruby/#{RUBY_VERSION}", options: nil, &block)
26
+ root = Root.build(extract_options(input, options))
27
27
 
28
28
  self.root = root
29
29
  self.node = root
@@ -35,7 +35,7 @@ class Regexp::Parser
35
35
 
36
36
  self.captured_group_counts = Hash.new(0)
37
37
 
38
- Regexp::Lexer.scan(input, syntax) do |token|
38
+ Regexp::Lexer.scan(input, syntax, options: options) do |token|
39
39
  parse_token(token)
40
40
  end
41
41
 
@@ -54,14 +54,20 @@ class Regexp::Parser
54
54
  :options_stack, :switching_options, :conditional_nesting,
55
55
  :captured_group_counts
56
56
 
57
- def options_from_input(input)
58
- return {} unless input.is_a?(::Regexp)
57
+ def extract_options(input, options)
58
+ if options && !input.is_a?(String)
59
+ raise ArgumentError, 'options cannot be supplied unless parsing a String'
60
+ end
61
+
62
+ options = input.options if input.is_a?(::Regexp)
59
63
 
60
- options = {}
61
- options[:i] = true if input.options & ::Regexp::IGNORECASE != 0
62
- options[:m] = true if input.options & ::Regexp::MULTILINE != 0
63
- options[:x] = true if input.options & ::Regexp::EXTENDED != 0
64
- options
64
+ return {} unless options
65
+
66
+ enabled_options = {}
67
+ enabled_options[:i] = true if options & ::Regexp::IGNORECASE != 0
68
+ enabled_options[:m] = true if options & ::Regexp::MULTILINE != 0
69
+ enabled_options[:x] = true if options & ::Regexp::EXTENDED != 0
70
+ enabled_options
65
71
  end
66
72
 
67
73
  def nest(exp)
@@ -432,6 +438,28 @@ class Regexp::Parser
432
438
  target_node || raise(ArgumentError, 'No valid target found for '\
433
439
  "'#{token.text}' ")
434
440
 
441
+ # in case of chained quantifiers, wrap target in an implicit passive group
442
+ # description of the problem: https://github.com/ammar/regexp_parser/issues/3
443
+ # rationale for this solution: https://github.com/ammar/regexp_parser/pull/69
444
+ if target_node.quantified?
445
+ new_token = Regexp::Token.new(
446
+ :group,
447
+ :passive,
448
+ '', # text
449
+ target_node.ts,
450
+ nil, # te (unused)
451
+ target_node.level,
452
+ target_node.set_level,
453
+ target_node.conditional_level
454
+ )
455
+ new_group = Group::Passive.new(new_token, active_opts)
456
+ new_group.implicit = true
457
+ new_group << target_node
458
+ increase_level(target_node)
459
+ node.expressions[offset] = new_group
460
+ target_node = new_group
461
+ end
462
+
435
463
  case token.token
436
464
  when :zero_or_one
437
465
  target_node.quantify(:zero_or_one, token.text, 0, 1, :greedy)
@@ -462,6 +490,11 @@ class Regexp::Parser
462
490
  end
463
491
  end
464
492
 
493
+ def increase_level(exp)
494
+ exp.level += 1
495
+ exp.respond_to?(:each) && exp.each { |subexp| increase_level(subexp) }
496
+ end
497
+
465
498
  def interval(target_node, token)
466
499
  text = token.text
467
500
  mchr = text[text.length-1].chr =~ /[?+]/ ? text[text.length-1].chr : nil