RubyGems - regexp_parser - Versions diffs - 1.7.1 → 2.0.1 - Mend

regexp_parser 1.7.1 → 2.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (34) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +83 -0
data/README.md +23 -11
data/lib/regexp_parser/expression.rb +10 -19
data/lib/regexp_parser/expression/classes/group.rb +17 -2
data/lib/regexp_parser/expression/classes/root.rb +4 -16
data/lib/regexp_parser/expression/quantifier.rb +9 -0
data/lib/regexp_parser/expression/sequence.rb +0 -10
data/lib/regexp_parser/lexer.rb +6 -6
data/lib/regexp_parser/parser.rb +45 -12
data/lib/regexp_parser/scanner.rb +1264 -1280
data/lib/regexp_parser/scanner/char_type.rl +11 -11
data/lib/regexp_parser/scanner/property.rl +2 -2
data/lib/regexp_parser/scanner/scanner.rl +195 -194
data/lib/regexp_parser/syntax/version_lookup.rb +2 -2
data/lib/regexp_parser/version.rb +1 -1
data/regexp_parser.gemspec +1 -1
data/spec/expression/base_spec.rb +10 -0
data/spec/expression/to_s_spec.rb +16 -0
data/spec/lexer/literals_spec.rb +24 -49
data/spec/parser/escapes_spec.rb +1 -1
data/spec/parser/options_spec.rb +28 -0
data/spec/parser/quantifiers_spec.rb +15 -0
data/spec/parser/set/ranges_spec.rb +3 -3
data/spec/scanner/escapes_spec.rb +11 -0
data/spec/scanner/free_space_spec.rb +32 -0
data/spec/scanner/groups_spec.rb +10 -1
data/spec/scanner/literals_spec.rb +28 -38
data/spec/scanner/options_spec.rb +36 -0
data/spec/scanner/quantifiers_spec.rb +18 -13
data/spec/scanner/sets_spec.rb +8 -2
metadata +60 -60
data/spec/expression/root_spec.rb +0 -9
data/spec/expression/sequence_spec.rb +0 -9

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: dd872b22bf04a288790ef0f73df9041f14fb88a08c2a03852d9dbbc238b452d6
-  data.tar.gz: 4641097a24b5fa0f7b0c8e5aacc152587fe8b15d30f3f78bbec8157887b8b897
+  metadata.gz: 4d4ee1ebabfe19761461dc33344c1d5928be3d1f47b3064b5bf37206984ec43e
+  data.tar.gz: d4d0fae95d08fecedfe67d60849564fbe8fb971dafe1a8039e8b646eab23d765
 SHA512:
-  metadata.gz: 858570df4a7047a2d8b09555b56de28a66ca4f8022e596c249900f5312f8e7fb9376384ca816bc3c08f3e324930702ad410a28b5be680adea6867e1f8075441e
-  data.tar.gz: 0d70e7b4f18739826bb334fb305e335e44a354ae302214ca3c1884f66ace8680e48a9e4c64b890b220b82056da761084413c8b9b8c5e363382f5cf165b3d3448
+  metadata.gz: a78da1d206611573a47328e7904b0aba69203e00b9d33afb65a0fec1d22498cf1d16c761dbda6cc3af930c3fdb4fcc35932126e0fc048a8c6047c17485ce62ec
+  data.tar.gz: 3bc8081a187746c76fe5cb7d69519638e03f690533fe221c8b8a9285d537c95afcecb1aebc861ceea1252e6af55a117004f063dd319b0a402c503ae95fb5e0c7

data/CHANGELOG.md CHANGED

@@ -1,5 +1,88 @@
 ## [Unreleased]
+## [2.0.1] - 2020-12-20 - [Janosch Müller](mailto:janosch84@gmail.com)
+### Fixed
+- fixed error when scanning some group names
+  * this affected names containing hyphens, digits or multibyte chars, e.g. `/(?<a1>a)/`
+  * thanks to [Daniel Gollahon](https://github.com/dgollahon) for the report
+- fixed error when scanning hex escapes with just one hex digit
+  * e.g. `/\x0A/` was scanned correctly, but the equivalent `/\xA/` was not
+  * thanks to [Daniel Gollahon](https://github.com/dgollahon) for the report
+## [2.0.0] - 2020-11-25 - [Janosch Müller](mailto:janosch84@gmail.com)
+### Changed
+- some methods that used to return byte-based indices now return char-based indices
+  * the returned values have only changed for Regexps that contain multibyte chars
+  * this is only a breaking change if you used such methods directly AND relied on them pointing to bytes
+  * affected methods:
+  * `Regexp::Token` `#length`, `#offset`, `#te`, `#ts`
+  * `Regexp::Expression::Base` `#full_length`, `#offset`, `#starts_at`, `#te`, `#ts`
+  * thanks to [Akinori MUSHA](https://github.com/knu) for the report
+- removed some deprecated methods/signatures
+  * these are rarely used and have been showing deprecation warnings for a long time
+  * `Regexp::Expression::Subexpression.new` with 3 arguments
+  * `Regexp::Expression::Root.new` without a token argument
+  * `Regexp::Expression.parsed`
+### Added
+- `Regexp::Expression::Base#base_length`
+  * returns the character count of an expression body, ignoring any quantifier
+- pragmatic, experimental support for chained quantifiers
+  * e.g.: `/^a{10}{4,6}$/` matches exactly 40, 50 or 60 `a`s
+  * successive quantifiers used to be silently dropped by the parser
+  * they are now wrapped with passive groups as if they were written `(?:a{10}){4,6}`
+  * thanks to [calfeld](https://github.com/calfeld) for reporting this a while back
+### Fixed
+- incorrect encoding output for non-ascii comments
+  * this led to a crash when calling `#to_s` on parse results containing such comments
+  * thanks to [Michael Glass](https://github.com/michaelglass) for the report
+- some crashes when scanning contrived patterns such as `'\😋'`
+### [1.8.2] - 2020-10-11 - [Janosch Müller](mailto:janosch84@gmail.com)
+### Fixed
+- fix `FrozenError` in `Expression::Base#repetitions` on Ruby 3.0
+  * thanks to [Thomas Walpole](https://github.com/twalpole)
+- removed "unknown future version" warning on Ruby 3.0
+### [1.8.1] - 2020-09-28 - [Janosch Müller](mailto:janosch84@gmail.com)
+### Fixed
+- fixed scanning of comment-like text in normal mode
+  * this was an old bug, but had become more prevalent in v1.8.0
+  * thanks to [Tietew](https://github.com/Tietew) for the report
+- specified correct minimum Ruby version in gemspec
+  * it said 1.9 but really required 2.0 as of v1.8.0
+### [1.8.0] - 2020-09-20 - [Janosch Müller](mailto:janosch84@gmail.com)
+### Changed
+- dropped support for running on Ruby 1.9.x
+### Added
+- regexp flags can now be passed when parsing a `String` as regexp body
+  * see the [README](/README.md#usage) for details
+  * thanks to [Owen Stephens](https://github.com/owst)
+- bare occurrences of `\g` and `\k` are now allowed and scanned as literal escapes
+  * matches Onigmo behavior
+  * thanks for the report to [Marc-André Lafortune](https://github.com/marcandre)
+### Fixed
+- fixed parsing comments without preceding space or trailing newline in x-mode
+  * thanks to [Owen Stephens](https://github.com/owst)
 ### [1.7.1] - 2020-06-07 - [Ammar Ali](mailto:ammarabuali@gmail.com)
 ### Fixed

data/README.md CHANGED

@@ -1,6 +1,6 @@
 # Regexp::Parser
-[![Gem Version](https://badge.fury.io/rb/regexp_parser.svg)](http://badge.fury.io/rb/regexp_parser) [![Build Status](https://secure.travis-ci.org/ammar/regexp_parser.svg?branch=master)](http://travis-ci.org/ammar/regexp_parser) [![Code Climate](https://codeclimate.com/github/ammar/regexp_parser.svg)](https://codeclimate.com/github/ammar/regexp_parser/badges)
+[![Gem Version](https://badge.fury.io/rb/regexp_parser.svg)](http://badge.fury.io/rb/regexp_parser) [![Build Status](https://github.com/ammar/regexp_parser/workflows/tests/badge.svg)](https://github.com/ammar/regexp_parser/actions) [![Code Climate](https://codeclimate.com/github/ammar/regexp_parser.svg)](https://codeclimate.com/github/ammar/regexp_parser/badges)
 A Ruby gem for tokenizing, parsing, and transforming regular expressions.
@@ -8,8 +8,8 @@ A Ruby gem for tokenizing, parsing, and transforming regular expressions.
   * A scanner/tokenizer based on [Ragel](http://www.colm.net/open-source/ragel/)
   * A lexer that produces a "stream" of token objects.
   * A parser that produces a "tree" of Expression objects (OO API)
-* Runs on Ruby 1.9, 2.x, and JRuby (1.9 mode) runtimes.
-* Recognizes Ruby 1.8, 1.9, and 2.x regular expressions [See Supported Syntax](#supported-syntax)
+* Runs on Ruby 2.x, 3.x and JRuby runtimes
+* Recognizes Ruby 1.8, 1.9, 2.x and 3.x regular expressions [See Supported Syntax](#supported-syntax)
 _For examples of regexp_parser in use, see [Example Projects](#example-projects)._
@@ -18,13 +18,10 @@ _For examples of regexp_parser in use, see [Example Projects](#example-projects)
 ---
 ## Requirements
-* Ruby >= 1.9
+* Ruby >= 2.0
 * Ragel >= 6.0, but only if you want to build the gem or work on the scanner.
-_Note: See the .travis.yml file for covered versions._
 ---
 ## Install
@@ -72,6 +69,17 @@ called with the results as follows:
 * **Parser**: after completion, the block gets passed the root expression.
   _The result of the block is returned._
+All three methods accept either a `Regexp` or `String` (containing the pattern)
+- if a String is passed, `options` can be supplied:
+```ruby
+require 'regexp_parser'
+Regexp::Parser.parse(
+  "a+ # Recognises a and A...",
+  options: ::Regexp::EXTENDED | ::Regexp::IGNORECASE
+)
+```
 ---
 ## Components
@@ -306,7 +314,7 @@ Expression class. See the next section for details._
 ## Supported Syntax
 The three modules support all the regular expression syntax features of Ruby 1.8,
-1.9, and 2.x:
+1.9, 2.x and 3.x:
 _Note that not all of these are available in all versions of Ruby_
@@ -429,13 +437,17 @@ rake install
 ## Example Projects
 Projects using regexp_parser.
+- [capybara](https://github.com/teamcapybara/capybara) is an integration testing tool that uses regexp_parser to convert Regexps to css/xpath selectors.
+- [js_regex](https://github.com/janosch-x/js_regex) converts Ruby regular expressions to JavaScript-compatible regular expressions.
 - [meta_re](https://github.com/ammar/meta_re) is a regular expression preprocessor with alias support.
 - [mutant](https://github.com/mbj/mutant) (before v0.9.0) manipulates your regular expressions (amongst others) to see if your tests cover their behavior.
-- [twitter-cldr-rb](https://github.com/twitter/twitter-cldr-rb) uses regexp_parser to generate examples of postal codes.
+- [rubocop](https://github.com/rubocop-hq/rubocop) is a linter for Ruby that uses regexp_parser to lint Regexps.
-- [js_regex](https://github.com/janosch-x/js_regex) converts Ruby regular expressions to JavaScript-compatible regular expressions.
+- [twitter-cldr-rb](https://github.com/twitter/twitter-cldr-rb) is a localization helper that uses regexp_parser to generate examples of postal codes.
 ## References
@@ -464,4 +476,4 @@ Documentation and books used while working on this project.
 ---
 ##### Copyright
-_Copyright (c) 2010-2019 Ammar Ali. See LICENSE file for details._
+_Copyright (c) 2010-2020 Ammar Ali. See LICENSE file for details._

data/lib/regexp_parser/expression.rb CHANGED

@@ -34,6 +34,10 @@ module Regexp::Expression
     alias :starts_at :ts
+    def base_length
+      to_s(:base).length
+    end
     def full_length
       to_s.length
     end
@@ -80,8 +84,12 @@ module Regexp::Expression
       return 1..1 unless quantified?
       min = quantifier.min
       max = quantifier.max < 0 ? Float::INFINITY : quantifier.max
-      # fix Range#minmax - https://bugs.ruby-lang.org/issues/15807
-      (min..max).tap { |r| r.define_singleton_method(:minmax) { [min, max] } }
+      range = min..max
+      # fix Range#minmax on old Rubies - https://bugs.ruby-lang.org/issues/15807
+      if RUBY_VERSION.to_f < 2.7
+        range.define_singleton_method(:minmax) { [min, max] }
+      end
+      range
     end
     def greedy?
@@ -114,23 +122,6 @@ module Regexp::Expression
     alias :to_h :attributes
   end
-  def self.parsed(exp)
-    warn('WARNING: Regexp::Expression::Base.parsed is buggy and '\
-         'will be removed in 2.0.0. Use Regexp::Parser.parse instead.')
-    case exp
-    when String
-      Regexp::Parser.parse(exp)
-    when Regexp
-      Regexp::Parser.parse(exp.source) # <- causes loss of root options
-    when Regexp::Expression            # <- never triggers
-      exp
-    else
-      raise ArgumentError, 'Expression.parsed accepts a String, Regexp, or '\
-                           'a Regexp::Expression as a value for exp, but it '\
-                           "was given #{exp.class.name}."
-    end
-  end
 end # module Regexp::Expression
 require 'regexp_parser/expression/quantifier'

data/lib/regexp_parser/expression/classes/group.rb CHANGED

@@ -10,9 +10,24 @@ module Regexp::Expression
       def comment?; false end
     end
-    class Atomic  < Group::Base; end
-    class Passive < Group::Base; end
+    class Passive < Group::Base
+      attr_writer :implicit
+      def to_s(format = :full)
+        if implicit?
+          "#{expressions.join}#{quantifier_affix(format)}"
+        else
+          super
+        end
+      end
+      def implicit?
+        @implicit ||= false
+      end
+    end
     class Absence < Group::Base; end
+    class Atomic  < Group::Base; end
     class Options < Group::Base
       attr_accessor :option_changes
     end

data/lib/regexp_parser/expression/classes/root.rb CHANGED

@@ -1,24 +1,12 @@
 module Regexp::Expression
   class Root < Regexp::Expression::Subexpression
-    # TODO: this override is here for backwards compatibility, remove in 2.0.0
-    def initialize(*args)
-      unless args.first.is_a?(Regexp::Token)
-        warn('WARNING: Root.new without a Token argument is deprecated and '\
-             'will be removed in 2.0.0. Use Root.build for the old behavior.')
-        return super(self.class.build_token, *args)
-      end
-      super
+    def self.build(options = {})
+      new(build_token, options)
     end
-    class << self
-      def build(options = {})
-        new(build_token, options)
-      end
-      def build_token
-        Regexp::Token.new(:expression, :root, '', 0)
-      end
+    def self.build_token
+      Regexp::Token.new(:expression, :root, '', 0)
     end
   end
 end

data/lib/regexp_parser/expression/quantifier.rb CHANGED

@@ -40,5 +40,14 @@ module Regexp::Expression
       RUBY
     end
     alias :lazy? :reluctant?
+    def ==(other)
+      other.class == self.class &&
+        other.token == token &&
+        other.mode == mode &&
+        other.min == min &&
+        other.max == max
+    end
+    alias :eq :==
   end
 end

data/lib/regexp_parser/expression/sequence.rb CHANGED

@@ -7,16 +7,6 @@ module Regexp::Expression
   # Used as the base class for the Alternation alternatives, Conditional
   # branches, and CharacterSet::Intersection intersected sequences.
   class Sequence < Regexp::Expression::Subexpression
-    # TODO: this override is here for backwards compatibility, remove in 2.0.0
-    def initialize(*args)
-      if args.count == 3
-        warn('WARNING: Sequence.new without a Regexp::Token argument is '\
-             'deprecated and will be removed in 2.0.0.')
-        return self.class.at_levels(*args)
-      end
-      super
-    end
     class << self
       def add_to(subexpression, params = {}, active_opts = {})
         sequence = at_levels(

data/lib/regexp_parser/lexer.rb CHANGED

@@ -11,11 +11,11 @@ class Regexp::Lexer
   CLOSING_TOKENS = [:close].freeze
-  def self.lex(input, syntax = "ruby/#{RUBY_VERSION}", &block)
-    new.lex(input, syntax, &block)
+  def self.lex(input, syntax = "ruby/#{RUBY_VERSION}", options: nil, &block)
+    new.lex(input, syntax, options: options, &block)
   end
-  def lex(input, syntax = "ruby/#{RUBY_VERSION}", &block)
+  def lex(input, syntax = "ruby/#{RUBY_VERSION}", options: nil, &block)
     syntax = Regexp::Syntax.new(syntax)
     self.tokens = []
@@ -25,7 +25,7 @@ class Regexp::Lexer
     self.shift = 0
     last = nil
-    Regexp::Scanner.scan(input) do |type, token, text, ts, te|
+    Regexp::Scanner.scan(input, options: options) do |type, token, text, ts, te|
       type, token = *syntax.normalize(type, token)
       syntax.check! type, token
@@ -96,10 +96,10 @@ class Regexp::Lexer
     tokens.pop
     tokens << Regexp::Token.new(:literal, :literal, lead,
-              token.ts, (token.te - last.bytesize),
+              token.ts, (token.te - last.length),
               nesting, set_nesting, conditional_nesting)
     tokens << Regexp::Token.new(:literal, :literal, last,
-              (token.ts + lead.bytesize), token.te,
+              (token.ts + lead.length), token.te,
               nesting, set_nesting, conditional_nesting)
   end

data/lib/regexp_parser/parser.rb CHANGED

@@ -18,12 +18,12 @@ class Regexp::Parser
     end
   end
-  def self.parse(input, syntax = "ruby/#{RUBY_VERSION}", &block)
-    new.parse(input, syntax, &block)
+  def self.parse(input, syntax = "ruby/#{RUBY_VERSION}", options: nil, &block)
+    new.parse(input, syntax, options: options, &block)
   end
-  def parse(input, syntax = "ruby/#{RUBY_VERSION}", &block)
-    root = Root.build(options_from_input(input))
+  def parse(input, syntax = "ruby/#{RUBY_VERSION}", options: nil, &block)
+    root = Root.build(extract_options(input, options))
     self.root = root
     self.node = root
@@ -35,7 +35,7 @@ class Regexp::Parser
     self.captured_group_counts = Hash.new(0)
-    Regexp::Lexer.scan(input, syntax) do |token|
+    Regexp::Lexer.scan(input, syntax, options: options) do |token|
       parse_token(token)
     end
@@ -54,14 +54,20 @@ class Regexp::Parser
                 :options_stack, :switching_options, :conditional_nesting,
                 :captured_group_counts
-  def options_from_input(input)
-    return {} unless input.is_a?(::Regexp)
+  def extract_options(input, options)
+    if options && !input.is_a?(String)
+      raise ArgumentError, 'options cannot be supplied unless parsing a String'
+    end
+    options = input.options if input.is_a?(::Regexp)
-    options = {}
-    options[:i] = true if input.options & ::Regexp::IGNORECASE != 0
-    options[:m] = true if input.options & ::Regexp::MULTILINE  != 0
-    options[:x] = true if input.options & ::Regexp::EXTENDED   != 0
-    options
+    return {} unless options
+    enabled_options = {}
+    enabled_options[:i] = true if options & ::Regexp::IGNORECASE != 0
+    enabled_options[:m] = true if options & ::Regexp::MULTILINE  != 0
+    enabled_options[:x] = true if options & ::Regexp::EXTENDED   != 0
+    enabled_options
   end
   def nest(exp)
@@ -432,6 +438,28 @@ class Regexp::Parser
     target_node || raise(ArgumentError, 'No valid target found for '\
                                         "'#{token.text}' ")
+    # in case of chained quantifiers, wrap target in an implicit passive group
+    # description of the problem: https://github.com/ammar/regexp_parser/issues/3
+    # rationale for this solution: https://github.com/ammar/regexp_parser/pull/69
+    if target_node.quantified?
+      new_token = Regexp::Token.new(
+        :group,
+        :passive,
+        '', # text
+        target_node.ts,
+        nil, # te (unused)
+        target_node.level,
+        target_node.set_level,
+        target_node.conditional_level
+      )
+      new_group = Group::Passive.new(new_token, active_opts)
+      new_group.implicit = true
+      new_group << target_node
+      increase_level(target_node)
+      node.expressions[offset] = new_group
+      target_node = new_group
+    end
     case token.token
     when :zero_or_one
       target_node.quantify(:zero_or_one, token.text, 0, 1, :greedy)
@@ -462,6 +490,11 @@ class Regexp::Parser
     end
   end
+  def increase_level(exp)
+    exp.level += 1
+    exp.respond_to?(:each) && exp.each { |subexp| increase_level(subexp) }
+  end
   def interval(target_node, token)
     text = token.text
     mchr = text[text.length-1].chr =~ /[?+]/ ? text[text.length-1].chr : nil