RubyGems - regexp_parser - Versions diffs - 2.4.0 → 2.7.0 - Mend

regexp_parser 2.4.0 → 2.7.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (31) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +98 -42
data/README.md +46 -30
data/lib/regexp_parser/expression/base.rb +17 -9
data/lib/regexp_parser/expression/classes/backreference.rb +19 -2
data/lib/regexp_parser/expression/classes/{type.rb → character_type.rb} +0 -0
data/lib/regexp_parser/expression/classes/conditional.rb +8 -0
data/lib/regexp_parser/expression/classes/escape_sequence.rb +1 -1
data/lib/regexp_parser/expression/classes/group.rb +10 -0
data/lib/regexp_parser/expression/classes/keep.rb +2 -0
data/lib/regexp_parser/expression/classes/root.rb +3 -5
data/lib/regexp_parser/expression/classes/{property.rb → unicode_property.rb} +1 -0
data/lib/regexp_parser/expression/methods/construct.rb +43 -0
data/lib/regexp_parser/expression/methods/human_name.rb +43 -0
data/lib/regexp_parser/expression/methods/match_length.rb +9 -5
data/lib/regexp_parser/expression/methods/traverse.rb +6 -3
data/lib/regexp_parser/expression/quantifier.rb +6 -5
data/lib/regexp_parser/expression/sequence.rb +6 -21
data/lib/regexp_parser/expression/shared.rb +20 -3
data/lib/regexp_parser/expression/subexpression.rb +4 -1
data/lib/regexp_parser/expression.rb +4 -2
data/lib/regexp_parser/lexer.rb +61 -29
data/lib/regexp_parser/parser.rb +36 -26
data/lib/regexp_parser/scanner/property.rl +1 -1
data/lib/regexp_parser/scanner/scanner.rl +57 -42
data/lib/regexp_parser/scanner.rb +873 -823
data/lib/regexp_parser/syntax/token/escape.rb +1 -1
data/lib/regexp_parser/syntax/version_lookup.rb +0 -8
data/lib/regexp_parser/syntax/versions.rb +2 -0
data/lib/regexp_parser/version.rb +1 -1
metadata +7 -5

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 8b84a4bb274f31b8608c7dc9d55ff6f1b8d92d0d147976f38079ae7701a6debe
-  data.tar.gz: 41db5f094d0beafade30a1fac2707cbc827831e818c485ad35d7173f18c6a91a
+  metadata.gz: 04af46818e9d560362fea9b3fd24802b557ac145ed95f6e02580dd7cf5e8ddfc
+  data.tar.gz: 75b7d30241f48ddf90c8cd68228fa928904ab6055ea755f4bdcf28361e645a4b
 SHA512:
-  metadata.gz: 5dcde6135ac42db609402e47e04ee3be1da8854de286d2baad15dafee04d451814fd7a3bae7adc5440a1fced811e242b69f5fd14bcfc4f3bd5091f86769d56be
-  data.tar.gz: 2660d0fb28a972a1de53b71b16f8591e573d4214724b5eea8a452549598ff5d0fc5b731149e8332f65bce01c812f4d0d72135bba7e3016064d9f05202a8b5580
+  metadata.gz: 407025a9b14af76463260fca2a48f9fef4ab863e3dddf3f7f54101c1348611afa49d9973e850d9e1c84d6e5faf8f1a9d3d2da5dceaefe8dc4fefe7069ecd9280
+  data.tar.gz: 9f3d2eb4264318511a82e9034c4c4a8a8e73e67e427945f0c9f745fd37b2f2f0ae8e30ba942f0920da3109b59436a5518dfc5e2f7669317de0214a0deb6f0e07

data/CHANGELOG.md CHANGED Viewed

@@ -1,33 +1,99 @@
-## [Unreleased]
+# Changelog
+All notable changes to this project will be documented in this file.
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+## [2.7.0] - 2023-02-08 - [Janosch Müller](mailto:janosch84@gmail.com)
+### Added
+- `Regexp::Lexer.lex` now streams tokens when called with a block
+  - it can now take arbitrarily large input, just like `Regexp::Scanner`
+  - this also slightly improves `Regexp::Parser.parse` performance
+  - note: `Regexp::Parser.parse` still does not and will not support streaming
+- improved performance of `Subexpression#each_expression`
+- minor improvements to `Regexp::Scanner` performance
+- overall improvement of parse performance: about 10% for large Regexps
+### Fixed
+- parsing of octal escape sequences in sets, e.g. `[\141]`
+  * thanks to [Randy Stauner](https://github.com/rwstauner) for the report
+## [2.6.2] - 2023-01-19 - [Janosch Müller](mailto:janosch84@gmail.com)
+### Fixed
+- fixed `SystemStackError` when cloning recursive subexpression calls
+  * e.g. `Regexp::Parser.parse(/a|b\g<0>/).dup`
+## [2.6.1] - 2022-11-16 - [Janosch Müller](mailto:janosch84@gmail.com)
+### Fixed
+- fixed scanning of two negative lookbehind edge cases
+  * `(?<!x)y>` used to raise a ScannerError
+  * `(?<!x>)y` used to be misinterpreted as a named group
+  * thanks to [Sergio Medina](https://github.com/serch) for the report
+## [2.6.0] - 2022-09-26 - [Janosch Müller](mailto:janosch84@gmail.com)
+### Fixed
+- fixed `#referenced_expression` for `\g<0>` (was `nil`, is now the `Root` exp)
+- fixed `#reference`, `#referenced_expression` for recursion level backrefs
+  * e.g. `(a)(b)\k<-1+1>`
+  * `#referenced_expression` was `nil`, now it is the correct `Group` exp
+- detect and raise for two more syntax errors when parsing String input
+  * quantification of option switches (e.g. `(?i)+`)
+  * invalid references (e.g. `/\k<1>/`)
+  * these are a `SyntaxError` in Ruby, so could only be passed as a String
+### Added
+- `Regexp::Expression::Base#human_name`
+  * returns a nice, human-readable description of the expression
+- `Regexp::Expression::Base#optional?`
+  * returns `true` if the expression is quantified accordingly (e.g. with `*`, `{,n}`)
+- added a deprecation warning when calling `#to_re` on set members
+## [2.5.0] - 2022-05-27 - [Janosch Müller](mailto:janosch84@gmail.com)
+### Added
+- `Regexp::Expression::Base.construct` and `.token_class` methods
+  * see the [wiki](https://github.com/ammar/regexp_parser/wiki) for details
 ## [2.4.0] - 2022-05-09 - [Janosch Müller](mailto:janosch84@gmail.com)
 ### Fixed
 - fixed interpretation of `+` and `?` after interval quantifiers (`{n,n}`)
-  - they used to be treated as reluctant or possessive mode indicators
-  - however, Ruby does not support these modes for interval quantifiers
-  - they are now treated as chained quantifiers instead, as Ruby does it
-  - c.f. [#3](https://github.com/ammar/regexp_parser/issues/3)
+  * they used to be treated as reluctant or possessive mode indicators
+  * however, Ruby does not support these modes for interval quantifiers
+  * they are now treated as chained quantifiers instead, as Ruby does it
+  * c.f. [#3](https://github.com/ammar/regexp_parser/issues/3)
 - fixed `Expression::Base#nesting_level` for some tree rewrite cases
-  - e.g. the alternatives in `/a|[b]/` had an inconsistent nesting_level
+  * e.g. the alternatives in `/a|[b]/` had an inconsistent nesting_level
 - fixed `Scanner` accepting invalid posix classes, e.g. `[[:foo:]]`
-  - they raise a `SyntaxError` when used in a Regexp, so could only be passed as String
-  - they now raise a `Regexp::Scanner::ValidationError` in the `Scanner`
+  * they raise a `SyntaxError` when used in a Regexp, so could only be passed as String
+  * they now raise a `Regexp::Scanner::ValidationError` in the `Scanner`
 ### Added
 - added `Expression::Base#==` for (deep) comparison of expressions
 - added `Expression::Base#parts`
-  - returns the text elements and subexpressions of an expression
-  - e.g. `parse(/(a)/)[0].parts # => ["(", #<Literal @text="a"...>, ")"]`
+  * returns the text elements and subexpressions of an expression
+  * e.g. `parse(/(a)/)[0].parts # => ["(", #<Literal @text="a"...>, ")"]`
 - added `Expression::Base#te` (a.k.a. token end index)
-  - `Expression::Subexpression` always had `#te`, only terminal nodes lacked it so far
+  * `Expression::Subexpression` always had `#te`, only terminal nodes lacked it so far
 - made some `Expression::Base` methods available on `Quantifier` instances, too
-  - `#type`, `#type?`, `#is?`, `#one_of?`, `#options`, `#terminal?`
-  - `#base_length`, `#full_length`, `#starts_at`, `#te`, `#ts`, `#offset`
-  - `#conditional_level`, `#level`, `#nesting_level` , `#set_level`
-  - this allows a more unified handling with `Expression::Base` instances
+  * `#type`, `#type?`, `#is?`, `#one_of?`, `#options`, `#terminal?`
+  * `#base_length`, `#full_length`, `#starts_at`, `#te`, `#ts`, `#offset`
+  * `#conditional_level`, `#level`, `#nesting_level` , `#set_level`
+  * this allows a more unified handling with `Expression::Base` instances
 - allowed `Quantifier#initialize` to take a token and options Hash like other nodes
 - added a deprecation warning for initializing Quantifiers with 4+ arguments:
@@ -36,10 +102,12 @@
     It will no longer be supported in regexp_parser v3.0.0.
-    Please pass a Regexp::Token instead, e.g. replace `type, text, min, max, mode`
-    with `::Regexp::Token.new(:quantifier, type, text)`. min, max, and mode
+    Please pass a Regexp::Token instead, e.g. replace `token, text, min, max, mode`
+    with `::Regexp::Token.new(:quantifier, token, text)`. min, max, and mode
     will be derived automatically.
+    Or do `exp.quantifier = Quantifier.construct(token: token, text: str)`.
     This is consistent with how Expression::Base instances are created.
@@ -48,18 +116,18 @@
 ### Fixed
 - removed five inexistent unicode properties from `Syntax#features`
-  - these were never supported by Ruby or the `Regexp::Scanner`
-  - thanks to [Markus Schirp](https://github.com/mbj) for the report
+  * these were never supported by Ruby or the `Regexp::Scanner`
+  * thanks to [Markus Schirp](https://github.com/mbj) for the report
 ## [2.3.0] - 2022-04-08 - [Janosch Müller](mailto:janosch84@gmail.com)
 ### Added
 - improved parsing performance through `Syntax` refactoring
-  - instead of fresh `Syntax` instances, pre-loaded constants are now re-used
-  - this approximately doubles the parsing speed for simple regexps
+  * instead of fresh `Syntax` instances, pre-loaded constants are now re-used
+  * this approximately doubles the parsing speed for simple regexps
 - added methods to `Syntax` classes to show relative feature sets
-  - e.g. `Regexp::Syntax::V3_2_0.added_features`
+  * e.g. `Regexp::Syntax::V3_2_0.added_features`
 - support for new unicode properties of Ruby 3.2 / Unicode 14.0
 ## [2.2.1] - 2022-02-11 - [Janosch Müller](mailto:janosch84@gmail.com)
@@ -67,14 +135,14 @@
 ### Fixed
 - fixed Syntax version of absence groups (`(?~...)`)
-  - the lexer accepted them for any Ruby version
-  - now they are only recognized for Ruby >= 2.4.1 in which they were introduced
+  * the lexer accepted them for any Ruby version
+  * now they are only recognized for Ruby >= 2.4.1 in which they were introduced
 - reduced gem size by excluding specs from package
 - removed deprecated `test_files` gemspec setting
 - no longer depend on `yaml`/`psych` (except for Ruby <= 2.4)
 - no longer depend on `set`
-  - `set` was removed from the stdlib and made a standalone gem as of Ruby 3
-  - this made it a hidden/undeclared dependency of `regexp_parser`
+  * `set` was removed from the stdlib and made a standalone gem as of Ruby 3
+  * this made it a hidden/undeclared dependency of `regexp_parser`
 ## [2.2.0] - 2021-12-04 - [Janosch Müller](mailto:janosch84@gmail.com)
@@ -312,8 +380,8 @@
 - Fixed missing quantifier in `Conditional::Expression` methods `#to_s`, `#to_re`
 - `Conditional::Condition` no longer lives outside the recursive `#expressions` tree
-  - it used to be the only expression stored in a custom ivar, complicating traversal
-  - its setter and getter (`#condition=`, `#condition`) still work as before
+  * it used to be the only expression stored in a custom ivar, complicating traversal
+  * its setter and getter (`#condition=`, `#condition`) still work as before
 ## [1.1.0] - 2018-09-17 - [Janosch Müller](mailto:janosch84@gmail.com)
@@ -321,8 +389,8 @@
 - Added `Quantifier` methods `#greedy?`, `#possessive?`, `#reluctant?`/`#lazy?`
 - Added `Group::Options#option_changes`
-  - shows the options enabled or disabled by the given options group
-  - as with all other expressions, `#options` shows the overall active options
+  * shows the options enabled or disabled by the given options group
+  * as with all other expressions, `#options` shows the overall active options
 - Added `Conditional#reference` and `Condition#reference`, indicating the determinative group
 - Added `Subexpression#dig`, acts like [`Array#dig`](http://ruby-doc.org/core-2.5.0/Array.html#method-i-dig)
@@ -506,7 +574,6 @@ This release includes several breaking changes, mostly to character sets, #map a
   * Fixed scanning of zero length comments (PR #12)
   * Fixed missing escape:codepoint_list syntax token (PR #14)
   * Fixed to_s for modified interval quantifiers (PR #17)
-- Added a note about MRI implementation quirks to Scanner section
 ## [0.3.2] - 2016-01-01 - [Ammar Ali](mailto:ammarabuali@gmail.com)
@@ -532,7 +599,6 @@ This release includes several breaking changes, mostly to character sets, #map a
 - Renamed Lexer's method to lex, added an alias to the old name (scan)
 - Use #map instead of #each to run the block in Lexer.lex.
 - Replaced VERSION.yml file with a constant.
-- Updated README
 - Update tokens and scanner with new additions in Unicode 7.0.
 ## [0.1.6] - 2014-10-06 - [Ammar Ali](mailto:ammarabuali@gmail.com)
@@ -542,20 +608,11 @@ This release includes several breaking changes, mostly to character sets, #map a
 - Added syntax files for missing ruby 2.x versions. These do not add
   extra syntax support, they just make the gem work with the newer
   ruby versions.
-- Added .travis.yml to project root.
-- README:
-  - Removed note purporting runtime support for ruby 1.8.6.
-  - Added a section identifying the main unsupported syntax features.
-  - Added sections for Testing and Building
-  - Added badges for gem version, Travis CI, and code climate.
-- Updated README, fixing broken examples, and converting it from a rdoc file to Github's flavor of Markdown.
 - Fixed a parser bug where an alternation sequence that contained nested expressions was incorrectly being appended to the parent expression when the nesting was exited. e.g. in /a|(b)c/, c was appended to the root.
 - Fixed a bug where character types were not being correctly scanned within character sets. e.g. in [\d], two tokens were scanned; one for the backslash '\' and one for the 'd'
 ## [0.1.5] - 2014-01-14 - [Ammar Ali](mailto:ammarabuali@gmail.com)
-- Correct ChangeLog.
 - Added syntax stubs for ruby versions 2.0 and 2.1
 - Added clone methods for deep copying expressions.
 - Added optional format argument for to_s on expressions to return the text of the expression with (:full, the default) or without (:base) its quantifier.
@@ -564,7 +621,6 @@ This release includes several breaking changes, mostly to character sets, #map a
 - Improved EOF handling in general and especially from sequences like hex and control escapes.
 - Fixed a bug where named groups with an empty name would return a blank token [].
 - Fixed a bug where member of a parent set where being added to its last subset.
-- Various code cleanups in scanner.rl
 - Fixed a few mutable string bugs by calling dup on the originals.
 - Made ruby 1.8.6 the base for all 1.8 syntax, and the 1.8 name a pointer to the latest (1.8.7 at this time)
 - Removed look-behind assertions (positive and negative) from 1.8 syntax

data/README.md CHANGED Viewed

@@ -9,8 +9,8 @@ A Ruby gem for tokenizing, parsing, and transforming regular expressions.
 * Multilayered
   * A scanner/tokenizer based on [Ragel](http://www.colm.net/open-source/ragel/)
-  * A lexer that produces a "stream" of token objects.
-  * A parser that produces a "tree" of Expression objects (OO API)
+  * A lexer that produces a "stream" of [Token objects](https://github.com/ammar/regexp_parser/wiki/Token-Objects)
+  * A parser that produces a "tree" of [Expression objects (OO API)](https://github.com/ammar/regexp_parser/wiki/Expression-Objects)
 * Runs on Ruby 2.x, 3.x and JRuby runtimes
 * Recognizes Ruby 1.8, 1.9, 2.x and 3.x regular expressions [See Supported Syntax](#supported-syntax)
@@ -36,14 +36,15 @@ Or, add it to your project's `Gemfile`:
 ```gem 'regexp_parser', '~> X.Y.Z'```
-See rubygems for the the [latest version number](https://rubygems.org/gems/regexp_parser)
+See the badge at the top of this README or [rubygems](https://rubygems.org/gems/regexp_parser)
+for the the latest version number.
 ---
 ## Usage
 The three main modules are **Scanner**, **Lexer**, and **Parser**. Each of them
-provides a single method that takes a regular expression (as a RegExp object or
+provides a single method that takes a regular expression (as a Regexp object or
 a string) and returns its results. The **Lexer** and the **Parser** accept an
 optional second argument that specifies the syntax version, like 'ruby/2.0',
 which defaults to the host Ruby version (using RUBY_VERSION).
@@ -79,7 +80,7 @@ All three methods accept either a `Regexp` or `String` (containing the pattern)
 require 'regexp_parser'
 Regexp::Parser.parse(
-  "a+ # Recognises a and A...",
+  "a+ # Recognizes a and A...",
   options: ::Regexp::EXTENDED | ::Regexp::IGNORECASE
 )
 ```
@@ -101,7 +102,7 @@ start/end offsets for each token found.
 ```ruby
 require 'regexp_parser'
-Regexp::Scanner.scan /(ab?(cd)*[e-h]+)/  do |type, token, text, ts, te|
+Regexp::Scanner.scan(/(ab?(cd)*[e-h]+)/) do |type, token, text, ts, te|
   puts "type: #{type}, token: #{token}, text: '#{text}' [#{ts}..#{te}]"
 end
@@ -124,7 +125,7 @@ A one-liner that uses map on the result of the scan to return the textual
 parts of the pattern:
 ```ruby
-Regexp::Scanner.scan( /(cat?([bhm]at)){3,5}/ ).map {|token| token[2]}
+Regexp::Scanner.scan(/(cat?([bhm]at)){3,5}/).map { |token| token[2] }
 #=> ["(", "cat", "?", "(", "[", "b", "h", "m", "]", "at", ")", ")", "{3,5}"]
 ```
@@ -220,7 +221,7 @@ syntax, and prints the token objects' text indented to their level.
 ```ruby
 require 'regexp_parser'
-Regexp::Lexer.lex /a?(b(c))*[d]+/, 'ruby/1.9' do |token|
+Regexp::Lexer.lex(/a?(b(c))*[d]+/, 'ruby/1.9') do |token|
   puts "#{'  ' * token.level}#{token.text}"
 end
@@ -246,7 +247,7 @@ how the sequence 'cat' is treated. The 't' is separated because it's followed
 by a quantifier that only applies to it.
 ```ruby
-Regexp::Lexer.scan( /(cat?([b]at)){3,5}/ ).map {|token| token.text}
+Regexp::Lexer.scan(/(cat?([b]at)){3,5}/).map { |token| token.text }
 #=> ["(", "ca", "t", "?", "(", "[", "b", "]", "at", ")", ")", "{3,5}"]
 ```
@@ -274,7 +275,7 @@ require 'regexp_parser'
 regex = /a?(b+(c)d)*(?<name>[0-9]+)/
-tree = Regexp::Parser.parse( regex, 'ruby/2.1' )
+tree = Regexp::Parser.parse(regex, 'ruby/2.1')
 tree.traverse do |event, exp|
   puts "#{event}: #{exp.type} `#{exp.to_s}`"
@@ -355,7 +356,7 @@ _Note that not all of these are available in all versions of Ruby_
 | &emsp;&emsp;_Nest Level_              | `\k<n-1>`                                               | &#x2713; |
 | &emsp;&emsp;_Numbered_                | `\k<1>`                                                 | &#x2713; |
 | &emsp;&emsp;_Relative_                | `\k<-2>`                                                | &#x2713; |
-| &emsp;&emsp;_Traditional_             | `\1` thru `\9`                                          | &#x2713; |
+| &emsp;&emsp;_Traditional_             | `\1` through `\9`                                       | &#x2713; |
 | &emsp;&nbsp;_**Capturing**_           | `(abc)`                                                 | &#x2713; |
 | &emsp;&nbsp;_**Comments**_            | `(?# comment text)`                                     | &#x2713; |
 | &emsp;&nbsp;_**Named**_               | `(?<name>abc)`, `(?'name'abc)`                          | &#x2713; |
@@ -375,7 +376,7 @@ _Note that not all of these are available in all versions of Ruby_
 | &emsp;&nbsp;_**Meta** \[2\]_          | `\M-c`, `\M-\C-C`, `\M-\cC`, `\C-\M-C`, `\c\M-C`        | &#x2713; |
 | &emsp;&nbsp;_**Octal**_               | `\0`, `\01`, `\012`                                     | &#x2713; |
 | &emsp;&nbsp;_**Unicode**_             | `\uHHHH`, `\u{H+ H+}`                                   | &#x2713; |
-| **Unicode Properties**                | _<sub>([Unicode 13.0.0](https://www.unicode.org/versions/Unicode13.0.0/))</sub>_ | &#x22f1; |
+| **Unicode Properties**                | _<sub>([Unicode 13.0.0])</sub>_                         | &#x22f1; |
 | &emsp;&nbsp;_**Age**_                 | `\p{Age=5.2}`, `\P{age=7.0}`, `\p{^age=8.0}`            | &#x2713; |
 | &emsp;&nbsp;_**Blocks**_              | `\p{InArmenian}`, `\P{InKhmer}`, `\p{^InThai}`          | &#x2713; |
 | &emsp;&nbsp;_**Classes**_             | `\p{Alpha}`, `\P{Space}`, `\p{^Alnum}`                  | &#x2713; |
@@ -384,13 +385,17 @@ _Note that not all of these are available in all versions of Ruby_
 | &emsp;&nbsp;_**Scripts**_             | `\p{Arabic}`, `\P{Hiragana}`, `\p{^Greek}`              | &#x2713; |
 | &emsp;&nbsp;_**Simple**_              | `\p{Dash}`, `\p{Extender}`, `\p{^Hyphen}`               | &#x2713; |
-**\[1\]**: Ruby does not support lazy or possessive interval quantifiers. Any `+` or `?` that follows an interval
-quantifier will be treated as another, chained quantifier. See also [#3](https://github.com/ammar/regexp_parser/issue/3),
+[Unicode 13.0.0]: https://www.unicode.org/versions/Unicode13.0.0/
+**\[1\]**: Ruby does not support lazy or possessive interval quantifiers.
+Any `+` or `?` that follows an interval quantifier will be treated as another,
+chained quantifier. See also [#3](https://github.com/ammar/regexp_parser/issue/3),
 [#69](https://github.com/ammar/regexp_parser/pull/69).
-**\[2\]**: As of Ruby 3.1, meta and control sequences are [pre-processed to hex escapes when used in Regexp literals](
- https://github.com/ruby/ruby/commit/11ae581a4a7f5d5f5ec6378872eab8f25381b1b9 ), so they will only reach the
-scanner and will only be emitted if a String or a Regexp that has been built with the `::new` constructor is scanned.
+**\[2\]**: As of Ruby 3.1, meta and control sequences are [pre-processed to hex
+escapes when used in Regexp literals](https://github.com/ruby/ruby/commit/11ae581),
+so they will only reach the scanner and will only be emitted if a String or a Regexp
+that has been built with the `::new` constructor is scanned.
 ##### Inapplicable Features
@@ -407,25 +412,27 @@ expressions library (Onigmo). They are not supported by the scanner.
 See something missing? Please submit an [issue](https://github.com/ammar/regexp_parser/issues)
-_**Note**: Attempting to process expressions with unsupported syntax features can raise an error,
-or incorrectly return tokens/objects as literals._
+_**Note**: Attempting to process expressions with unsupported syntax features can raise
+an error, or incorrectly return tokens/objects as literals._
 ## Testing
 To run the tests simply run rake from the root directory.
-The default task generates the scanner's code from the Ragel source files and runs all the specs, thus it requires Ragel to be installed.
+The default task generates the scanner's code from the Ragel source files and runs
+all the specs, thus it requires Ragel to be installed.
-Note that changes to Ragel files will not be reflected when running `rspec` on its own, so to run individual tests you might want to run:
+Note that changes to Ragel files will not be reflected when running `rspec` on its own,
+so to run individual tests you might want to run:
 ```
 rake ragel:rb && rspec spec/scanner/properties_spec.rb
 ```
 ## Building
-Building the scanner and the gem requires [Ragel](http://www.colm.net/open-source/ragel/) to be
-installed. The build tasks will automatically invoke the 'ragel:rb' task to generate the
-Ruby scanner code.
+Building the scanner and the gem requires [Ragel](http://www.colm.net/open-source/ragel/)
+to be installed. The build tasks will automatically invoke the 'ragel:rb' task to generate
+the Ruby scanner code.
 The project uses the standard rubygems package tasks, so:
@@ -445,17 +452,26 @@ rake install
 ## Example Projects
 Projects using regexp_parser.
-- [capybara](https://github.com/teamcapybara/capybara) is an integration testing tool that uses regexp_parser to convert Regexps to css/xpath selectors.
+- [capybara](https://github.com/teamcapybara/capybara) is an integration testing tool
+that uses regexp_parser to convert Regexps to css/xpath selectors.
+- [js_regex](https://github.com/jaynetics/js_regex) converts Ruby regular expressions
+to JavaScript-compatible regular expressions.
-- [js_regex](https://github.com/janosch-x/js_regex) converts Ruby regular expressions to JavaScript-compatible regular expressions.
+- [meta_re](https://github.com/ammar/meta_re) is a regular expression preprocessor
+with alias support.
-- [meta_re](https://github.com/ammar/meta_re) is a regular expression preprocessor with alias support.
+- [mutant](https://github.com/mbj/mutant) manipulates your regular expressions
+(amongst others) to see if your tests cover their behavior.
-- [mutant](https://github.com/mbj/mutant) manipulates your regular expressions (amongst others) to see if your tests cover their behavior.
+- [repper](https://github.com/jaynetics/repper) is a regular expression
+pretty-printer and formatter for Ruby.
-- [rubocop](https://github.com/rubocop-hq/rubocop) is a linter for Ruby that uses regexp_parser to lint Regexps.
+- [rubocop](https://github.com/rubocop-hq/rubocop) is a linter for Ruby that
+uses regexp_parser to lint Regexps.
-- [twitter-cldr-rb](https://github.com/twitter/twitter-cldr-rb) is a localization helper that uses regexp_parser to generate examples of postal codes.
+- [twitter-cldr-rb](https://github.com/twitter/twitter-cldr-rb) is a localization helper
+that uses regexp_parser to generate examples of postal codes.
 ## References

data/lib/regexp_parser/expression/base.rb CHANGED Viewed

@@ -14,6 +14,10 @@ module Regexp::Expression
     end
     def to_re(format = :full)
+      if set_level > 0
+        warn "Calling #to_re on character set members is deprecated - "\
+             "their behavior might not be equivalent outside of the set."
+      end
       ::Regexp.new(to_s(format))
     end
@@ -32,15 +36,19 @@ module Regexp::Expression
     end
     def repetitions
-      return 1..1 unless quantified?
-      min = quantifier.min
-      max = quantifier.max < 0 ? Float::INFINITY : quantifier.max
-      range = min..max
-      # fix Range#minmax on old Rubies - https://bugs.ruby-lang.org/issues/15807
-      if RUBY_VERSION.to_f < 2.7
-        range.define_singleton_method(:minmax) { [min, max] }
-      end
-      range
+      @repetitions ||=
+        if quantified?
+          min = quantifier.min
+          max = quantifier.max < 0 ? Float::INFINITY : quantifier.max
+          range = min..max
+          # fix Range#minmax on old Rubies - https://bugs.ruby-lang.org/issues/15807
+          if RUBY_VERSION.to_f < 2.7
+            range.define_singleton_method(:minmax) { [min, max] }
+          end
+          range
+        else
+          1..1
+        end
     end
     def greedy?

data/lib/regexp_parser/expression/classes/backreference.rb CHANGED Viewed

@@ -1,12 +1,29 @@
 module Regexp::Expression
+  # TODO: unify name with token :backref, one way or the other, in v3.0.0
   module Backreference
     class Base < Regexp::Expression::Base
       attr_accessor :referenced_expression
       def initialize_copy(orig)
-        self.referenced_expression = orig.referenced_expression.dup
+        exp_id = [self.class, self.starts_at]
+        # prevent infinite recursion for recursive subexp calls
+        copied = @@copied ||= {}
+        self.referenced_expression =
+          if copied[exp_id]
+            orig.referenced_expression
+          else
+            copied[exp_id] = true
+            orig.referenced_expression.dup
+          end
+        copied.clear
         super
       end
+      def referential?
+        true
+      end
     end
     class Number < Backreference::Base
@@ -38,7 +55,7 @@ module Regexp::Expression
     class NameCall           < Backreference::Name; end
     class NumberCallRelative < Backreference::NumberRelative; end
-    class NumberRecursionLevel < Backreference::Number
+    class NumberRecursionLevel < Backreference::NumberRelative
       attr_reader :recursion_level
       def initialize(token, options = {})

data/lib/regexp_parser/expression/classes/{type.rb → character_type.rb} RENAMED Viewed

File without changes

data/lib/regexp_parser/expression/classes/conditional.rb CHANGED Viewed

@@ -20,6 +20,10 @@ module Regexp::Expression
         self.referenced_expression = orig.referenced_expression.dup
         super
       end
+      def referential?
+        true
+      end
     end
     class Branch < Regexp::Expression::Sequence; end
@@ -55,6 +59,10 @@ module Regexp::Expression
         condition.reference
       end
+      def referential?
+        true
+      end
       def parts
         [text.dup, condition, *intersperse(branches, '|'), ')']
       end

data/lib/regexp_parser/expression/classes/escape_sequence.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 module Regexp::Expression
-  # TODO: unify naming with Token::Escape, on way or the other, in v3.0.0
+  # TODO: unify naming with Token::Escape, one way or the other, in v3.0.0
   module EscapeSequence
     class Base < Regexp::Expression::Base
       def codepoint

data/lib/regexp_parser/expression/classes/group.rb CHANGED Viewed

@@ -33,6 +33,8 @@ module Regexp::Expression
     class Absence < Group::Base; end
     class Atomic  < Group::Base; end
+    # TODO: should split off OptionsSwitch in v3.0.0. Maybe even make it no
+    # longer inherit from Group because it is effectively a terminal expression.
     class Options < Group::Base
       attr_accessor :option_changes
@@ -40,6 +42,14 @@ module Regexp::Expression
         self.option_changes = orig.option_changes.dup
         super
       end
+      def quantify(*args)
+        if token == :options_switch
+          raise Regexp::Parser::Error, 'Can not quantify an option switch'
+        else
+          super
+        end
+      end
     end
     class Capture < Group::Base

data/lib/regexp_parser/expression/classes/keep.rb CHANGED Viewed

@@ -1,5 +1,7 @@
 module Regexp::Expression
   module Keep
+    # TOOD: in regexp_parser v3.0.0 this should possibly be a Subexpression
+    #       that contains all expressions to its left.
     class Mark < Regexp::Expression::Base; end
   end
 end

data/lib/regexp_parser/expression/classes/root.rb CHANGED Viewed

@@ -1,11 +1,9 @@
 module Regexp::Expression
   class Root < Regexp::Expression::Subexpression
     def self.build(options = {})
-      new(build_token, options)
-    end
-    def self.build_token
-      Regexp::Token.new(:expression, :root, '', 0)
+      warn "`#{self.class}.build(options)` is deprecated and will raise in "\
+           "regexp_parser v3.0.0. Please use `.construct(options: options)`."
+      construct(options: options)
     end
   end
 end

data/lib/regexp_parser/expression/classes/{property.rb → unicode_property.rb} RENAMED Viewed

@@ -1,4 +1,5 @@
 module Regexp::Expression
+  # TODO: unify name with token :property, one way or the other, in v3.0.0
   module UnicodeProperty
     class Base < Regexp::Expression::Base
       def negative?

data/lib/regexp_parser/expression/methods/construct.rb ADDED Viewed

@@ -0,0 +1,43 @@
+module Regexp::Expression
+  module Shared
+    module ClassMethods
+      # Convenience method to init a valid Expression without a Regexp::Token
+      def construct(params = {})
+        attrs = construct_defaults.merge(params)
+        options = attrs.delete(:options)
+        token_args = Regexp::TOKEN_KEYS.map { |k| attrs.delete(k) }
+        token = Regexp::Token.new(*token_args)
+        raise ArgumentError, "unsupported attribute(s): #{attrs}" if attrs.any?
+        new(token, options)
+      end
+      def construct_defaults
+        if self == Root
+          { type: :expression, token: :root, ts: 0 }
+        elsif self < Sequence
+          { type: :expression, token: :sequence }
+        else
+          { type: token_class::Type }
+        end.merge(level: 0, set_level: 0, conditional_level: 0, text: '')
+      end
+      def token_class
+        if self == Root || self < Sequence
+          nil # no token class because these objects are Parser-generated
+        # TODO: synch exp & token class names for alt., dot, escapes in v3.0.0
+        elsif self == Alternation || self == CharacterType::Any
+          Regexp::Syntax::Token::Meta
+        elsif self <= EscapeSequence::Base
+          Regexp::Syntax::Token::Escape
+        else
+          Regexp::Syntax::Token.const_get(name.split('::')[2])
+        end
+      end
+    end
+    def token_class
+      self.class.token_class
+    end
+  end
+end