RubyGems - pcre2 - Versions diffs - 0.1.0 → 0.2.0 - Mend

pcre2 0.1.0 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (8) hide show

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 3af91ee4c80035897edf206316fbc0d3db890a04af6e8443ef6c2449f4d2c4ab
-  data.tar.gz: ac7380e81492952a72a5ccd7b20a704f673e11645eb32b258216db40b1fa6cad
+  metadata.gz: d2f4ae20cf3f5adb8a896c4574a54e321b7c01203184d1a818d6a6c54c4019d5
+  data.tar.gz: 9fe7f755c7c1742cacaf8773ad31633dd5d83fda424068410a2633e72ac47341
 SHA512:
-  metadata.gz: 32f765faedfbaeb55e3b63572d13546d0afb7fb69f2a1cc102cf9f2c393c2a3e957be61a187cf5f7744c0e8f2d63655e281b932fecf26058c754c450aa1d8ef5
-  data.tar.gz: 590060108d1f0f68d945a9753372936cf1bc46b4efa4e53544e4b13fde0e7ecd49b0b7a4aae1f9858bc734b636b29e4fd0622ca60cb1648f838ee136e7cfb22a
+  metadata.gz: 252735a0bde32bc8ef4edecc6c616e93860bccd66148d8e75440264b03c6b3093025e38cec4026576de0318aa395ec9da4e135e82cda02f7fa7ba71f36bc9ca7
+  data.tar.gz: 218d7e2c69e668c11d01a56967660552276123e543c8826a746c5b7c080bcfcdfc2b3b88b4d69bed137f37ab07fbb93a482b294eb24f2cea59d4e86ca6903ef1

data/README.md CHANGED

@@ -2,6 +2,17 @@
 This library provides a Ruby interface for the PCRE2 library, which supports more advanced regular expression functionality than the built-in Ruby `Regexp`.
+## Why?
+Ruby's `Regexp` is actually quite fast! For simple Regexps without backtracking (for instance regexp without matches like `.*`), you should probably keep using the Ruby `Regexp`. No extra dependencies and it'll be faster than using an external library, including PCRE2.
+The main reason I built this was so I could use the [backtracking control verbs](https://www.rexegg.com/backtracking-control-verbs.html#mainverbs) such as `(*SKIP)(*FAIL)` that are not supported by Ruby's `Regexp`. Using these, and other features, `PCRE2` supports some pretty wild and advanced regular expressions which you cannot do with Ruby's `Regexp`.
+`PCRE2` also supports JIT (just-in-time) compilation of the regular expression. From [the manual](https://www.pcre.org/current/doc/html/pcre2jit.html):
+> Just-in-time compiling is a heavyweight optimization that can greatly speed up pattern matching. However, it comes at the cost of extra processing before the match is performed, so it is of most benefit when the same pattern is going to be matched many times. This does not necessarily mean many calls of a matching function; if the pattern is not anchored, matching attempts may take place many times at various positions in the subject, even for a single call. Therefore, if the subject string is very long, it may still pay to use JIT even for one-off matches.
+You can enable JIT by calling `regexp.jit!` on the `PCRE2::Regexp` object. Using JIT the `PCRE2` matching can be more than 2X faster than Ruby's built-in.
 ## Installation
 Install the PCRE2 library:
@@ -39,6 +50,27 @@ matchdata[0] # => "hello"
 matchdata = regexp.match(subject, 11) # find next match
 ```
+Also some of the utility methods on `String` are reimplemented on `PCRE2::Regexp`:
+```ruby
+regexp = PCRE2::Regexp.new('\d+')
+subject = "and a 1 and a 2 and a 345"
+regexp.scan(subject)  # => ["1", "2", "345"]
+regexp.split(subject) # => ["and a ", " and a ", " and a "]
+```
+There is one new method not available on `Regexp`: `PCRE2::Regexp#matches` which will loop over all matches of the string, and yield the corresponding `Matchdata`:
+```ruby
+string = "well hello hello hello there!"
+re = PCRE2::Regexp.new("hello")
+re.matches(string) do |matchdata|
+  puts "Matchdata found between #{matchdata.offsets(0)[0]} and #{matchdata.offsets(0)[1]}"
+end
+```
 ## Benchmark
 You can run the benchmark that compares `PCRE2::Regexp` with Ruby's built-in `Regexp` as follows:

data/lib/pcre2.rb CHANGED

@@ -1,7 +1,9 @@
 require "pcre2/version"
 require "pcre2/lib"
 require "pcre2/lib/constants"
+require "pcre2/string_utils"
+# Classes
 require "pcre2/error"
 require "pcre2/regexp"
 require "pcre2/matchdata"

data/lib/pcre2/matchdata.rb CHANGED

@@ -26,13 +26,33 @@ class PCRE2::MatchData
   end
   def to_a
-    pairs.map { |pair| string_from_pair(*pair) }
+    @to_a ||= pairs.map { |pair| string_from_pair(*pair) }
   end
   def captures
     to_a[1..-1]
   end
+  def length
+    start_of_match - end_of_match
+  end
+  def pre_match
+    string[0 ... start_of_match]
+  end
+  def post_match
+    string[end_of_match .. -1]
+  end
+  def start_of_match
+    offset(0)[0]
+  end
+  def end_of_match
+    offset(0)[1]
+  end
   private
   def string_from_pair(start, ending)

data/lib/pcre2/regexp.rb CHANGED

@@ -1,35 +1,69 @@
-class PCRE2::Regexp
-  attr :source, :pattern_ptr
+module PCRE2
+  class Regexp
+    attr :source, :pattern_ptr
-  def initialize(pattern, *options)
-    @source = pattern
-    @pattern_ptr = PCRE2::Lib.compile_pattern(pattern, options)
-  end
+    include StringUtils
-  # Compiles the Regexp into a JIT optimised version. Returns whether it was successful
-  def jit!
-    options = PCRE2::PCRE2_JIT_COMPLETE | PCRE2::PCRE2_JIT_PARTIAL_SOFT | PCRE2::PCRE2_JIT_PARTIAL_HARD
+    # Accepts a String, Regexp or another PCRE2::Regexp
+    def initialize(pattern, *options)
+      case pattern
+      when ::Regexp, PCRE2::Regexp
+        @source = pattern.source
+      else
+        @source = pattern
+      end
-    PCRE2::Lib.pcre2_jit_compile_8(pattern_ptr, options) == 0
-  end
+      @pattern_ptr = Lib.compile_pattern(source, options)
+    end
+    # Compiles the Regexp into a JIT optimised version. Returns whether it was successful
+    def jit!
+      options = PCRE2_JIT_COMPLETE | PCRE2_JIT_PARTIAL_SOFT | PCRE2_JIT_PARTIAL_HARD
-  def match(str, pos = nil)
-    result_count, match_data_ptr = PCRE2::Lib.match(@pattern_ptr, str, position: pos)
+      Lib.pcre2_jit_compile_8(pattern_ptr, options) == 0
+    end
+    def match(str, pos = nil)
+      result_count, match_data_ptr = Lib.match(@pattern_ptr, str, position: pos)
-    if result_count == 0
-      nil
-    else
-      pairs = PCRE2::Lib.get_ovector_pairs(match_data_ptr, result_count)
+      if result_count == 0
+        nil
+      else
+        pairs = PCRE2::Lib.get_ovector_pairs(match_data_ptr, result_count)
-      PCRE2::MatchData.new(self, str, pairs)
+        MatchData.new(self, str, pairs)
+      end
     end
-  end
-  def named_captures
-    @named_captures ||= PCRE2::Lib.named_captures(pattern_ptr)
-  end
+    def matches(str, pos = nil, &block)
+      return enum_for(:matches, str, pos) if !block_given?
-  def names
-    named_captures.keys
+      pos ||= 0
+      while pos < str.length
+        matchdata = self.match(str, pos)
+        if matchdata
+          yield matchdata
+          beginning, ending = matchdata.offset(0)
+          if pos == ending # Manually increment position if no change to avoid infinite loops
+            pos += 1
+          else
+            pos = ending
+          end
+        else
+          return
+        end
+      end
+    end
+    def named_captures
+      @named_captures ||= Lib.named_captures(pattern_ptr)
+    end
+    def names
+      named_captures.keys
+    end
   end
 end

data/lib/pcre2/string_utils.rb ADDED

@@ -0,0 +1,43 @@
+module PCRE2::StringUtils
+  def scan(string, &block)
+    return enum_for(:scan, string).to_a if !block_given?
+    matches(string) do |matchdata|
+      if matchdata.captures.any?
+        yield matchdata.captures
+      else
+        yield matchdata[0]
+      end
+    end
+  end
+  def split(string, &block)
+    return enum_for(:split, string).to_a if !block_given?
+    previous_position = 0
+    matches(string) do |matchdata|
+      beginning, ending = matchdata.offset(0)
+      # If zero-length match and the previous_position is equal to the match position, just skip
+      # it. The next zero-length match will have a different previous_position and generate a split
+      # which results in the appearance of a "per character split" but without empty parts in the
+      # beginning. Note that we're also skipping adding capture groups.
+      if matchdata.length == 0 && previous_position == beginning
+        next
+      end
+      yield string[previous_position ... beginning]
+      matchdata.captures.each do |capture|
+        yield capture
+      end
+      previous_position = ending
+    end
+    # Also return the ending of the string from the last match
+    if previous_position < string.length
+      yield string[previous_position .. -1]
+    end
+  end
+end

data/lib/pcre2/version.rb CHANGED

@@ -1,3 +1,3 @@
 module PCRE2
-  VERSION = "0.1.0"
+  VERSION = "0.2.0"
 end

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: pcre2
 version: !ruby/object:Gem::Version
-  version: 0.1.0
+  version: 0.2.0
 platform: ruby
 authors:
 - David Verhasselt
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2020-08-05 00:00:00.000000000 Z
+date: 2020-08-15 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: ffi
@@ -48,6 +48,7 @@ files:
 - lib/pcre2/lib/constants.rb
 - lib/pcre2/matchdata.rb
 - lib/pcre2/regexp.rb
+- lib/pcre2/string_utils.rb
 - lib/pcre2/version.rb
 - pcre2.gemspec
 homepage: https://github.com/dv/pcre2