pcre2 0.1.0 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 3af91ee4c80035897edf206316fbc0d3db890a04af6e8443ef6c2449f4d2c4ab
4
- data.tar.gz: ac7380e81492952a72a5ccd7b20a704f673e11645eb32b258216db40b1fa6cad
3
+ metadata.gz: d2f4ae20cf3f5adb8a896c4574a54e321b7c01203184d1a818d6a6c54c4019d5
4
+ data.tar.gz: 9fe7f755c7c1742cacaf8773ad31633dd5d83fda424068410a2633e72ac47341
5
5
  SHA512:
6
- metadata.gz: 32f765faedfbaeb55e3b63572d13546d0afb7fb69f2a1cc102cf9f2c393c2a3e957be61a187cf5f7744c0e8f2d63655e281b932fecf26058c754c450aa1d8ef5
7
- data.tar.gz: 590060108d1f0f68d945a9753372936cf1bc46b4efa4e53544e4b13fde0e7ecd49b0b7a4aae1f9858bc734b636b29e4fd0622ca60cb1648f838ee136e7cfb22a
6
+ metadata.gz: 252735a0bde32bc8ef4edecc6c616e93860bccd66148d8e75440264b03c6b3093025e38cec4026576de0318aa395ec9da4e135e82cda02f7fa7ba71f36bc9ca7
7
+ data.tar.gz: 218d7e2c69e668c11d01a56967660552276123e543c8826a746c5b7c080bcfcdfc2b3b88b4d69bed137f37ab07fbb93a482b294eb24f2cea59d4e86ca6903ef1
data/README.md CHANGED
@@ -2,6 +2,17 @@
2
2
 
3
3
  This library provides a Ruby interface for the PCRE2 library, which supports more advanced regular expression functionality than the built-in Ruby `Regexp`.
4
4
 
5
+ ## Why?
6
+
7
+ Ruby's `Regexp` is actually quite fast! For simple Regexps without backtracking (for instance regexp without matches like `.*`), you should probably keep using the Ruby `Regexp`. No extra dependencies and it'll be faster than using an external library, including PCRE2.
8
+
9
+ The main reason I built this was so I could use the [backtracking control verbs](https://www.rexegg.com/backtracking-control-verbs.html#mainverbs) such as `(*SKIP)(*FAIL)` that are not supported by Ruby's `Regexp`. Using these, and other features, `PCRE2` supports some pretty wild and advanced regular expressions which you cannot do with Ruby's `Regexp`.
10
+
11
+ `PCRE2` also supports JIT (just-in-time) compilation of the regular expression. From [the manual](https://www.pcre.org/current/doc/html/pcre2jit.html):
12
+ > Just-in-time compiling is a heavyweight optimization that can greatly speed up pattern matching. However, it comes at the cost of extra processing before the match is performed, so it is of most benefit when the same pattern is going to be matched many times. This does not necessarily mean many calls of a matching function; if the pattern is not anchored, matching attempts may take place many times at various positions in the subject, even for a single call. Therefore, if the subject string is very long, it may still pay to use JIT even for one-off matches.
13
+
14
+ You can enable JIT by calling `regexp.jit!` on the `PCRE2::Regexp` object. Using JIT the `PCRE2` matching can be more than 2X faster than Ruby's built-in.
15
+
5
16
  ## Installation
6
17
 
7
18
  Install the PCRE2 library:
@@ -39,6 +50,27 @@ matchdata[0] # => "hello"
39
50
  matchdata = regexp.match(subject, 11) # find next match
40
51
  ```
41
52
 
53
+ Also some of the utility methods on `String` are reimplemented on `PCRE2::Regexp`:
54
+
55
+ ```ruby
56
+ regexp = PCRE2::Regexp.new('\d+')
57
+ subject = "and a 1 and a 2 and a 345"
58
+
59
+ regexp.scan(subject) # => ["1", "2", "345"]
60
+ regexp.split(subject) # => ["and a ", " and a ", " and a "]
61
+ ```
62
+
63
+ There is one new method not available on `Regexp`: `PCRE2::Regexp#matches` which will loop over all matches of the string, and yield the corresponding `Matchdata`:
64
+
65
+ ```ruby
66
+ string = "well hello hello hello there!"
67
+ re = PCRE2::Regexp.new("hello")
68
+
69
+ re.matches(string) do |matchdata|
70
+ puts "Matchdata found between #{matchdata.offsets(0)[0]} and #{matchdata.offsets(0)[1]}"
71
+ end
72
+ ```
73
+
42
74
  ## Benchmark
43
75
 
44
76
  You can run the benchmark that compares `PCRE2::Regexp` with Ruby's built-in `Regexp` as follows:
@@ -1,7 +1,9 @@
1
1
  require "pcre2/version"
2
2
  require "pcre2/lib"
3
3
  require "pcre2/lib/constants"
4
+ require "pcre2/string_utils"
4
5
 
6
+ # Classes
5
7
  require "pcre2/error"
6
8
  require "pcre2/regexp"
7
9
  require "pcre2/matchdata"
@@ -26,13 +26,33 @@ class PCRE2::MatchData
26
26
  end
27
27
 
28
28
  def to_a
29
- pairs.map { |pair| string_from_pair(*pair) }
29
+ @to_a ||= pairs.map { |pair| string_from_pair(*pair) }
30
30
  end
31
31
 
32
32
  def captures
33
33
  to_a[1..-1]
34
34
  end
35
35
 
36
+ def length
37
+ start_of_match - end_of_match
38
+ end
39
+
40
+ def pre_match
41
+ string[0 ... start_of_match]
42
+ end
43
+
44
+ def post_match
45
+ string[end_of_match .. -1]
46
+ end
47
+
48
+ def start_of_match
49
+ offset(0)[0]
50
+ end
51
+
52
+ def end_of_match
53
+ offset(0)[1]
54
+ end
55
+
36
56
  private
37
57
 
38
58
  def string_from_pair(start, ending)
@@ -1,35 +1,69 @@
1
- class PCRE2::Regexp
2
- attr :source, :pattern_ptr
1
+ module PCRE2
2
+ class Regexp
3
+ attr :source, :pattern_ptr
3
4
 
4
- def initialize(pattern, *options)
5
- @source = pattern
6
- @pattern_ptr = PCRE2::Lib.compile_pattern(pattern, options)
7
- end
5
+ include StringUtils
8
6
 
9
- # Compiles the Regexp into a JIT optimised version. Returns whether it was successful
10
- def jit!
11
- options = PCRE2::PCRE2_JIT_COMPLETE | PCRE2::PCRE2_JIT_PARTIAL_SOFT | PCRE2::PCRE2_JIT_PARTIAL_HARD
7
+ # Accepts a String, Regexp or another PCRE2::Regexp
8
+ def initialize(pattern, *options)
9
+ case pattern
10
+ when ::Regexp, PCRE2::Regexp
11
+ @source = pattern.source
12
+ else
13
+ @source = pattern
14
+ end
12
15
 
13
- PCRE2::Lib.pcre2_jit_compile_8(pattern_ptr, options) == 0
14
- end
16
+ @pattern_ptr = Lib.compile_pattern(source, options)
17
+ end
18
+
19
+ # Compiles the Regexp into a JIT optimised version. Returns whether it was successful
20
+ def jit!
21
+ options = PCRE2_JIT_COMPLETE | PCRE2_JIT_PARTIAL_SOFT | PCRE2_JIT_PARTIAL_HARD
15
22
 
16
- def match(str, pos = nil)
17
- result_count, match_data_ptr = PCRE2::Lib.match(@pattern_ptr, str, position: pos)
23
+ Lib.pcre2_jit_compile_8(pattern_ptr, options) == 0
24
+ end
25
+
26
+ def match(str, pos = nil)
27
+ result_count, match_data_ptr = Lib.match(@pattern_ptr, str, position: pos)
18
28
 
19
- if result_count == 0
20
- nil
21
- else
22
- pairs = PCRE2::Lib.get_ovector_pairs(match_data_ptr, result_count)
29
+ if result_count == 0
30
+ nil
31
+ else
32
+ pairs = PCRE2::Lib.get_ovector_pairs(match_data_ptr, result_count)
23
33
 
24
- PCRE2::MatchData.new(self, str, pairs)
34
+ MatchData.new(self, str, pairs)
35
+ end
25
36
  end
26
- end
27
37
 
28
- def named_captures
29
- @named_captures ||= PCRE2::Lib.named_captures(pattern_ptr)
30
- end
38
+ def matches(str, pos = nil, &block)
39
+ return enum_for(:matches, str, pos) if !block_given?
31
40
 
32
- def names
33
- named_captures.keys
41
+ pos ||= 0
42
+ while pos < str.length
43
+ matchdata = self.match(str, pos)
44
+
45
+ if matchdata
46
+ yield matchdata
47
+
48
+ beginning, ending = matchdata.offset(0)
49
+
50
+ if pos == ending # Manually increment position if no change to avoid infinite loops
51
+ pos += 1
52
+ else
53
+ pos = ending
54
+ end
55
+ else
56
+ return
57
+ end
58
+ end
59
+ end
60
+
61
+ def named_captures
62
+ @named_captures ||= Lib.named_captures(pattern_ptr)
63
+ end
64
+
65
+ def names
66
+ named_captures.keys
67
+ end
34
68
  end
35
69
  end
@@ -0,0 +1,43 @@
1
+ module PCRE2::StringUtils
2
+ def scan(string, &block)
3
+ return enum_for(:scan, string).to_a if !block_given?
4
+
5
+ matches(string) do |matchdata|
6
+ if matchdata.captures.any?
7
+ yield matchdata.captures
8
+ else
9
+ yield matchdata[0]
10
+ end
11
+ end
12
+ end
13
+
14
+ def split(string, &block)
15
+ return enum_for(:split, string).to_a if !block_given?
16
+
17
+ previous_position = 0
18
+ matches(string) do |matchdata|
19
+ beginning, ending = matchdata.offset(0)
20
+
21
+ # If zero-length match and the previous_position is equal to the match position, just skip
22
+ # it. The next zero-length match will have a different previous_position and generate a split
23
+ # which results in the appearance of a "per character split" but without empty parts in the
24
+ # beginning. Note that we're also skipping adding capture groups.
25
+ if matchdata.length == 0 && previous_position == beginning
26
+ next
27
+ end
28
+
29
+ yield string[previous_position ... beginning]
30
+
31
+ matchdata.captures.each do |capture|
32
+ yield capture
33
+ end
34
+
35
+ previous_position = ending
36
+ end
37
+
38
+ # Also return the ending of the string from the last match
39
+ if previous_position < string.length
40
+ yield string[previous_position .. -1]
41
+ end
42
+ end
43
+ end
@@ -1,3 +1,3 @@
1
1
  module PCRE2
2
- VERSION = "0.1.0"
2
+ VERSION = "0.2.0"
3
3
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: pcre2
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.0
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - David Verhasselt
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2020-08-05 00:00:00.000000000 Z
11
+ date: 2020-08-15 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: ffi
@@ -48,6 +48,7 @@ files:
48
48
  - lib/pcre2/lib/constants.rb
49
49
  - lib/pcre2/matchdata.rb
50
50
  - lib/pcre2/regexp.rb
51
+ - lib/pcre2/string_utils.rb
51
52
  - lib/pcre2/version.rb
52
53
  - pcre2.gemspec
53
54
  homepage: https://github.com/dv/pcre2