pcre2 0.1.0 → 0.2.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +32 -0
- data/lib/pcre2.rb +2 -0
- data/lib/pcre2/matchdata.rb +21 -1
- data/lib/pcre2/regexp.rb +58 -24
- data/lib/pcre2/string_utils.rb +43 -0
- data/lib/pcre2/version.rb +1 -1
- metadata +3 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: d2f4ae20cf3f5adb8a896c4574a54e321b7c01203184d1a818d6a6c54c4019d5
|
4
|
+
data.tar.gz: 9fe7f755c7c1742cacaf8773ad31633dd5d83fda424068410a2633e72ac47341
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 252735a0bde32bc8ef4edecc6c616e93860bccd66148d8e75440264b03c6b3093025e38cec4026576de0318aa395ec9da4e135e82cda02f7fa7ba71f36bc9ca7
|
7
|
+
data.tar.gz: 218d7e2c69e668c11d01a56967660552276123e543c8826a746c5b7c080bcfcdfc2b3b88b4d69bed137f37ab07fbb93a482b294eb24f2cea59d4e86ca6903ef1
|
data/README.md
CHANGED
@@ -2,6 +2,17 @@
|
|
2
2
|
|
3
3
|
This library provides a Ruby interface for the PCRE2 library, which supports more advanced regular expression functionality than the built-in Ruby `Regexp`.
|
4
4
|
|
5
|
+
## Why?
|
6
|
+
|
7
|
+
Ruby's `Regexp` is actually quite fast! For simple Regexps without backtracking (for instance regexp without matches like `.*`), you should probably keep using the Ruby `Regexp`. No extra dependencies and it'll be faster than using an external library, including PCRE2.
|
8
|
+
|
9
|
+
The main reason I built this was so I could use the [backtracking control verbs](https://www.rexegg.com/backtracking-control-verbs.html#mainverbs) such as `(*SKIP)(*FAIL)` that are not supported by Ruby's `Regexp`. Using these, and other features, `PCRE2` supports some pretty wild and advanced regular expressions which you cannot do with Ruby's `Regexp`.
|
10
|
+
|
11
|
+
`PCRE2` also supports JIT (just-in-time) compilation of the regular expression. From [the manual](https://www.pcre.org/current/doc/html/pcre2jit.html):
|
12
|
+
> Just-in-time compiling is a heavyweight optimization that can greatly speed up pattern matching. However, it comes at the cost of extra processing before the match is performed, so it is of most benefit when the same pattern is going to be matched many times. This does not necessarily mean many calls of a matching function; if the pattern is not anchored, matching attempts may take place many times at various positions in the subject, even for a single call. Therefore, if the subject string is very long, it may still pay to use JIT even for one-off matches.
|
13
|
+
|
14
|
+
You can enable JIT by calling `regexp.jit!` on the `PCRE2::Regexp` object. Using JIT the `PCRE2` matching can be more than 2X faster than Ruby's built-in.
|
15
|
+
|
5
16
|
## Installation
|
6
17
|
|
7
18
|
Install the PCRE2 library:
|
@@ -39,6 +50,27 @@ matchdata[0] # => "hello"
|
|
39
50
|
matchdata = regexp.match(subject, 11) # find next match
|
40
51
|
```
|
41
52
|
|
53
|
+
Also some of the utility methods on `String` are reimplemented on `PCRE2::Regexp`:
|
54
|
+
|
55
|
+
```ruby
|
56
|
+
regexp = PCRE2::Regexp.new('\d+')
|
57
|
+
subject = "and a 1 and a 2 and a 345"
|
58
|
+
|
59
|
+
regexp.scan(subject) # => ["1", "2", "345"]
|
60
|
+
regexp.split(subject) # => ["and a ", " and a ", " and a "]
|
61
|
+
```
|
62
|
+
|
63
|
+
There is one new method not available on `Regexp`: `PCRE2::Regexp#matches` which will loop over all matches of the string, and yield the corresponding `Matchdata`:
|
64
|
+
|
65
|
+
```ruby
|
66
|
+
string = "well hello hello hello there!"
|
67
|
+
re = PCRE2::Regexp.new("hello")
|
68
|
+
|
69
|
+
re.matches(string) do |matchdata|
|
70
|
+
puts "Matchdata found between #{matchdata.offsets(0)[0]} and #{matchdata.offsets(0)[1]}"
|
71
|
+
end
|
72
|
+
```
|
73
|
+
|
42
74
|
## Benchmark
|
43
75
|
|
44
76
|
You can run the benchmark that compares `PCRE2::Regexp` with Ruby's built-in `Regexp` as follows:
|
data/lib/pcre2.rb
CHANGED
data/lib/pcre2/matchdata.rb
CHANGED
@@ -26,13 +26,33 @@ class PCRE2::MatchData
|
|
26
26
|
end
|
27
27
|
|
28
28
|
def to_a
|
29
|
-
pairs.map { |pair| string_from_pair(*pair) }
|
29
|
+
@to_a ||= pairs.map { |pair| string_from_pair(*pair) }
|
30
30
|
end
|
31
31
|
|
32
32
|
def captures
|
33
33
|
to_a[1..-1]
|
34
34
|
end
|
35
35
|
|
36
|
+
def length
|
37
|
+
start_of_match - end_of_match
|
38
|
+
end
|
39
|
+
|
40
|
+
def pre_match
|
41
|
+
string[0 ... start_of_match]
|
42
|
+
end
|
43
|
+
|
44
|
+
def post_match
|
45
|
+
string[end_of_match .. -1]
|
46
|
+
end
|
47
|
+
|
48
|
+
def start_of_match
|
49
|
+
offset(0)[0]
|
50
|
+
end
|
51
|
+
|
52
|
+
def end_of_match
|
53
|
+
offset(0)[1]
|
54
|
+
end
|
55
|
+
|
36
56
|
private
|
37
57
|
|
38
58
|
def string_from_pair(start, ending)
|
data/lib/pcre2/regexp.rb
CHANGED
@@ -1,35 +1,69 @@
|
|
1
|
-
|
2
|
-
|
1
|
+
module PCRE2
|
2
|
+
class Regexp
|
3
|
+
attr :source, :pattern_ptr
|
3
4
|
|
4
|
-
|
5
|
-
@source = pattern
|
6
|
-
@pattern_ptr = PCRE2::Lib.compile_pattern(pattern, options)
|
7
|
-
end
|
5
|
+
include StringUtils
|
8
6
|
|
9
|
-
|
10
|
-
|
11
|
-
|
7
|
+
# Accepts a String, Regexp or another PCRE2::Regexp
|
8
|
+
def initialize(pattern, *options)
|
9
|
+
case pattern
|
10
|
+
when ::Regexp, PCRE2::Regexp
|
11
|
+
@source = pattern.source
|
12
|
+
else
|
13
|
+
@source = pattern
|
14
|
+
end
|
12
15
|
|
13
|
-
|
14
|
-
|
16
|
+
@pattern_ptr = Lib.compile_pattern(source, options)
|
17
|
+
end
|
18
|
+
|
19
|
+
# Compiles the Regexp into a JIT optimised version. Returns whether it was successful
|
20
|
+
def jit!
|
21
|
+
options = PCRE2_JIT_COMPLETE | PCRE2_JIT_PARTIAL_SOFT | PCRE2_JIT_PARTIAL_HARD
|
15
22
|
|
16
|
-
|
17
|
-
|
23
|
+
Lib.pcre2_jit_compile_8(pattern_ptr, options) == 0
|
24
|
+
end
|
25
|
+
|
26
|
+
def match(str, pos = nil)
|
27
|
+
result_count, match_data_ptr = Lib.match(@pattern_ptr, str, position: pos)
|
18
28
|
|
19
|
-
|
20
|
-
|
21
|
-
|
22
|
-
|
29
|
+
if result_count == 0
|
30
|
+
nil
|
31
|
+
else
|
32
|
+
pairs = PCRE2::Lib.get_ovector_pairs(match_data_ptr, result_count)
|
23
33
|
|
24
|
-
|
34
|
+
MatchData.new(self, str, pairs)
|
35
|
+
end
|
25
36
|
end
|
26
|
-
end
|
27
37
|
|
28
|
-
|
29
|
-
|
30
|
-
end
|
38
|
+
def matches(str, pos = nil, &block)
|
39
|
+
return enum_for(:matches, str, pos) if !block_given?
|
31
40
|
|
32
|
-
|
33
|
-
|
41
|
+
pos ||= 0
|
42
|
+
while pos < str.length
|
43
|
+
matchdata = self.match(str, pos)
|
44
|
+
|
45
|
+
if matchdata
|
46
|
+
yield matchdata
|
47
|
+
|
48
|
+
beginning, ending = matchdata.offset(0)
|
49
|
+
|
50
|
+
if pos == ending # Manually increment position if no change to avoid infinite loops
|
51
|
+
pos += 1
|
52
|
+
else
|
53
|
+
pos = ending
|
54
|
+
end
|
55
|
+
else
|
56
|
+
return
|
57
|
+
end
|
58
|
+
end
|
59
|
+
end
|
60
|
+
|
61
|
+
def named_captures
|
62
|
+
@named_captures ||= Lib.named_captures(pattern_ptr)
|
63
|
+
end
|
64
|
+
|
65
|
+
def names
|
66
|
+
named_captures.keys
|
67
|
+
end
|
34
68
|
end
|
35
69
|
end
|
@@ -0,0 +1,43 @@
|
|
1
|
+
module PCRE2::StringUtils
|
2
|
+
def scan(string, &block)
|
3
|
+
return enum_for(:scan, string).to_a if !block_given?
|
4
|
+
|
5
|
+
matches(string) do |matchdata|
|
6
|
+
if matchdata.captures.any?
|
7
|
+
yield matchdata.captures
|
8
|
+
else
|
9
|
+
yield matchdata[0]
|
10
|
+
end
|
11
|
+
end
|
12
|
+
end
|
13
|
+
|
14
|
+
def split(string, &block)
|
15
|
+
return enum_for(:split, string).to_a if !block_given?
|
16
|
+
|
17
|
+
previous_position = 0
|
18
|
+
matches(string) do |matchdata|
|
19
|
+
beginning, ending = matchdata.offset(0)
|
20
|
+
|
21
|
+
# If zero-length match and the previous_position is equal to the match position, just skip
|
22
|
+
# it. The next zero-length match will have a different previous_position and generate a split
|
23
|
+
# which results in the appearance of a "per character split" but without empty parts in the
|
24
|
+
# beginning. Note that we're also skipping adding capture groups.
|
25
|
+
if matchdata.length == 0 && previous_position == beginning
|
26
|
+
next
|
27
|
+
end
|
28
|
+
|
29
|
+
yield string[previous_position ... beginning]
|
30
|
+
|
31
|
+
matchdata.captures.each do |capture|
|
32
|
+
yield capture
|
33
|
+
end
|
34
|
+
|
35
|
+
previous_position = ending
|
36
|
+
end
|
37
|
+
|
38
|
+
# Also return the ending of the string from the last match
|
39
|
+
if previous_position < string.length
|
40
|
+
yield string[previous_position .. -1]
|
41
|
+
end
|
42
|
+
end
|
43
|
+
end
|
data/lib/pcre2/version.rb
CHANGED
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: pcre2
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.2.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- David Verhasselt
|
8
8
|
autorequire:
|
9
9
|
bindir: exe
|
10
10
|
cert_chain: []
|
11
|
-
date: 2020-08-
|
11
|
+
date: 2020-08-15 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: ffi
|
@@ -48,6 +48,7 @@ files:
|
|
48
48
|
- lib/pcre2/lib/constants.rb
|
49
49
|
- lib/pcre2/matchdata.rb
|
50
50
|
- lib/pcre2/regexp.rb
|
51
|
+
- lib/pcre2/string_utils.rb
|
51
52
|
- lib/pcre2/version.rb
|
52
53
|
- pcre2.gemspec
|
53
54
|
homepage: https://github.com/dv/pcre2
|