rpeg 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA256:
3
+ metadata.gz: b2451499aff760ff561b11e194eacff8fb9820eef1a6272ec53b54fc6e45f773
4
+ data.tar.gz: 81caf76a3767ffd3f8442ee49fe461f576ecce33d089f7c613360bdb30910cef
5
+ SHA512:
6
+ metadata.gz: b024a52dee62253c3db7b8f07fe911ca0ca7ad606af328331cfc8d8a18f47ce25b14a79b77d4149ca25ffe89701664d7ae91cd3539547fd152d1342de880831a
7
+ data.tar.gz: f1028c2c43e99d23ea414a60fa106cc9e462e0c9161c5c72e21974e7d6d5421370ba7a998d6495dd74739627e1bd2287a22345bd4fb71c8685433098ee741ef0
data/CHANGELOG.md ADDED
@@ -0,0 +1,7 @@
1
+ # Changelog
2
+
3
+ ## [Unreleased]
4
+
5
+ ## [0.1.0]
6
+
7
+ Release in gem form
data/README.md ADDED
@@ -0,0 +1,155 @@
1
+ # RPeg
2
+
3
+ RPeg is a Ruby port of [LPeg](http://www.inf.puc-rio.br/~roberto/lpeg/), Lua's pattern-matching library based on
4
+ [Parsing Expression Grammars](https://en.wikipedia.org/wiki/Parsing_expression_grammar) (PEGs).
5
+
6
+ This project doesn't contain documentation of the library's functionality. For that, see the LPeg page, keeping in mind the
7
+ differences in the Ruby port, described below. For a theoretical justification of the use of PEGs for pattern matching and a lot of
8
+ detail of the internal design of LPeg, see Roberto Ierusalimschy's paper[[Ierusalimschy]](#refereces).
9
+
10
+ ## Why You Should Use RPeg
11
+
12
+ PEGs are flexible and expressive and, once complexity reaches a certain level, tend to be much more readable than regular
13
+ expressions. PEGs are also more powerful than regular expressions, though the various ad hoc extensions to regexes - such as in
14
+ PCRE - close the gap. The LPeg documentation and the Wikipedia article give some examples of what is possible.
15
+
16
+ Being able to use and combine patterns as Ruby objects allows us to build up complex patterns step by step. This makes the code
17
+ easier to read and maintain.
18
+
19
+ ## Why You Should Not Use RPeg
20
+
21
+ I wrote RPeg as learning exercise and for my own illumination. I was interested in how regular expressions can be implemented
22
+ efficiently using a virtual machine ([[Cox]](#references)) and stumbled on Ierusalimschy's paper. I found that paper fascinating and
23
+ decided to try to implement the algorithm in Ruby.
24
+
25
+ ### It is slow
26
+
27
+ Very slow.
28
+
29
+ Ruby is interpreted language. So is Lua, but almost all of LPeg is implemented in C, and this makes LPeg very fast. Ierusalimschy's
30
+ paper, from 2008, states that LPeg can search a large string (the full text of the King James Bible) for "Alpha " in about 40
31
+ milliseconds. RPeg, on more modern hardware[^1], takes 5.4 seconds (!) for the same task. I have profiled my code as best I can and
32
+ don't think it will get any faster.
33
+
34
+ Of course, Ruby can call C code just as well as Lua can, but I am not going to attempt to write RPeg in C. The LPeg code is very
35
+ carefully written to do all of the necessary memory managment, and it gets pretty hairy in the implemention of "runtime captures". I
36
+ have no interest in attempting this for RPeg.
37
+
38
+ ### It is not industrial-strength
39
+
40
+ As much as I could I implemented LPeg as described in the Ieuraselimschy paper, but this only got me so far. There is a great deal
41
+ of cleverness in LPeg, performing optimizations when a pattern is compiled for the bespoke VM, when analyzing patterns for errors,
42
+ and for dozen of other things. So, most of RPeg's code was written while carefully reading the LPeg sources. This was mostly
43
+ educational, but in a few cases I simply couldn't understand what the LPeg code was doing, and was reduced to blindly following the
44
+ logic step-by-step, without a clear picture of what was "really" going on. This was unsatisfying, and left me worried about the
45
+ soundness of my code.
46
+
47
+ I have ported most of LPeg's (extensive) test suite and it all passes, but this is not a battle-hardened product.
48
+
49
+ While I have made efforts to follow LPeg's functionality as closely as I can, all bugs in RPeg are my responsibility.
50
+
51
+ ## Using RPeg
52
+
53
+ Patterns in RPeg are much like they are in LPeg.
54
+
55
+ ``` ruby
56
+ require 'rpeg'
57
+
58
+ # Pattern to match strings of balanced parentheses
59
+ patt1 = RPEG.P( [ "(" * ((1 - RPEG.S("()")) + RPEG.V(0))**0 * ")" ] )
60
+ patt2 = patt1 * -1
61
+
62
+ puts patt2.match "(()()(()))" # 10
63
+ puts patt2.match "(()()(())" # nil, i.e., no match
64
+ ```
65
+
66
+ The examples in the LPeg documentation will work once modified for the syntax of RPeg.
67
+
68
+ TODO: add some actual RPeg examples.
69
+
70
+ ## Differences Between RPeg and LPeg
71
+
72
+ Efforts have been made to keep RPeg's syntax as close to LPeg's as possible. But there are necessarily some differences enforced by
73
+ Ruby.
74
+
75
+ ### Indexing
76
+
77
+ Lua indexes strings and arrays (tables) from 1, while Ruby indexes from zero. RPeg follows the Ruby way. This means that
78
+
79
+ - `match` functions return the Ruby-style index of the end of the matched substring
80
+ - "open" rules in grammars using numeric references use 0-indexing
81
+ - other contexts in which an integer is used as index - such as argument captures - are 0-indexed
82
+
83
+ ### 'And' patterns
84
+
85
+ Given a pattern `p`, RPeg forms its "and" pattern using `+p` where LPeg uses `#p`.
86
+
87
+ Using unary `+` doesn't read very well in practice, even though unary `-` is OK for "not" patterns. I think this is because
88
+ binary `+` is much more common in patterns than binary `-`. But the other unary operators are no good.
89
+
90
+ - `&` is a unary operator in Ruby but the parser appears to restrict it to syntactic sugar for `#to_proc`.
91
+ - `-` is used for "not" pattern formation.
92
+ - `!` must be left untouched as a logical operator.
93
+ - `~` works, but is too easy to mistake for `-`.
94
+
95
+ ### Grammars
96
+
97
+ Grammars are represented in LPeg with Lua tables, which are sort of a cross between arrays and hash tables. After some
98
+ experimentation, RPeg allows grammars to be specified using an array or a hash table.
99
+
100
+ If an array is given then the nonterminals aren't named and all open calls must use numeric indices. The first element of the
101
+ array is either
102
+
103
+ - a non-negative integer 0, 1, 2, ... and specifies the (rule of the) initial nonterminal among the remaining elements with
104
+ indices reckoned _without_ that initial integer
105
+ - something else, which is interpreted as the pattern for the initial nonterminal
106
+
107
+ Otherwise the grammar is defined with a Hash. The keys are the nonterminal symbols and the values are the rule patterns.
108
+
109
+ - the keys must be symbols or strings (which are converted to symbols). No rule can use :initial or "initial" as
110
+ nonterminal.
111
+ - the open calls can refer either to the nonterminals (as strings or symbols) or to rule indices as they appear in the hash,
112
+ ignoring the :initial key (if present)
113
+ - :initial/"initial" can appear as a key in the hash and its value specifies the initial nonterminal.
114
+ - if it is a non-zero integer it gives the index of the initial terminal's rule, reckoned without the presence of the :initial
115
+ key itself.
116
+ - if it is a symbol or a string it specifies the initial nonterminal directly
117
+
118
+ TODO: some examples
119
+
120
+ ### Table captures
121
+
122
+ Table captures - defined with `#Ct` - return instances of a special `TableCapture` class, which mimics a small part of Lua's table
123
+ functionality. Other approaches have been tried and haven't worked well.
124
+
125
+ ### Function captures
126
+
127
+ Various kinds of captures involve calling a function (proc) provided by client code. For example, the construction `patt / fn` takes
128
+ the captures made by patt and passes them as arguments to fn. Then the values returned by fn become the captures of the
129
+ expression.
130
+
131
+ Lua is better than Ruby at distinguishing between a function that returns multiple values and one that returns a single value that
132
+ is an array. In RPeg, returns from function in contexts like this are treated as follows:
133
+
134
+ - `[1, 2, 3]`: multiple captures, 1, 2, 3.
135
+ - this is the natural interpretation as it's the standard way that a Ruby function returns multiple values
136
+ - `[[1, 2, 3]]`: a single capture that is the array `[1, 2, 3]`.
137
+ - nil: no captures
138
+ - even if the function says something like "return nil", the capture code has no way to distinguish between that and a
139
+ function that returns nothing
140
+ - `[nil]`: a single capture with value nil
141
+ - the weirdest case, but I don't see an alternative
142
+ - otherwise, the single value returned by the function is the single captured value.
143
+
144
+ ## TODOs
145
+
146
+ - make this into a useable README
147
+ - turn the code into a gem
148
+
149
+
150
+ # References
151
+ - [Ierusalimschy] Ierusalimschy, R., _Text Pattern-Matching Tool based on Parsing Expression Grammars_, Software: Practice and Experience, 39(3):221-258, https://doi.org/10.1002/spe.892, http://www.inf.puc-rio.br/~roberto/docs/peg.pdf (retrieved 2022-01-??).
152
+ - [Cox] Cox, R., _Regular Expression Matching: the Virtual Machine Approach_, https://swtch.com/~rsc/regexp/regexp2.html.
153
+
154
+
155
+ [^1]: A 2016 Macbook Pro
data/Rakefile ADDED
@@ -0,0 +1,9 @@
1
+ require 'rubygems'
2
+ require 'rake/testtask'
3
+
4
+ Rake::TestTask.new do |t|
5
+ t.libs << 'test'
6
+ end
7
+
8
+ desc 'Run Tests'
9
+ task default: :test