citrus 1.7.0 → 1.8.0
Sign up to get free protection for your applications and to get access to all the features.
- data/README +217 -154
- data/doc/{background.rdoc → background.markdown} +35 -32
- data/doc/example.markdown +145 -0
- data/doc/index.markdown +18 -0
- data/doc/{license.rdoc → license.markdown} +2 -1
- data/doc/links.markdown +13 -0
- data/doc/syntax.markdown +129 -0
- data/examples/calc.citrus +55 -49
- data/examples/calc.rb +55 -49
- data/examples/ip.rb +1 -1
- data/lib/citrus.rb +118 -89
- data/lib/citrus/debug.rb +1 -1
- data/lib/citrus/file.rb +75 -154
- data/test/alias_test.rb +2 -4
- data/test/and_predicate_test.rb +1 -1
- data/test/but_predicate_test.rb +36 -0
- data/test/choice_test.rb +5 -5
- data/test/expression_test.rb +1 -1
- data/test/file_test.rb +17 -15
- data/test/fixed_width_test.rb +2 -2
- data/test/grammar_test.rb +8 -8
- data/test/helper.rb +54 -6
- data/test/label_test.rb +3 -3
- data/test/match_test.rb +5 -5
- data/test/not_predicate_test.rb +1 -1
- data/test/repeat_test.rb +17 -17
- data/test/rule_test.rb +5 -9
- data/test/sequence_test.rb +3 -3
- data/test/super_test.rb +2 -2
- metadata +11 -9
- data/doc/example.rdoc +0 -115
- data/doc/index.rdoc +0 -15
- data/doc/links.rdoc +0 -18
- data/doc/syntax.rdoc +0 -96
data/README
CHANGED
@@ -5,26 +5,27 @@
|
|
5
5
|
Parsing Expressions for Ruby
|
6
6
|
|
7
7
|
|
8
|
-
Citrus is a compact and powerful parsing library for
|
9
|
-
|
10
|
-
|
8
|
+
Citrus is a compact and powerful parsing library for
|
9
|
+
[Ruby](http://ruby-lang.org/) that combines the elegance and expressiveness of
|
10
|
+
the language with the simplicity and power of
|
11
|
+
[parsing expressions](http://en.wikipedia.org/wiki/Parsing_expression_grammar).
|
11
12
|
|
12
13
|
|
13
|
-
|
14
|
+
# Installation
|
14
15
|
|
15
16
|
|
16
|
-
Via RubyGems:
|
17
|
+
Via [RubyGems](http://rubygems.org/):
|
17
18
|
|
18
|
-
|
19
|
+
$ sudo gem install citrus
|
19
20
|
|
20
21
|
From a local copy:
|
21
22
|
|
22
|
-
|
23
|
-
|
24
|
-
|
23
|
+
$ git clone git://github.com/mjijackson/citrus.git
|
24
|
+
$ cd citrus
|
25
|
+
$ rake package && sudo rake install
|
25
26
|
|
26
27
|
|
27
|
-
|
28
|
+
# Background
|
28
29
|
|
29
30
|
|
30
31
|
In order to be able to use Citrus effectively, you must first understand the
|
@@ -36,8 +37,8 @@ sentences should end with a period.
|
|
36
37
|
Semantics are the rules by which meaning may be derived in a language. For
|
37
38
|
example, as you read a book you are able to make some sense of the particular
|
38
39
|
way in which words on a page are combined to form thoughts and express ideas
|
39
|
-
because you understand what the words themselves mean and you
|
40
|
-
|
40
|
+
because you understand what the words themselves mean and you understand what
|
41
|
+
they mean collectively.
|
41
42
|
|
42
43
|
Computers use a similar process when interpreting code. First, the code must be
|
43
44
|
parsed into recognizable symbols or tokens. These tokens may then be passed to
|
@@ -49,14 +50,15 @@ powerful parsers that are simple to understand and easy to create and maintain.
|
|
49
50
|
|
50
51
|
In Citrus, there are three main types of objects: rules, grammars, and matches.
|
51
52
|
|
52
|
-
|
53
|
+
## Rules
|
53
54
|
|
54
|
-
A
|
55
|
-
two types of rules: terminals and non-terminals.
|
56
|
-
strings or regular expressions that specify some
|
57
|
-
terminal created from the string "end" would
|
58
|
-
characters "e", "n", and "d", in that order. A
|
59
|
-
expression uses Ruby's regular expression engine
|
55
|
+
A [Rule](api/classes/Citrus/Rule.html) is an object that specifies some matching
|
56
|
+
behavior on a string. There are two types of rules: terminals and non-terminals.
|
57
|
+
Terminals can be either Ruby strings or regular expressions that specify some
|
58
|
+
input to match. For example, a terminal created from the string "end" would
|
59
|
+
match any sequence of the characters "e", "n", and "d", in that order. A
|
60
|
+
terminal created from a regular expression uses Ruby's regular expression engine
|
61
|
+
to attempt to create a match.
|
60
62
|
|
61
63
|
Non-terminals are rules that may contain other rules but do not themselves match
|
62
64
|
directly on the input. For example, a Repeat is a non-terminal that may contain
|
@@ -64,34 +66,35 @@ one other rule that will try and match a certain number of times. Several other
|
|
64
66
|
types of non-terminals are available that will be discussed later.
|
65
67
|
|
66
68
|
Rule objects may also have semantic information associated with them in the form
|
67
|
-
of Ruby modules.
|
68
|
-
match objects created by the rule with which they are associated.
|
69
|
+
of Ruby modules. Rules use these modules to extend the matches they create.
|
69
70
|
|
70
|
-
|
71
|
+
## Grammars
|
71
72
|
|
72
73
|
A grammar is a container for rules. Usually the rules in a grammar collectively
|
73
74
|
form a complete specification for some language, or a well-defined subset
|
74
75
|
thereof.
|
75
76
|
|
76
|
-
A Citrus grammar is really just a souped-up Ruby
|
77
|
+
A Citrus grammar is really just a souped-up Ruby
|
78
|
+
[module](http://ruby-doc.org/core/classes/Module.html). These modules may be
|
77
79
|
included in other grammar modules in the same way that Ruby modules are normally
|
78
|
-
used. This property allows you to divide a complex grammar into
|
79
|
-
that may be combined
|
80
|
-
as a rule in an included grammar may access that rule with a mechanism
|
81
|
-
to Ruby's super keyword.
|
80
|
+
used. This property allows you to divide a complex grammar into more manageable,
|
81
|
+
reusable pieces that may be combined at runtime. Any grammar rule with the same
|
82
|
+
name as a rule in an included grammar may access that rule with a mechanism
|
83
|
+
similar to Ruby's super keyword.
|
82
84
|
|
83
|
-
|
85
|
+
## Matches
|
84
86
|
|
85
|
-
Matches are created by rule objects when they match on the input. A
|
86
|
-
|
87
|
-
|
88
|
-
|
89
|
-
|
90
|
-
grammar.
|
87
|
+
Matches are created by rule objects when they match on the input. A
|
88
|
+
[Match](api/classes/Citrus/Match.html) in Citrus is actually a
|
89
|
+
[String](http://ruby-doc.org/core/classes/String.html) with some extra
|
90
|
+
information attached such as the name(s) of the rule(s) which generated the
|
91
|
+
match as well as its offset in the original input string.
|
91
92
|
|
92
|
-
|
93
|
-
|
94
|
-
|
93
|
+
During a parse, matches are arranged in a tree structure where any match may
|
94
|
+
contain any number of other matches. This structure is determined by the way in
|
95
|
+
which the rule that generated each match is used in the grammar. For example, a
|
96
|
+
match that is created from a non-terminal rule that contains several other
|
97
|
+
terminals will likewise contain several matches, one for each terminal.
|
95
98
|
|
96
99
|
Match objects may be extended with semantic information in the form of methods.
|
97
100
|
These methods can interpret the text of a match using the wealth of information
|
@@ -99,139 +102,164 @@ available to them including the text of the match, its position in the input,
|
|
99
102
|
and any submatches.
|
100
103
|
|
101
104
|
|
102
|
-
|
105
|
+
# Syntax
|
103
106
|
|
104
107
|
|
105
108
|
The most straightforward way to compose a Citrus grammar is to use Citrus' own
|
106
109
|
custom grammar syntax. This syntax borrows heavily from Ruby, so it should
|
107
110
|
already be familiar to Ruby programmers.
|
108
111
|
|
109
|
-
|
112
|
+
## Terminals
|
110
113
|
|
111
114
|
Terminals may be represented by a string or a regular expression. Both follow
|
112
115
|
the same rules as Ruby string and regular expression literals.
|
113
116
|
|
114
|
-
|
115
|
-
|
116
|
-
|
117
|
+
'abc'
|
118
|
+
"abc\n"
|
119
|
+
/\xFF/
|
117
120
|
|
118
121
|
Character classes and the dot (match anything) symbol are supported as well for
|
119
122
|
compatibility with other parsing expression implementations.
|
120
123
|
|
121
|
-
|
122
|
-
|
123
|
-
|
124
|
+
[a-z0-9] # match any lowercase letter or digit
|
125
|
+
[\x00-\xFF] # match any octet
|
126
|
+
. # match anything, even new lines
|
124
127
|
|
125
|
-
|
128
|
+
See [FixedWidth](api/classes/Citrus/FixedWidth.html) and
|
129
|
+
[Expression](api/classes/Citrus/Expression.html) for more information.
|
130
|
+
|
131
|
+
## Repetition
|
126
132
|
|
127
133
|
Quantifiers may be used after any expression to specify a number of times it
|
128
134
|
must match. The universal form of a quantifier is N*M where N is the minimum and
|
129
135
|
M is the maximum number of times the expression may match.
|
130
136
|
|
131
|
-
|
132
|
-
|
133
|
-
|
134
|
-
|
137
|
+
'abc'1*2 # match "abc" a minimum of one, maximum
|
138
|
+
# of two times
|
139
|
+
'abc'1* # match "abc" at least once
|
140
|
+
'abc'*2 # match "abc" a maximum of twice
|
135
141
|
|
136
142
|
The + and ? operators are supported as well for the common cases of 1* and *1
|
137
143
|
respectively.
|
138
144
|
|
139
|
-
|
140
|
-
|
145
|
+
'abc'+ # match "abc" at least once
|
146
|
+
'abc'? # match "abc" a maximum of once
|
147
|
+
|
148
|
+
See [Repeat](api/classes/Citrus/Repeat.html) for more information.
|
141
149
|
|
142
|
-
|
150
|
+
## Lookahead
|
143
151
|
|
144
152
|
Both positive and negative lookahead are supported in Citrus. Use the & and !
|
145
153
|
operators to indicate that an expression either should or should not match. In
|
146
154
|
neither case is any input consumed.
|
147
155
|
|
148
|
-
|
149
|
-
|
150
|
-
|
156
|
+
&'a' 'b' # match a "b" preceded by an "a"
|
157
|
+
!'a' 'b' # match a "b" that is not preceded by an "a"
|
158
|
+
!'a' . # match any character except for "a"
|
159
|
+
|
160
|
+
A special form of lookahead is also supported which will match any character
|
161
|
+
that does not match a given expression.
|
151
162
|
|
152
|
-
|
163
|
+
~'a' # match all characters until an "a"
|
164
|
+
~/xyz/ # match all characters until /xyz/ matches
|
165
|
+
|
166
|
+
See [AndPredicate](api/classes/Citrus/AndPredicate.html),
|
167
|
+
[NotPredicate](api/classes/Citrus/NotPredicate.html), and
|
168
|
+
[ButPredicate](api/classes/Citrus/ButPredicate.html) for more information.
|
169
|
+
|
170
|
+
## Sequences
|
153
171
|
|
154
172
|
Sequences of expressions may be separated by a space to indicate that the rules
|
155
173
|
should match in that order.
|
156
174
|
|
157
|
-
|
158
|
-
|
175
|
+
'a' 'b' 'c' # match "a", then "b", then "c"
|
176
|
+
'a' [0-9] # match "a", then a numeric digit
|
177
|
+
|
178
|
+
See [Sequence](api/classes/Citrus/Sequence.html) for more information.
|
159
179
|
|
160
|
-
|
180
|
+
## Choices
|
161
181
|
|
162
182
|
Ordered choice is indicated by a vertical bar that separates two expressions.
|
163
183
|
Note that any operator binds more tightly than the bar.
|
164
184
|
|
165
|
-
|
166
|
-
|
185
|
+
'a' | 'b' # match "a" or "b"
|
186
|
+
'a' 'b' | 'c' # match "a" then "b" (in sequence), or "c"
|
167
187
|
|
168
|
-
|
188
|
+
See [Choice](api/classes/Citrus/Choice.html) for more information.
|
189
|
+
|
190
|
+
## Super
|
169
191
|
|
170
192
|
When including a grammar inside another, all rules in the child that have the
|
171
193
|
same name as a rule in the parent also have access to the "super" keyword to
|
172
194
|
invoke the parent rule.
|
173
195
|
|
174
|
-
|
196
|
+
See [Super](api/classes/Citrus/Super.html) for more information.
|
197
|
+
|
198
|
+
## Labels
|
175
199
|
|
176
200
|
Match objects may be referred to by a different name than the rule that
|
177
201
|
originally generated them. Labels are created by placing the label and a colon
|
178
202
|
immediately preceding any expression.
|
179
203
|
|
180
|
-
|
181
|
-
|
182
|
-
|
204
|
+
chars:/[a-z]+/ # the characters matched by the regular
|
205
|
+
# expression may be referred to as "chars"
|
206
|
+
# in a block method
|
207
|
+
|
208
|
+
See [Label](api/classes/Citrus/Label.html) for more information.
|
183
209
|
|
184
|
-
|
210
|
+
## Precedence
|
185
211
|
|
186
|
-
The following table contains a list of all operators and their
|
187
|
-
|
212
|
+
The following table contains a list of all Citrus operators and their
|
213
|
+
precedence. A higher precedence indicates tighter binding.
|
188
214
|
|
189
|
-
Operator |
|
190
|
-
|
191
|
-
'' | 6
|
192
|
-
"" | 6
|
193
|
-
[] |
|
194
|
-
. |
|
195
|
-
// |
|
196
|
-
() | 6
|
197
|
-
* |
|
198
|
-
+ |
|
199
|
-
? |
|
200
|
-
& |
|
201
|
-
! |
|
202
|
-
|
203
|
-
|
204
|
-
|
205
|
-
|
215
|
+
| Operator | Name | Precedence |
|
216
|
+
| ----------- | ------------------------- | ---------- |
|
217
|
+
| '' | String (single quoted) | 6 |
|
218
|
+
| "" | String (double quoted) | 6 |
|
219
|
+
| [] | Character class | 6 |
|
220
|
+
| . | Dot (any character) | 6 |
|
221
|
+
| // | Regular expression | 6 |
|
222
|
+
| () | Grouping | 6 |
|
223
|
+
| * | Repetition (arbitrary) | 5 |
|
224
|
+
| + | Repetition (one or more) | 5 |
|
225
|
+
| ? | Repetition (zero or one) | 5 |
|
226
|
+
| & | And predicate | 4 |
|
227
|
+
| ! | Not predicate | 4 |
|
228
|
+
| ~ | But predicate | 4 |
|
229
|
+
| : | Label | 4 |
|
230
|
+
| <> | Extension (module name) | 3 |
|
231
|
+
| {} | Extension (literal) | 3 |
|
232
|
+
| e1 e2 | Sequence | 2 |
|
233
|
+
| e1 | e2 | Ordered choice | 1 |
|
206
234
|
|
207
235
|
|
208
|
-
|
236
|
+
# Example
|
209
237
|
|
210
238
|
|
211
239
|
Below is an example of a simple grammar that is able to parse strings of
|
212
|
-
integers separated by any amount of white space and a
|
240
|
+
integers separated by any amount of white space and a `+` symbol.
|
213
241
|
|
214
|
-
|
215
|
-
|
216
|
-
|
217
|
-
|
242
|
+
grammar Addition
|
243
|
+
rule additive
|
244
|
+
number plus (additive | number)
|
245
|
+
end
|
218
246
|
|
219
|
-
|
220
|
-
|
221
|
-
|
247
|
+
rule number
|
248
|
+
[0-9]+ space
|
249
|
+
end
|
222
250
|
|
223
|
-
|
224
|
-
|
225
|
-
|
251
|
+
rule plus
|
252
|
+
'+' space
|
253
|
+
end
|
226
254
|
|
227
|
-
|
228
|
-
|
255
|
+
rule space
|
256
|
+
[ \t]*
|
257
|
+
end
|
229
258
|
end
|
230
|
-
end
|
231
259
|
|
232
260
|
Several things to note about the above example:
|
233
261
|
|
234
|
-
* Grammar and rule declarations end with the
|
262
|
+
* Grammar and rule declarations end with the `end` keyword
|
235
263
|
* A Sequence of rules is created by separating expressions with a space
|
236
264
|
* Likewise, ordered choice is represented with a vertical bar
|
237
265
|
* Parentheses may be used to override the natural binding order
|
@@ -239,15 +267,16 @@ Several things to note about the above example:
|
|
239
267
|
other rule's name
|
240
268
|
* Any expression may be followed by a quantifier
|
241
269
|
|
242
|
-
|
270
|
+
## Interpretation
|
243
271
|
|
244
272
|
The grammar above is able to parse simple mathematical expressions such as "1+2"
|
245
273
|
and "1 + 2+3", but it does not have enough semantic information to be able to
|
246
274
|
actually interpret these expressions.
|
247
275
|
|
248
|
-
At this point, when the grammar parses a string it generates a tree of
|
249
|
-
objects. Each match is created by a rule.
|
250
|
-
contains, its offset in the original input, and what
|
276
|
+
At this point, when the grammar parses a string it generates a tree of
|
277
|
+
[Match](api/classes/Citrus/Match.html) objects. Each match is created by a rule.
|
278
|
+
A match knows what text it contains, its offset in the original input, and what
|
279
|
+
submatches it contains.
|
251
280
|
|
252
281
|
Submatches are created whenever a rule contains another rule. For example, in
|
253
282
|
the grammar above the number rule matches a string of digits followed by white
|
@@ -257,32 +286,32 @@ We can use Ruby's block syntax to create a module that will be attached to these
|
|
257
286
|
matches when they are created and is used to lazily extend them when we want to
|
258
287
|
interpret them. The following example shows one way to do this.
|
259
288
|
|
260
|
-
|
261
|
-
|
262
|
-
|
263
|
-
|
264
|
-
|
265
|
-
|
266
|
-
|
289
|
+
grammar Addition
|
290
|
+
rule additive
|
291
|
+
(number plus term:(additive | number)) {
|
292
|
+
def value
|
293
|
+
number.value + term.value
|
294
|
+
end
|
295
|
+
}
|
296
|
+
end
|
297
|
+
|
298
|
+
rule number
|
299
|
+
([0-9]+ space) {
|
300
|
+
def value
|
301
|
+
strip.to_i
|
302
|
+
end
|
303
|
+
}
|
304
|
+
end
|
305
|
+
|
306
|
+
rule plus
|
307
|
+
'+' space
|
308
|
+
end
|
309
|
+
|
310
|
+
rule space
|
311
|
+
[ \t]*
|
312
|
+
end
|
267
313
|
end
|
268
314
|
|
269
|
-
rule number
|
270
|
-
([0-9]+ space) {
|
271
|
-
def value
|
272
|
-
text.strip.to_i
|
273
|
-
end
|
274
|
-
}
|
275
|
-
end
|
276
|
-
|
277
|
-
rule plus
|
278
|
-
'+' space
|
279
|
-
end
|
280
|
-
|
281
|
-
rule space
|
282
|
-
[ \t]*
|
283
|
-
end
|
284
|
-
end
|
285
|
-
|
286
315
|
In this version of the grammar we have added two semantic blocks, one each for
|
287
316
|
the additive and number rules. These blocks contain methods that will be present
|
288
317
|
on all match objects that result from matches of those particular rules. It's
|
@@ -291,48 +320,82 @@ block, which is defined within the number rule.
|
|
291
320
|
|
292
321
|
The semantic block associated with the number rule defines one method, value.
|
293
322
|
Inside this method, we can see that the value of a number match is determined to
|
294
|
-
be its text value, stripped of white space and converted to an integer.
|
323
|
+
be its text value, stripped of white space and converted to an integer. Remember
|
324
|
+
that matches are simply strings, so the `strip` method in this case is actually
|
325
|
+
`String#strip`.
|
295
326
|
|
296
|
-
The additive rule also extends its matches with a value method. Notice the use
|
297
|
-
of the
|
327
|
+
The `additive` rule also extends its matches with a value method. Notice the use
|
328
|
+
of the `term` label within the rule definition. This label allows the match that
|
298
329
|
is created by either the additive or the number rule to be retrieved using the
|
299
|
-
|
300
|
-
number and term matches added together using Ruby's addition operator.
|
330
|
+
`term` label. The value of an additive is determined to be the values of its
|
331
|
+
`number` and `term` matches added together using Ruby's addition operator.
|
301
332
|
|
302
333
|
Since additive is the first rule defined in the grammar, any match that results
|
303
|
-
from parsing a string with this grammar will have a value method that can be
|
334
|
+
from parsing a string with this grammar will have a `value` method that can be
|
304
335
|
used to recursively calculate the collective value of the entire match tree.
|
305
336
|
|
306
337
|
To give it a try, save the code for the Addition grammar in a file called
|
307
338
|
addition.citrus. Next, assuming you have the Citrus gem installed, try the
|
308
339
|
following sequence of commands in a terminal.
|
309
340
|
|
310
|
-
|
311
|
-
|
312
|
-
|
313
|
-
|
314
|
-
|
315
|
-
|
316
|
-
|
317
|
-
|
318
|
-
|
341
|
+
$ irb
|
342
|
+
> require 'citrus'
|
343
|
+
=> true
|
344
|
+
> Citrus.load 'addition'
|
345
|
+
=> [Addition]
|
346
|
+
> m = Addition.parse '1 + 2 + 3'
|
347
|
+
=> #<Citrus::Match ...
|
348
|
+
> m.value
|
349
|
+
=> 6
|
319
350
|
|
320
351
|
Congratulations! You just ran your first piece of Citrus code.
|
321
352
|
|
322
|
-
Take a look at
|
323
|
-
|
353
|
+
Take a look at
|
354
|
+
[examples/calc.citrus](http://github.com/mjijackson/citrus/blob/master/examples/calc.citrus)
|
355
|
+
for an example of a calculator that is able to parse and evaluate more complex
|
356
|
+
mathematical expressions.
|
357
|
+
|
358
|
+
## Implicit Value
|
359
|
+
|
360
|
+
It is very common for a grammar to only have one interpretation for a given
|
361
|
+
symbol. For this reason, you may find yourself writing a `value` method for
|
362
|
+
every rule in your grammar. Because this can be tedious, Citrus allows you to
|
363
|
+
omit defining such a method if you choose. For example, the `additive` and
|
364
|
+
`number` rules from the simple calculator example above could also be written
|
365
|
+
as:
|
366
|
+
|
367
|
+
rule additive
|
368
|
+
(number plus term:(additive | number)) {
|
369
|
+
number.value + term.value
|
370
|
+
}
|
371
|
+
end
|
372
|
+
|
373
|
+
rule number
|
374
|
+
([0-9]+ space) {
|
375
|
+
strip.to_i
|
376
|
+
}
|
377
|
+
end
|
378
|
+
|
379
|
+
Since no method name is explicitly specified in the semantic blocks, they may be
|
380
|
+
called using the `value` method.
|
381
|
+
|
382
|
+
|
383
|
+
# Links
|
324
384
|
|
325
385
|
|
326
|
-
|
386
|
+
The primary resource for all things to do with parsing expressions can be found
|
387
|
+
on the original [Packrat and Parsing Expression Grammars page](http://pdos.csail.mit.edu/~baford/packrat) at MIT.
|
327
388
|
|
389
|
+
Also, a useful summary of parsing expression grammars can be found on
|
390
|
+
[Wikipedia](http://en.wikipedia.org/wiki/Parsing_expression_grammar).
|
328
391
|
|
329
|
-
|
330
|
-
|
331
|
-
http://
|
332
|
-
|
392
|
+
Citrus draws inspiration from another Ruby library for writing parsing
|
393
|
+
expression grammars, Treetop. While Citrus' syntax is similar to that of
|
394
|
+
[Treetop](http://treetop.rubyforge.org), it's not identical. The link is
|
395
|
+
included here for those who may wish toexplore an alternative implementation.
|
333
396
|
|
334
397
|
|
335
|
-
|
398
|
+
# License
|
336
399
|
|
337
400
|
|
338
401
|
Copyright 2010 Michael Jackson
|