citrus 1.7.0 → 1.8.0

Sign up to get free protection for your applications and to get access to all the features.
data/README CHANGED
@@ -5,26 +5,27 @@
5
5
  Parsing Expressions for Ruby
6
6
 
7
7
 
8
- Citrus is a compact and powerful parsing library for Ruby that combines the
9
- elegance and expressiveness of the language with the simplicity and power of
10
- parsing expressions.
8
+ Citrus is a compact and powerful parsing library for
9
+ [Ruby](http://ruby-lang.org/) that combines the elegance and expressiveness of
10
+ the language with the simplicity and power of
11
+ [parsing expressions](http://en.wikipedia.org/wiki/Parsing_expression_grammar).
11
12
 
12
13
 
13
- = Installation
14
+ # Installation
14
15
 
15
16
 
16
- Via RubyGems:
17
+ Via [RubyGems](http://rubygems.org/):
17
18
 
18
- $ sudo gem install citrus
19
+ $ sudo gem install citrus
19
20
 
20
21
  From a local copy:
21
22
 
22
- $ git clone git://github.com/mjijackson/citrus.git
23
- $ cd citrus
24
- $ rake package && sudo rake install
23
+ $ git clone git://github.com/mjijackson/citrus.git
24
+ $ cd citrus
25
+ $ rake package && sudo rake install
25
26
 
26
27
 
27
- = Background
28
+ # Background
28
29
 
29
30
 
30
31
  In order to be able to use Citrus effectively, you must first understand the
@@ -36,8 +37,8 @@ sentences should end with a period.
36
37
  Semantics are the rules by which meaning may be derived in a language. For
37
38
  example, as you read a book you are able to make some sense of the particular
38
39
  way in which words on a page are combined to form thoughts and express ideas
39
- because you understand what the words themselves mean and you can understand
40
- what they mean collectively.
40
+ because you understand what the words themselves mean and you understand what
41
+ they mean collectively.
41
42
 
42
43
  Computers use a similar process when interpreting code. First, the code must be
43
44
  parsed into recognizable symbols or tokens. These tokens may then be passed to
@@ -49,14 +50,15 @@ powerful parsers that are simple to understand and easy to create and maintain.
49
50
 
50
51
  In Citrus, there are three main types of objects: rules, grammars, and matches.
51
52
 
52
- == Rules
53
+ ## Rules
53
54
 
54
- A rule is an object that specifies some matching behavior on a string. There are
55
- two types of rules: terminals and non-terminals. Terminals can be either Ruby
56
- strings or regular expressions that specify some input to match. For example, a
57
- terminal created from the string "end" would match any sequence of the
58
- characters "e", "n", and "d", in that order. A terminal created from a regular
59
- expression uses Ruby's regular expression engine to attempt to create a match.
55
+ A [Rule](api/classes/Citrus/Rule.html) is an object that specifies some matching
56
+ behavior on a string. There are two types of rules: terminals and non-terminals.
57
+ Terminals can be either Ruby strings or regular expressions that specify some
58
+ input to match. For example, a terminal created from the string "end" would
59
+ match any sequence of the characters "e", "n", and "d", in that order. A
60
+ terminal created from a regular expression uses Ruby's regular expression engine
61
+ to attempt to create a match.
60
62
 
61
63
  Non-terminals are rules that may contain other rules but do not themselves match
62
64
  directly on the input. For example, a Repeat is a non-terminal that may contain
@@ -64,34 +66,35 @@ one other rule that will try and match a certain number of times. Several other
64
66
  types of non-terminals are available that will be discussed later.
65
67
 
66
68
  Rule objects may also have semantic information associated with them in the form
67
- of Ruby modules. These modules contain methods that will be used to extend any
68
- match objects created by the rule with which they are associated.
69
+ of Ruby modules. Rules use these modules to extend the matches they create.
69
70
 
70
- == Grammars
71
+ ## Grammars
71
72
 
72
73
  A grammar is a container for rules. Usually the rules in a grammar collectively
73
74
  form a complete specification for some language, or a well-defined subset
74
75
  thereof.
75
76
 
76
- A Citrus grammar is really just a souped-up Ruby module. These modules may be
77
+ A Citrus grammar is really just a souped-up Ruby
78
+ [module](http://ruby-doc.org/core/classes/Module.html). These modules may be
77
79
  included in other grammar modules in the same way that Ruby modules are normally
78
- used. This property allows you to divide a complex grammar into reusable pieces
79
- that may be combined dynamically at runtime. Any grammar rule with the same name
80
- as a rule in an included grammar may access that rule with a mechanism similar
81
- to Ruby's super keyword.
80
+ used. This property allows you to divide a complex grammar into more manageable,
81
+ reusable pieces that may be combined at runtime. Any grammar rule with the same
82
+ name as a rule in an included grammar may access that rule with a mechanism
83
+ similar to Ruby's super keyword.
82
84
 
83
- == Matches
85
+ ## Matches
84
86
 
85
- Matches are created by rule objects when they match on the input. A match
86
- contains the string of text that made up the match as well as its offset in the
87
- original input string. During a parse, matches are arranged in a tree structure
88
- where any match may contain any number of other matches. This structure is
89
- determined by the way in which the rule that generated each match is used in the
90
- grammar.
87
+ Matches are created by rule objects when they match on the input. A
88
+ [Match](api/classes/Citrus/Match.html) in Citrus is actually a
89
+ [String](http://ruby-doc.org/core/classes/String.html) with some extra
90
+ information attached such as the name(s) of the rule(s) which generated the
91
+ match as well as its offset in the original input string.
91
92
 
92
- For example, a match that is created from a non-terminal rule that contains
93
- several other terminals will likewise contain several matches, one for each
94
- terminal.
93
+ During a parse, matches are arranged in a tree structure where any match may
94
+ contain any number of other matches. This structure is determined by the way in
95
+ which the rule that generated each match is used in the grammar. For example, a
96
+ match that is created from a non-terminal rule that contains several other
97
+ terminals will likewise contain several matches, one for each terminal.
95
98
 
96
99
  Match objects may be extended with semantic information in the form of methods.
97
100
  These methods can interpret the text of a match using the wealth of information
@@ -99,139 +102,164 @@ available to them including the text of the match, its position in the input,
99
102
  and any submatches.
100
103
 
101
104
 
102
- = Syntax
105
+ # Syntax
103
106
 
104
107
 
105
108
  The most straightforward way to compose a Citrus grammar is to use Citrus' own
106
109
  custom grammar syntax. This syntax borrows heavily from Ruby, so it should
107
110
  already be familiar to Ruby programmers.
108
111
 
109
- == Terminals
112
+ ## Terminals
110
113
 
111
114
  Terminals may be represented by a string or a regular expression. Both follow
112
115
  the same rules as Ruby string and regular expression literals.
113
116
 
114
- 'abc'
115
- "abc\n"
116
- /\xFF/
117
+ 'abc'
118
+ "abc\n"
119
+ /\xFF/
117
120
 
118
121
  Character classes and the dot (match anything) symbol are supported as well for
119
122
  compatibility with other parsing expression implementations.
120
123
 
121
- [a-z0-9] # match any lowercase letter or digit
122
- [\x00-\xFF] # match any octet
123
- . # match anything, even new lines
124
+ [a-z0-9] # match any lowercase letter or digit
125
+ [\x00-\xFF] # match any octet
126
+ . # match anything, even new lines
124
127
 
125
- == Repetition
128
+ See [FixedWidth](api/classes/Citrus/FixedWidth.html) and
129
+ [Expression](api/classes/Citrus/Expression.html) for more information.
130
+
131
+ ## Repetition
126
132
 
127
133
  Quantifiers may be used after any expression to specify a number of times it
128
134
  must match. The universal form of a quantifier is N*M where N is the minimum and
129
135
  M is the maximum number of times the expression may match.
130
136
 
131
- 'abc'1*2 # match "abc" a minimum of one, maximum
132
- # of two times
133
- 'abc'1* # match "abc" at least once
134
- 'abc'*2 # match "abc" a maximum of twice
137
+ 'abc'1*2 # match "abc" a minimum of one, maximum
138
+ # of two times
139
+ 'abc'1* # match "abc" at least once
140
+ 'abc'*2 # match "abc" a maximum of twice
135
141
 
136
142
  The + and ? operators are supported as well for the common cases of 1* and *1
137
143
  respectively.
138
144
 
139
- 'abc'+ # match "abc" at least once
140
- 'abc'? # match "abc" a maximum of once
145
+ 'abc'+ # match "abc" at least once
146
+ 'abc'? # match "abc" a maximum of once
147
+
148
+ See [Repeat](api/classes/Citrus/Repeat.html) for more information.
141
149
 
142
- == Lookahead
150
+ ## Lookahead
143
151
 
144
152
  Both positive and negative lookahead are supported in Citrus. Use the & and !
145
153
  operators to indicate that an expression either should or should not match. In
146
154
  neither case is any input consumed.
147
155
 
148
- &'a' 'b' # match a "b" preceded by an "a"
149
- !'a' 'b' # match a "b" that is not preceded by an "a"
150
- !'a' . # match any character except for "a"
156
+ &'a' 'b' # match a "b" preceded by an "a"
157
+ !'a' 'b' # match a "b" that is not preceded by an "a"
158
+ !'a' . # match any character except for "a"
159
+
160
+ A special form of lookahead is also supported which will match any character
161
+ that does not match a given expression.
151
162
 
152
- == Sequences
163
+ ~'a' # match all characters until an "a"
164
+ ~/xyz/ # match all characters until /xyz/ matches
165
+
166
+ See [AndPredicate](api/classes/Citrus/AndPredicate.html),
167
+ [NotPredicate](api/classes/Citrus/NotPredicate.html), and
168
+ [ButPredicate](api/classes/Citrus/ButPredicate.html) for more information.
169
+
170
+ ## Sequences
153
171
 
154
172
  Sequences of expressions may be separated by a space to indicate that the rules
155
173
  should match in that order.
156
174
 
157
- 'a' 'b' 'c' # match "a", then "b", then "c"
158
- 'a' [0-9] # match "a", then a numeric digit
175
+ 'a' 'b' 'c' # match "a", then "b", then "c"
176
+ 'a' [0-9] # match "a", then a numeric digit
177
+
178
+ See [Sequence](api/classes/Citrus/Sequence.html) for more information.
159
179
 
160
- == Choices
180
+ ## Choices
161
181
 
162
182
  Ordered choice is indicated by a vertical bar that separates two expressions.
163
183
  Note that any operator binds more tightly than the bar.
164
184
 
165
- 'a' | 'b' # match "a" or "b"
166
- 'a' 'b' | 'c' # match "a" then "b" (in sequence), or "c"
185
+ 'a' | 'b' # match "a" or "b"
186
+ 'a' 'b' | 'c' # match "a" then "b" (in sequence), or "c"
167
187
 
168
- == Super
188
+ See [Choice](api/classes/Citrus/Choice.html) for more information.
189
+
190
+ ## Super
169
191
 
170
192
  When including a grammar inside another, all rules in the child that have the
171
193
  same name as a rule in the parent also have access to the "super" keyword to
172
194
  invoke the parent rule.
173
195
 
174
- == Labels
196
+ See [Super](api/classes/Citrus/Super.html) for more information.
197
+
198
+ ## Labels
175
199
 
176
200
  Match objects may be referred to by a different name than the rule that
177
201
  originally generated them. Labels are created by placing the label and a colon
178
202
  immediately preceding any expression.
179
203
 
180
- chars:/[a-z]+/ # the characters matched by the regular
181
- # expression may be referred to as "chars"
182
- # in a block method
204
+ chars:/[a-z]+/ # the characters matched by the regular
205
+ # expression may be referred to as "chars"
206
+ # in a block method
207
+
208
+ See [Label](api/classes/Citrus/Label.html) for more information.
183
209
 
184
- == Precedence
210
+ ## Precedence
185
211
 
186
- The following table contains a list of all operators and their precedence. A
187
- higher level of precedence indicates tighter binding.
212
+ The following table contains a list of all Citrus operators and their
213
+ precedence. A higher precedence indicates tighter binding.
188
214
 
189
- Operator | Level of Precedence | Name
190
- ----------------------------------------------------------------
191
- '' | 6 | Literal string
192
- "" | 6 | Literal string
193
- [] | 6 | Character class
194
- . | 6 | Any character (dot)
195
- // | 6 | Regular expression
196
- () | 6 | Grouping
197
- * | 5 | Repetition (arbitrary)
198
- + | 5 | Repetition (one or more)
199
- ? | 5 | Repetition (zero or one)
200
- & | 4 | And predicate
201
- ! | 4 | Not predicate
202
- : | 4 | Label
203
- <>, {} | 3 | Extension
204
- e1 e2 | 2 | Sequence
205
- e1 | e2 | 1 | Ordered choice
215
+ | Operator | Name | Precedence |
216
+ | ----------- | ------------------------- | ---------- |
217
+ | '' | String (single quoted) | 6 |
218
+ | "" | String (double quoted) | 6 |
219
+ | [] | Character class | 6 |
220
+ | . | Dot (any character) | 6 |
221
+ | // | Regular expression | 6 |
222
+ | () | Grouping | 6 |
223
+ | * | Repetition (arbitrary) | 5 |
224
+ | + | Repetition (one or more) | 5 |
225
+ | ? | Repetition (zero or one) | 5 |
226
+ | & | And predicate | 4 |
227
+ | ! | Not predicate | 4 |
228
+ | ~ | But predicate | 4 |
229
+ | : | Label | 4 |
230
+ | <> | Extension (module name) | 3 |
231
+ | {} | Extension (literal) | 3 |
232
+ | e1 e2 | Sequence | 2 |
233
+ | e1 | e2 | Ordered choice | 1 |
206
234
 
207
235
 
208
- = Example
236
+ # Example
209
237
 
210
238
 
211
239
  Below is an example of a simple grammar that is able to parse strings of
212
- integers separated by any amount of white space and a + symbol.
240
+ integers separated by any amount of white space and a `+` symbol.
213
241
 
214
- grammar Addition
215
- rule additive
216
- number plus (additive | number)
217
- end
242
+ grammar Addition
243
+ rule additive
244
+ number plus (additive | number)
245
+ end
218
246
 
219
- rule number
220
- [0-9]+ space
221
- end
247
+ rule number
248
+ [0-9]+ space
249
+ end
222
250
 
223
- rule plus
224
- '+' space
225
- end
251
+ rule plus
252
+ '+' space
253
+ end
226
254
 
227
- rule space
228
- [ \t]*
255
+ rule space
256
+ [ \t]*
257
+ end
229
258
  end
230
- end
231
259
 
232
260
  Several things to note about the above example:
233
261
 
234
- * Grammar and rule declarations end with the "end" keyword
262
+ * Grammar and rule declarations end with the `end` keyword
235
263
  * A Sequence of rules is created by separating expressions with a space
236
264
  * Likewise, ordered choice is represented with a vertical bar
237
265
  * Parentheses may be used to override the natural binding order
@@ -239,15 +267,16 @@ Several things to note about the above example:
239
267
  other rule's name
240
268
  * Any expression may be followed by a quantifier
241
269
 
242
- == Interpretation
270
+ ## Interpretation
243
271
 
244
272
  The grammar above is able to parse simple mathematical expressions such as "1+2"
245
273
  and "1 + 2+3", but it does not have enough semantic information to be able to
246
274
  actually interpret these expressions.
247
275
 
248
- At this point, when the grammar parses a string it generates a tree of Match
249
- objects. Each match is created by a rule. A match will know what text it
250
- contains, its offset in the original input, and what submatches it contains.
276
+ At this point, when the grammar parses a string it generates a tree of
277
+ [Match](api/classes/Citrus/Match.html) objects. Each match is created by a rule.
278
+ A match knows what text it contains, its offset in the original input, and what
279
+ submatches it contains.
251
280
 
252
281
  Submatches are created whenever a rule contains another rule. For example, in
253
282
  the grammar above the number rule matches a string of digits followed by white
@@ -257,32 +286,32 @@ We can use Ruby's block syntax to create a module that will be attached to these
257
286
  matches when they are created and is used to lazily extend them when we want to
258
287
  interpret them. The following example shows one way to do this.
259
288
 
260
- grammar Addition
261
- rule additive
262
- (number plus term:(additive | number)) {
263
- def value
264
- number.value + term.value
265
- end
266
- }
289
+ grammar Addition
290
+ rule additive
291
+ (number plus term:(additive | number)) {
292
+ def value
293
+ number.value + term.value
294
+ end
295
+ }
296
+ end
297
+
298
+ rule number
299
+ ([0-9]+ space) {
300
+ def value
301
+ strip.to_i
302
+ end
303
+ }
304
+ end
305
+
306
+ rule plus
307
+ '+' space
308
+ end
309
+
310
+ rule space
311
+ [ \t]*
312
+ end
267
313
  end
268
314
 
269
- rule number
270
- ([0-9]+ space) {
271
- def value
272
- text.strip.to_i
273
- end
274
- }
275
- end
276
-
277
- rule plus
278
- '+' space
279
- end
280
-
281
- rule space
282
- [ \t]*
283
- end
284
- end
285
-
286
315
  In this version of the grammar we have added two semantic blocks, one each for
287
316
  the additive and number rules. These blocks contain methods that will be present
288
317
  on all match objects that result from matches of those particular rules. It's
@@ -291,48 +320,82 @@ block, which is defined within the number rule.
291
320
 
292
321
  The semantic block associated with the number rule defines one method, value.
293
322
  Inside this method, we can see that the value of a number match is determined to
294
- be its text value, stripped of white space and converted to an integer.
323
+ be its text value, stripped of white space and converted to an integer. Remember
324
+ that matches are simply strings, so the `strip` method in this case is actually
325
+ `String#strip`.
295
326
 
296
- The additive rule also extends its matches with a value method. Notice the use
297
- of the "term" label within the rule definition. This label allows the match that
327
+ The `additive` rule also extends its matches with a value method. Notice the use
328
+ of the `term` label within the rule definition. This label allows the match that
298
329
  is created by either the additive or the number rule to be retrieved using the
299
- "term" label. The value of an additive is determined to be the values of its
300
- number and term matches added together using Ruby's addition operator.
330
+ `term` label. The value of an additive is determined to be the values of its
331
+ `number` and `term` matches added together using Ruby's addition operator.
301
332
 
302
333
  Since additive is the first rule defined in the grammar, any match that results
303
- from parsing a string with this grammar will have a value method that can be
334
+ from parsing a string with this grammar will have a `value` method that can be
304
335
  used to recursively calculate the collective value of the entire match tree.
305
336
 
306
337
  To give it a try, save the code for the Addition grammar in a file called
307
338
  addition.citrus. Next, assuming you have the Citrus gem installed, try the
308
339
  following sequence of commands in a terminal.
309
340
 
310
- $ irb
311
- > require 'citrus'
312
- => true
313
- > Citrus.load 'addition'
314
- => [Addition]
315
- > m = Addition.parse '1 + 2 + 3'
316
- => #<Citrus::Match ...
317
- > m.value
318
- => 6
341
+ $ irb
342
+ > require 'citrus'
343
+ => true
344
+ > Citrus.load 'addition'
345
+ => [Addition]
346
+ > m = Addition.parse '1 + 2 + 3'
347
+ => #<Citrus::Match ...
348
+ > m.value
349
+ => 6
319
350
 
320
351
  Congratulations! You just ran your first piece of Citrus code.
321
352
 
322
- Take a look at examples/calc.citrus for an example of a calculator that is able
323
- to parse and evaluate more complex mathematical expressions.
353
+ Take a look at
354
+ [examples/calc.citrus](http://github.com/mjijackson/citrus/blob/master/examples/calc.citrus)
355
+ for an example of a calculator that is able to parse and evaluate more complex
356
+ mathematical expressions.
357
+
358
+ ## Implicit Value
359
+
360
+ It is very common for a grammar to only have one interpretation for a given
361
+ symbol. For this reason, you may find yourself writing a `value` method for
362
+ every rule in your grammar. Because this can be tedious, Citrus allows you to
363
+ omit defining such a method if you choose. For example, the `additive` and
364
+ `number` rules from the simple calculator example above could also be written
365
+ as:
366
+
367
+ rule additive
368
+ (number plus term:(additive | number)) {
369
+ number.value + term.value
370
+ }
371
+ end
372
+
373
+ rule number
374
+ ([0-9]+ space) {
375
+ strip.to_i
376
+ }
377
+ end
378
+
379
+ Since no method name is explicitly specified in the semantic blocks, they may be
380
+ called using the `value` method.
381
+
382
+
383
+ # Links
324
384
 
325
385
 
326
- = Links
386
+ The primary resource for all things to do with parsing expressions can be found
387
+ on the original [Packrat and Parsing Expression Grammars page](http://pdos.csail.mit.edu/~baford/packrat) at MIT.
327
388
 
389
+ Also, a useful summary of parsing expression grammars can be found on
390
+ [Wikipedia](http://en.wikipedia.org/wiki/Parsing_expression_grammar).
328
391
 
329
- http://mjijackson.com/citrus
330
- http://pdos.csail.mit.edu/~baford/packrat/
331
- http://en.wikipedia.org/wiki/Parsing_expression_grammar
332
- http://treetop.rubyforge.org/index.html
392
+ Citrus draws inspiration from another Ruby library for writing parsing
393
+ expression grammars, Treetop. While Citrus' syntax is similar to that of
394
+ [Treetop](http://treetop.rubyforge.org), it's not identical. The link is
395
+ included here for those who may wish toexplore an alternative implementation.
333
396
 
334
397
 
335
- = License
398
+ # License
336
399
 
337
400
 
338
401
  Copyright 2010 Michael Jackson