whittle 0.0.1

Sign up to get free protection for your applications and to get access to all the features.
data/.gitignore ADDED
@@ -0,0 +1,4 @@
1
+ *.gem
2
+ .bundle
3
+ Gemfile.lock
4
+ pkg/*
data/.rspec ADDED
@@ -0,0 +1 @@
1
+ --colour
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source "http://rubygems.org"
2
+
3
+ # Specify your gem's dependencies in whittle.gemspec
4
+ gemspec
data/LICENSE ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2011 Chris Corbyn
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,468 @@
1
+ # Whittle: A little LALR(1) Parser in Pure Ruby — Not a Generator
2
+
3
+ Whittle is a LALR(1) parser. It's very small, easy to understand, and what's most important,
4
+ it's 100% ruby. You write parsers by specifying sequences of allowable rules (which refer to
5
+ other rules, or even to themselves), and for each rule in your grammar, you provide a block that
6
+ is invoked when the grammar is recognized.
7
+
8
+ If you're not familiar with parsing, you should find Whittle to be a very friendly little
9
+ parser.
10
+
11
+ It is related, somewhat, to yacc and bison, which belong to the class of parsers knows as
12
+ LALR(1): Lookahead Left-Right (using 1 lookahead token). This class of parsers is both easy to
13
+ work with, and powerful.
14
+
15
+ Whittle provides meaningful error reporting and even lets you hook into the error handling logic
16
+ if you need to write some sort of crazy madman forgiving parser.
17
+
18
+ ## The Basics
19
+
20
+ Parsers using Whittle are *not* generated. This may strike users of other LALR(1) parsers as
21
+ odd, but c'mon, we're using Ruby, right?
22
+
23
+ I'll avoid discussing the algorithm until we get into the really advanced stuff, but you will
24
+ need to understand a few fundamental ideas before we begin.
25
+
26
+ 1. There are two types of rule that make up a complete parser: terminal, and nonterminal
27
+ - A terminal rule is quite simply a chunk of the input string, like '42', or 'function'
28
+ - A nonterminal rule is a rule that makes reference to other rules (terminal and nonterminal)
29
+ 2. The input to be parsed *always* conforms to just one rule at the topmost level. This is
30
+ known as the "start rule".
31
+
32
+ The easiest way to understand how the parser works is just to learn by example, so let's see an
33
+ example.
34
+
35
+ ``` ruby
36
+ require 'whittle'
37
+
38
+ class Mathematician < Whittle::Parser
39
+ rule("+")
40
+
41
+ rule(:int) do |r|
42
+ r[/[0-9]+/].as { |num| Integer(num) }
43
+ end
44
+
45
+ rule(:expr) do |r|
46
+ r[:int, "+", :int].as { |a, _, b| a + b }
47
+ end
48
+
49
+ start(:expr)
50
+ end
51
+
52
+ mathematician = Mathematician.new
53
+ mathematician.parse("1+2")
54
+ # => 3
55
+ ```
56
+
57
+ Let's break this down a bit. As you can see, the whole thing is really just `rule` used in
58
+ different ways. We also have to set the rule that we can use to describe an entire program,
59
+ which in this case is the `:expr` rule that can add two numbers together.
60
+
61
+ There are two terminal rules (`"+"` and `:int`) and one nonterminal (`:expr`) in the above
62
+ grammar. Each rule can have a block attached to it. The block is invoked with the result
63
+ evaluating the blocks that are attached to each input (recursively). A rule with no block
64
+ attached as just a shorthand way of saying "return the input verbatim", so our "+" above receives
65
+ the string "+" and returns the string "+". Since this is such a common use-case, Whittle offers
66
+ the shorthand.
67
+
68
+ As the input string is parsed, it *must* match the start rule `:expr`. Whittle reads the "1",
69
+ which matches `:int` (which casts the String "1" to the Integer 1), next the parser looks for the
70
+ expected "+", which it gets. Now it looks for another `:int`, which it gets. Upon having
71
+ read the sequence `:int`, `"+"`, `:int`, Whittle invokes the block for `:expr` with the arguments
72
+ 1, "+", 2, returning the 3 we expect.
73
+
74
+ ## Nonterminal rules can have more than one valid sequence
75
+
76
+ Our mathematician class above is not much of a mathematician. It can only add numbers together.
77
+ Surely subtraction, division and multiplication should be possible too?
78
+
79
+ It turns out that this is really simple to do. Just add multiple possibilities to the same
80
+ rule.
81
+
82
+ ``` ruby
83
+ require 'whittle'
84
+
85
+ class Mathematician < Whittle::Parser
86
+ rule("+")
87
+ rule("-")
88
+ rule("*")
89
+ rule("/")
90
+
91
+ rule(:int) do |r|
92
+ r[/[0-9]+/].as { |num| Integer(num) }
93
+ end
94
+
95
+ rule(:expr) do |r|
96
+ r[:int, "+", :int].as { |a, _, b| a + b }
97
+ r[:int, "-", :int].as { |a, _, b| a - b }
98
+ r[:int, "*", :int].as { |a, _, b| a * b }
99
+ r[:int, "/", :int].as { |a, _, b| a / b }
100
+ end
101
+
102
+ start(:expr)
103
+ end
104
+
105
+ mathematician = Mathematician.new
106
+
107
+ mathematician.parse("1+2")
108
+ # => 3
109
+
110
+ mathematician.parse("1-2")
111
+ # => -1
112
+
113
+ mathematician.parse("2*3")
114
+ # => 6
115
+
116
+ mathematician.parse("4/2")
117
+ # => 2
118
+ ```
119
+
120
+ Now you're probably seeing how matching just one rule for the entire input is not a problem.
121
+
122
+ ## Rules can refer to themselves
123
+
124
+ But our mathematician is still not very bright. It can only work with two operands. What about
125
+ more complex expressions?
126
+
127
+ ``` ruby
128
+ require 'whittle'
129
+
130
+ class Mathematician < Whittle::Parser
131
+ rule("+")
132
+ rule("-")
133
+ rule("*")
134
+ rule("/")
135
+
136
+ rule(:int) do |r|
137
+ r[/[0-9]+/].as { |num| Integer(num) }
138
+ end
139
+
140
+ rule(:expr) do |r|
141
+ r[:expr, "+", :expr].as { |a, _, b| a + b }
142
+ r[:expr, "-", :expr].as { |a, _, b| a - b }
143
+ r[:expr, "*", :expr].as { |a, _, b| a * b }
144
+ r[:expr, "/", :expr].as { |a, _, b| a / b }
145
+ r[:int].as(:value)
146
+ end
147
+
148
+ start(:expr)
149
+ end
150
+
151
+ mathematician = Mathematician.new
152
+ mathematician.parse("1+5-2")
153
+ # => 4
154
+ ```
155
+
156
+ Adding a rule of just `:int` to the `:expr` rule means that any integer is also a valid `:expr`.
157
+ It is now possible to say that any `:expr` can be added to, multiplied by, divided by or
158
+ subtracted from another `:expr`. It is this ability to self-reference that makes LALR(1)
159
+ parsers so powerful and easy to use. Note that because the result each rule is computed
160
+ *before* being passed as arguments to the block, each `:expr` in the calculations above will
161
+ always be a number, since each `:expr` returns a number.
162
+
163
+ ## Specifying the associativity
164
+
165
+ Our mathematician still isn't very clever however. It makes some silly mistakes. Let's see
166
+ what happens when we do the following:
167
+
168
+ ``` ruby
169
+ mathematician.parse("6-3-1")
170
+ # => 4
171
+ ```
172
+
173
+ Oops. That's not correct. Shouldn't the answer be 2?
174
+
175
+ Our grammar is ambiguous. The input string could be interpreted as either:
176
+
177
+ 6-(3-1)
178
+
179
+ Or as:
180
+
181
+ (6-3)-1
182
+
183
+ Basic arithmetic takes the latter approach, but the parser's default approach is to go the other
184
+ way. We refer to these two alternatives as being left associative (the second example) and
185
+ right associative (the first example). By default, operators are right associative, which means
186
+ as much input will be read as possible before beginning to compute a result.
187
+
188
+ We can correct this by tagging our operators as left associative.
189
+
190
+ ``` ruby
191
+ require 'whittle'
192
+
193
+ class Mathematician < Whittle::Parser
194
+ rule("+") % :left
195
+ rule("-") % :left
196
+ rule("*") % :left
197
+ rule("/") % :left
198
+
199
+ rule(:int) do |r|
200
+ r[/[0-9]+/].as { |num| Integer(num) }
201
+ end
202
+
203
+ rule(:expr) do |r|
204
+ r[:expr, "+", :expr].as { |a, _, b| a + b }
205
+ r[:expr, "-", :expr].as { |a, _, b| a - b }
206
+ r[:expr, "*", :expr].as { |a, _, b| a * b }
207
+ r[:expr, "/", :expr].as { |a, _, b| a / b }
208
+ r[:int].as(:value)
209
+ end
210
+
211
+ start(:expr)
212
+ end
213
+
214
+ mathematician = Mathematician.new
215
+ mathematician.parse("6-3-1")
216
+ # => 2
217
+ ```
218
+
219
+ Attaching a percent sign followed by either `:left` or `:right` changes the associativity of a
220
+ rule. We now get the correct result.
221
+
222
+ ## Specifying the operator precedence
223
+
224
+ Well, despite fixing the associativity, we find we still have a problem:
225
+
226
+ ``` ruby
227
+ mathematician.parse("1+2*3")
228
+ # => 9
229
+ ```
230
+
231
+ Hmm. The expression has been interpreted as (1+2)*3. It turns out arithmetic is not as simple
232
+ as one might think ;) The parser does not (yet) know that the multiplication operator has a
233
+ higher precedence than the addition operator. We need to indicate this in the grammar.
234
+
235
+ ``` ruby
236
+ require 'whittle'
237
+
238
+ class Mathematician < Whittle::Parser
239
+ rule("+") % :left ^ 1
240
+ rule("-") % :left ^ 1
241
+ rule("*") % :left ^ 2
242
+ rule("/") % :left ^ 2
243
+
244
+ rule(:int) do |r|
245
+ r[/[0-9]+/].as { |num| Integer(num) }
246
+ end
247
+
248
+ rule(:expr) do |r|
249
+ r[:expr, "+", :expr].as { |a, _, b| a + b }
250
+ r[:expr, "-", :expr].as { |a, _, b| a - b }
251
+ r[:expr, "*", :expr].as { |a, _, b| a * b }
252
+ r[:expr, "/", :expr].as { |a, _, b| a / b }
253
+ r[:int].as(:value)
254
+ end
255
+
256
+ start(:expr)
257
+ end
258
+
259
+ mathematician = Mathematician.new
260
+ mathematician.parse("1+2*3")
261
+ # => 7
262
+ ```
263
+
264
+ That's better. We can attach a precedence level to a rule by following it with the caret `^`,
265
+ followed by an integer value. The higher the value, the higher the precedence. Note that "+"
266
+ and "-" both have the same precedence, since "1+(2-3)" and "(1+2)-3" are logically equivalent.
267
+ The same applies to "*" and "/", but these both usually have a higher precedence than "+" and
268
+ "-".
269
+
270
+ ## Disambiguating expressions with the use of parentheses
271
+
272
+ Sometimes we really do want "1+2*3" to mean "(1+2)*3", so we should really support this in our
273
+ mathematician. Fortunately adjusting the syntax rules in Whittle is a painless exercise.
274
+
275
+ ``` ruby
276
+ require 'whittle'
277
+
278
+ class Mathematician < Whittle::Parser
279
+ rule("+") % :left ^ 1
280
+ rule("-") % :left ^ 1
281
+ rule("*") % :left ^ 2
282
+ rule("/") % :left ^ 2
283
+
284
+ rule("(")
285
+ rule(")")
286
+
287
+ rule(:int) do |r|
288
+ r[/[0-9]+/].as { |num| Integer(num) }
289
+ end
290
+
291
+ rule(:expr) do |r|
292
+ r["(", :expr, ")"].as { |_, exp, _| exp }
293
+ r[:expr, "+", :expr].as { |a, _, b| a + b }
294
+ r[:expr, "-", :expr].as { |a, _, b| a - b }
295
+ r[:expr, "*", :expr].as { |a, _, b| a * b }
296
+ r[:expr, "/", :expr].as { |a, _, b| a / b }
297
+ r[:int].as(:value)
298
+ end
299
+
300
+ start(:expr)
301
+ end
302
+
303
+ mathematician = Mathematician.new
304
+ mathematician.parse("(1+2)*3")
305
+ # => 9
306
+ ```
307
+
308
+ All we had to do was add the new terminal rules for "(" and ")" then specify that the value of
309
+ an expression enclosed in parentheses is simply the value of the expression itself.
310
+
311
+ ## Skipping whitespace
312
+
313
+ Most languages contain tokens that are ignored when interpreting the input, such as whitespace
314
+ and comments. Accounting for the possibility of these in all rules would be both wasteful and
315
+ tiresome. Instead, we skip them entirely, by declaring a terminal rule without any associated
316
+ action, or if you want to be explicit, with `as(:nothing)`.
317
+
318
+ ``` ruby
319
+ require 'whittle'
320
+
321
+ class Mathematician < Whittle::Parser
322
+ rule(:wsp) do |r|
323
+ r[/\s+/]
324
+ end
325
+
326
+ rule("+") % :left ^ 1
327
+ rule("-") % :left ^ 1
328
+ rule("*") % :left ^ 2
329
+ rule("/") % :left ^ 2
330
+
331
+ rule("(")
332
+ rule(")")
333
+
334
+ rule(:int) do |r|
335
+ r[/[0-9]+/].as { |num| Integer(num) }
336
+ end
337
+
338
+ rule(:expr) do |r|
339
+ r["(", :expr, ")"].as { |_, exp, _| exp }
340
+ r[:expr, "+", :expr].as { |a, _, b| a + b }
341
+ r[:expr, "-", :expr].as { |a, _, b| a - b }
342
+ r[:expr, "*", :expr].as { |a, _, b| a * b }
343
+ r[:expr, "/", :expr].as { |a, _, b| a / b }
344
+ r[:int].as(:value)
345
+ end
346
+
347
+ start(:expr)
348
+ end
349
+
350
+ mathematician = Mathematician.new
351
+ mathematician.parse("( 1 + 2)*3 - 4")
352
+ # => 5
353
+ ```
354
+
355
+ Now the whitespace can either exist between the tokens in the input or not. The parser doesn't
356
+ pay attention to it, it simply discards it as the input string is read.
357
+
358
+ ## Rules can be empty
359
+
360
+ Sometimes you want to describe a structure, such as a list, that may have zero or more items in
361
+ it. In order to do this, the empty rule comes in extremely useful. Imagine the input string:
362
+
363
+ (((())))
364
+
365
+ We can say that this is matched by any pair of parentheses inside any pair of parentheses, any
366
+ number of times. But what's in the middle?
367
+
368
+ ``` ruby
369
+ require 'whittle'
370
+
371
+ class Parser < Whittle::Parser
372
+ rule("(")
373
+ rule(")")
374
+
375
+ rule(:parens) do |r|
376
+ r[]
377
+ r["(", :parens, ")"]
378
+ end
379
+
380
+ start(:parens)
381
+ end
382
+ ```
383
+
384
+ The above parser will happily match our input, because it is possible for the `:parens` rule to
385
+ match nothing at all, which is what we hit in the middle of our nested parentheses.
386
+
387
+ This is most useful in constructs like the following:
388
+
389
+ ``` ruby
390
+ rule(:id) do |r|
391
+ r[/[a-z]+/].as(:value)
392
+ end
393
+
394
+ rule(:list) do |r|
395
+ r[].as { [] }
396
+ r[:list, ",", :id].as { |list, _, id| list << id }
397
+ r[:id].as { |id| [id] }
398
+ end
399
+ ```
400
+
401
+ The following would return the array `["a", "b", "c"]` given the input string "a, b, c", or
402
+ given the input string "" (nothing) it would return the empty array.
403
+
404
+ ## Parse errors
405
+
406
+ ### The default error reporting
407
+
408
+ When the parser encounters an unexpected token in the input, an exception of type
409
+ `Whittle::ParseError` is raised. The exception has a very clear message, indicates the line on
410
+ which the error was encountered, and additionally gives you programmatic access to the same
411
+ information.
412
+
413
+ ``` ruby
414
+ class ListParser < Whittle::Parser
415
+ rule(:wsp) do |r|
416
+ r[/\s+/]
417
+ end
418
+
419
+ rule(:id) do |r|
420
+ r[/[a-z]+/].as(:value)
421
+ end
422
+
423
+ rule(",")
424
+ rule("-")
425
+
426
+ rule(:list) do |r|
427
+ r[:list, ",", :id].as { |list, _, id| list << id }
428
+ r[:id].as { |id| Array(id) }
429
+ end
430
+
431
+ start(:list)
432
+ end
433
+
434
+ ListParser.new.parse("a, \nb, \nc- \nd")
435
+
436
+ # =>
437
+ # Parse error: expected "," but got "-" on line 3
438
+ ```
439
+
440
+ You can also access `#line`, `#expected` and `#received` if you catch the exception.
441
+
442
+ ### Recovering from a parse error
443
+
444
+ It is possible to override the `#error` method in the parser to do something smart if you
445
+ believe there to be easily resolved parse errors (such as switching the input token to
446
+ something else, or rewinding the parse stack to a point where the error would not manifest. I
447
+ need to write some specs on this and explore it fully myself before I document it. 99% of users
448
+ would never need to do such a thing.
449
+
450
+ ## TODO
451
+
452
+ - Provide a more powerful (state based) lexer algorithm, or at least document how users can
453
+ override `#lex`.
454
+ - Allow inspection of the parse table (it is not very human friendly right now).
455
+ - Allow inspection of the AST (maybe).
456
+ - Given in an input String, provide a human readble explanation of the parse.
457
+
458
+ ## License & Copyright
459
+
460
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
461
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
462
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
463
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
464
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
465
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
466
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
467
+
468
+ Copright (c) Chris Corbyn, 2011
data/Rakefile ADDED
@@ -0,0 +1 @@
1
+ require "bundler/gem_tasks"
@@ -0,0 +1,9 @@
1
+ # Whittle: A little LALR(1) parser in pure ruby, without a generator.
2
+ #
3
+ # Copyright (c) Chris Corbyn, 2011
4
+
5
+ module Whittle
6
+ # All exceptions descend from this one.
7
+ class Error < RuntimeError
8
+ end
9
+ end
@@ -0,0 +1,9 @@
1
+ # Whittle: A little LALR(1) parser in pure ruby, without a generator.
2
+ #
3
+ # Copyright (c) Chris Corbyn, 2011
4
+
5
+ module Whittle
6
+ # GrammarError is raised if the developer defines an incorrect grammar.
7
+ class GrammarError < Error
8
+ end
9
+ end
@@ -0,0 +1,35 @@
1
+ # Whittle: A little LALR(1) parser in pure ruby, without a generator.
2
+ #
3
+ # Copyright (c) Chris Corbyn, 2011
4
+
5
+ module Whittle
6
+ # ParseError is raised if the parse encounters an unexpected token in the input.
7
+ #
8
+ # You can extract the line number, the expected input and the received input.
9
+ class ParseError < Error
10
+ attr_reader :line
11
+ attr_reader :expected
12
+ attr_reader :received
13
+
14
+ # Initialize the ParseError with information about the location
15
+ #
16
+ # @param [String] message
17
+ # the exception message displayed to the user
18
+ #
19
+ # @param [Fixnum] line
20
+ # the line on which the unexpected token was encountered
21
+ #
22
+ # @param [Array] expected
23
+ # an array of all possible tokens in the current parser state
24
+ #
25
+ # @param [String, Symbol] received
26
+ # the name of the actually received token
27
+ def initialize(message, line, expected, received)
28
+ super(message)
29
+
30
+ @line = line
31
+ @expected = expected
32
+ @received = received
33
+ end
34
+ end
35
+ end
@@ -0,0 +1,9 @@
1
+ # Whittle: A little LALR(1) parser in pure ruby, without a generator.
2
+ #
3
+ # Copyright (c) Chris Corbyn, 2011
4
+
5
+ module Whittle
6
+ # UnconsumedInputError is raised if the lexical analyzer itself cannot find any tokens.
7
+ class UnconsumedInputError < Error
8
+ end
9
+ end