whittle 0.0.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/.gitignore ADDED
@@ -0,0 +1,4 @@
1
+ *.gem
2
+ .bundle
3
+ Gemfile.lock
4
+ pkg/*
data/.rspec ADDED
@@ -0,0 +1 @@
1
+ --colour
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source "http://rubygems.org"
2
+
3
+ # Specify your gem's dependencies in whittle.gemspec
4
+ gemspec
data/LICENSE ADDED
@@ -0,0 +1,20 @@
1
+ Copyright (c) 2011 Chris Corbyn
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining
4
+ a copy of this software and associated documentation files (the
5
+ "Software"), to deal in the Software without restriction, including
6
+ without limitation the rights to use, copy, modify, merge, publish,
7
+ distribute, sublicense, and/or sell copies of the Software, and to
8
+ permit persons to whom the Software is furnished to do so, subject to
9
+ the following conditions:
10
+
11
+ The above copyright notice and this permission notice shall be
12
+ included in all copies or substantial portions of the Software.
13
+
14
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,468 @@
1
+ # Whittle: A little LALR(1) Parser in Pure Ruby — Not a Generator
2
+
3
+ Whittle is a LALR(1) parser. It's very small, easy to understand, and what's most important,
4
+ it's 100% ruby. You write parsers by specifying sequences of allowable rules (which refer to
5
+ other rules, or even to themselves), and for each rule in your grammar, you provide a block that
6
+ is invoked when the grammar is recognized.
7
+
8
+ If you're not familiar with parsing, you should find Whittle to be a very friendly little
9
+ parser.
10
+
11
+ It is related, somewhat, to yacc and bison, which belong to the class of parsers knows as
12
+ LALR(1): Lookahead Left-Right (using 1 lookahead token). This class of parsers is both easy to
13
+ work with, and powerful.
14
+
15
+ Whittle provides meaningful error reporting and even lets you hook into the error handling logic
16
+ if you need to write some sort of crazy madman forgiving parser.
17
+
18
+ ## The Basics
19
+
20
+ Parsers using Whittle are *not* generated. This may strike users of other LALR(1) parsers as
21
+ odd, but c'mon, we're using Ruby, right?
22
+
23
+ I'll avoid discussing the algorithm until we get into the really advanced stuff, but you will
24
+ need to understand a few fundamental ideas before we begin.
25
+
26
+ 1. There are two types of rule that make up a complete parser: terminal, and nonterminal
27
+ - A terminal rule is quite simply a chunk of the input string, like '42', or 'function'
28
+ - A nonterminal rule is a rule that makes reference to other rules (terminal and nonterminal)
29
+ 2. The input to be parsed *always* conforms to just one rule at the topmost level. This is
30
+ known as the "start rule".
31
+
32
+ The easiest way to understand how the parser works is just to learn by example, so let's see an
33
+ example.
34
+
35
+ ``` ruby
36
+ require 'whittle'
37
+
38
+ class Mathematician < Whittle::Parser
39
+ rule("+")
40
+
41
+ rule(:int) do |r|
42
+ r[/[0-9]+/].as { |num| Integer(num) }
43
+ end
44
+
45
+ rule(:expr) do |r|
46
+ r[:int, "+", :int].as { |a, _, b| a + b }
47
+ end
48
+
49
+ start(:expr)
50
+ end
51
+
52
+ mathematician = Mathematician.new
53
+ mathematician.parse("1+2")
54
+ # => 3
55
+ ```
56
+
57
+ Let's break this down a bit. As you can see, the whole thing is really just `rule` used in
58
+ different ways. We also have to set the rule that we can use to describe an entire program,
59
+ which in this case is the `:expr` rule that can add two numbers together.
60
+
61
+ There are two terminal rules (`"+"` and `:int`) and one nonterminal (`:expr`) in the above
62
+ grammar. Each rule can have a block attached to it. The block is invoked with the result
63
+ evaluating the blocks that are attached to each input (recursively). A rule with no block
64
+ attached as just a shorthand way of saying "return the input verbatim", so our "+" above receives
65
+ the string "+" and returns the string "+". Since this is such a common use-case, Whittle offers
66
+ the shorthand.
67
+
68
+ As the input string is parsed, it *must* match the start rule `:expr`. Whittle reads the "1",
69
+ which matches `:int` (which casts the String "1" to the Integer 1), next the parser looks for the
70
+ expected "+", which it gets. Now it looks for another `:int`, which it gets. Upon having
71
+ read the sequence `:int`, `"+"`, `:int`, Whittle invokes the block for `:expr` with the arguments
72
+ 1, "+", 2, returning the 3 we expect.
73
+
74
+ ## Nonterminal rules can have more than one valid sequence
75
+
76
+ Our mathematician class above is not much of a mathematician. It can only add numbers together.
77
+ Surely subtraction, division and multiplication should be possible too?
78
+
79
+ It turns out that this is really simple to do. Just add multiple possibilities to the same
80
+ rule.
81
+
82
+ ``` ruby
83
+ require 'whittle'
84
+
85
+ class Mathematician < Whittle::Parser
86
+ rule("+")
87
+ rule("-")
88
+ rule("*")
89
+ rule("/")
90
+
91
+ rule(:int) do |r|
92
+ r[/[0-9]+/].as { |num| Integer(num) }
93
+ end
94
+
95
+ rule(:expr) do |r|
96
+ r[:int, "+", :int].as { |a, _, b| a + b }
97
+ r[:int, "-", :int].as { |a, _, b| a - b }
98
+ r[:int, "*", :int].as { |a, _, b| a * b }
99
+ r[:int, "/", :int].as { |a, _, b| a / b }
100
+ end
101
+
102
+ start(:expr)
103
+ end
104
+
105
+ mathematician = Mathematician.new
106
+
107
+ mathematician.parse("1+2")
108
+ # => 3
109
+
110
+ mathematician.parse("1-2")
111
+ # => -1
112
+
113
+ mathematician.parse("2*3")
114
+ # => 6
115
+
116
+ mathematician.parse("4/2")
117
+ # => 2
118
+ ```
119
+
120
+ Now you're probably seeing how matching just one rule for the entire input is not a problem.
121
+
122
+ ## Rules can refer to themselves
123
+
124
+ But our mathematician is still not very bright. It can only work with two operands. What about
125
+ more complex expressions?
126
+
127
+ ``` ruby
128
+ require 'whittle'
129
+
130
+ class Mathematician < Whittle::Parser
131
+ rule("+")
132
+ rule("-")
133
+ rule("*")
134
+ rule("/")
135
+
136
+ rule(:int) do |r|
137
+ r[/[0-9]+/].as { |num| Integer(num) }
138
+ end
139
+
140
+ rule(:expr) do |r|
141
+ r[:expr, "+", :expr].as { |a, _, b| a + b }
142
+ r[:expr, "-", :expr].as { |a, _, b| a - b }
143
+ r[:expr, "*", :expr].as { |a, _, b| a * b }
144
+ r[:expr, "/", :expr].as { |a, _, b| a / b }
145
+ r[:int].as(:value)
146
+ end
147
+
148
+ start(:expr)
149
+ end
150
+
151
+ mathematician = Mathematician.new
152
+ mathematician.parse("1+5-2")
153
+ # => 4
154
+ ```
155
+
156
+ Adding a rule of just `:int` to the `:expr` rule means that any integer is also a valid `:expr`.
157
+ It is now possible to say that any `:expr` can be added to, multiplied by, divided by or
158
+ subtracted from another `:expr`. It is this ability to self-reference that makes LALR(1)
159
+ parsers so powerful and easy to use. Note that because the result each rule is computed
160
+ *before* being passed as arguments to the block, each `:expr` in the calculations above will
161
+ always be a number, since each `:expr` returns a number.
162
+
163
+ ## Specifying the associativity
164
+
165
+ Our mathematician still isn't very clever however. It makes some silly mistakes. Let's see
166
+ what happens when we do the following:
167
+
168
+ ``` ruby
169
+ mathematician.parse("6-3-1")
170
+ # => 4
171
+ ```
172
+
173
+ Oops. That's not correct. Shouldn't the answer be 2?
174
+
175
+ Our grammar is ambiguous. The input string could be interpreted as either:
176
+
177
+ 6-(3-1)
178
+
179
+ Or as:
180
+
181
+ (6-3)-1
182
+
183
+ Basic arithmetic takes the latter approach, but the parser's default approach is to go the other
184
+ way. We refer to these two alternatives as being left associative (the second example) and
185
+ right associative (the first example). By default, operators are right associative, which means
186
+ as much input will be read as possible before beginning to compute a result.
187
+
188
+ We can correct this by tagging our operators as left associative.
189
+
190
+ ``` ruby
191
+ require 'whittle'
192
+
193
+ class Mathematician < Whittle::Parser
194
+ rule("+") % :left
195
+ rule("-") % :left
196
+ rule("*") % :left
197
+ rule("/") % :left
198
+
199
+ rule(:int) do |r|
200
+ r[/[0-9]+/].as { |num| Integer(num) }
201
+ end
202
+
203
+ rule(:expr) do |r|
204
+ r[:expr, "+", :expr].as { |a, _, b| a + b }
205
+ r[:expr, "-", :expr].as { |a, _, b| a - b }
206
+ r[:expr, "*", :expr].as { |a, _, b| a * b }
207
+ r[:expr, "/", :expr].as { |a, _, b| a / b }
208
+ r[:int].as(:value)
209
+ end
210
+
211
+ start(:expr)
212
+ end
213
+
214
+ mathematician = Mathematician.new
215
+ mathematician.parse("6-3-1")
216
+ # => 2
217
+ ```
218
+
219
+ Attaching a percent sign followed by either `:left` or `:right` changes the associativity of a
220
+ rule. We now get the correct result.
221
+
222
+ ## Specifying the operator precedence
223
+
224
+ Well, despite fixing the associativity, we find we still have a problem:
225
+
226
+ ``` ruby
227
+ mathematician.parse("1+2*3")
228
+ # => 9
229
+ ```
230
+
231
+ Hmm. The expression has been interpreted as (1+2)*3. It turns out arithmetic is not as simple
232
+ as one might think ;) The parser does not (yet) know that the multiplication operator has a
233
+ higher precedence than the addition operator. We need to indicate this in the grammar.
234
+
235
+ ``` ruby
236
+ require 'whittle'
237
+
238
+ class Mathematician < Whittle::Parser
239
+ rule("+") % :left ^ 1
240
+ rule("-") % :left ^ 1
241
+ rule("*") % :left ^ 2
242
+ rule("/") % :left ^ 2
243
+
244
+ rule(:int) do |r|
245
+ r[/[0-9]+/].as { |num| Integer(num) }
246
+ end
247
+
248
+ rule(:expr) do |r|
249
+ r[:expr, "+", :expr].as { |a, _, b| a + b }
250
+ r[:expr, "-", :expr].as { |a, _, b| a - b }
251
+ r[:expr, "*", :expr].as { |a, _, b| a * b }
252
+ r[:expr, "/", :expr].as { |a, _, b| a / b }
253
+ r[:int].as(:value)
254
+ end
255
+
256
+ start(:expr)
257
+ end
258
+
259
+ mathematician = Mathematician.new
260
+ mathematician.parse("1+2*3")
261
+ # => 7
262
+ ```
263
+
264
+ That's better. We can attach a precedence level to a rule by following it with the caret `^`,
265
+ followed by an integer value. The higher the value, the higher the precedence. Note that "+"
266
+ and "-" both have the same precedence, since "1+(2-3)" and "(1+2)-3" are logically equivalent.
267
+ The same applies to "*" and "/", but these both usually have a higher precedence than "+" and
268
+ "-".
269
+
270
+ ## Disambiguating expressions with the use of parentheses
271
+
272
+ Sometimes we really do want "1+2*3" to mean "(1+2)*3", so we should really support this in our
273
+ mathematician. Fortunately adjusting the syntax rules in Whittle is a painless exercise.
274
+
275
+ ``` ruby
276
+ require 'whittle'
277
+
278
+ class Mathematician < Whittle::Parser
279
+ rule("+") % :left ^ 1
280
+ rule("-") % :left ^ 1
281
+ rule("*") % :left ^ 2
282
+ rule("/") % :left ^ 2
283
+
284
+ rule("(")
285
+ rule(")")
286
+
287
+ rule(:int) do |r|
288
+ r[/[0-9]+/].as { |num| Integer(num) }
289
+ end
290
+
291
+ rule(:expr) do |r|
292
+ r["(", :expr, ")"].as { |_, exp, _| exp }
293
+ r[:expr, "+", :expr].as { |a, _, b| a + b }
294
+ r[:expr, "-", :expr].as { |a, _, b| a - b }
295
+ r[:expr, "*", :expr].as { |a, _, b| a * b }
296
+ r[:expr, "/", :expr].as { |a, _, b| a / b }
297
+ r[:int].as(:value)
298
+ end
299
+
300
+ start(:expr)
301
+ end
302
+
303
+ mathematician = Mathematician.new
304
+ mathematician.parse("(1+2)*3")
305
+ # => 9
306
+ ```
307
+
308
+ All we had to do was add the new terminal rules for "(" and ")" then specify that the value of
309
+ an expression enclosed in parentheses is simply the value of the expression itself.
310
+
311
+ ## Skipping whitespace
312
+
313
+ Most languages contain tokens that are ignored when interpreting the input, such as whitespace
314
+ and comments. Accounting for the possibility of these in all rules would be both wasteful and
315
+ tiresome. Instead, we skip them entirely, by declaring a terminal rule without any associated
316
+ action, or if you want to be explicit, with `as(:nothing)`.
317
+
318
+ ``` ruby
319
+ require 'whittle'
320
+
321
+ class Mathematician < Whittle::Parser
322
+ rule(:wsp) do |r|
323
+ r[/\s+/]
324
+ end
325
+
326
+ rule("+") % :left ^ 1
327
+ rule("-") % :left ^ 1
328
+ rule("*") % :left ^ 2
329
+ rule("/") % :left ^ 2
330
+
331
+ rule("(")
332
+ rule(")")
333
+
334
+ rule(:int) do |r|
335
+ r[/[0-9]+/].as { |num| Integer(num) }
336
+ end
337
+
338
+ rule(:expr) do |r|
339
+ r["(", :expr, ")"].as { |_, exp, _| exp }
340
+ r[:expr, "+", :expr].as { |a, _, b| a + b }
341
+ r[:expr, "-", :expr].as { |a, _, b| a - b }
342
+ r[:expr, "*", :expr].as { |a, _, b| a * b }
343
+ r[:expr, "/", :expr].as { |a, _, b| a / b }
344
+ r[:int].as(:value)
345
+ end
346
+
347
+ start(:expr)
348
+ end
349
+
350
+ mathematician = Mathematician.new
351
+ mathematician.parse("( 1 + 2)*3 - 4")
352
+ # => 5
353
+ ```
354
+
355
+ Now the whitespace can either exist between the tokens in the input or not. The parser doesn't
356
+ pay attention to it, it simply discards it as the input string is read.
357
+
358
+ ## Rules can be empty
359
+
360
+ Sometimes you want to describe a structure, such as a list, that may have zero or more items in
361
+ it. In order to do this, the empty rule comes in extremely useful. Imagine the input string:
362
+
363
+ (((())))
364
+
365
+ We can say that this is matched by any pair of parentheses inside any pair of parentheses, any
366
+ number of times. But what's in the middle?
367
+
368
+ ``` ruby
369
+ require 'whittle'
370
+
371
+ class Parser < Whittle::Parser
372
+ rule("(")
373
+ rule(")")
374
+
375
+ rule(:parens) do |r|
376
+ r[]
377
+ r["(", :parens, ")"]
378
+ end
379
+
380
+ start(:parens)
381
+ end
382
+ ```
383
+
384
+ The above parser will happily match our input, because it is possible for the `:parens` rule to
385
+ match nothing at all, which is what we hit in the middle of our nested parentheses.
386
+
387
+ This is most useful in constructs like the following:
388
+
389
+ ``` ruby
390
+ rule(:id) do |r|
391
+ r[/[a-z]+/].as(:value)
392
+ end
393
+
394
+ rule(:list) do |r|
395
+ r[].as { [] }
396
+ r[:list, ",", :id].as { |list, _, id| list << id }
397
+ r[:id].as { |id| [id] }
398
+ end
399
+ ```
400
+
401
+ The following would return the array `["a", "b", "c"]` given the input string "a, b, c", or
402
+ given the input string "" (nothing) it would return the empty array.
403
+
404
+ ## Parse errors
405
+
406
+ ### The default error reporting
407
+
408
+ When the parser encounters an unexpected token in the input, an exception of type
409
+ `Whittle::ParseError` is raised. The exception has a very clear message, indicates the line on
410
+ which the error was encountered, and additionally gives you programmatic access to the same
411
+ information.
412
+
413
+ ``` ruby
414
+ class ListParser < Whittle::Parser
415
+ rule(:wsp) do |r|
416
+ r[/\s+/]
417
+ end
418
+
419
+ rule(:id) do |r|
420
+ r[/[a-z]+/].as(:value)
421
+ end
422
+
423
+ rule(",")
424
+ rule("-")
425
+
426
+ rule(:list) do |r|
427
+ r[:list, ",", :id].as { |list, _, id| list << id }
428
+ r[:id].as { |id| Array(id) }
429
+ end
430
+
431
+ start(:list)
432
+ end
433
+
434
+ ListParser.new.parse("a, \nb, \nc- \nd")
435
+
436
+ # =>
437
+ # Parse error: expected "," but got "-" on line 3
438
+ ```
439
+
440
+ You can also access `#line`, `#expected` and `#received` if you catch the exception.
441
+
442
+ ### Recovering from a parse error
443
+
444
+ It is possible to override the `#error` method in the parser to do something smart if you
445
+ believe there to be easily resolved parse errors (such as switching the input token to
446
+ something else, or rewinding the parse stack to a point where the error would not manifest. I
447
+ need to write some specs on this and explore it fully myself before I document it. 99% of users
448
+ would never need to do such a thing.
449
+
450
+ ## TODO
451
+
452
+ - Provide a more powerful (state based) lexer algorithm, or at least document how users can
453
+ override `#lex`.
454
+ - Allow inspection of the parse table (it is not very human friendly right now).
455
+ - Allow inspection of the AST (maybe).
456
+ - Given in an input String, provide a human readble explanation of the parse.
457
+
458
+ ## License & Copyright
459
+
460
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
461
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
462
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
463
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
464
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
465
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
466
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
467
+
468
+ Copright (c) Chris Corbyn, 2011
data/Rakefile ADDED
@@ -0,0 +1 @@
1
+ require "bundler/gem_tasks"
@@ -0,0 +1,9 @@
1
+ # Whittle: A little LALR(1) parser in pure ruby, without a generator.
2
+ #
3
+ # Copyright (c) Chris Corbyn, 2011
4
+
5
+ module Whittle
6
+ # All exceptions descend from this one.
7
+ class Error < RuntimeError
8
+ end
9
+ end
@@ -0,0 +1,9 @@
1
+ # Whittle: A little LALR(1) parser in pure ruby, without a generator.
2
+ #
3
+ # Copyright (c) Chris Corbyn, 2011
4
+
5
+ module Whittle
6
+ # GrammarError is raised if the developer defines an incorrect grammar.
7
+ class GrammarError < Error
8
+ end
9
+ end
@@ -0,0 +1,35 @@
1
+ # Whittle: A little LALR(1) parser in pure ruby, without a generator.
2
+ #
3
+ # Copyright (c) Chris Corbyn, 2011
4
+
5
+ module Whittle
6
+ # ParseError is raised if the parse encounters an unexpected token in the input.
7
+ #
8
+ # You can extract the line number, the expected input and the received input.
9
+ class ParseError < Error
10
+ attr_reader :line
11
+ attr_reader :expected
12
+ attr_reader :received
13
+
14
+ # Initialize the ParseError with information about the location
15
+ #
16
+ # @param [String] message
17
+ # the exception message displayed to the user
18
+ #
19
+ # @param [Fixnum] line
20
+ # the line on which the unexpected token was encountered
21
+ #
22
+ # @param [Array] expected
23
+ # an array of all possible tokens in the current parser state
24
+ #
25
+ # @param [String, Symbol] received
26
+ # the name of the actually received token
27
+ def initialize(message, line, expected, received)
28
+ super(message)
29
+
30
+ @line = line
31
+ @expected = expected
32
+ @received = received
33
+ end
34
+ end
35
+ end
@@ -0,0 +1,9 @@
1
+ # Whittle: A little LALR(1) parser in pure ruby, without a generator.
2
+ #
3
+ # Copyright (c) Chris Corbyn, 2011
4
+
5
+ module Whittle
6
+ # UnconsumedInputError is raised if the lexical analyzer itself cannot find any tokens.
7
+ class UnconsumedInputError < Error
8
+ end
9
+ end