RubyGems - whittle - Versions diffs - 0.0.1 - Mend

whittle 0.0.1

Files changed (30) hide show

data/.gitignore +4 -0
data/.rspec +1 -0
data/Gemfile +4 -0
data/LICENSE +20 -0
data/README.md +468 -0
data/Rakefile +1 -0
data/lib/whittle/error.rb +9 -0
data/lib/whittle/errors/grammar_error.rb +9 -0
data/lib/whittle/errors/parse_error.rb +35 -0
data/lib/whittle/errors/unconsumed_input_error.rb +9 -0
data/lib/whittle/parser.rb +343 -0
data/lib/whittle/rule.rb +239 -0
data/lib/whittle/rule_set.rb +118 -0
data/lib/whittle/version.rb +3 -0
data/lib/whittle.rb +8 -0
data/spec/spec_helper.rb +4 -0
data/spec/unit/parser/empty_rule_spec.rb +21 -0
data/spec/unit/parser/empty_string_spec.rb +17 -0
data/spec/unit/parser/error_reporting_spec.rb +55 -0
data/spec/unit/parser/grouped_expr_spec.rb +27 -0
data/spec/unit/parser/multiple_precedence_spec.rb +33 -0
data/spec/unit/parser/noop_spec.rb +23 -0
data/spec/unit/parser/pass_through_parser_spec.rb +17 -0
data/spec/unit/parser/precedence_spec.rb +26 -0
data/spec/unit/parser/self_referential_expr_spec.rb +26 -0
data/spec/unit/parser/skipped_tokens_spec.rb +28 -0
data/spec/unit/parser/sum_parser_spec.rb +23 -0
data/spec/unit/parser/typecast_parser_spec.rb +17 -0
data/whittle.gemspec +27 -0
metadata +104 -0

data/.gitignore ADDED Viewed

@@ -0,0 +1,4 @@
+*.gem
+.bundle
+Gemfile.lock
+pkg/*

data/.rspec ADDED Viewed

	@@ -0,0 +1 @@
1	+ --colour

data/Gemfile ADDED Viewed

@@ -0,0 +1,4 @@
+source "http://rubygems.org"
+# Specify your gem's dependencies in whittle.gemspec
+gemspec

data/LICENSE ADDED Viewed

@@ -0,0 +1,20 @@
+Copyright (c) 2011 Chris Corbyn
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,468 @@
+# Whittle: A little LALR(1) Parser in Pure Ruby — Not a Generator
+Whittle is a LALR(1) parser.  It's very small, easy to understand, and what's most important,
+it's 100% ruby.  You write parsers by specifying sequences of allowable rules (which refer to
+other rules, or even to themselves), and for each rule in your grammar, you provide a block that
+is invoked when the grammar is recognized.
+If you're not familiar with parsing, you should find Whittle to be a very friendly little
+parser.
+It is related, somewhat, to yacc and bison, which belong to the class of parsers knows as
+LALR(1): Lookahead Left-Right (using 1 lookahead token).  This class of parsers is both easy to
+work with, and powerful.
+Whittle provides meaningful error reporting and even lets you hook into the error handling logic
+if you need to write some sort of crazy madman forgiving parser.
+## The Basics
+Parsers using Whittle are *not* generated.  This may strike users of other LALR(1) parsers as
+odd, but c'mon, we're using Ruby, right?
+I'll avoid discussing the algorithm until we get into the really advanced stuff, but you will
+need to understand a few fundamental ideas before we begin.
+  1. There are two types of rule that make up a complete parser: terminal, and nonterminal
+    - A terminal rule is quite simply a chunk of the input string, like '42', or 'function'
+    - A nonterminal rule is a rule that makes reference to other rules (terminal and nonterminal)
+  2. The input to be parsed *always* conforms to just one rule at the topmost level.  This is
+     known as the "start rule".
+The easiest way to understand how the parser works is just to learn by example, so let's see an
+example.
+``` ruby
+require 'whittle'
+class Mathematician < Whittle::Parser
+  rule("+")
+  rule(:int) do |r|
+    r[/[0-9]+/].as { |num| Integer(num) }
+  end
+  rule(:expr) do |r|
+    r[:int, "+", :int].as { |a, _, b| a + b }
+  end
+  start(:expr)
+end
+mathematician = Mathematician.new
+mathematician.parse("1+2")
+# => 3
+```
+Let's break this down a bit.  As you can see, the whole thing is really just `rule` used in
+different ways.  We also have to set the rule that we can use to describe an entire program,
+which in this case is the `:expr` rule that can add two numbers together.
+There are two terminal rules (`"+"` and `:int`) and one nonterminal (`:expr`) in the above
+grammar.  Each rule can have a block attached to it.  The block is invoked with the result
+evaluating the blocks that are attached to each input (recursively).  A rule with no block
+attached as just a shorthand way of saying "return the input verbatim", so our "+" above receives
+the string "+" and returns the string "+".  Since this is such a common use-case, Whittle offers
+the shorthand.
+As the input string is parsed, it *must* match the start rule `:expr`.  Whittle reads the "1",
+which matches `:int` (which casts the String "1" to the Integer 1), next the parser looks for the
+expected "+", which it gets.  Now it looks for another `:int`, which it gets.  Upon having
+read the sequence `:int`, `"+"`, `:int`, Whittle invokes the block for `:expr` with the arguments
+1, "+", 2, returning the 3 we expect.
+## Nonterminal rules can have more than one valid sequence
+Our mathematician class above is not much of a mathematician.  It can only add numbers together.
+Surely subtraction, division and multiplication should be possible too?
+It turns out that this is really simple to do.  Just add multiple possibilities to the same
+rule.
+``` ruby
+require 'whittle'
+class Mathematician < Whittle::Parser
+  rule("+")
+  rule("-")
+  rule("*")
+  rule("/")
+  rule(:int) do |r|
+    r[/[0-9]+/].as { |num| Integer(num) }
+  end
+  rule(:expr) do |r|
+    r[:int, "+", :int].as { |a, _, b| a + b }
+    r[:int, "-", :int].as { |a, _, b| a - b }
+    r[:int, "*", :int].as { |a, _, b| a * b }
+    r[:int, "/", :int].as { |a, _, b| a / b }
+  end
+  start(:expr)
+end
+mathematician = Mathematician.new
+mathematician.parse("1+2")
+# => 3
+mathematician.parse("1-2")
+# => -1
+mathematician.parse("2*3")
+# => 6
+mathematician.parse("4/2")
+# => 2
+```
+Now you're probably seeing how matching just one rule for the entire input is not a problem.
+## Rules can refer to themselves
+But our mathematician is still not very bright.  It can only work with two operands.  What about
+more complex expressions?
+``` ruby
+require 'whittle'
+class Mathematician < Whittle::Parser
+  rule("+")
+  rule("-")
+  rule("*")
+  rule("/")
+  rule(:int) do |r|
+    r[/[0-9]+/].as { |num| Integer(num) }
+  end
+  rule(:expr) do |r|
+    r[:expr, "+", :expr].as { |a, _, b| a + b }
+    r[:expr, "-", :expr].as { |a, _, b| a - b }
+    r[:expr, "*", :expr].as { |a, _, b| a * b }
+    r[:expr, "/", :expr].as { |a, _, b| a / b }
+    r[:int].as(:value)
+  end
+  start(:expr)
+end
+mathematician = Mathematician.new
+mathematician.parse("1+5-2")
+# => 4
+```
+Adding a rule of just `:int` to the `:expr` rule means that any integer is also a valid `:expr`.
+It is now possible to say that any `:expr` can be added to, multiplied by, divided by or
+subtracted from another `:expr`.  It is this ability to self-reference that makes LALR(1)
+parsers so powerful and easy to use.  Note that because the result each rule is computed
+*before* being passed as arguments to the block, each `:expr` in the calculations above will
+always be a number, since each `:expr` returns a number.
+## Specifying the associativity
+Our mathematician still isn't very clever however.  It makes some silly mistakes.  Let's see
+what happens when we do the following:
+``` ruby
+mathematician.parse("6-3-1")
+# => 4
+```
+Oops.  That's not correct.  Shouldn't the answer be 2?
+Our grammar is ambiguous.  The input string could be interpreted as either:
+    6-(3-1)
+Or as:
+    (6-3)-1
+Basic arithmetic takes the latter approach, but the parser's default approach is to go the other
+way.  We refer to these two alternatives as being left associative (the second example) and
+right associative (the first example).  By default, operators are right associative, which means
+as much input will be read as possible before beginning to compute a result.
+We can correct this by tagging our operators as left associative.
+``` ruby
+require 'whittle'
+class Mathematician < Whittle::Parser
+  rule("+") % :left
+  rule("-") % :left
+  rule("*") % :left
+  rule("/") % :left
+  rule(:int) do |r|
+    r[/[0-9]+/].as { |num| Integer(num) }
+  end
+  rule(:expr) do |r|
+    r[:expr, "+", :expr].as { |a, _, b| a + b }
+    r[:expr, "-", :expr].as { |a, _, b| a - b }
+    r[:expr, "*", :expr].as { |a, _, b| a * b }
+    r[:expr, "/", :expr].as { |a, _, b| a / b }
+    r[:int].as(:value)
+  end
+  start(:expr)
+end
+mathematician = Mathematician.new
+mathematician.parse("6-3-1")
+# => 2
+```
+Attaching a percent sign followed by either `:left` or `:right` changes the associativity of a
+rule.  We now get the correct result.
+## Specifying the operator precedence
+Well, despite fixing the associativity, we find we still have a problem:
+``` ruby
+mathematician.parse("1+2*3")
+# => 9
+```
+Hmm.  The expression has been interpreted as (1+2)*3.  It turns out arithmetic is not as simple
+as one might think ;)  The parser does not (yet) know that the multiplication operator has a
+higher precedence than the addition operator.  We need to indicate this in the grammar.
+``` ruby
+require 'whittle'
+class Mathematician < Whittle::Parser
+  rule("+") % :left ^ 1
+  rule("-") % :left ^ 1
+  rule("*") % :left ^ 2
+  rule("/") % :left ^ 2
+  rule(:int) do |r|
+    r[/[0-9]+/].as { |num| Integer(num) }
+  end
+  rule(:expr) do |r|
+    r[:expr, "+", :expr].as { |a, _, b| a + b }
+    r[:expr, "-", :expr].as { |a, _, b| a - b }
+    r[:expr, "*", :expr].as { |a, _, b| a * b }
+    r[:expr, "/", :expr].as { |a, _, b| a / b }
+    r[:int].as(:value)
+  end
+  start(:expr)
+end
+mathematician = Mathematician.new
+mathematician.parse("1+2*3")
+# => 7
+```
+That's better.  We can attach a precedence level to a rule by following it with the caret `^`,
+followed by an integer value.  The higher the value, the higher the precedence.  Note that "+"
+and "-" both have the same precedence, since "1+(2-3)" and "(1+2)-3" are logically equivalent.
+The same applies to "*" and "/", but these both usually have a higher precedence than "+" and
+"-".
+## Disambiguating expressions with the use of parentheses
+Sometimes we really do want "1+2*3" to mean "(1+2)*3", so we should really support this in our
+mathematician.  Fortunately adjusting the syntax rules in Whittle is a painless exercise.
+``` ruby
+require 'whittle'
+class Mathematician < Whittle::Parser
+  rule("+") % :left ^ 1
+  rule("-") % :left ^ 1
+  rule("*") % :left ^ 2
+  rule("/") % :left ^ 2
+  rule("(")
+  rule(")")
+  rule(:int) do |r|
+    r[/[0-9]+/].as { |num| Integer(num) }
+  end
+  rule(:expr) do |r|
+    r["(", :expr, ")"].as   { |_, exp, _| exp }
+    r[:expr, "+", :expr].as { |a, _, b| a + b }
+    r[:expr, "-", :expr].as { |a, _, b| a - b }
+    r[:expr, "*", :expr].as { |a, _, b| a * b }
+    r[:expr, "/", :expr].as { |a, _, b| a / b }
+    r[:int].as(:value)
+  end
+  start(:expr)
+end
+mathematician = Mathematician.new
+mathematician.parse("(1+2)*3")
+# => 9
+```
+All we had to do was add the new terminal rules for "(" and ")" then specify that the value of
+an expression enclosed in parentheses is simply the value of the expression itself.
+## Skipping whitespace
+Most languages contain tokens that are ignored when interpreting the input, such as whitespace
+and comments.  Accounting for the possibility of these in all rules would be both wasteful and
+tiresome.  Instead, we skip them entirely, by declaring a terminal rule without any associated
+action, or if you want to be explicit, with `as(:nothing)`.
+``` ruby
+require 'whittle'
+class Mathematician < Whittle::Parser
+  rule(:wsp) do |r|
+    r[/\s+/]
+  end
+  rule("+") % :left ^ 1
+  rule("-") % :left ^ 1
+  rule("*") % :left ^ 2
+  rule("/") % :left ^ 2
+  rule("(")
+  rule(")")
+  rule(:int) do |r|
+    r[/[0-9]+/].as { |num| Integer(num) }
+  end
+  rule(:expr) do |r|
+    r["(", :expr, ")"].as   { |_, exp, _| exp }
+    r[:expr, "+", :expr].as { |a, _, b| a + b }
+    r[:expr, "-", :expr].as { |a, _, b| a - b }
+    r[:expr, "*", :expr].as { |a, _, b| a * b }
+    r[:expr, "/", :expr].as { |a, _, b| a / b }
+    r[:int].as(:value)
+  end
+  start(:expr)
+end
+mathematician = Mathematician.new
+mathematician.parse("( 1 + 2)*3 - 4")
+# => 5
+```
+Now the whitespace can either exist between the tokens in the input or not.  The parser doesn't
+pay attention to it, it simply discards it as the input string is read.
+## Rules can be empty
+Sometimes you want to describe a structure, such as a list, that may have zero or more items in
+it. In order to do this, the empty rule comes in extremely useful.  Imagine the input string:
+    (((())))
+We can say that this is matched by any pair of parentheses inside any pair of parentheses, any
+number of times. But what's in the middle?
+``` ruby
+require 'whittle'
+class Parser < Whittle::Parser
+  rule("(")
+  rule(")")
+  rule(:parens) do |r|
+    r[]
+    r["(", :parens, ")"]
+  end
+  start(:parens)
+end
+```
+The above parser will happily match our input, because it is possible for the `:parens` rule to
+match nothing at all, which is what we hit in the middle of our nested parentheses.
+This is most useful in constructs like the following:
+``` ruby
+rule(:id) do |r|
+  r[/[a-z]+/].as(:value)
+end
+rule(:list) do |r|
+  r[].as                { [] }
+  r[:list, ",", :id].as { |list, _, id| list << id }
+  r[:id].as             { |id| [id] }
+end
+```
+The following would return the array `["a", "b", "c"]` given the input string "a, b, c", or
+given the input string "" (nothing) it would return the empty array.
+## Parse errors
+### The default error reporting
+When the parser encounters an unexpected token in the input, an exception of type
+`Whittle::ParseError` is raised.  The exception has a very clear message, indicates the line on
+which the error was encountered, and additionally gives you programmatic access to the same
+information.
+``` ruby
+class ListParser < Whittle::Parser
+  rule(:wsp) do |r|
+    r[/\s+/]
+  end
+  rule(:id) do |r|
+    r[/[a-z]+/].as(:value)
+  end
+  rule(",")
+  rule("-")
+  rule(:list) do |r|
+    r[:list, ",", :id].as { |list, _, id| list << id }
+    r[:id].as             { |id| Array(id) }
+  end
+  start(:list)
+end
+ListParser.new.parse("a, \nb, \nc- \nd")
+# =>
+# Parse error: expected "," but got "-" on line 3
+```
+You can also access `#line`, `#expected` and `#received` if you catch the exception.
+### Recovering from a parse error
+It is possible to override the `#error` method in the parser to do something smart if you
+believe there to be easily resolved parse errors (such as switching the input token to
+something else, or rewinding the parse stack to a point where the error would not manifest.  I
+need to write some specs on this and explore it fully myself before I document it.  99% of users
+would never need to do such a thing.
+## TODO
+  - Provide a more powerful (state based) lexer algorithm, or at least document how users can
+override `#lex`.
+  - Allow inspection of the parse table (it is not very human friendly right now).
+  - Allow inspection of the AST (maybe).
+  - Given in an input String, provide a human readble explanation of the parse.
+## License & Copyright
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+Copright (c) Chris Corbyn, 2011

data/Rakefile ADDED Viewed

	@@ -0,0 +1 @@
1	+ require "bundler/gem_tasks"

data/lib/whittle/error.rb ADDED Viewed

@@ -0,0 +1,9 @@
+# Whittle: A little LALR(1) parser in pure ruby, without a generator.
+#
+# Copyright (c) Chris Corbyn, 2011
+module Whittle
+  # All exceptions descend from this one.
+  class Error < RuntimeError
+  end
+end

data/lib/whittle/errors/grammar_error.rb ADDED Viewed

@@ -0,0 +1,9 @@
+# Whittle: A little LALR(1) parser in pure ruby, without a generator.
+#
+# Copyright (c) Chris Corbyn, 2011
+module Whittle
+  # GrammarError is raised if the developer defines an incorrect grammar.
+  class GrammarError < Error
+  end
+end

data/lib/whittle/errors/parse_error.rb ADDED Viewed

@@ -0,0 +1,35 @@
+# Whittle: A little LALR(1) parser in pure ruby, without a generator.
+#
+# Copyright (c) Chris Corbyn, 2011
+module Whittle
+  # ParseError is raised if the parse encounters an unexpected token in the input.
+  #
+  # You can extract the line number, the expected input and the received input.
+  class ParseError < Error
+    attr_reader :line
+    attr_reader :expected
+    attr_reader :received
+    # Initialize the ParseError with information about the location
+    #
+    # @param [String] message
+    #   the exception message displayed to the user
+    #
+    # @param [Fixnum] line
+    #   the line on which the unexpected token was encountered
+    #
+    # @param [Array] expected
+    #   an array of all possible tokens in the current parser state
+    #
+    # @param [String, Symbol] received
+    #   the name of the actually received token
+    def initialize(message, line, expected, received)
+      super(message)
+      @line     = line
+      @expected = expected
+      @received = received
+    end
+  end
+end

data/lib/whittle/errors/unconsumed_input_error.rb ADDED Viewed

@@ -0,0 +1,9 @@
+# Whittle: A little LALR(1) parser in pure ruby, without a generator.
+#
+# Copyright (c) Chris Corbyn, 2011
+module Whittle
+  # UnconsumedInputError is raised if the lexical analyzer itself cannot find any tokens.
+  class UnconsumedInputError < Error
+  end
+end