treetop 1.1.1 → 1.1.2

Sign up to get free protection for your applications and to get access to all the features.
data/Rakefile CHANGED
@@ -15,13 +15,13 @@ end
15
15
 
16
16
  gemspec = Gem::Specification.new do |s|
17
17
  s.name = "treetop"
18
- s.version = "1.1.1"
18
+ s.version = "1.1.2"
19
19
  s.author = "Nathan Sobo"
20
20
  s.email = "nathansobo@gmail.com"
21
21
  s.homepage = "http://functionalform.blogspot.com"
22
22
  s.platform = Gem::Platform::RUBY
23
23
  s.summary = "A Ruby-based text parsing and interpretation DSL"
24
- s.files = FileList["README", "Rakefile", "{test,lib,bin,examples}/**/*"].to_a
24
+ s.files = FileList["README", "Rakefile", "{test,lib,bin,doc,examples}/**/*"].to_a
25
25
  s.bindir = "bin"
26
26
  s.executables = ["tt"]
27
27
  s.require_path = "lib"
@@ -0,0 +1,106 @@
1
+ #Contributing
2
+ I like to try Rubinius's policy regarding commit rights. If you submit one patch worth integrating, I'll give you commit rights. We'll see how this goes, but I think it's a good policy.
3
+
4
+ ##Getting Started with the Code
5
+ Treetop compiler is interesting in that it is implemented in itself. Its functionality revolves around `metagrammar.treetop`, which specifies the grammar for Treetop grammars. I took a hybrid approach with regard to definition of methods on syntax nodes in the metagrammar. Methods that are more syntactic in nature, like those that provide access to elements of the syntax tree, are often defined inline, directly in the grammar. More semantic methods are defined in custom node classes.
6
+
7
+ Iterating on the metagrammar is tricky. The current testing strategy uses the last stable version of the metagrammar to parse the version under test. Then the version under test is used to parse and functionally test the various pieces of syntax it should recognize and translate to Ruby. As you change `metagrammar.treetop` and its associated node classes, note that the node classes you are changing are also used to support the previous stable version of the metagrammar, so must be kept backward compatible until such time as a new stable version can be produced to replace it. This became an issue fairly recently when I closed the loop on the bootstrap. Serious iteration on the metagrammar will probably necessitate a more robust testing strategy, perhaps one that relies on the Treetop gem for compiling the metagrammar under test. I haven't done this because my changes since closing the metacircular loop have been minor enough to deal with the issue, but let me know if you need help on this front.
8
+
9
+ ##Tests
10
+ Most of the compiler's tests are functional in nature. The grammar under test is used to parse and compile piece of sample code. Then I attempt to parse input with the compiled output and test its results.
11
+
12
+ Due to shortcomings in Ruby's semantics that scope constant definitions in a block's lexical environment rather than the environment in which it is module evaluated, I was unable to use Rspec without polluting a global namespace with const definitions. Rspec has recently improved to allow specs to reside within standard Ruby classes, but I have not yet migrated the tests back. Instead, they are built on a modified version of Test::Unit that allows tests to be defined as strings. It's not ideal but it worked at the time.
13
+
14
+ #What Needs to be Done
15
+ ##Small Stuff
16
+ * Migrate the tests back to RSpec.
17
+ * Improve the `tt` command line tool to allow `.treetop` extensions to be elided in its arguments.
18
+ * Generate and load temp files with `load_grammar` rather than evaluating strings to improve stack trace readability.
19
+ * Allow `do/end` style blocks as well as curly brace blocks. This was originally omitted because I thought it would be confusing. It probably isn't.
20
+ * Allow the root of a grammar to be dynamically set for testing purposes.
21
+
22
+ ##Big Stuff
23
+ ###Avoiding Excessive Object Instantiation
24
+ Based on some preliminary profiling work, it is pretty apparent that a large percentage of a typical parse's time is spent instantiating objects. This needs to be avoided if parsing is to be more performant.
25
+
26
+ ####Avoiding Failure Result Instantiation
27
+ Currently, every parse failure instantiates a failure object. Both success and failure objects propagate an array of the furthest-advanced terminal failures encountered during the parse. These are used to give feedback to the user in the event of a parse failure as to where the most likely source of the error was located. Rather than propagate them upward in the failure objects, it would be faster to just return false in the event of failure and instead write terminal failures to a mutable data structure that is global to the parse. Even this can be done only in the event that the index of the failure is greater than or equal to the current maximal failure index. In addition to minimizing failure object instantiation, this will probably reduce the time spent sorting propagated failures.
28
+
29
+ ####Transient Expressions
30
+ Currently, every parsing expression instantiates a syntax node. This includes even very simple parsing expressions, like single characters. It is probably unnecessary for every single expression in the parse to correspond to its own syntax node, so much savings could be garnered from a transient declaration that instructs the parser only to attempt a match without instantiating nodes.
31
+
32
+ ###Generate Rule Implementations in C
33
+ Parsing expressions are currently compiled into simple Ruby source code that comprises the body of parsing rules, which are translated into Ruby methods. The generator could produce C instead of Ruby in the body of these method implementations.
34
+
35
+ ###Global Parsing State and Semantic Backtrack Triggering
36
+ Some programming language grammars are not entirely context-free, requiring that global state dictate the behavior of the parser in certain circumstances. Treetop does not currently expose explicit parser control to the grammar writer, and instead automatically constructs the syntax tree for them. A means of semantic parser control compatible with this approach would involve callback methods defined on parsing nodes. Each time a node is successfully parsed it will be given an opportunity to set global state and optionally trigger a parse failure on _extrasyntactic_ grounds. Nodes will probably need to define an additional method that undoes their changes to global state when there is a parse failure and they are backtracked.
37
+
38
+ Here is a sketch of the potential utility of such mechanisms. Consider the structure of YAML, which uses indentation to indicate block structure.
39
+
40
+ level_1:
41
+ level_2a:
42
+ level_2b:
43
+ level_3a:
44
+ level_2c:
45
+
46
+ Imagine a grammar like the following:
47
+
48
+ rule yaml_element
49
+ name ':' block
50
+ /
51
+ name ':' value
52
+ end
53
+
54
+ rule block
55
+ indent yaml_elements outdent
56
+ end
57
+
58
+ rule yaml_elements
59
+ yaml_element (samedent yaml_element)*
60
+ end
61
+
62
+ rule samedent
63
+ newline spaces {
64
+ def after_success(parser_state)
65
+ spaces.length == parser_state.indent_level
66
+ end
67
+ }
68
+ end
69
+
70
+ rule indent
71
+ newline spaces {
72
+ def after_success(parser_state)
73
+ if spaces.length == parser_state.indent_level + 2
74
+ parser_state.indent_level += 2
75
+ true
76
+ else
77
+ false # fail the parse on extrasyntactic grounds
78
+ end
79
+ end
80
+
81
+ def undo_success(parser_state)
82
+ parser_state.indent_level -= 2
83
+ end
84
+ }
85
+ end
86
+
87
+ rule outdent
88
+ newline spaces {
89
+ def after_success(parser_state)
90
+ if spaces.length == parser_state.indent_level - 2
91
+ parser_state.indent_level -= 2
92
+ true
93
+ else
94
+ false # fail the parse on extrasyntactic grounds
95
+ end
96
+ end
97
+
98
+ def undo_success(parser_state)
99
+ parser_state.indent_level += 2
100
+ end
101
+ }
102
+ end
103
+
104
+ In this case a block will be detected only if a change in indentation warrants it. Note that this change in the state of indentation must be undone if a subsequent failure causes this node not to ultimately be incorporated into a successful result.
105
+
106
+ I am by no means sure that the above sketch is free of problems, or even that this overall strategy is sound, but it seems like a promising path.
@@ -0,0 +1,65 @@
1
+ #Grammar Composition
2
+ A unique property of parsing expression grammars is that they are _closed under composition_. This means that when you compose two grammars they yield another grammar that can be composed yet again. This is a radical departure from parsing frameworks require on lexical scanning, which makes compositionally impossible. Treetop's facilities for composition are built upon those of Ruby.
3
+
4
+ ##The Mapping of Treetop Constructs to Ruby Constructs
5
+ When Treetop compiles a grammar definition, it produces a module and a class. The module contains methods implementing all of the rules defined in the grammar. The generated class is a subclass of Treetop::Runtime::CompiledParser and includes the module. For example:
6
+
7
+ grammar Foo
8
+ ...
9
+ end
10
+
11
+ results in a Ruby module named `Foo` and a Ruby class named `FooParser` that `include`s the `Foo` module.
12
+
13
+ ##Using Mixin Semantics to Compose Grammars
14
+ Because grammars are just modules, they can be mixed into one another. This enables grammars to share rules.
15
+
16
+ grammar A
17
+ rule a
18
+ 'a'
19
+ end
20
+ end
21
+
22
+ grammar B
23
+ include A
24
+
25
+ rule ab
26
+ a 'b'
27
+ end
28
+ end
29
+
30
+ Grammar `B` above references rule `a` defined in a separate grammar that it includes. Because module inclusion places modules in the ancestor chain, rules may also be overridden with the use of the `super` keyword accessing the overridden rule.
31
+
32
+ grammar A
33
+ rule a
34
+ 'a'
35
+ end
36
+ end
37
+
38
+ grammar B
39
+ include A
40
+
41
+ rule a
42
+ super / 'b'
43
+ end
44
+ end
45
+
46
+ Now rule `a` in grammar `B` matches either `'a'` or `'b'`.
47
+
48
+ ##Motivation
49
+ Imagine a grammar for Ruby that took account of SQL queries embedded in strings within the language. That could be achieved by combining two existing grammars.
50
+
51
+ grammar RubyPlusSQL
52
+ include Ruby
53
+ include SQL
54
+
55
+ rule expression
56
+ ruby_expression
57
+ end
58
+
59
+ rule ruby_string
60
+ ruby_quote sql_expression ruby_quote / ruby_string
61
+ end
62
+ end
63
+
64
+ ##Work to be Done
65
+ It has become clear that the include facility in grammars would be more useful if it had the ability to name prefix all rules from the included grammar to avoid collision. This is a planned but currently unimplemented feature.
@@ -0,0 +1,24 @@
1
+ #Overview
2
+ Treetop blends cutting-edge parser research with the elegance of Ruby deliver the power of grammar-based syntactic analysis without its traditional conceptual and technical overhead. It is based on parsing expression grammars, and compiles clean, intuitive, and _composable_ language descriptions into _packrat parsers_ written in pure, readable Ruby.
3
+
4
+ ##Intuitive Grammar Specifications
5
+ Treetop's packrat parsers use _memoization_ to make the time-complexity of backtracking a non-issue. This cuts the gordian knot of grammar design. There's no need to look ahead and no need to lex. Worry about the structure of the language, not the idiosyncrasies of the parser.
6
+
7
+ ##Syntax-Oriented Programming
8
+ Rather than implementing semantic actions that construct parse trees, define methods on the trees that Treetop automatically constructs–and write this code directly inside the grammar.
9
+
10
+ ##Reusable, Composable Language Descriptions
11
+ Break grammars into modules and compose them via Ruby's mixin semantics. Or combine grammars written by others in novel ways. Or extend existing grammars with your own syntactic constructs by overriding rules with access to a `super` keyword. Compositionally means your investment of time into grammar writing is secure–you can always extend and reuse your code.
12
+
13
+ ##Acknowledgements
14
+ First, thank you to my employer Rob Mee of Pivotal Labs for funding a substantial portion of Treetop's development. He gets it.
15
+
16
+ I'd also like to thank:
17
+
18
+ * Damon McCormick for several hours of pair programming.
19
+ * Nick Kallen for constant, well-considered feedback, and a few hours of programming to boot.
20
+ * Eliot Miranda for urging me rewrite as a compiler right away rather than putting it off.
21
+ * Ryan Davis and Eric Hodel for hurting my code.
22
+ * Dav Yaginuma for kicking me into action on my idea.
23
+ * Bryan Ford for his seminal work on Packrat Parsers.
24
+ * The editors of Lambda the Ultimate, where I discovered parsing expression grammars.
@@ -0,0 +1,51 @@
1
+ #Pitfalls
2
+ ##Left Recursion
3
+ An weakness shared by all recursive descent parsers is the inability to parse left-recursive rules. Consider the following rule:
4
+
5
+ rule left_recursive
6
+ left_recursive 'a' / 'a'
7
+ end
8
+
9
+ Logically it should match a list of 'a' characters. But it never consumes anything, because attempting to recognize `left_recursive` begins by attempting to recognize `left_recursive`, and so goes an infinite recursion. There's always a way to eliminate these types of structures from your grammar. There's a mechanistic transformation called _left factorization_ that can eliminate it, but it isn't always pretty, especially in combination with automatically constructed syntax trees. So far, I have found more thoughtful ways around the problem. For instance, in the interpreter example I interpret inherently left-recursive function application right recursively in syntax, then correct the directionality in my semantic interpretation. You may have to be clever.
10
+
11
+ #Advanced Techniques
12
+ Here are a few interesting problems I've encountered. I figure sharing them may give you insight into how these types of issues are addressed with the tools of parsing expressions.
13
+
14
+ ##Matching a String
15
+
16
+ rule string
17
+ '"' (!'"' . / '\"')* '"'
18
+ end
19
+
20
+ This expression says: Match a quote, then zero or more of any character but a quote or an escaped quote followed by a quote. Lookahead assertions are essential for these types of problems.
21
+
22
+ ##Matching Nested Structures With Non-Unique Delimeters
23
+ Say I want to parse a diabolical wiki syntax in which the following interpretations apply.
24
+
25
+ ** *hello* ** --> <strong><em>hello</em></strong>
26
+ * **hello** * --> <em><strong>hello</strong></em>
27
+
28
+ rule strong
29
+ '**' (em / !'*' . / '\*')+ '**'
30
+ end
31
+
32
+ rule em
33
+ '**' (strong / !'*' . / '\*')+ '**'
34
+ end
35
+
36
+ Emphasized text is allowed within strong text by virtue of `em` being the first alternative. Since `em` will only successfully parse if a matching `*` is found, it is permitted, but other than that, no `*` characters are allowed unless they are escaped.
37
+
38
+ ##Matching a Keyword But Not Words Prefixed Therewith
39
+ Say I want to consider a given string a characters only when it occurs in isolation. Lets use the `end` keyword as an example. We don't want the prefix of `'enders_game'` to be considered a keyword. A naiive implementation might be the following.
40
+
41
+ rule end_keyword
42
+ 'end' &space
43
+ end
44
+
45
+ This says that `'end'` must be followed by a space, but this space is not consumed as part of the matching of `keyword`. This works in most cases, but is actually incorrect. What if `end` occurs at the end of the buffer? In that case, it occurs in isolation but will not match the above expression. What we really mean is that `'end'` cannot be followed by a _non-space_ character.
46
+
47
+ rule end_keyword
48
+ 'end' !(!' ' .)
49
+ end
50
+
51
+ In general, when the syntax gets tough, it helps to focus on what you really mean. A keyword is a character not followed by another character that isn't a space.
@@ -0,0 +1,187 @@
1
+ #Semantic Interpretation
2
+ Lets use the below grammar as an example. It describes parentheses wrapping a single character to an arbitrary depth.
3
+
4
+ grammar ParenLanguage
5
+ rule parenthesized_letter
6
+ '(' parenthesized_letter ')'
7
+ /
8
+ [a-z]
9
+ end
10
+ end
11
+
12
+ Matches:
13
+
14
+ * `'a'`
15
+ * `'(a)'`
16
+ * `'((a))'`
17
+ * etc.
18
+
19
+
20
+ Output from a parser for this grammar looks like this:
21
+
22
+ ![Tree Returned By ParenLanguageParser](./images/paren_language_output.png)
23
+
24
+ This is a parse tree whose nodes are instances of `Treetop::Runtime::SyntaxNode`. What if we could define methods on these node objects? We would then have an object-oriented program whose structure corresponded to the structure of our language. Treetop provides two techniques for doing just this.
25
+
26
+ ##Associating Methods with Node-Instantiating Expressions
27
+ Sequences and all types of terminals are node-instantiating expressions. When they match, they create instances of `Treetop::Runtime::SyntaxNode`. Methods can be added to these nodes in the following ways:
28
+
29
+ ###Inline Method Definition
30
+ Methods can be added to the nodes instantiated by the successful match of an expression
31
+
32
+ grammar ParenLanguage
33
+ rule parenthesized_letter
34
+ '(' parenthesized_letter ')' {
35
+ def depth
36
+ parenthesized_letter.depth + 1
37
+ end
38
+ }
39
+ /
40
+ [a-z] {
41
+ def depth
42
+ 0
43
+ end
44
+ }
45
+ end
46
+ end
47
+
48
+ Note that each alternative expression is followed by a block containing a method definition. A `depth` method is defined on both expressions. The recursive `depth` method defined in the block following the first expression determines the depth of the nested parentheses and adds one two it. The base case is implemented in the block following the second expression; a single character has a depth of 0.
49
+
50
+
51
+ ###Custom `SyntaxNode` Subclass Declarations
52
+ You can instruct the parser to instantiate a custom subclass of Treetop::Runtime::SyntaxNode for an expression by following it by the name of that class enclosed in angle brackets (`<>`). The above inline method definitions could have been moved out into a single class like so.
53
+
54
+ # in .treetop file
55
+ grammar ParenLanguage
56
+ rule parenthesized_letter
57
+ '(' parenthesized_letter ')' <ParenNode>
58
+ /
59
+ [a-z] <ParenNode>
60
+ end
61
+ end
62
+
63
+ # in separate .rb file
64
+ class ParenNode < Treetop::Runtime::SyntaxNode
65
+ def depth
66
+ if nonterminal?
67
+ parenthesized_letter.depth + 1
68
+ else
69
+ 0
70
+ end
71
+ end
72
+ end
73
+
74
+ ##Automatic Extension of Results
75
+ Nonterminal and ordered choice expressions do not instantiate new nodes, but rather pass through nodes that are instantiated by other expressions. They can extend nodes they propagate with anonymous or declared modules, using similar constructs used with expressions that instantiate their own syntax nodes.
76
+
77
+ ###Extending a Propagated Node with an Anonymous Module
78
+ rule parenthesized_letter
79
+ ('(' parenthesized_letter ')' / [a-z]) {
80
+ def depth
81
+ if nonterminal?
82
+ parenthesized_letter.depth + 1
83
+ else
84
+ 0
85
+ end
86
+ end
87
+ }
88
+ end
89
+
90
+ The parenthesized choice above can result in a node matching either of the two choices. Than node will be extended with methods defined in the subsequent block. Note that a choice must always be parenthesized to be associated with a following block.
91
+
92
+ ###Extending A Propagated Node with a Declared Module
93
+ # in .treetop file
94
+ rule parenthesized_letter
95
+ ('(' parenthesized_letter ')' / [a-z]) <ParenNode>
96
+ end
97
+
98
+ # in separate .rb file
99
+ module ParenNode
100
+ def depth
101
+ if nonterminal?
102
+ parenthesized_letter.depth + 1
103
+ else
104
+ 0
105
+ end
106
+ end
107
+ end
108
+
109
+ Here the result is extended with the `ParenNode` module. Note the previous example for node-instantiating expressions, the constant in the declaration must be a module because the result is extended with it.
110
+
111
+ ##Automatically-Defined Element Accessor Methods
112
+ ###Default Accessors
113
+ Nodes instantiated upon the matching of sequences have methods automatically defined for any nonterminals in the sequence.
114
+
115
+ rule abc
116
+ a b c {
117
+ def to_s
118
+ a.to_s + b.to_s + c.to_s
119
+ end
120
+ }
121
+ end
122
+
123
+ In the above code, the `to_s` method calls automatically-defined element accessors for the nodes returned by parsing nonterminals `a`, `b`, and `c`.
124
+
125
+ ###Labels
126
+ Subexpressions can be given an explicit label to have an element accessor method defined for them. This is useful in cases of ambiguity between two references to the same nonterminal or when you need to access an unnamed subexpression.
127
+
128
+ rule labels
129
+ first_letter:[a-z] rest_letters:(', ' letter:[a-z])* {
130
+ def letters
131
+ [first_letter] + rest_letters.map { |comma_and_letter| comma_and_letter.letter }
132
+ end
133
+ }
134
+ end
135
+
136
+ The above grammar uses label-derived accessors to determine the letters in a comma-delimited list of letters. The labeled expressions _could_ have been extracted to their own rules, but if they aren't used elsewhere, labels still enable them to be referenced by a name within the expression's methods.
137
+
138
+ ###Overriding Element Accessors
139
+ The module containing automatically defined element accessor methods is an ancestor of the module in which you define your own methods, meaning you can override them with access to the `super` keyword. Here's an example of how this fact can improve the readability of the example above.
140
+
141
+ rule labels
142
+ first_letter:[a-z] rest_letters:(', ' letter:[a-z])* {
143
+ def letters
144
+ [first_letter] + rest_letters
145
+ end
146
+
147
+ def rest_letters
148
+ super.map { |comma_and_letter| comma_and_letter.letter }
149
+ end
150
+ }
151
+ end
152
+
153
+
154
+ ##Methods Available on `Treetop::Runtime::SyntaxNode`
155
+
156
+ <table>
157
+ <tr>
158
+ <td>
159
+ <code>terminal?</code>
160
+ </td>
161
+ <td>
162
+ Was this node produced by the matching of a terminal symbol?
163
+ </td>
164
+ </tr>
165
+ <tr>
166
+ <td>
167
+ <code>nonterminal?</code>
168
+ </td>
169
+ <td>
170
+ Was this node produced by the matching of a nonterminal symbol?
171
+ </td>
172
+ <tr>
173
+ <td>
174
+ <code>text_value</code>
175
+ </td>
176
+ <td>
177
+ The substring of the input represented by this node.
178
+ </td>
179
+ <tr>
180
+ <td>
181
+ <code>elements</code>
182
+ </td>
183
+ <td>
184
+ Available only on nonterminal nodes, returns the nodes parsed by the elements of the matched sequence.
185
+ </td>
186
+ </tr>
187
+ </table>