treetop 1.4.14 → 1.4.15

Sign up to get free protection for your applications and to get access to all the features.
Binary file
Binary file
Binary file
Binary file
data/doc/site/index.html DELETED
@@ -1,102 +0,0 @@
1
- <html><head><link href="./screen.css" rel="stylesheet" type="text/css" />
2
- <script src="http://www.google-analytics.com/urchin.js" type="text/javascript">
3
- </script>
4
- <script type="text/javascript">
5
- _uacct = "UA-3418876-1";
6
- urchinTracker();
7
- </script>
8
- </head><body><div id="top"><div id="main_navigation"><ul><li><a href="syntactic_recognition.html">Documentation</a></li><li><a href="contribute.html">Contribute</a></li><li>Home</li></ul></div></div><div id="middle"><div id="main_content"><p class="intro_text">
9
-
10
- Treetop is a language for describing languages. Combining the elegance of Ruby with cutting-edge <em>parsing expression grammars</em>, it helps you analyze syntax with revolutionary ease.
11
-
12
- </p>
13
-
14
-
15
- <pre><code>sudo gem install treetop
16
- </code></pre>
17
-
18
- <h1>Intuitive Grammar Specifications</h1>
19
-
20
- <p>Parsing expression grammars (PEGs) are simple to write and easy to maintain. They are a simple but powerful generalization of regular expressions that are easier to work with than the LALR or LR-1 grammars of traditional parser generators. There's no need for a tokenization phase, and <em>lookahead assertions</em> can be used for a limited degree of context-sensitivity. Here's an extremely simple Treetop grammar that matches a subset of arithmetic, respecting operator precedence:</p>
21
-
22
- <pre><code>grammar Arithmetic
23
- rule additive
24
- multitive ( '+' multitive )*
25
- end
26
-
27
- rule multitive
28
- primary ( [*/%] primary )*
29
- end
30
-
31
- rule primary
32
- '(' additive ')' / number
33
- end
34
-
35
- rule number
36
- '-'? [1-9] [0-9]*
37
- end
38
- end
39
- </code></pre>
40
-
41
- <h1>Syntax-Oriented Programming</h1>
42
-
43
- <p>Rather than implementing semantic actions that construct parse trees, Treetop lets you define methods on trees that it constructs for you automatically. You can define these methods directly within the grammar...</p>
44
-
45
- <pre><code>grammar Arithmetic
46
- rule additive
47
- multitive a:( '+' multitive )* {
48
- def value
49
- a.elements.inject(multitive.value) { |sum, e|
50
- sum+e.multitive.value
51
- }
52
- end
53
- }
54
- end
55
-
56
- # other rules below ...
57
- end
58
- </code></pre>
59
-
60
- <p>...or associate rules with classes of nodes you wish your parsers to instantiate upon matching a rule.</p>
61
-
62
- <pre><code>grammar Arithmetic
63
- rule additive
64
- multitive ('+' multitive)* &lt;AdditiveNode&gt;
65
- end
66
-
67
- # other rules below ...
68
- end
69
- </code></pre>
70
-
71
- <h1>Reusable, Composable Language Descriptions</h1>
72
-
73
- <p>Because PEGs are closed under composition, Treetop grammars can be treated like Ruby modules. You can mix them into one another and override rules with access to the <code>super</code> keyword. You can break large grammars down into coherent units or make your language's syntax modular. This is especially useful if you want other programmers to be able to reuse your work.</p>
74
-
75
- <pre><code>grammar RubyWithEmbeddedSQL
76
- include SQL
77
-
78
- rule string
79
- quote sql_expression quote / super
80
- end
81
- end
82
- </code></pre>
83
-
84
- <h1>Acknowledgements</h1>
85
-
86
- <p><a href="http://pivotallabs.com"><img id="pivotal_logo" src="./images/pivotal.gif"></a></p>
87
-
88
- <p>First, thank you to my employer Rob Mee of <a href="http://pivotallabs.com"/>Pivotal Labs</a> for funding a substantial portion of Treetop's development. He gets it.</p>
89
-
90
- <p>I'd also like to thank:</p>
91
-
92
- <ul>
93
- <li>Damon McCormick for several hours of pair programming.</li>
94
- <li>Nick Kallen for lots of well-considered feedback and a few afternoons of programming.</li>
95
- <li>Brian Takita for a night of pair programming.</li>
96
- <li>Eliot Miranda for urging me rewrite as a compiler right away rather than putting it off.</li>
97
- <li>Ryan Davis and Eric Hodel for hurting my code.</li>
98
- <li>Dav Yaginuma for kicking me into action on my idea.</li>
99
- <li>Bryan Ford for his seminal work on Packrat Parsers.</li>
100
- <li>The editors of Lambda the Ultimate, where I discovered parsing expression grammars.</li>
101
- </ul>
102
- </div></div><div id="bottom"></div></body></html>
@@ -1,68 +0,0 @@
1
- <html><head><link href="./screen.css" rel="stylesheet" type="text/css" />
2
- <script src="http://www.google-analytics.com/urchin.js" type="text/javascript">
3
- </script>
4
- <script type="text/javascript">
5
- _uacct = "UA-3418876-1";
6
- urchinTracker();
7
- </script>
8
- </head><body><div id="top"><div id="main_navigation"><ul><li>Documentation</li><li><a href="contribute.html">Contribute</a></li><li><a href="index.html">Home</a></li></ul></div></div><div id="middle"><div id="main_content"><div id="secondary_navigation"><ul><li><a href="syntactic_recognition.html">Syntax</a></li><li><a href="semantic_interpretation.html">Semantics</a></li><li><a href="using_in_ruby.html">Using In Ruby</a></li><li>Advanced Techniques</li></ul></div><div id="documentation_content"><h1>Pitfalls</h1>
9
-
10
- <h2>Left Recursion</h2>
11
-
12
- <p>An weakness shared by all recursive descent parsers is the inability to parse left-recursive rules. Consider the following rule:</p>
13
-
14
- <pre><code>rule left_recursive
15
- left_recursive 'a' / 'a'
16
- end
17
- </code></pre>
18
-
19
- <p>Logically it should match a list of 'a' characters. But it never consumes anything, because attempting to recognize <code>left_recursive</code> begins by attempting to recognize <code>left_recursive</code>, and so goes an infinite recursion. There's always a way to eliminate these types of structures from your grammar. There's a mechanistic transformation called <em>left factorization</em> that can eliminate it, but it isn't always pretty, especially in combination with automatically constructed syntax trees. So far, I have found more thoughtful ways around the problem. For instance, in the interpreter example I interpret inherently left-recursive function application right recursively in syntax, then correct the directionality in my semantic interpretation. You may have to be clever.</p>
20
-
21
- <h1>Advanced Techniques</h1>
22
-
23
- <p>Here are a few interesting problems I've encountered. I figure sharing them may give you insight into how these types of issues are addressed with the tools of parsing expressions.</p>
24
-
25
- <h2>Matching a String</h2>
26
-
27
- <pre><code>rule string
28
- '"' ('\"' / !'"' .)* '"'
29
- end
30
- </code></pre>
31
-
32
- <p>This expression says: Match a quote, then zero or more of, an escaped quote or any character but a quote, followed by a quote. Lookahead assertions are essential for these types of problems.</p>
33
-
34
- <h2>Matching Nested Structures With Non-Unique Delimeters</h2>
35
-
36
- <p>Say I want to parse a diabolical wiki syntax in which the following interpretations apply.</p>
37
-
38
- <pre><code>** *hello* ** --&gt; &lt;strong&gt;&lt;em&gt;hello&lt;/em&gt;&lt;/strong&gt;
39
- * **hello** * --&gt; &lt;em&gt;&lt;strong&gt;hello&lt;/strong&gt;&lt;/em&gt;
40
-
41
- rule strong
42
- '**' (em / !'*' . / '\*')+ '**'
43
- end
44
-
45
- rule em
46
- '**' (strong / !'*' . / '\*')+ '**'
47
- end
48
- </code></pre>
49
-
50
- <p>Emphasized text is allowed within strong text by virtue of <code>em</code> being the first alternative. Since <code>em</code> will only successfully parse if a matching <code>*</code> is found, it is permitted, but other than that, no <code>*</code> characters are allowed unless they are escaped.</p>
51
-
52
- <h2>Matching a Keyword But Not Words Prefixed Therewith</h2>
53
-
54
- <p>Say I want to consider a given string a characters only when it occurs in isolation. Lets use the <code>end</code> keyword as an example. We don't want the prefix of <code>'enders_game'</code> to be considered a keyword. A naiive implementation might be the following.</p>
55
-
56
- <pre><code>rule end_keyword
57
- 'end' &amp;space
58
- end
59
- </code></pre>
60
-
61
- <p>This says that <code>'end'</code> must be followed by a space, but this space is not consumed as part of the matching of <code>keyword</code>. This works in most cases, but is actually incorrect. What if <code>end</code> occurs at the end of the buffer? In that case, it occurs in isolation but will not match the above expression. What we really mean is that <code>'end'</code> cannot be followed by a <em>non-space</em> character.</p>
62
-
63
- <pre><code>rule end_keyword
64
- 'end' !(!' ' .)
65
- end
66
- </code></pre>
67
-
68
- <p>In general, when the syntax gets tough, it helps to focus on what you really mean. A keyword is a character not followed by another character that isn't a space.</p></div></div></div><div id="bottom"></div></body></html>
data/doc/site/robots.txt DELETED
@@ -1,5 +0,0 @@
1
- User-agent: *
2
- Disallow: /softwaremap/ # This is an infinite virtual URL space
3
- Disallow: /statcvs/ # This is an infinite virtual URL space
4
- Disallow: /usage/ # This is an infinite virtual URL space
5
- Disallow: /wiki/ # This is an infinite virtual URL space
data/doc/site/screen.css DELETED
@@ -1,134 +0,0 @@
1
- body {
2
- margin: 0;
3
- padding: 0;
4
- background: #666666;
5
- font-family: "Lucida Grande", Geneva, Arial, Verdana, sans-serif;
6
- color: #333333;
7
- }
8
-
9
- div {
10
- margin: 0;
11
- background-position: center;
12
- background-repeat: none;
13
- }
14
-
15
- h1 {
16
- font-size: 125%;
17
- margin-top: 1.5em;
18
- margin-bottom: .5em;
19
- }
20
-
21
- h2 {
22
- font-size: 115%;
23
- margin-top: 3em;
24
- margin-bottom: .5em;
25
- }
26
-
27
- h3 {
28
- font-size: 105%;
29
- margin-top: 1.5em;
30
- margin-bottom: .5em;
31
- }
32
-
33
- a {
34
- color: #ff8429;
35
- }
36
-
37
-
38
- div#top {
39
- background-image: url( "images/top_background.png" );
40
- height: 200px;
41
- width: 100%;
42
- }
43
-
44
- div#middle {
45
- padding-top: 10px;
46
- background-image: url( "images/middle_background.png" );
47
- background-repeat: repeat-y;
48
- }
49
-
50
- div#bottom {
51
- background-image: url( "images/bottom_background.png" );
52
- height: 13px;
53
- margin-bottom: 30px;
54
- }
55
-
56
- div#main_navigation {
57
- width: 300px;
58
- margin: 0px auto 0 auto;
59
- padding-top: 43px;
60
- padding-right: 10px;
61
- position: relative;
62
- right: 500px;
63
- text-align: right;
64
- line-height: 130%;
65
- font-size: 90%;
66
- }
67
-
68
- div#main_navigation ul {
69
- list-style-type: none;
70
- padding: 0;
71
- }
72
-
73
- div#main_navigation a, div#main_navigation a:visited {
74
- color: white;
75
- text-decoration: none;
76
- }
77
-
78
- div#main_navigation a:hover {
79
- text-decoration: underline;
80
- }
81
-
82
- div#secondary_navigation {
83
- position: relative;
84
- font-size: 90%;
85
- margin: 0 auto 0 auto;
86
- padding: 0px;
87
- text-align: center;
88
- position: relative;
89
- top: -10px;
90
- }
91
-
92
- div#secondary_navigation ul {
93
- list-style-type: none;
94
- padding: 0;
95
- }
96
-
97
- div#secondary_navigation li {
98
- display: inline;
99
- margin-left: 10px;
100
- margin-right: 10px;
101
- }
102
-
103
- div#main_content {
104
- width: 545px;
105
- margin: 0 auto 0 auto;
106
- padding: 0 60px 25px 60px;
107
- }
108
-
109
- pre {
110
- background: #333333;
111
- color: white;
112
- padding: 15px;
113
- border: 1px solid #666666;
114
- }
115
-
116
- p {
117
- line-height: 150%;
118
- }
119
-
120
- p.intro_text {
121
- color: #C45900;
122
- font-size: 115%;
123
- }
124
-
125
- img#pivotal_logo {
126
- border: none;
127
- margin-left: auto;
128
- margin-right: auto;
129
- }
130
-
131
- tr td {
132
- vertical-align: top;
133
- padding: 0.3em;
134
- }
@@ -1,245 +0,0 @@
1
- <html><head><link href="./screen.css" rel="stylesheet" type="text/css" />
2
- <script src="http://www.google-analytics.com/urchin.js" type="text/javascript">
3
- </script>
4
- <script type="text/javascript">
5
- _uacct = "UA-3418876-1";
6
- urchinTracker();
7
- </script>
8
- </head><body><div id="top"><div id="main_navigation"><ul><li>Documentation</li><li><a href="contribute.html">Contribute</a></li><li><a href="index.html">Home</a></li></ul></div></div><div id="middle"><div id="main_content"><div id="secondary_navigation"><ul><li><a href="syntactic_recognition.html">Syntax</a></li><li>Semantics</li><li><a href="using_in_ruby.html">Using In Ruby</a></li><li><a href="pitfalls_and_advanced_techniques.html">Advanced Techniques</a></li></ul></div><div id="documentation_content"><h1>Semantic Interpretation</h1>
9
-
10
- <p>Lets use the below grammar as an example. It describes parentheses wrapping a single character to an arbitrary depth.</p>
11
-
12
- <pre><code>grammar ParenLanguage
13
- rule parenthesized_letter
14
- '(' parenthesized_letter ')'
15
- /
16
- [a-z]
17
- end
18
- end
19
- </code></pre>
20
-
21
- <p>Matches:</p>
22
-
23
- <ul>
24
- <li><code>'a'</code></li>
25
- <li><code>'(a)'</code></li>
26
- <li><code>'((a))'</code></li>
27
- <li>etc.</li>
28
- </ul>
29
-
30
-
31
- <p>Output from a parser for this grammar looks like this:</p>
32
-
33
- <p><img src="./images/paren_language_output.png" alt="Tree Returned By ParenLanguageParser" /></p>
34
-
35
- <p>This is a parse tree whose nodes are instances of <code>Treetop::Runtime::SyntaxNode</code>. What if we could define methods on these node objects? We would then have an object-oriented program whose structure corresponded to the structure of our language. Treetop provides two techniques for doing just this.</p>
36
-
37
- <h2>Associating Methods with Node-Instantiating Expressions</h2>
38
-
39
- <p>Sequences and all types of terminals are node-instantiating expressions. When they match, they create instances of <code>Treetop::Runtime::SyntaxNode</code>. Methods can be added to these nodes in the following ways:</p>
40
-
41
- <h3>Inline Method Definition</h3>
42
-
43
- <p>Methods can be added to the nodes instantiated by the successful match of an expression</p>
44
-
45
- <pre><code>grammar ParenLanguage
46
- rule parenthesized_letter
47
- '(' parenthesized_letter ')' {
48
- def depth
49
- parenthesized_letter.depth + 1
50
- end
51
- }
52
- /
53
- [a-z] {
54
- def depth
55
- 0
56
- end
57
- }
58
- end
59
- end
60
- </code></pre>
61
-
62
- <p>Note that each alternative expression is followed by a block containing a method definition. A <code>depth</code> method is defined on both expressions. The recursive <code>depth</code> method defined in the block following the first expression determines the depth of the nested parentheses and adds one to it. The base case is implemented in the block following the second expression; a single character has a depth of 0.</p>
63
-
64
- <h3>Custom <code>SyntaxNode</code> Subclass Declarations</h3>
65
-
66
- <p>You can instruct the parser to instantiate a custom subclass of Treetop::Runtime::SyntaxNode for an expression by following it by the name of that class enclosed in angle brackets (<code>&lt;&gt;</code>). The above inline method definitions could have been moved out into a single class like so.</p>
67
-
68
- <pre><code># in .treetop file
69
- grammar ParenLanguage
70
- rule parenthesized_letter
71
- '(' parenthesized_letter ')' &lt;ParenNode&gt;
72
- /
73
- [a-z] &lt;ParenNode&gt;
74
- end
75
- end
76
-
77
- # in separate .rb file
78
- class ParenNode &lt; Treetop::Runtime::SyntaxNode
79
- def depth
80
- if nonterminal?
81
- parenthesized_letter.depth + 1
82
- else
83
- 0
84
- end
85
- end
86
- end
87
- </code></pre>
88
-
89
- <h2>Automatic Extension of Results</h2>
90
-
91
- <p>Nonterminal and ordered choice expressions do not instantiate new nodes, but rather pass through nodes that are instantiated by other expressions. They can extend nodes they propagate with anonymous or declared modules, using similar constructs used with expressions that instantiate their own syntax nodes.</p>
92
-
93
- <h3>Extending a Propagated Node with an Anonymous Module</h3>
94
-
95
- <pre><code>rule parenthesized_letter
96
- ('(' parenthesized_letter ')' / [a-z]) {
97
- def depth
98
- if nonterminal?
99
- parenthesized_letter.depth + 1
100
- else
101
- 0
102
- end
103
- end
104
- }
105
- end
106
- </code></pre>
107
-
108
- <p>The parenthesized choice above can result in a node matching either of the two choices. The node will be extended with methods defined in the subsequent block. Note that a choice must always be parenthesized to be associated with a following block, otherwise the block will apply to just the last alternative.</p>
109
-
110
- <h3>Extending A Propagated Node with a Declared Module</h3>
111
-
112
- <pre><code># in .treetop file
113
- rule parenthesized_letter
114
- ('(' parenthesized_letter ')' / [a-z]) &lt;ParenNode&gt;
115
- end
116
-
117
- # in separate .rb file
118
- module ParenNode
119
- def depth
120
- if nonterminal?
121
- parenthesized_letter.depth + 1
122
- else
123
- 0
124
- end
125
- end
126
- end
127
- </code></pre>
128
-
129
- <p>Here the result is extended with the <code>ParenNode</code> module. Note the previous example for node-instantiating expressions, the constant in the declaration must be a module because the result is extended with it.</p>
130
-
131
- <h2>Automatically-Defined Element Accessor Methods</h2>
132
-
133
- <h3>Default Accessors</h3>
134
-
135
- <p>Nodes instantiated upon the matching of sequences have methods automatically defined for any nonterminals in the sequence.</p>
136
-
137
- <pre><code>rule abc
138
- a b c {
139
- def to_s
140
- a.to_s + b.to_s + c.to_s
141
- end
142
- }
143
- end
144
- </code></pre>
145
-
146
- <p>In the above code, the <code>to_s</code> method calls automatically-defined element accessors for the nodes returned by parsing nonterminals <code>a</code>, <code>b</code>, and <code>c</code>.</p>
147
-
148
- <h3>Labels</h3>
149
-
150
- <p>Subexpressions can be given an explicit label to have an element accessor method defined for them. This is useful in cases of ambiguity between two references to the same nonterminal or when you need to access an unnamed subexpression.</p>
151
-
152
- <pre><code>rule labels
153
- first_letter:[a-z] rest_letters:(', ' letter:[a-z])* {
154
- def letters
155
- [first_letter] + rest_letters.elements.map do |comma_and_letter|
156
- comma_and_letter.letter
157
- end
158
- end
159
- }
160
- end
161
- </code></pre>
162
-
163
- <p>The above grammar uses label-derived accessors to determine the letters in a comma-delimited list of letters. The labeled expressions <em>could</em> have been extracted to their own rules, but if they aren't used elsewhere, labels still enable them to be referenced by a name within the expression's methods.</p>
164
-
165
- <h3>Overriding Element Accessors</h3>
166
-
167
- <p>The module containing automatically defined element accessor methods is an ancestor of the module in which you define your own methods, meaning you can override them with access to the <code>super</code> keyword. Here's an example of how this fact can improve the readability of the example above.</p>
168
-
169
- <pre><code>rule labels
170
- first_letter:[a-z] rest_letters:(', ' letter:[a-z])* {
171
- def letters
172
- [first_letter] + rest_letters
173
- end
174
-
175
- def rest_letters
176
- super.elements.map { |comma_and_letter| comma_and_letter.letter }
177
- end
178
- }
179
- end
180
- </code></pre>
181
-
182
- <h2>Methods Available on <code>Treetop::Runtime::SyntaxNode</code></h2>
183
-
184
- <table>
185
- <tr>
186
- <td>
187
- <code>terminal?</code>
188
- </td>
189
- <td>
190
- Was this node produced by the matching of a terminal symbol?
191
- </td>
192
- </tr>
193
- <tr>
194
- <td>
195
- <code>nonterminal?</code>
196
- </td>
197
- <td>
198
- Was this node produced by the matching of a nonterminal symbol?
199
- </td>
200
- <tr>
201
- <td>
202
- <code>text_value</code>
203
- </td>
204
- <td>
205
- The substring of the input represented by this node.
206
- </td>
207
- <tr>
208
- <td>
209
- <code>elements</code>
210
- </td>
211
- <td>
212
- Available only on nonterminal nodes, returns the nodes parsed by the elements of the matched sequence.
213
- </td>
214
- <tr>
215
- <td>
216
- <code>input</code>
217
- </td>
218
- <td>
219
- The entire input string, which is useful mainly in conjunction with <code>interval</code>
220
- </td>
221
- <tr>
222
- <td>
223
- <code>interval</code>
224
- </td>
225
- <td>
226
- The Range of characters in <code>input</code> matched by this rule
227
- </td>
228
- <tr>
229
- <td>
230
- <code>empty?</code>
231
- </td>
232
- <td>
233
- returns true if this rule matched no characters of input
234
- </td>
235
- <tr>
236
- <td>
237
- <code>inspect</code>
238
- </td>
239
- <td>
240
- Handy-dandy method that returns an indented subtree dump of the syntax tree starting here.
241
- This dump includes, for each node, the offset and a snippet of the text this rule matched, and the names of mixin modules and the accessor and extension methods.
242
- </td>
243
- </tr>
244
- </table>
245
- </div></div></div><div id="bottom"></div></body></html>