cognita-treetop 1.2.4

Sign up to get free protection for your applications and to get access to all the features.
Files changed (61) hide show
  1. data/README +164 -0
  2. data/Rakefile +35 -0
  3. data/bin/tt +25 -0
  4. data/doc/contributing_and_planned_features.markdown +103 -0
  5. data/doc/grammar_composition.markdown +65 -0
  6. data/doc/index.markdown +90 -0
  7. data/doc/pitfalls_and_advanced_techniques.markdown +51 -0
  8. data/doc/semantic_interpretation.markdown +189 -0
  9. data/doc/site.rb +110 -0
  10. data/doc/sitegen.rb +60 -0
  11. data/doc/syntactic_recognition.markdown +100 -0
  12. data/doc/using_in_ruby.markdown +21 -0
  13. data/examples/lambda_calculus/arithmetic.rb +551 -0
  14. data/examples/lambda_calculus/arithmetic.treetop +97 -0
  15. data/examples/lambda_calculus/arithmetic_node_classes.rb +7 -0
  16. data/examples/lambda_calculus/arithmetic_test.rb +54 -0
  17. data/examples/lambda_calculus/lambda_calculus +0 -0
  18. data/examples/lambda_calculus/lambda_calculus.rb +718 -0
  19. data/examples/lambda_calculus/lambda_calculus.treetop +132 -0
  20. data/examples/lambda_calculus/lambda_calculus_node_classes.rb +5 -0
  21. data/examples/lambda_calculus/lambda_calculus_test.rb +89 -0
  22. data/examples/lambda_calculus/test_helper.rb +18 -0
  23. data/lib/treetop.rb +8 -0
  24. data/lib/treetop/bootstrap_gen_1_metagrammar.rb +45 -0
  25. data/lib/treetop/compiler.rb +6 -0
  26. data/lib/treetop/compiler/grammar_compiler.rb +40 -0
  27. data/lib/treetop/compiler/lexical_address_space.rb +17 -0
  28. data/lib/treetop/compiler/metagrammar.rb +2887 -0
  29. data/lib/treetop/compiler/metagrammar.treetop +404 -0
  30. data/lib/treetop/compiler/node_classes.rb +19 -0
  31. data/lib/treetop/compiler/node_classes/anything_symbol.rb +18 -0
  32. data/lib/treetop/compiler/node_classes/atomic_expression.rb +14 -0
  33. data/lib/treetop/compiler/node_classes/character_class.rb +19 -0
  34. data/lib/treetop/compiler/node_classes/choice.rb +31 -0
  35. data/lib/treetop/compiler/node_classes/declaration_sequence.rb +24 -0
  36. data/lib/treetop/compiler/node_classes/grammar.rb +28 -0
  37. data/lib/treetop/compiler/node_classes/inline_module.rb +27 -0
  38. data/lib/treetop/compiler/node_classes/nonterminal.rb +13 -0
  39. data/lib/treetop/compiler/node_classes/optional.rb +19 -0
  40. data/lib/treetop/compiler/node_classes/parenthesized_expression.rb +9 -0
  41. data/lib/treetop/compiler/node_classes/parsing_expression.rb +138 -0
  42. data/lib/treetop/compiler/node_classes/parsing_rule.rb +55 -0
  43. data/lib/treetop/compiler/node_classes/predicate.rb +45 -0
  44. data/lib/treetop/compiler/node_classes/repetition.rb +55 -0
  45. data/lib/treetop/compiler/node_classes/sequence.rb +68 -0
  46. data/lib/treetop/compiler/node_classes/terminal.rb +20 -0
  47. data/lib/treetop/compiler/node_classes/transient_prefix.rb +9 -0
  48. data/lib/treetop/compiler/node_classes/treetop_file.rb +9 -0
  49. data/lib/treetop/compiler/ruby_builder.rb +113 -0
  50. data/lib/treetop/ruby_extensions.rb +2 -0
  51. data/lib/treetop/ruby_extensions/string.rb +42 -0
  52. data/lib/treetop/runtime.rb +5 -0
  53. data/lib/treetop/runtime/compiled_parser.rb +87 -0
  54. data/lib/treetop/runtime/interval_skip_list.rb +4 -0
  55. data/lib/treetop/runtime/interval_skip_list/head_node.rb +15 -0
  56. data/lib/treetop/runtime/interval_skip_list/interval_skip_list.rb +200 -0
  57. data/lib/treetop/runtime/interval_skip_list/node.rb +164 -0
  58. data/lib/treetop/runtime/syntax_node.rb +72 -0
  59. data/lib/treetop/runtime/terminal_parse_failure.rb +16 -0
  60. data/lib/treetop/runtime/terminal_syntax_node.rb +17 -0
  61. metadata +119 -0
@@ -0,0 +1,51 @@
1
+ #Pitfalls
2
+ ##Left Recursion
3
+ An weakness shared by all recursive descent parsers is the inability to parse left-recursive rules. Consider the following rule:
4
+
5
+ rule left_recursive
6
+ left_recursive 'a' / 'a'
7
+ end
8
+
9
+ Logically it should match a list of 'a' characters. But it never consumes anything, because attempting to recognize `left_recursive` begins by attempting to recognize `left_recursive`, and so goes an infinite recursion. There's always a way to eliminate these types of structures from your grammar. There's a mechanistic transformation called _left factorization_ that can eliminate it, but it isn't always pretty, especially in combination with automatically constructed syntax trees. So far, I have found more thoughtful ways around the problem. For instance, in the interpreter example I interpret inherently left-recursive function application right recursively in syntax, then correct the directionality in my semantic interpretation. You may have to be clever.
10
+
11
+ #Advanced Techniques
12
+ Here are a few interesting problems I've encountered. I figure sharing them may give you insight into how these types of issues are addressed with the tools of parsing expressions.
13
+
14
+ ##Matching a String
15
+
16
+ rule string
17
+ '"' (!'"' . / '\"')* '"'
18
+ end
19
+
20
+ This expression says: Match a quote, then zero or more of any character but a quote or an escaped quote followed by a quote. Lookahead assertions are essential for these types of problems.
21
+
22
+ ##Matching Nested Structures With Non-Unique Delimeters
23
+ Say I want to parse a diabolical wiki syntax in which the following interpretations apply.
24
+
25
+ ** *hello* ** --> <strong><em>hello</em></strong>
26
+ * **hello** * --> <em><strong>hello</strong></em>
27
+
28
+ rule strong
29
+ '**' (em / !'*' . / '\*')+ '**'
30
+ end
31
+
32
+ rule em
33
+ '**' (strong / !'*' . / '\*')+ '**'
34
+ end
35
+
36
+ Emphasized text is allowed within strong text by virtue of `em` being the first alternative. Since `em` will only successfully parse if a matching `*` is found, it is permitted, but other than that, no `*` characters are allowed unless they are escaped.
37
+
38
+ ##Matching a Keyword But Not Words Prefixed Therewith
39
+ Say I want to consider a given string a characters only when it occurs in isolation. Lets use the `end` keyword as an example. We don't want the prefix of `'enders_game'` to be considered a keyword. A naiive implementation might be the following.
40
+
41
+ rule end_keyword
42
+ 'end' &space
43
+ end
44
+
45
+ This says that `'end'` must be followed by a space, but this space is not consumed as part of the matching of `keyword`. This works in most cases, but is actually incorrect. What if `end` occurs at the end of the buffer? In that case, it occurs in isolation but will not match the above expression. What we really mean is that `'end'` cannot be followed by a _non-space_ character.
46
+
47
+ rule end_keyword
48
+ 'end' !(!' ' .)
49
+ end
50
+
51
+ In general, when the syntax gets tough, it helps to focus on what you really mean. A keyword is a character not followed by another character that isn't a space.
@@ -0,0 +1,189 @@
1
+ #Semantic Interpretation
2
+ Lets use the below grammar as an example. It describes parentheses wrapping a single character to an arbitrary depth.
3
+
4
+ grammar ParenLanguage
5
+ rule parenthesized_letter
6
+ '(' parenthesized_letter ')'
7
+ /
8
+ [a-z]
9
+ end
10
+ end
11
+
12
+ Matches:
13
+
14
+ * `'a'`
15
+ * `'(a)'`
16
+ * `'((a))'`
17
+ * etc.
18
+
19
+
20
+ Output from a parser for this grammar looks like this:
21
+
22
+ ![Tree Returned By ParenLanguageParser](./images/paren_language_output.png)
23
+
24
+ This is a parse tree whose nodes are instances of `Treetop::Runtime::SyntaxNode`. What if we could define methods on these node objects? We would then have an object-oriented program whose structure corresponded to the structure of our language. Treetop provides two techniques for doing just this.
25
+
26
+ ##Associating Methods with Node-Instantiating Expressions
27
+ Sequences and all types of terminals are node-instantiating expressions. When they match, they create instances of `Treetop::Runtime::SyntaxNode`. Methods can be added to these nodes in the following ways:
28
+
29
+ ###Inline Method Definition
30
+ Methods can be added to the nodes instantiated by the successful match of an expression
31
+
32
+ grammar ParenLanguage
33
+ rule parenthesized_letter
34
+ '(' parenthesized_letter ')' {
35
+ def depth
36
+ parenthesized_letter.depth + 1
37
+ end
38
+ }
39
+ /
40
+ [a-z] {
41
+ def depth
42
+ 0
43
+ end
44
+ }
45
+ end
46
+ end
47
+
48
+ Note that each alternative expression is followed by a block containing a method definition. A `depth` method is defined on both expressions. The recursive `depth` method defined in the block following the first expression determines the depth of the nested parentheses and adds one two it. The base case is implemented in the block following the second expression; a single character has a depth of 0.
49
+
50
+
51
+ ###Custom `SyntaxNode` Subclass Declarations
52
+ You can instruct the parser to instantiate a custom subclass of Treetop::Runtime::SyntaxNode for an expression by following it by the name of that class enclosed in angle brackets (`<>`). The above inline method definitions could have been moved out into a single class like so.
53
+
54
+ # in .treetop file
55
+ grammar ParenLanguage
56
+ rule parenthesized_letter
57
+ '(' parenthesized_letter ')' <ParenNode>
58
+ /
59
+ [a-z] <ParenNode>
60
+ end
61
+ end
62
+
63
+ # in separate .rb file
64
+ class ParenNode < Treetop::Runtime::SyntaxNode
65
+ def depth
66
+ if nonterminal?
67
+ parenthesized_letter.depth + 1
68
+ else
69
+ 0
70
+ end
71
+ end
72
+ end
73
+
74
+ ##Automatic Extension of Results
75
+ Nonterminal and ordered choice expressions do not instantiate new nodes, but rather pass through nodes that are instantiated by other expressions. They can extend nodes they propagate with anonymous or declared modules, using similar constructs used with expressions that instantiate their own syntax nodes.
76
+
77
+ ###Extending a Propagated Node with an Anonymous Module
78
+ rule parenthesized_letter
79
+ ('(' parenthesized_letter ')' / [a-z]) {
80
+ def depth
81
+ if nonterminal?
82
+ parenthesized_letter.depth + 1
83
+ else
84
+ 0
85
+ end
86
+ end
87
+ }
88
+ end
89
+
90
+ The parenthesized choice above can result in a node matching either of the two choices. Than node will be extended with methods defined in the subsequent block. Note that a choice must always be parenthesized to be associated with a following block.
91
+
92
+ ###Extending A Propagated Node with a Declared Module
93
+ # in .treetop file
94
+ rule parenthesized_letter
95
+ ('(' parenthesized_letter ')' / [a-z]) <ParenNode>
96
+ end
97
+
98
+ # in separate .rb file
99
+ module ParenNode
100
+ def depth
101
+ if nonterminal?
102
+ parenthesized_letter.depth + 1
103
+ else
104
+ 0
105
+ end
106
+ end
107
+ end
108
+
109
+ Here the result is extended with the `ParenNode` module. Note the previous example for node-instantiating expressions, the constant in the declaration must be a module because the result is extended with it.
110
+
111
+ ##Automatically-Defined Element Accessor Methods
112
+ ###Default Accessors
113
+ Nodes instantiated upon the matching of sequences have methods automatically defined for any nonterminals in the sequence.
114
+
115
+ rule abc
116
+ a b c {
117
+ def to_s
118
+ a.to_s + b.to_s + c.to_s
119
+ end
120
+ }
121
+ end
122
+
123
+ In the above code, the `to_s` method calls automatically-defined element accessors for the nodes returned by parsing nonterminals `a`, `b`, and `c`.
124
+
125
+ ###Labels
126
+ Subexpressions can be given an explicit label to have an element accessor method defined for them. This is useful in cases of ambiguity between two references to the same nonterminal or when you need to access an unnamed subexpression.
127
+
128
+ rule labels
129
+ first_letter:[a-z] rest_letters:(', ' letter:[a-z])* {
130
+ def letters
131
+ [first_letter] + rest_letters.map do |comma_and_letter|
132
+ comma_and_letter.letter
133
+ end
134
+ end
135
+ }
136
+ end
137
+
138
+ The above grammar uses label-derived accessors to determine the letters in a comma-delimited list of letters. The labeled expressions _could_ have been extracted to their own rules, but if they aren't used elsewhere, labels still enable them to be referenced by a name within the expression's methods.
139
+
140
+ ###Overriding Element Accessors
141
+ The module containing automatically defined element accessor methods is an ancestor of the module in which you define your own methods, meaning you can override them with access to the `super` keyword. Here's an example of how this fact can improve the readability of the example above.
142
+
143
+ rule labels
144
+ first_letter:[a-z] rest_letters:(', ' letter:[a-z])* {
145
+ def letters
146
+ [first_letter] + rest_letters
147
+ end
148
+
149
+ def rest_letters
150
+ super.map { |comma_and_letter| comma_and_letter.letter }
151
+ end
152
+ }
153
+ end
154
+
155
+
156
+ ##Methods Available on `Treetop::Runtime::SyntaxNode`
157
+
158
+ <table>
159
+ <tr>
160
+ <td>
161
+ <code>terminal?</code>
162
+ </td>
163
+ <td>
164
+ Was this node produced by the matching of a terminal symbol?
165
+ </td>
166
+ </tr>
167
+ <tr>
168
+ <td>
169
+ <code>nonterminal?</code>
170
+ </td>
171
+ <td>
172
+ Was this node produced by the matching of a nonterminal symbol?
173
+ </td>
174
+ <tr>
175
+ <td>
176
+ <code>text_value</code>
177
+ </td>
178
+ <td>
179
+ The substring of the input represented by this node.
180
+ </td>
181
+ <tr>
182
+ <td>
183
+ <code>elements</code>
184
+ </td>
185
+ <td>
186
+ Available only on nonterminal nodes, returns the nodes parsed by the elements of the matched sequence.
187
+ </td>
188
+ </tr>
189
+ </table>
data/doc/site.rb ADDED
@@ -0,0 +1,110 @@
1
+ require 'rubygems'
2
+ require 'erector'
3
+ require "#{File.dirname(__FILE__)}/sitegen"
4
+
5
+ class Layout < Erector::Widget
6
+ def render
7
+ html do
8
+ head do
9
+ link :rel => "stylesheet",
10
+ :type => "text/css",
11
+ :href => "./screen.css"
12
+
13
+ rawtext %(
14
+ <script src="http://www.google-analytics.com/urchin.js" type="text/javascript">
15
+ </script>
16
+ <script type="text/javascript">
17
+ _uacct = "UA-3418876-1";
18
+ urchinTracker();
19
+ </script>
20
+ )
21
+ end
22
+
23
+ body do
24
+ div :id => 'top' do
25
+ div :id => 'main_navigation' do
26
+ main_navigation
27
+ end
28
+ end
29
+ div :id => 'middle' do
30
+ div :id => 'content' do
31
+ content
32
+ end
33
+ end
34
+ div :id => 'bottom' do
35
+
36
+ end
37
+ end
38
+ end
39
+ end
40
+
41
+ def main_navigation
42
+ ul do
43
+ li { link_to "Documentation", SyntacticRecognition, Documentation }
44
+ li { link_to "Contribute", Contribute }
45
+ li { link_to "Home", Index }
46
+ end
47
+ end
48
+
49
+ def content
50
+ end
51
+ end
52
+
53
+ class Index < Layout
54
+ def content
55
+ bluecloth "index.markdown"
56
+ end
57
+ end
58
+
59
+ class Documentation < Layout
60
+ abstract
61
+
62
+ def content
63
+ div :id => 'secondary_navigation' do
64
+ ul do
65
+ li { link_to 'Syntax', SyntacticRecognition }
66
+ li { link_to 'Semantics', SemanticInterpretation }
67
+ li { link_to 'Using In Ruby', UsingInRuby }
68
+ li { link_to 'Advanced Techniques', PitfallsAndAdvancedTechniques }
69
+ end
70
+ end
71
+
72
+ div :id => 'documentation_content' do
73
+ documentation_content
74
+ end
75
+ end
76
+ end
77
+
78
+ class SyntacticRecognition < Documentation
79
+ def documentation_content
80
+ bluecloth "syntactic_recognition.markdown"
81
+ end
82
+ end
83
+
84
+ class SemanticInterpretation < Documentation
85
+ def documentation_content
86
+ bluecloth "semantic_interpretation.markdown"
87
+ end
88
+ end
89
+
90
+ class UsingInRuby < Documentation
91
+ def documentation_content
92
+ bluecloth "using_in_ruby.markdown"
93
+ end
94
+ end
95
+
96
+ class PitfallsAndAdvancedTechniques < Documentation
97
+ def documentation_content
98
+ bluecloth "pitfalls_and_advanced_techniques.markdown"
99
+ end
100
+ end
101
+
102
+
103
+ class Contribute < Layout
104
+ def content
105
+ bluecloth "contributing_and_planned_features.markdown"
106
+ end
107
+ end
108
+
109
+
110
+ Layout.generate_site
data/doc/sitegen.rb ADDED
@@ -0,0 +1,60 @@
1
+ class Layout < Erector::Widget
2
+
3
+ class << self
4
+ def inherited(page_class)
5
+ puts page_class
6
+ (@@page_classes ||= []) << page_class
7
+ end
8
+
9
+ def generate_site
10
+ @@page_classes.each do |page_class|
11
+ page_class.generate_html unless page_class.abstract?
12
+ puts page_class
13
+ end
14
+ end
15
+
16
+ def generate_html
17
+ File.open(absolute_path, 'w') do |file|
18
+ file.write(new.render)
19
+ end
20
+ end
21
+
22
+ def absolute_path
23
+ absolutize(relative_path)
24
+ end
25
+
26
+ def relative_path
27
+ "#{name.gsub('::', '_').underscore}.html"
28
+ end
29
+
30
+ def absolutize(relative_path)
31
+ File.join(File.dirname(__FILE__), "site", relative_path)
32
+ end
33
+
34
+ def abstract
35
+ @abstract = true
36
+ end
37
+
38
+ def abstract?
39
+ @abstract
40
+ end
41
+ end
42
+
43
+ def bluecloth(relative_path)
44
+ File.open(File.join(File.dirname(__FILE__), relative_path)) do |file|
45
+ rawtext BlueCloth.new(file.read).to_html
46
+ end
47
+ end
48
+
49
+ def absolutize(relative_path)
50
+ self.class.absolutize(relative_path)
51
+ end
52
+
53
+ def link_to(link_text, page_class, section_class=nil)
54
+ if instance_of?(page_class) || section_class && is_a?(section_class)
55
+ text link_text
56
+ else
57
+ a link_text, :href => page_class.relative_path
58
+ end
59
+ end
60
+ end
@@ -0,0 +1,100 @@
1
+ #Syntactic Recognition
2
+ Treetop grammars are written in a custom language based on parsing expression grammars. Literature on the subject of <a href="http://en.wikipedia.org/wiki/Parsing_expression_grammar">parsing expression grammars</a> is useful in writing Treetop grammars.
3
+
4
+ #Grammar Structure
5
+ Treetop grammars look like this:
6
+
7
+ grammar GrammarName
8
+ rule rule_name
9
+ ...
10
+ end
11
+
12
+ rule rule_name
13
+ ...
14
+ end
15
+
16
+ ...
17
+ end
18
+
19
+ The main keywords are:
20
+
21
+ * `grammar` : This introduces a new grammar. It is followed by a constant name to which the grammar will be bound when it is loaded.
22
+
23
+ * `rule` : This defines a parsing rule within the grammar. It is followed by a name by which this rule can be referenced within other rules. It is then followed by a parsing expression defining the rule.
24
+
25
+ #Parsing Expressions
26
+ Each rule associates a name with a _parsing expression_. Parsing expressions are a generalization of vanilla regular expressions. Their key feature is the ability to reference other expressions in the grammar by name.
27
+
28
+ ##Terminal Symbols
29
+ ###Strings
30
+ Strings are surrounded in double or single quotes and must be matched exactly.
31
+
32
+ * `"foo"`
33
+ * `'foo'`
34
+
35
+ ###Character Classes
36
+ Character classes are surrounded by brackets. Their semantics are identical to those used in Ruby's regular expressions.
37
+
38
+ * `[a-zA-Z]`
39
+ * `[0-9]`
40
+
41
+ ###The Anything Symbol
42
+ The anything symbol is represented by a dot (`.`) and matches any single character.
43
+
44
+ ##Nonterminal Symbols
45
+ Nonterminal symbols are unquoted references to other named rules. They are equivalent to an inline substitution of the named expression.
46
+
47
+ rule foo
48
+ "the dog " bar
49
+ end
50
+
51
+ rule bar
52
+ "jumped"
53
+ end
54
+
55
+ The above grammar is equivalent to:
56
+
57
+ rule foo
58
+ "the dog jumped"
59
+ end
60
+
61
+ ##Ordered Choice
62
+ Parsers attempt to match ordered choices in left-to-right order, and stop after the first successful match.
63
+
64
+ "foobar" / "foo" / "bar"
65
+
66
+ Note that if `"foo"` in the above expression came first, `"foobar"` would never be matched.
67
+
68
+ ##Sequences
69
+
70
+ Sequences are a space-separated list of parsing expressions. They have higher precedence than choices, so choices must be parenthesized to be used as the elements of a sequence.
71
+
72
+ "foo" "bar" ("baz" / "bop")
73
+
74
+ ##Zero or More
75
+ Parsers will greedily match an expression zero or more times if it is followed by the star (`*`) symbol.
76
+
77
+ * `'foo'*` matches the empty string, `"foo"`, `"foofoo"`, etc.
78
+
79
+ ##One or More
80
+ Parsers will greedily match an expression one or more times if it is followed by the star (`+`) symbol.
81
+
82
+ * `'foo'+` does not match the empty string, but matches `"foo"`, `"foofoo"`, etc.
83
+
84
+ ##Optional Expressions
85
+ An expression can be declared optional by following it with a question mark (`?`).
86
+
87
+ * `'foo'?` matches `"foo"` or the empty string.
88
+
89
+ ##Lookahead Assertions
90
+ Lookahead assertions can be used to give parsing expressions a limited degree of context-sensitivity. The parser will look ahead into the buffer and attempt to match an expression without consuming input.
91
+
92
+ ###Positive Lookahead Assertion
93
+ Preceding an expression with an ampersand `(&)` indicates that it must match, but no input will be consumed in the process of determining whether this is true.
94
+
95
+ * `"foo" &"bar"` matches `"foobar"` but only consumes up to the end `"foo"`. It will not match `"foobaz"`.
96
+
97
+ ###Negative Lookahead Assertion
98
+ Preceding an expression with a bang `(!)` indicates that the expression must not match, but no input will be consumed in the process of determining whether this is true.
99
+
100
+ * `"foo" !"bar"` matches `"foobaz"` but only consumes up to the end `"foo"`. It will not match `"foobar"`.