ruby-ll 1.0.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/.yardopts +13 -0
- data/LICENSE +19 -0
- data/README.md +380 -0
- data/bin/ruby-ll +5 -0
- data/doc/DCO.md +25 -0
- data/doc/changelog.md +8 -0
- data/doc/css/common.css +77 -0
- data/ext/c/driver.c +258 -0
- data/ext/c/driver.h +28 -0
- data/ext/c/driver_config.c +209 -0
- data/ext/c/driver_config.h +53 -0
- data/ext/c/extconf.rb +13 -0
- data/ext/c/khash.h +619 -0
- data/ext/c/kvec.h +90 -0
- data/ext/c/libll.c +7 -0
- data/ext/c/libll.h +9 -0
- data/ext/c/macros.h +6 -0
- data/ext/java/Libll.java +12 -0
- data/ext/java/org/libll/Driver.java +247 -0
- data/ext/java/org/libll/DriverConfig.java +193 -0
- data/lib/ll.rb +26 -0
- data/lib/ll/ast/node.rb +13 -0
- data/lib/ll/branch.rb +57 -0
- data/lib/ll/cli.rb +118 -0
- data/lib/ll/code_generator.rb +32 -0
- data/lib/ll/compiled_configuration.rb +35 -0
- data/lib/ll/compiled_grammar.rb +167 -0
- data/lib/ll/configuration_compiler.rb +204 -0
- data/lib/ll/driver.rb +46 -0
- data/lib/ll/driver_config.rb +36 -0
- data/lib/ll/driver_template.erb +51 -0
- data/lib/ll/epsilon.rb +23 -0
- data/lib/ll/erb_context.rb +23 -0
- data/lib/ll/grammar_compiler.rb +359 -0
- data/lib/ll/lexer.rb +582 -0
- data/lib/ll/message.rb +102 -0
- data/lib/ll/parser.rb +280 -0
- data/lib/ll/parser_error.rb +8 -0
- data/lib/ll/rule.rb +53 -0
- data/lib/ll/setup.rb +11 -0
- data/lib/ll/source_line.rb +46 -0
- data/lib/ll/terminal.rb +29 -0
- data/lib/ll/token.rb +30 -0
- data/lib/ll/version.rb +3 -0
- data/ruby-ll.gemspec +47 -0
- metadata +217 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: ef87cedfa33d3340b77133abff1bdb11f5ad767e
|
4
|
+
data.tar.gz: 6e25ecbd4a78bc7f3469bba05d85a1cd9634b13a
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 07561fc9d28c285ec101c5f9d2c67f16f15b700efd814af5a49174e9e47de90f44f2ffb0859d039f7022bd494f74b8867677c5fd93b6189dd44c40367e5dc62d
|
7
|
+
data.tar.gz: 35a3352d8d2207937d5e1d8657f2c366f8417e6604fc2aef88effe9ed6d94a710e01b8ce4830881ca439fb7fa25a4b326e49a5157a925277d6cc260e4ae5c3f1
|
data/.yardopts
ADDED
data/LICENSE
ADDED
@@ -0,0 +1,19 @@
|
|
1
|
+
Copyright (c) 2015, Yorick Peterse
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
4
|
+
of this software and associated documentation files (the "Software"), to deal
|
5
|
+
in the Software without restriction, including without limitation the rights
|
6
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
7
|
+
copies of the Software, and to permit persons to whom the Software is
|
8
|
+
furnished to do so, subject to the following conditions:
|
9
|
+
|
10
|
+
The above copyright notice and this permission notice shall be included in
|
11
|
+
all copies or substantial portions of the Software.
|
12
|
+
|
13
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
14
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
15
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
16
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
17
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
18
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
19
|
+
THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,380 @@
|
|
1
|
+
# ruby-ll
|
2
|
+
|
3
|
+
ruby-ll is a high performance LL(1) table based parser generator for Ruby. The
|
4
|
+
parser driver is written in C/Java to ensure good runtime performance, the
|
5
|
+
compiler is written entirely in Ruby.
|
6
|
+
|
7
|
+
ruby-ll was written to serve as a fast and easy to use alternative to
|
8
|
+
[Racc][racc] for the various parsers used in [Oga][oga]. However, ruby-ll isn't
|
9
|
+
limited to just Oga, you can use it to write a parser for any language that can
|
10
|
+
be represented using an LL(1) grammar.
|
11
|
+
|
12
|
+
ruby-ll is self-hosting, this allows one to use ruby-ll to modify its own
|
13
|
+
parser. Self-hosting was achieved by bootstrapping the parser using a Racc
|
14
|
+
parser that outputs the same AST as the ruby-ll parser. The Racc parser remains
|
15
|
+
in the repository for historical purposes and in case it's ever needed again, it
|
16
|
+
can be found in [bootstrap/parser.y](lib/ll/bootstrap/parser.y).
|
17
|
+
|
18
|
+
For more information on LL parsing, see
|
19
|
+
<https://en.wikipedia.org/wiki/LL_parser>.
|
20
|
+
|
21
|
+
## Features
|
22
|
+
|
23
|
+
* Support for detecting first/first and first/follow conflicts
|
24
|
+
* clang-like error/warning messages to ease debugging parsers
|
25
|
+
* High performance and a low memory footprint
|
26
|
+
|
27
|
+
## Requirements
|
28
|
+
|
29
|
+
| Ruby | Required | Recommended |
|
30
|
+
|:---------|:--------------|:------------|
|
31
|
+
| MRI | >= 1.9.3 | >= 2.1.0 |
|
32
|
+
| Rubinius | >= 2.2 | >= 2.5.0 |
|
33
|
+
| JRuby | >= 1.7 | >= 1.7.0 |
|
34
|
+
| Maglev | Not supported | |
|
35
|
+
| Topaz | Not supported | |
|
36
|
+
| mruby | Not supported | |
|
37
|
+
|
38
|
+
For MRI/Rubinius you'll need a C90 compatible compiler such as clang or gcc. For
|
39
|
+
JRuby you don't need any compilers to be installed as the .jar is packaged with
|
40
|
+
the Gem itself.
|
41
|
+
|
42
|
+
When hacking on Oga you'll also need to have the following installed:
|
43
|
+
|
44
|
+
* Ragel 6 for building the grammar lexer
|
45
|
+
* javac for building the JRuby extension
|
46
|
+
|
47
|
+
## Usage
|
48
|
+
|
49
|
+
The CLI takes a grammar input file (see below for the exact syntax) with the
|
50
|
+
extension `.rll` and turns it into a corresponding Ruby file. For example:
|
51
|
+
|
52
|
+
ruby-ll lib/my-gem/parser.rll
|
53
|
+
|
54
|
+
This would result in the parser being written to `lib/my-gem/parser.rb`. If you
|
55
|
+
want to customize the output path you can do so using the `-o` / `--output`
|
56
|
+
options:
|
57
|
+
|
58
|
+
ruby-ll lib/my-gem/parser.rll -o lib/my-gem/my-parser.rb
|
59
|
+
|
60
|
+
By default ruby-ll adds various `require` calls to ensure you can load the
|
61
|
+
parser _without_ having to load all of ruby-ll (e.g. the compiler code). If you
|
62
|
+
want to disable this behaviour you can use the `--no-requires` option when
|
63
|
+
processing a grammar:
|
64
|
+
|
65
|
+
ruby-ll lib/my-gem/parser.rll --no-requires
|
66
|
+
|
67
|
+
Once generated you can use the parser class like any other parser. To start
|
68
|
+
parsing simply call the `parse` method:
|
69
|
+
|
70
|
+
parser = MyGem::Parser.new
|
71
|
+
|
72
|
+
parser.parse
|
73
|
+
|
74
|
+
The return value of this method is whatever the root rule (= the first rule
|
75
|
+
defined) returned.
|
76
|
+
|
77
|
+
## Grammar Syntax
|
78
|
+
|
79
|
+
The syntax of a ruby-ll grammar file is fairly simple and consists out of
|
80
|
+
directives, rules, comments and code blocks.
|
81
|
+
|
82
|
+
Directives can be seen as configuration options, for example to set the name of
|
83
|
+
the parser class. Rules are, well, the parsing rules. Code blocks can be used to
|
84
|
+
associate Ruby code with either a branch of a rule or a certain section of the
|
85
|
+
parser (the header or its inner body).
|
86
|
+
|
87
|
+
Directives and rules must be terminated using a semicolon, this is not needed
|
88
|
+
for `%inner` / `%header` blocks.
|
89
|
+
|
90
|
+
For a full example, see ruby-ll's own parser located at
|
91
|
+
[lib/ll/parser.rll](lib/ll/parser.rll).
|
92
|
+
|
93
|
+
### Comments
|
94
|
+
|
95
|
+
Comments start with a hash (`#`) sign and continue until the end of the line,
|
96
|
+
just like Ruby. Example:
|
97
|
+
|
98
|
+
# Some say comments are a code smell.
|
99
|
+
|
100
|
+
### %name
|
101
|
+
|
102
|
+
The `%name` directive is used to set the full name/namespace of the parser
|
103
|
+
class. The name consists out of a single identifier or multiple identifiers
|
104
|
+
separated by `::` (just like Ruby). Some examples:
|
105
|
+
|
106
|
+
%name A;
|
107
|
+
%name A::B;
|
108
|
+
%name A::B::C;
|
109
|
+
|
110
|
+
The last identifier is used as the actual class name. This class will be nested
|
111
|
+
inside a module for every other segment leading up to the last one. For example,
|
112
|
+
this:
|
113
|
+
|
114
|
+
%name A;
|
115
|
+
|
116
|
+
Gets turned into this:
|
117
|
+
|
118
|
+
class A < LL::Driver
|
119
|
+
|
120
|
+
end
|
121
|
+
|
122
|
+
While this:
|
123
|
+
|
124
|
+
%name A::B::C;
|
125
|
+
|
126
|
+
Gets turned into this:
|
127
|
+
|
128
|
+
module A
|
129
|
+
module B
|
130
|
+
class C < LL::Driver
|
131
|
+
|
132
|
+
end
|
133
|
+
end
|
134
|
+
end
|
135
|
+
|
136
|
+
By nesting the parser class in modules any constants in the scope can be
|
137
|
+
referred to without requiring the use of a full namespace. For example, the
|
138
|
+
constant `A::B::X` can just be referred to as `X` in the above example.
|
139
|
+
|
140
|
+
Multiple calls to this directive will result in previous values being
|
141
|
+
overwritten.
|
142
|
+
|
143
|
+
### %terminals
|
144
|
+
|
145
|
+
The `%terminals` directive is used to list one or more terminals of the grammar.
|
146
|
+
Each terminal is an identifier separated by a space. For example:
|
147
|
+
|
148
|
+
%terminals A B C;
|
149
|
+
|
150
|
+
This would define 3 terminals: `A`, `B` and `C`. While there's no specific
|
151
|
+
requirement as to how you name your terminals it's common practise to capitalize
|
152
|
+
them and prefix them with `T_`, like so:
|
153
|
+
|
154
|
+
%terminals T_A T_B T_C;
|
155
|
+
|
156
|
+
Multiple calls to this directive will result in the terminals being appended to
|
157
|
+
the existing list.
|
158
|
+
|
159
|
+
### %inner
|
160
|
+
|
161
|
+
The `%inner` directive can be used to specify a code block that should be placed
|
162
|
+
inside the parser's body, just after the section containing all parsing tables.
|
163
|
+
This directive should be used for adding custom methods and such to the parser.
|
164
|
+
For example:
|
165
|
+
|
166
|
+
%inner
|
167
|
+
{
|
168
|
+
def initialize(input)
|
169
|
+
@input = input
|
170
|
+
end
|
171
|
+
}
|
172
|
+
|
173
|
+
This would result in the following:
|
174
|
+
|
175
|
+
class A < LL::Driver
|
176
|
+
def initialize(input)
|
177
|
+
@input = input
|
178
|
+
end
|
179
|
+
end
|
180
|
+
|
181
|
+
Curly braces can either be placed on the same line as the `%inner` directive or
|
182
|
+
on a new line, it's up to you.
|
183
|
+
|
184
|
+
Unlike regular directives this directive should not be terminated using a
|
185
|
+
semicolon.
|
186
|
+
|
187
|
+
### %header
|
188
|
+
|
189
|
+
The `%header` directive is similar to the `%inner` directive in that it can be
|
190
|
+
used to add a code block to the parser. The code of this directive is placed
|
191
|
+
just before the `class` definition of the parser. This directive can be used to
|
192
|
+
add documentation to the parser class. For example:
|
193
|
+
|
194
|
+
%header
|
195
|
+
{
|
196
|
+
# Hello world
|
197
|
+
}
|
198
|
+
|
199
|
+
This would result in the following:
|
200
|
+
|
201
|
+
# Hello world
|
202
|
+
class A < LL::Driver
|
203
|
+
end
|
204
|
+
|
205
|
+
### Rules
|
206
|
+
|
207
|
+
Rules consist out of a name followed by an equals sign (`=`) followed by 1 or
|
208
|
+
more branches. Each branch is separated using a pipe (`|`). A branch can consist
|
209
|
+
out of 1 or many steps, or an epsilon. Branches can be followed by a code block
|
210
|
+
starting with `{` and ending with `}`. A rule must be terminated using a
|
211
|
+
semicolon.
|
212
|
+
|
213
|
+
An epsilon is represented as a single underscore (`_`) and is used to denote a
|
214
|
+
wildcard/nothingness.
|
215
|
+
|
216
|
+
A simple example:
|
217
|
+
|
218
|
+
%terminals A;
|
219
|
+
|
220
|
+
numbers = A | B;
|
221
|
+
|
222
|
+
Here the rule `numbers` is defined and has two branches. If we wanted a rule
|
223
|
+
that would match terminal `A` or nothing we'd use the following:
|
224
|
+
|
225
|
+
%terminals A;
|
226
|
+
|
227
|
+
numbers = A | _;
|
228
|
+
|
229
|
+
Code blocks can also be added:
|
230
|
+
|
231
|
+
numbers
|
232
|
+
= A { 'A' }
|
233
|
+
| B { 'B' }
|
234
|
+
;
|
235
|
+
|
236
|
+
When the terminal `A` would be processed the returned value would be "B", for
|
237
|
+
terminal `B` the returned value would be "B".
|
238
|
+
|
239
|
+
Code blocks have access to an array called `val` which contains the values of
|
240
|
+
every step of a branch. For example:
|
241
|
+
|
242
|
+
numbers = A B { val };
|
243
|
+
|
244
|
+
Here `val` would return `[A, B]`. Since `val` is just an Array you can also
|
245
|
+
return specific elements from it:
|
246
|
+
|
247
|
+
numbers = A B { val[0] };
|
248
|
+
|
249
|
+
Values returned by code blocks are passed to whatever other rule called it. This
|
250
|
+
allows code blocks to be used for building ASTs and the likes. If no explicit
|
251
|
+
code block is defined `val` is returned as is.
|
252
|
+
|
253
|
+
ruby-ll parsers recurse into rules before unwinding, this means that the
|
254
|
+
inner-most rule is processed first.
|
255
|
+
|
256
|
+
Branches of a rule can also refer to other rules:
|
257
|
+
|
258
|
+
numbers = A other_rule;
|
259
|
+
other_rule = B;
|
260
|
+
|
261
|
+
The value for `other_rule` in the `numbers` rule would be whatever the
|
262
|
+
`other_rule` below it returns.
|
263
|
+
|
264
|
+
The grammar compiler adds errors whenever it encounters a rule with the same
|
265
|
+
name as a terminal, as such the following is invalid:
|
266
|
+
|
267
|
+
%terminals A B;
|
268
|
+
|
269
|
+
A = B;
|
270
|
+
|
271
|
+
It's also an error to re-define an existing rule.
|
272
|
+
|
273
|
+
## Conflicts
|
274
|
+
|
275
|
+
LL(1) grammars can have two kinds of conflicts in a rule:
|
276
|
+
|
277
|
+
* first/first
|
278
|
+
* first/follow
|
279
|
+
|
280
|
+
### first/first
|
281
|
+
|
282
|
+
A first/first conflict means that multiple branches of a rule start with the
|
283
|
+
same terminal, resulting in the parser being unable to choose what branch to
|
284
|
+
use. For example:
|
285
|
+
|
286
|
+
%terminals A B;
|
287
|
+
|
288
|
+
rule = A | A B;
|
289
|
+
|
290
|
+
This would result in the following output:
|
291
|
+
|
292
|
+
example.rll:5:1:error: first/first conflict, multiple branches start with the same terminals
|
293
|
+
rule = A | A B;
|
294
|
+
^
|
295
|
+
example.rll:5:8:error: branch starts with: A
|
296
|
+
rule = A | A B;
|
297
|
+
^
|
298
|
+
example.rll:5:12:error: branch starts with: A
|
299
|
+
rule = A | A B;
|
300
|
+
^
|
301
|
+
|
302
|
+
To solve a first/first conflict you'll have to factor out the common left
|
303
|
+
factor. For example:
|
304
|
+
|
305
|
+
%name Example;
|
306
|
+
|
307
|
+
%terminals A B;
|
308
|
+
|
309
|
+
rule = A rule_follow;
|
310
|
+
rule_follow = B | _;
|
311
|
+
|
312
|
+
Here the `rule` rule starts with terminal `A` and can optionally be followed by
|
313
|
+
`B`, without introducing any first/first conflicts.
|
314
|
+
|
315
|
+
### first/follow
|
316
|
+
|
317
|
+
A first/follow conflict occurs when a branch in a rule starts with an epsilon
|
318
|
+
and is followed by one or more terminals and/or rules. An example of a
|
319
|
+
first/follow conflict:
|
320
|
+
|
321
|
+
%name Example;
|
322
|
+
|
323
|
+
%terminals A B;
|
324
|
+
|
325
|
+
rule = other_rule B;
|
326
|
+
other_rule = A | _;
|
327
|
+
|
328
|
+
This produces the following errors:
|
329
|
+
|
330
|
+
example.rll:5:14:error: first/follow conflict, branch can start with epsilon and is followed by (non) terminals
|
331
|
+
rule = other_rule B;
|
332
|
+
^
|
333
|
+
example.rll:6:18:error: epsilon originates from here
|
334
|
+
other_rule = A | _;
|
335
|
+
^
|
336
|
+
|
337
|
+
There's no specific procedure to solving such a conflict other than simply
|
338
|
+
removing the starting epsilon.
|
339
|
+
|
340
|
+
## Performance
|
341
|
+
|
342
|
+
One of the goals of ruby-ll is to be faster than existing parser generators,
|
343
|
+
Racc in particular. How much faster ruby-ll will be depends on the use case. For
|
344
|
+
example, for the benchmark
|
345
|
+
[benchmark/ll/simple\_json\_bench.rb](benchmark/l/simple_json_bench.rb) the
|
346
|
+
performance gains of ruby-ll over Racc are as following:
|
347
|
+
|
348
|
+
| Ruby | Speed |
|
349
|
+
|:----------------|:------|
|
350
|
+
| MRI 2.2 | 1.75x |
|
351
|
+
| Rubinius 2.5.2 | 3.85x |
|
352
|
+
| JRuby 1.7.18 | 6.44x |
|
353
|
+
| JRuby 9000 pre1 | 7.50x |
|
354
|
+
|
355
|
+
This benchmark was run on a Thinkpad T520 laptop so it's probably best to run
|
356
|
+
the bencharmk yourself to see how it behaves on your platform.
|
357
|
+
|
358
|
+
Depending on the complexity of your parser you might end up with different
|
359
|
+
different numbers. The above metrics are simply an indication of the maximum
|
360
|
+
performance gain of ruby-ll compared to Racc.
|
361
|
+
|
362
|
+
## Thread Safety
|
363
|
+
|
364
|
+
Parsers generated by ruby-ll share an internal, mutable state on a per instance
|
365
|
+
basis. As a result of this a single instance of your parser _can not_ be used by
|
366
|
+
multiple threads in parallel. If it wasn't for MRI's C API (specifically due to
|
367
|
+
how `rb_block_call` works) this wouldn't have been an issue.
|
368
|
+
|
369
|
+
To mitigate the above simply create a new instance of your parser every time you
|
370
|
+
need it and have the GC clean it up once you're done. This _will_ introduce a
|
371
|
+
slight allocation overhead but it beats having to deal with race conditions.
|
372
|
+
|
373
|
+
## License
|
374
|
+
|
375
|
+
All source code in this repository is licensed under the MIT license unless
|
376
|
+
specified otherwise. A copy of this license can be found in the file "LICENSE"
|
377
|
+
in the root directory of this repository.
|
378
|
+
|
379
|
+
[racc]: https://github.com/tenderlove/racc
|
380
|
+
[oga]: https://github.com/yorickpeterse/oga
|
data/bin/ruby-ll
ADDED
data/doc/DCO.md
ADDED
@@ -0,0 +1,25 @@
|
|
1
|
+
# Developer's Certificate of Origin 1.0
|
2
|
+
|
3
|
+
By making a contribution to this project, I certify that:
|
4
|
+
|
5
|
+
1. The contribution was created in whole or in part by me and I
|
6
|
+
have the right to submit it under the open source license
|
7
|
+
indicated in the file LICENSE; or
|
8
|
+
|
9
|
+
2. The contribution is based upon previous work that, to the best
|
10
|
+
of my knowledge, is covered under an appropriate open source
|
11
|
+
license and I have the right under that license to submit that
|
12
|
+
work with modifications, whether created in whole or in part
|
13
|
+
by me, under the same open source license (unless I am
|
14
|
+
permitted to submit under a different license), as indicated
|
15
|
+
in the file LICENSE; or
|
16
|
+
|
17
|
+
3. The contribution was provided directly to me by some other
|
18
|
+
person who certified (1), (2) or (3) and I have not modified
|
19
|
+
it.
|
20
|
+
|
21
|
+
4. I understand and agree that this project and the contribution
|
22
|
+
are public and that a record of the contribution (including all
|
23
|
+
personal information I submit with it, including my sign-off) is
|
24
|
+
maintained indefinitely and may be redistributed consistent with
|
25
|
+
this project or the open source license(s) involved.
|