ruby-ll 1.0.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/.yardopts +13 -0
- data/LICENSE +19 -0
- data/README.md +380 -0
- data/bin/ruby-ll +5 -0
- data/doc/DCO.md +25 -0
- data/doc/changelog.md +8 -0
- data/doc/css/common.css +77 -0
- data/ext/c/driver.c +258 -0
- data/ext/c/driver.h +28 -0
- data/ext/c/driver_config.c +209 -0
- data/ext/c/driver_config.h +53 -0
- data/ext/c/extconf.rb +13 -0
- data/ext/c/khash.h +619 -0
- data/ext/c/kvec.h +90 -0
- data/ext/c/libll.c +7 -0
- data/ext/c/libll.h +9 -0
- data/ext/c/macros.h +6 -0
- data/ext/java/Libll.java +12 -0
- data/ext/java/org/libll/Driver.java +247 -0
- data/ext/java/org/libll/DriverConfig.java +193 -0
- data/lib/ll.rb +26 -0
- data/lib/ll/ast/node.rb +13 -0
- data/lib/ll/branch.rb +57 -0
- data/lib/ll/cli.rb +118 -0
- data/lib/ll/code_generator.rb +32 -0
- data/lib/ll/compiled_configuration.rb +35 -0
- data/lib/ll/compiled_grammar.rb +167 -0
- data/lib/ll/configuration_compiler.rb +204 -0
- data/lib/ll/driver.rb +46 -0
- data/lib/ll/driver_config.rb +36 -0
- data/lib/ll/driver_template.erb +51 -0
- data/lib/ll/epsilon.rb +23 -0
- data/lib/ll/erb_context.rb +23 -0
- data/lib/ll/grammar_compiler.rb +359 -0
- data/lib/ll/lexer.rb +582 -0
- data/lib/ll/message.rb +102 -0
- data/lib/ll/parser.rb +280 -0
- data/lib/ll/parser_error.rb +8 -0
- data/lib/ll/rule.rb +53 -0
- data/lib/ll/setup.rb +11 -0
- data/lib/ll/source_line.rb +46 -0
- data/lib/ll/terminal.rb +29 -0
- data/lib/ll/token.rb +30 -0
- data/lib/ll/version.rb +3 -0
- data/ruby-ll.gemspec +47 -0
- metadata +217 -0
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: ef87cedfa33d3340b77133abff1bdb11f5ad767e
|
4
|
+
data.tar.gz: 6e25ecbd4a78bc7f3469bba05d85a1cd9634b13a
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 07561fc9d28c285ec101c5f9d2c67f16f15b700efd814af5a49174e9e47de90f44f2ffb0859d039f7022bd494f74b8867677c5fd93b6189dd44c40367e5dc62d
|
7
|
+
data.tar.gz: 35a3352d8d2207937d5e1d8657f2c366f8417e6604fc2aef88effe9ed6d94a710e01b8ce4830881ca439fb7fa25a4b326e49a5157a925277d6cc260e4ae5c3f1
|
data/.yardopts
ADDED
data/LICENSE
ADDED
@@ -0,0 +1,19 @@
|
|
1
|
+
Copyright (c) 2015, Yorick Peterse
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
4
|
+
of this software and associated documentation files (the "Software"), to deal
|
5
|
+
in the Software without restriction, including without limitation the rights
|
6
|
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
7
|
+
copies of the Software, and to permit persons to whom the Software is
|
8
|
+
furnished to do so, subject to the following conditions:
|
9
|
+
|
10
|
+
The above copyright notice and this permission notice shall be included in
|
11
|
+
all copies or substantial portions of the Software.
|
12
|
+
|
13
|
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
14
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
15
|
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
16
|
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
17
|
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
18
|
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
19
|
+
THE SOFTWARE.
|
data/README.md
ADDED
@@ -0,0 +1,380 @@
|
|
1
|
+
# ruby-ll
|
2
|
+
|
3
|
+
ruby-ll is a high performance LL(1) table based parser generator for Ruby. The
|
4
|
+
parser driver is written in C/Java to ensure good runtime performance, the
|
5
|
+
compiler is written entirely in Ruby.
|
6
|
+
|
7
|
+
ruby-ll was written to serve as a fast and easy to use alternative to
|
8
|
+
[Racc][racc] for the various parsers used in [Oga][oga]. However, ruby-ll isn't
|
9
|
+
limited to just Oga, you can use it to write a parser for any language that can
|
10
|
+
be represented using an LL(1) grammar.
|
11
|
+
|
12
|
+
ruby-ll is self-hosting, this allows one to use ruby-ll to modify its own
|
13
|
+
parser. Self-hosting was achieved by bootstrapping the parser using a Racc
|
14
|
+
parser that outputs the same AST as the ruby-ll parser. The Racc parser remains
|
15
|
+
in the repository for historical purposes and in case it's ever needed again, it
|
16
|
+
can be found in [bootstrap/parser.y](lib/ll/bootstrap/parser.y).
|
17
|
+
|
18
|
+
For more information on LL parsing, see
|
19
|
+
<https://en.wikipedia.org/wiki/LL_parser>.
|
20
|
+
|
21
|
+
## Features
|
22
|
+
|
23
|
+
* Support for detecting first/first and first/follow conflicts
|
24
|
+
* clang-like error/warning messages to ease debugging parsers
|
25
|
+
* High performance and a low memory footprint
|
26
|
+
|
27
|
+
## Requirements
|
28
|
+
|
29
|
+
| Ruby | Required | Recommended |
|
30
|
+
|:---------|:--------------|:------------|
|
31
|
+
| MRI | >= 1.9.3 | >= 2.1.0 |
|
32
|
+
| Rubinius | >= 2.2 | >= 2.5.0 |
|
33
|
+
| JRuby | >= 1.7 | >= 1.7.0 |
|
34
|
+
| Maglev | Not supported | |
|
35
|
+
| Topaz | Not supported | |
|
36
|
+
| mruby | Not supported | |
|
37
|
+
|
38
|
+
For MRI/Rubinius you'll need a C90 compatible compiler such as clang or gcc. For
|
39
|
+
JRuby you don't need any compilers to be installed as the .jar is packaged with
|
40
|
+
the Gem itself.
|
41
|
+
|
42
|
+
When hacking on Oga you'll also need to have the following installed:
|
43
|
+
|
44
|
+
* Ragel 6 for building the grammar lexer
|
45
|
+
* javac for building the JRuby extension
|
46
|
+
|
47
|
+
## Usage
|
48
|
+
|
49
|
+
The CLI takes a grammar input file (see below for the exact syntax) with the
|
50
|
+
extension `.rll` and turns it into a corresponding Ruby file. For example:
|
51
|
+
|
52
|
+
ruby-ll lib/my-gem/parser.rll
|
53
|
+
|
54
|
+
This would result in the parser being written to `lib/my-gem/parser.rb`. If you
|
55
|
+
want to customize the output path you can do so using the `-o` / `--output`
|
56
|
+
options:
|
57
|
+
|
58
|
+
ruby-ll lib/my-gem/parser.rll -o lib/my-gem/my-parser.rb
|
59
|
+
|
60
|
+
By default ruby-ll adds various `require` calls to ensure you can load the
|
61
|
+
parser _without_ having to load all of ruby-ll (e.g. the compiler code). If you
|
62
|
+
want to disable this behaviour you can use the `--no-requires` option when
|
63
|
+
processing a grammar:
|
64
|
+
|
65
|
+
ruby-ll lib/my-gem/parser.rll --no-requires
|
66
|
+
|
67
|
+
Once generated you can use the parser class like any other parser. To start
|
68
|
+
parsing simply call the `parse` method:
|
69
|
+
|
70
|
+
parser = MyGem::Parser.new
|
71
|
+
|
72
|
+
parser.parse
|
73
|
+
|
74
|
+
The return value of this method is whatever the root rule (= the first rule
|
75
|
+
defined) returned.
|
76
|
+
|
77
|
+
## Grammar Syntax
|
78
|
+
|
79
|
+
The syntax of a ruby-ll grammar file is fairly simple and consists out of
|
80
|
+
directives, rules, comments and code blocks.
|
81
|
+
|
82
|
+
Directives can be seen as configuration options, for example to set the name of
|
83
|
+
the parser class. Rules are, well, the parsing rules. Code blocks can be used to
|
84
|
+
associate Ruby code with either a branch of a rule or a certain section of the
|
85
|
+
parser (the header or its inner body).
|
86
|
+
|
87
|
+
Directives and rules must be terminated using a semicolon, this is not needed
|
88
|
+
for `%inner` / `%header` blocks.
|
89
|
+
|
90
|
+
For a full example, see ruby-ll's own parser located at
|
91
|
+
[lib/ll/parser.rll](lib/ll/parser.rll).
|
92
|
+
|
93
|
+
### Comments
|
94
|
+
|
95
|
+
Comments start with a hash (`#`) sign and continue until the end of the line,
|
96
|
+
just like Ruby. Example:
|
97
|
+
|
98
|
+
# Some say comments are a code smell.
|
99
|
+
|
100
|
+
### %name
|
101
|
+
|
102
|
+
The `%name` directive is used to set the full name/namespace of the parser
|
103
|
+
class. The name consists out of a single identifier or multiple identifiers
|
104
|
+
separated by `::` (just like Ruby). Some examples:
|
105
|
+
|
106
|
+
%name A;
|
107
|
+
%name A::B;
|
108
|
+
%name A::B::C;
|
109
|
+
|
110
|
+
The last identifier is used as the actual class name. This class will be nested
|
111
|
+
inside a module for every other segment leading up to the last one. For example,
|
112
|
+
this:
|
113
|
+
|
114
|
+
%name A;
|
115
|
+
|
116
|
+
Gets turned into this:
|
117
|
+
|
118
|
+
class A < LL::Driver
|
119
|
+
|
120
|
+
end
|
121
|
+
|
122
|
+
While this:
|
123
|
+
|
124
|
+
%name A::B::C;
|
125
|
+
|
126
|
+
Gets turned into this:
|
127
|
+
|
128
|
+
module A
|
129
|
+
module B
|
130
|
+
class C < LL::Driver
|
131
|
+
|
132
|
+
end
|
133
|
+
end
|
134
|
+
end
|
135
|
+
|
136
|
+
By nesting the parser class in modules any constants in the scope can be
|
137
|
+
referred to without requiring the use of a full namespace. For example, the
|
138
|
+
constant `A::B::X` can just be referred to as `X` in the above example.
|
139
|
+
|
140
|
+
Multiple calls to this directive will result in previous values being
|
141
|
+
overwritten.
|
142
|
+
|
143
|
+
### %terminals
|
144
|
+
|
145
|
+
The `%terminals` directive is used to list one or more terminals of the grammar.
|
146
|
+
Each terminal is an identifier separated by a space. For example:
|
147
|
+
|
148
|
+
%terminals A B C;
|
149
|
+
|
150
|
+
This would define 3 terminals: `A`, `B` and `C`. While there's no specific
|
151
|
+
requirement as to how you name your terminals it's common practise to capitalize
|
152
|
+
them and prefix them with `T_`, like so:
|
153
|
+
|
154
|
+
%terminals T_A T_B T_C;
|
155
|
+
|
156
|
+
Multiple calls to this directive will result in the terminals being appended to
|
157
|
+
the existing list.
|
158
|
+
|
159
|
+
### %inner
|
160
|
+
|
161
|
+
The `%inner` directive can be used to specify a code block that should be placed
|
162
|
+
inside the parser's body, just after the section containing all parsing tables.
|
163
|
+
This directive should be used for adding custom methods and such to the parser.
|
164
|
+
For example:
|
165
|
+
|
166
|
+
%inner
|
167
|
+
{
|
168
|
+
def initialize(input)
|
169
|
+
@input = input
|
170
|
+
end
|
171
|
+
}
|
172
|
+
|
173
|
+
This would result in the following:
|
174
|
+
|
175
|
+
class A < LL::Driver
|
176
|
+
def initialize(input)
|
177
|
+
@input = input
|
178
|
+
end
|
179
|
+
end
|
180
|
+
|
181
|
+
Curly braces can either be placed on the same line as the `%inner` directive or
|
182
|
+
on a new line, it's up to you.
|
183
|
+
|
184
|
+
Unlike regular directives this directive should not be terminated using a
|
185
|
+
semicolon.
|
186
|
+
|
187
|
+
### %header
|
188
|
+
|
189
|
+
The `%header` directive is similar to the `%inner` directive in that it can be
|
190
|
+
used to add a code block to the parser. The code of this directive is placed
|
191
|
+
just before the `class` definition of the parser. This directive can be used to
|
192
|
+
add documentation to the parser class. For example:
|
193
|
+
|
194
|
+
%header
|
195
|
+
{
|
196
|
+
# Hello world
|
197
|
+
}
|
198
|
+
|
199
|
+
This would result in the following:
|
200
|
+
|
201
|
+
# Hello world
|
202
|
+
class A < LL::Driver
|
203
|
+
end
|
204
|
+
|
205
|
+
### Rules
|
206
|
+
|
207
|
+
Rules consist out of a name followed by an equals sign (`=`) followed by 1 or
|
208
|
+
more branches. Each branch is separated using a pipe (`|`). A branch can consist
|
209
|
+
out of 1 or many steps, or an epsilon. Branches can be followed by a code block
|
210
|
+
starting with `{` and ending with `}`. A rule must be terminated using a
|
211
|
+
semicolon.
|
212
|
+
|
213
|
+
An epsilon is represented as a single underscore (`_`) and is used to denote a
|
214
|
+
wildcard/nothingness.
|
215
|
+
|
216
|
+
A simple example:
|
217
|
+
|
218
|
+
%terminals A;
|
219
|
+
|
220
|
+
numbers = A | B;
|
221
|
+
|
222
|
+
Here the rule `numbers` is defined and has two branches. If we wanted a rule
|
223
|
+
that would match terminal `A` or nothing we'd use the following:
|
224
|
+
|
225
|
+
%terminals A;
|
226
|
+
|
227
|
+
numbers = A | _;
|
228
|
+
|
229
|
+
Code blocks can also be added:
|
230
|
+
|
231
|
+
numbers
|
232
|
+
= A { 'A' }
|
233
|
+
| B { 'B' }
|
234
|
+
;
|
235
|
+
|
236
|
+
When the terminal `A` would be processed the returned value would be "B", for
|
237
|
+
terminal `B` the returned value would be "B".
|
238
|
+
|
239
|
+
Code blocks have access to an array called `val` which contains the values of
|
240
|
+
every step of a branch. For example:
|
241
|
+
|
242
|
+
numbers = A B { val };
|
243
|
+
|
244
|
+
Here `val` would return `[A, B]`. Since `val` is just an Array you can also
|
245
|
+
return specific elements from it:
|
246
|
+
|
247
|
+
numbers = A B { val[0] };
|
248
|
+
|
249
|
+
Values returned by code blocks are passed to whatever other rule called it. This
|
250
|
+
allows code blocks to be used for building ASTs and the likes. If no explicit
|
251
|
+
code block is defined `val` is returned as is.
|
252
|
+
|
253
|
+
ruby-ll parsers recurse into rules before unwinding, this means that the
|
254
|
+
inner-most rule is processed first.
|
255
|
+
|
256
|
+
Branches of a rule can also refer to other rules:
|
257
|
+
|
258
|
+
numbers = A other_rule;
|
259
|
+
other_rule = B;
|
260
|
+
|
261
|
+
The value for `other_rule` in the `numbers` rule would be whatever the
|
262
|
+
`other_rule` below it returns.
|
263
|
+
|
264
|
+
The grammar compiler adds errors whenever it encounters a rule with the same
|
265
|
+
name as a terminal, as such the following is invalid:
|
266
|
+
|
267
|
+
%terminals A B;
|
268
|
+
|
269
|
+
A = B;
|
270
|
+
|
271
|
+
It's also an error to re-define an existing rule.
|
272
|
+
|
273
|
+
## Conflicts
|
274
|
+
|
275
|
+
LL(1) grammars can have two kinds of conflicts in a rule:
|
276
|
+
|
277
|
+
* first/first
|
278
|
+
* first/follow
|
279
|
+
|
280
|
+
### first/first
|
281
|
+
|
282
|
+
A first/first conflict means that multiple branches of a rule start with the
|
283
|
+
same terminal, resulting in the parser being unable to choose what branch to
|
284
|
+
use. For example:
|
285
|
+
|
286
|
+
%terminals A B;
|
287
|
+
|
288
|
+
rule = A | A B;
|
289
|
+
|
290
|
+
This would result in the following output:
|
291
|
+
|
292
|
+
example.rll:5:1:error: first/first conflict, multiple branches start with the same terminals
|
293
|
+
rule = A | A B;
|
294
|
+
^
|
295
|
+
example.rll:5:8:error: branch starts with: A
|
296
|
+
rule = A | A B;
|
297
|
+
^
|
298
|
+
example.rll:5:12:error: branch starts with: A
|
299
|
+
rule = A | A B;
|
300
|
+
^
|
301
|
+
|
302
|
+
To solve a first/first conflict you'll have to factor out the common left
|
303
|
+
factor. For example:
|
304
|
+
|
305
|
+
%name Example;
|
306
|
+
|
307
|
+
%terminals A B;
|
308
|
+
|
309
|
+
rule = A rule_follow;
|
310
|
+
rule_follow = B | _;
|
311
|
+
|
312
|
+
Here the `rule` rule starts with terminal `A` and can optionally be followed by
|
313
|
+
`B`, without introducing any first/first conflicts.
|
314
|
+
|
315
|
+
### first/follow
|
316
|
+
|
317
|
+
A first/follow conflict occurs when a branch in a rule starts with an epsilon
|
318
|
+
and is followed by one or more terminals and/or rules. An example of a
|
319
|
+
first/follow conflict:
|
320
|
+
|
321
|
+
%name Example;
|
322
|
+
|
323
|
+
%terminals A B;
|
324
|
+
|
325
|
+
rule = other_rule B;
|
326
|
+
other_rule = A | _;
|
327
|
+
|
328
|
+
This produces the following errors:
|
329
|
+
|
330
|
+
example.rll:5:14:error: first/follow conflict, branch can start with epsilon and is followed by (non) terminals
|
331
|
+
rule = other_rule B;
|
332
|
+
^
|
333
|
+
example.rll:6:18:error: epsilon originates from here
|
334
|
+
other_rule = A | _;
|
335
|
+
^
|
336
|
+
|
337
|
+
There's no specific procedure to solving such a conflict other than simply
|
338
|
+
removing the starting epsilon.
|
339
|
+
|
340
|
+
## Performance
|
341
|
+
|
342
|
+
One of the goals of ruby-ll is to be faster than existing parser generators,
|
343
|
+
Racc in particular. How much faster ruby-ll will be depends on the use case. For
|
344
|
+
example, for the benchmark
|
345
|
+
[benchmark/ll/simple\_json\_bench.rb](benchmark/l/simple_json_bench.rb) the
|
346
|
+
performance gains of ruby-ll over Racc are as following:
|
347
|
+
|
348
|
+
| Ruby | Speed |
|
349
|
+
|:----------------|:------|
|
350
|
+
| MRI 2.2 | 1.75x |
|
351
|
+
| Rubinius 2.5.2 | 3.85x |
|
352
|
+
| JRuby 1.7.18 | 6.44x |
|
353
|
+
| JRuby 9000 pre1 | 7.50x |
|
354
|
+
|
355
|
+
This benchmark was run on a Thinkpad T520 laptop so it's probably best to run
|
356
|
+
the bencharmk yourself to see how it behaves on your platform.
|
357
|
+
|
358
|
+
Depending on the complexity of your parser you might end up with different
|
359
|
+
different numbers. The above metrics are simply an indication of the maximum
|
360
|
+
performance gain of ruby-ll compared to Racc.
|
361
|
+
|
362
|
+
## Thread Safety
|
363
|
+
|
364
|
+
Parsers generated by ruby-ll share an internal, mutable state on a per instance
|
365
|
+
basis. As a result of this a single instance of your parser _can not_ be used by
|
366
|
+
multiple threads in parallel. If it wasn't for MRI's C API (specifically due to
|
367
|
+
how `rb_block_call` works) this wouldn't have been an issue.
|
368
|
+
|
369
|
+
To mitigate the above simply create a new instance of your parser every time you
|
370
|
+
need it and have the GC clean it up once you're done. This _will_ introduce a
|
371
|
+
slight allocation overhead but it beats having to deal with race conditions.
|
372
|
+
|
373
|
+
## License
|
374
|
+
|
375
|
+
All source code in this repository is licensed under the MIT license unless
|
376
|
+
specified otherwise. A copy of this license can be found in the file "LICENSE"
|
377
|
+
in the root directory of this repository.
|
378
|
+
|
379
|
+
[racc]: https://github.com/tenderlove/racc
|
380
|
+
[oga]: https://github.com/yorickpeterse/oga
|
data/bin/ruby-ll
ADDED
data/doc/DCO.md
ADDED
@@ -0,0 +1,25 @@
|
|
1
|
+
# Developer's Certificate of Origin 1.0
|
2
|
+
|
3
|
+
By making a contribution to this project, I certify that:
|
4
|
+
|
5
|
+
1. The contribution was created in whole or in part by me and I
|
6
|
+
have the right to submit it under the open source license
|
7
|
+
indicated in the file LICENSE; or
|
8
|
+
|
9
|
+
2. The contribution is based upon previous work that, to the best
|
10
|
+
of my knowledge, is covered under an appropriate open source
|
11
|
+
license and I have the right under that license to submit that
|
12
|
+
work with modifications, whether created in whole or in part
|
13
|
+
by me, under the same open source license (unless I am
|
14
|
+
permitted to submit under a different license), as indicated
|
15
|
+
in the file LICENSE; or
|
16
|
+
|
17
|
+
3. The contribution was provided directly to me by some other
|
18
|
+
person who certified (1), (2) or (3) and I have not modified
|
19
|
+
it.
|
20
|
+
|
21
|
+
4. I understand and agree that this project and the contribution
|
22
|
+
are public and that a record of the contribution (including all
|
23
|
+
personal information I submit with it, including my sign-off) is
|
24
|
+
maintained indefinitely and may be redistributed consistent with
|
25
|
+
this project or the open source license(s) involved.
|