ruby-ll 1.1.3-java → 2.0.0-java
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.md +68 -2
- data/doc/driver_architecture.md +32 -0
- data/ext/c/driver.c +71 -2
- data/ext/c/driver_config.c +1 -1
- data/ext/c/driver_config.h +1 -1
- data/ext/java/org/libll/Driver.java +63 -7
- data/lib/libll.jar +0 -0
- data/lib/ll.rb +1 -0
- data/lib/ll/branch.rb +7 -1
- data/lib/ll/configuration_compiler.rb +34 -9
- data/lib/ll/driver.rb +4 -0
- data/lib/ll/grammar_compiler.rb +77 -27
- data/lib/ll/lexer.rb +75 -51
- data/lib/ll/operator.rb +26 -0
- data/lib/ll/parser.rb +129 -62
- data/lib/ll/version.rb +1 -1
- metadata +59 -57
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 0f307cb5b874f4f151092083942884b5077e5fab
|
4
|
+
data.tar.gz: 1a9dbbcd10bd6898a709aa5759d5f91030272e8f
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: d397ac3dac1792bb4310508aaec7ee448cfe0f09f9fa1072daa5c53f1b8ea290c2ff342897e8edfcced314d5b9fa0715970dccadea8ac0c9f0b23be1dced959b
|
7
|
+
data.tar.gz: 76771a39c8fdebf2e64f5b0535a019c22ac5aff0907b96f2a8d91ec5d5f054f742aa3fc75dd847cb1c050c1e1a60c79269308a2eb6aa596cd9923479a21b3665
|
data/README.md
CHANGED
@@ -278,8 +278,30 @@ return specific elements from it:
|
|
278
278
|
numbers = A B { val[0] };
|
279
279
|
|
280
280
|
Values returned by code blocks are passed to whatever other rule called it. This
|
281
|
-
allows code blocks to be used for building ASTs and the likes.
|
282
|
-
|
281
|
+
allows code blocks to be used for building ASTs and the likes.
|
282
|
+
|
283
|
+
If no explicit code block is defined then ruby-ll will generate one for you. If
|
284
|
+
a branch consists out of only a single step (e.g. `A = B;`) then only the first
|
285
|
+
value is returned, otherwise all values are returned.
|
286
|
+
|
287
|
+
This means that in the following example the output will be whatever value "C"
|
288
|
+
contains:
|
289
|
+
|
290
|
+
A = B { p val[0] };
|
291
|
+
B = C;
|
292
|
+
|
293
|
+
However, here the output would be `[C, D]` as the `B` rule's branch contains
|
294
|
+
multiple steps:
|
295
|
+
|
296
|
+
A = B { p val[0] };
|
297
|
+
B = C D;
|
298
|
+
|
299
|
+
To summarize (`# =>` denotes the return value):
|
300
|
+
|
301
|
+
A = B; # => B
|
302
|
+
A = B C; # => [B, C]
|
303
|
+
|
304
|
+
You can override this behaviour simply by defining your own code block.
|
283
305
|
|
284
306
|
ruby-ll parsers recurse into rules before unwinding, this means that the
|
285
307
|
inner-most rule is processed first.
|
@@ -301,6 +323,50 @@ name as a terminal, as such the following is invalid:
|
|
301
323
|
|
302
324
|
It's also an error to re-define an existing rule.
|
303
325
|
|
326
|
+
### Operators
|
327
|
+
|
328
|
+
Grammars can use two operators to define a sequence of terminals/non-terminals:
|
329
|
+
the star (`*`) and plus (`+`) operators.
|
330
|
+
|
331
|
+
The star operator indicates that something should occur 0 or more times. Here
|
332
|
+
the "B" identifier could occur 0 times, once, twice or many more times:
|
333
|
+
|
334
|
+
A = B*;
|
335
|
+
|
336
|
+
The plus operator indicates that something should occur at least once followed
|
337
|
+
by any number of more occurrences. For example, this grammar states that "B"
|
338
|
+
should occur at least once but can also occur, say, 10 times:
|
339
|
+
|
340
|
+
A = B+;
|
341
|
+
|
342
|
+
Operators can be applied either to a single terminal/rule or a series of
|
343
|
+
terminals/rules grouped together using parenthesis. For example, both are
|
344
|
+
perfectly valid:
|
345
|
+
|
346
|
+
A = B+;
|
347
|
+
A = (B C)+;
|
348
|
+
|
349
|
+
When calling an operator on a single terminal/rule the corresponding entry in
|
350
|
+
the `val` array is simply set to the terminal/rule value. For example:
|
351
|
+
|
352
|
+
A = B+ { p val[0] };
|
353
|
+
|
354
|
+
For input `B B B` this would output `[B, B, B]`.
|
355
|
+
|
356
|
+
However, when grouping multiple terminals/rules using parenthesis every
|
357
|
+
occurrence is wrapped in an Array. For example:
|
358
|
+
|
359
|
+
A = (B C)+ { p val[0] };
|
360
|
+
|
361
|
+
For input `B C B C` this would output `[[B, C], [B, C]]`. To work around this
|
362
|
+
you can simply move the group of identifiers to its own rule and only return
|
363
|
+
whatever you need:
|
364
|
+
|
365
|
+
A = A1+ { p val[0] };
|
366
|
+
A1 = B C { val[0] }; # only return "B"
|
367
|
+
|
368
|
+
For input `B C B C` this would output `[B, B]`.
|
369
|
+
|
304
370
|
## Conflicts
|
305
371
|
|
306
372
|
LL(1) grammars can have two kinds of conflicts in a rule:
|
@@ -0,0 +1,32 @@
|
|
1
|
+
# Driver Architecture
|
2
|
+
|
3
|
+
The actual parsing of input is handled by a so called "driver" represented as
|
4
|
+
the class `LL::Driver`. This class is written in either C or Java depending on
|
5
|
+
the Ruby platform that's being used. The rationale for this is simple:
|
6
|
+
performance. While Ruby is a great language it's sadly not fast enough to handle
|
7
|
+
parsing of large inputs in a way that doesn't either require lots of memory,
|
8
|
+
time or both.
|
9
|
+
|
10
|
+
Both the C and Java drivers try to use native data structures as much as
|
11
|
+
possible instead of using Ruby structures. For example, their internal parsing
|
12
|
+
stacks are native stacks. In case of Java this is an ArrayDeque, in case of C
|
13
|
+
this is a vector created using the [kvec][kvec] library as C doesn't have a
|
14
|
+
native vector structure.
|
15
|
+
|
16
|
+
The driver operates by iterating over every token supplied by the `each_token`
|
17
|
+
method (this method must be defined by a parser itself). For every input token a
|
18
|
+
callback function in C/Java is executed that determines what to parse and how to
|
19
|
+
parse it.
|
20
|
+
|
21
|
+
The parsing process largely operates on integers, only using Ruby objects where
|
22
|
+
absolutely required. For example, all steps of a rule's branch are represented
|
23
|
+
as integers. Lookup tables are also simply arrays of integers with terminals
|
24
|
+
being mapped directly to the indexes of these arrays. See ruby-ll's own parser
|
25
|
+
for examples. Note that the integers for the `rules` Array are in reverse order,
|
26
|
+
so everything that comes first is processed last.
|
27
|
+
|
28
|
+
For more information on the internals its best to refer to the C driver code
|
29
|
+
located in `ext/c/driver.c`. The Java code is largely based on this code safe
|
30
|
+
for some code comments here and there.
|
31
|
+
|
32
|
+
[kvec]: https://github.com/attractivechaos/klib/blob/master/kvec.h
|
data/ext/c/driver.c
CHANGED
@@ -5,6 +5,10 @@
|
|
5
5
|
#define T_TERMINAL 1
|
6
6
|
#define T_EPSILON 2
|
7
7
|
#define T_ACTION 3
|
8
|
+
#define T_STAR 4
|
9
|
+
#define T_PLUS 5
|
10
|
+
#define T_ADD_VALUE_STACK 6
|
11
|
+
#define T_APPEND_VALUE_STACK 7
|
8
12
|
|
9
13
|
ID id_config_const;
|
10
14
|
ID id_each_token;
|
@@ -66,6 +70,8 @@ VALUE ll_driver_each_token(VALUE token, VALUE self)
|
|
66
70
|
VALUE method;
|
67
71
|
VALUE action_args;
|
68
72
|
VALUE action_retval;
|
73
|
+
VALUE operator_buffer;
|
74
|
+
VALUE last_value;
|
69
75
|
long num_args;
|
70
76
|
long args_i;
|
71
77
|
|
@@ -113,8 +119,8 @@ VALUE ll_driver_each_token(VALUE token, VALUE self)
|
|
113
119
|
}
|
114
120
|
}
|
115
121
|
|
116
|
-
/*
|
117
|
-
if ( stack_type == T_RULE )
|
122
|
+
/* A rule or the "+" operator */
|
123
|
+
if ( stack_type == T_RULE || stack_type == T_PLUS )
|
118
124
|
{
|
119
125
|
production_i = state->config->table[stack_value][token_id];
|
120
126
|
|
@@ -132,6 +138,19 @@ VALUE ll_driver_each_token(VALUE token, VALUE self)
|
|
132
138
|
}
|
133
139
|
else
|
134
140
|
{
|
141
|
+
/*
|
142
|
+
Append a "*" operator for all following occurrences as they are
|
143
|
+
optional
|
144
|
+
*/
|
145
|
+
if ( stack_type == T_PLUS )
|
146
|
+
{
|
147
|
+
kv_push(long, state->stack, T_STAR);
|
148
|
+
kv_push(long, state->stack, stack_value);
|
149
|
+
|
150
|
+
kv_push(long, state->stack, T_APPEND_VALUE_STACK);
|
151
|
+
kv_push(long, state->stack, 0);
|
152
|
+
}
|
153
|
+
|
135
154
|
FOR(rule_i, state->config->rule_lengths[production_i])
|
136
155
|
{
|
137
156
|
kv_push(
|
@@ -142,6 +161,56 @@ VALUE ll_driver_each_token(VALUE token, VALUE self)
|
|
142
161
|
}
|
143
162
|
}
|
144
163
|
}
|
164
|
+
/* "*" operator */
|
165
|
+
else if ( stack_type == T_STAR )
|
166
|
+
{
|
167
|
+
production_i = state->config->table[stack_value][token_id];
|
168
|
+
|
169
|
+
if ( production_i != T_EOF )
|
170
|
+
{
|
171
|
+
kv_push(long, state->stack, T_STAR);
|
172
|
+
kv_push(long, state->stack, stack_value);
|
173
|
+
|
174
|
+
kv_push(long, state->stack, T_APPEND_VALUE_STACK);
|
175
|
+
kv_push(long, state->stack, 0);
|
176
|
+
|
177
|
+
FOR(rule_i, state->config->rule_lengths[production_i])
|
178
|
+
{
|
179
|
+
kv_push(
|
180
|
+
long,
|
181
|
+
state->stack,
|
182
|
+
state->config->rules[production_i][rule_i]
|
183
|
+
);
|
184
|
+
}
|
185
|
+
}
|
186
|
+
}
|
187
|
+
/*
|
188
|
+
Adds a new array to the value stack that can be used to group operator
|
189
|
+
values together
|
190
|
+
*/
|
191
|
+
else if ( stack_type == T_ADD_VALUE_STACK )
|
192
|
+
{
|
193
|
+
operator_buffer = rb_ary_new();
|
194
|
+
|
195
|
+
kv_push(VALUE, state->value_stack, operator_buffer);
|
196
|
+
|
197
|
+
RB_GC_GUARD(operator_buffer);
|
198
|
+
}
|
199
|
+
/*
|
200
|
+
Appends the last value on the value stack to the operator buffer that
|
201
|
+
preceeds it.
|
202
|
+
*/
|
203
|
+
else if ( stack_type == T_APPEND_VALUE_STACK )
|
204
|
+
{
|
205
|
+
last_value = kv_pop(state->value_stack);
|
206
|
+
|
207
|
+
operator_buffer = kv_A(
|
208
|
+
state->value_stack,
|
209
|
+
kv_size(state->value_stack) - 1
|
210
|
+
);
|
211
|
+
|
212
|
+
rb_ary_push(operator_buffer, last_value);
|
213
|
+
}
|
145
214
|
/* Terminal */
|
146
215
|
else if ( stack_type == T_TERMINAL )
|
147
216
|
{
|
data/ext/c/driver_config.c
CHANGED
@@ -166,7 +166,7 @@ VALUE ll_driver_config_set_actions(VALUE self, VALUE array)
|
|
166
166
|
|
167
167
|
Data_Get_Struct(self, DriverConfig, config);
|
168
168
|
|
169
|
-
config->action_names = ALLOC_N(
|
169
|
+
config->action_names = ALLOC_N(VALUE, row_count);
|
170
170
|
config->action_arg_amounts = ALLOC_N(long, row_count);
|
171
171
|
|
172
172
|
FOR(rindex, row_count)
|
data/ext/c/driver_config.h
CHANGED
@@ -27,11 +27,15 @@ import org.jruby.runtime.builtin.IRubyObject;
|
|
27
27
|
@JRubyClass(name="LL::Driver", parent="Object")
|
28
28
|
public class Driver extends RubyObject
|
29
29
|
{
|
30
|
-
private static long T_EOF
|
31
|
-
private static long T_RULE
|
32
|
-
private static long T_TERMINAL
|
33
|
-
private static long T_EPSILON
|
34
|
-
private static long T_ACTION
|
30
|
+
private static long T_EOF = -1;
|
31
|
+
private static long T_RULE = 0;
|
32
|
+
private static long T_TERMINAL = 1;
|
33
|
+
private static long T_EPSILON = 2;
|
34
|
+
private static long T_ACTION = 3;
|
35
|
+
private static long T_STAR = 4;
|
36
|
+
private static long T_PLUS = 5;
|
37
|
+
private static long T_ADD_VALUE_STACK = 6;
|
38
|
+
private static long T_APPEND_VALUE_STACK = 7;
|
35
39
|
|
36
40
|
/**
|
37
41
|
* The current Ruby runtime.
|
@@ -132,8 +136,8 @@ public class Driver extends RubyObject
|
|
132
136
|
token_id = self.config.terminals.get(type);
|
133
137
|
}
|
134
138
|
|
135
|
-
//
|
136
|
-
if ( stack_type == self.T_RULE )
|
139
|
+
// A rule or the "+" operator
|
140
|
+
if ( stack_type == self.T_RULE || stack_type == self.T_PLUS )
|
137
141
|
{
|
138
142
|
Long production_i = self.config.table
|
139
143
|
.get(stack_value.intValue())
|
@@ -152,6 +156,17 @@ public class Driver extends RubyObject
|
|
152
156
|
}
|
153
157
|
else
|
154
158
|
{
|
159
|
+
// Append a "*" operator for all following
|
160
|
+
// occurrences as they are optional
|
161
|
+
if ( stack_type == self.T_PLUS )
|
162
|
+
{
|
163
|
+
stack.push(self.T_STAR);
|
164
|
+
stack.push(stack_value);
|
165
|
+
|
166
|
+
stack.push(self.T_APPEND_VALUE_STACK);
|
167
|
+
stack.push(Long.valueOf(0));
|
168
|
+
}
|
169
|
+
|
155
170
|
ArrayList<Long> row = self.config.rules
|
156
171
|
.get(production_i.intValue());
|
157
172
|
|
@@ -161,6 +176,47 @@ public class Driver extends RubyObject
|
|
161
176
|
}
|
162
177
|
}
|
163
178
|
}
|
179
|
+
// "*" operator
|
180
|
+
else if ( stack_type == self.T_STAR )
|
181
|
+
{
|
182
|
+
Long production_i = self.config.table
|
183
|
+
.get(stack_value.intValue())
|
184
|
+
.get(token_id.intValue());
|
185
|
+
|
186
|
+
if ( production_i != self.T_EOF )
|
187
|
+
{
|
188
|
+
stack.push(self.T_STAR);
|
189
|
+
stack.push(stack_value);
|
190
|
+
|
191
|
+
stack.push(self.T_APPEND_VALUE_STACK);
|
192
|
+
stack.push(Long.valueOf(0));
|
193
|
+
|
194
|
+
ArrayList<Long> row = self.config.rules
|
195
|
+
.get(production_i.intValue());
|
196
|
+
|
197
|
+
for ( int index = 0; index < row.size(); index++ )
|
198
|
+
{
|
199
|
+
stack.push(row.get(index));
|
200
|
+
}
|
201
|
+
}
|
202
|
+
}
|
203
|
+
// Adds a new array to the value stack that can be used to
|
204
|
+
// group operator values together
|
205
|
+
else if ( stack_type == self.T_ADD_VALUE_STACK )
|
206
|
+
{
|
207
|
+
RubyArray operator_buffer = self.runtime.newArray();
|
208
|
+
|
209
|
+
value_stack.push(operator_buffer);
|
210
|
+
}
|
211
|
+
// Appends the last value on the value stack to the operator
|
212
|
+
// buffer that preceeds it.
|
213
|
+
else if ( stack_type == self.T_APPEND_VALUE_STACK )
|
214
|
+
{
|
215
|
+
IRubyObject last_value = value_stack.pop();
|
216
|
+
RubyArray operator_buffer = (RubyArray) value_stack.peek();
|
217
|
+
|
218
|
+
operator_buffer.append(last_value);
|
219
|
+
}
|
164
220
|
// Terminal
|
165
221
|
else if ( stack_type == self.T_TERMINAL )
|
166
222
|
{
|
data/lib/libll.jar
CHANGED
Binary file
|
data/lib/ll.rb
CHANGED
@@ -18,6 +18,7 @@ require_relative 'll/rule'
|
|
18
18
|
require_relative 'll/branch'
|
19
19
|
require_relative 'll/terminal'
|
20
20
|
require_relative 'll/epsilon'
|
21
|
+
require_relative 'll/operator'
|
21
22
|
require_relative 'll/message'
|
22
23
|
require_relative 'll/ast/node'
|
23
24
|
require_relative 'll/erb_context'
|
data/lib/ll/branch.rb
CHANGED
@@ -8,11 +8,15 @@ module LL
|
|
8
8
|
# @return [Hash]
|
9
9
|
#
|
10
10
|
TYPES = {
|
11
|
-
:eof
|
12
|
-
:rule
|
13
|
-
:terminal
|
14
|
-
:epsilon
|
15
|
-
:action
|
11
|
+
:eof => -1,
|
12
|
+
:rule => 0,
|
13
|
+
:terminal => 1,
|
14
|
+
:epsilon => 2,
|
15
|
+
:action => 3,
|
16
|
+
:star => 4,
|
17
|
+
:plus => 5,
|
18
|
+
:add_value_stack => 6,
|
19
|
+
:append_value_stack => 7
|
16
20
|
}.freeze
|
17
21
|
|
18
22
|
##
|
@@ -105,7 +109,21 @@ module LL
|
|
105
109
|
|
106
110
|
grammar.rules.each do |rule|
|
107
111
|
rule.branches.each do |branch|
|
108
|
-
|
112
|
+
if branch.ruby_code
|
113
|
+
code = branch.ruby_code
|
114
|
+
|
115
|
+
# If a branch only contains a single, non-epsilon step we can just
|
116
|
+
# return that value as-is. This makes parsing code a little bit
|
117
|
+
# easier.
|
118
|
+
elsif !branch.ruby_code and branch.steps.length == 1 \
|
119
|
+
and !branch.steps[0].is_a?(Epsilon)
|
120
|
+
code = 'val[0]'
|
121
|
+
|
122
|
+
else
|
123
|
+
code = DEFAULT_RUBY_CODE
|
124
|
+
end
|
125
|
+
|
126
|
+
bodies[:"_rule_#{index}"] = code
|
109
127
|
|
110
128
|
index += 1
|
111
129
|
end
|
@@ -133,17 +151,24 @@ module LL
|
|
133
151
|
action_index += 1
|
134
152
|
|
135
153
|
branch.steps.reverse_each do |step|
|
136
|
-
if step.is_a?(
|
154
|
+
if step.is_a?(Terminal)
|
137
155
|
row << TYPES[:terminal]
|
138
156
|
row << term_indices[step] + 1
|
139
157
|
|
140
|
-
elsif step.is_a?(
|
158
|
+
elsif step.is_a?(Rule)
|
141
159
|
row << TYPES[:rule]
|
142
160
|
row << rule_indices[step]
|
143
161
|
|
144
|
-
elsif step.is_a?(
|
162
|
+
elsif step.is_a?(Epsilon)
|
145
163
|
row << TYPES[:epsilon]
|
146
164
|
row << 0
|
165
|
+
|
166
|
+
elsif step.is_a?(Operator)
|
167
|
+
row << TYPES[step.type]
|
168
|
+
row << rule_indices[step.receiver]
|
169
|
+
|
170
|
+
row << TYPES[:add_value_stack]
|
171
|
+
row << 0
|
147
172
|
end
|
148
173
|
end
|
149
174
|
|