ruby-ll 1.1.3-java → 2.0.0-java
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/README.md +68 -2
- data/doc/driver_architecture.md +32 -0
- data/ext/c/driver.c +71 -2
- data/ext/c/driver_config.c +1 -1
- data/ext/c/driver_config.h +1 -1
- data/ext/java/org/libll/Driver.java +63 -7
- data/lib/libll.jar +0 -0
- data/lib/ll.rb +1 -0
- data/lib/ll/branch.rb +7 -1
- data/lib/ll/configuration_compiler.rb +34 -9
- data/lib/ll/driver.rb +4 -0
- data/lib/ll/grammar_compiler.rb +77 -27
- data/lib/ll/lexer.rb +75 -51
- data/lib/ll/operator.rb +26 -0
- data/lib/ll/parser.rb +129 -62
- data/lib/ll/version.rb +1 -1
- metadata +59 -57
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA1:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 0f307cb5b874f4f151092083942884b5077e5fab
|
4
|
+
data.tar.gz: 1a9dbbcd10bd6898a709aa5759d5f91030272e8f
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: d397ac3dac1792bb4310508aaec7ee448cfe0f09f9fa1072daa5c53f1b8ea290c2ff342897e8edfcced314d5b9fa0715970dccadea8ac0c9f0b23be1dced959b
|
7
|
+
data.tar.gz: 76771a39c8fdebf2e64f5b0535a019c22ac5aff0907b96f2a8d91ec5d5f054f742aa3fc75dd847cb1c050c1e1a60c79269308a2eb6aa596cd9923479a21b3665
|
data/README.md
CHANGED
@@ -278,8 +278,30 @@ return specific elements from it:
|
|
278
278
|
numbers = A B { val[0] };
|
279
279
|
|
280
280
|
Values returned by code blocks are passed to whatever other rule called it. This
|
281
|
-
allows code blocks to be used for building ASTs and the likes.
|
282
|
-
|
281
|
+
allows code blocks to be used for building ASTs and the likes.
|
282
|
+
|
283
|
+
If no explicit code block is defined then ruby-ll will generate one for you. If
|
284
|
+
a branch consists out of only a single step (e.g. `A = B;`) then only the first
|
285
|
+
value is returned, otherwise all values are returned.
|
286
|
+
|
287
|
+
This means that in the following example the output will be whatever value "C"
|
288
|
+
contains:
|
289
|
+
|
290
|
+
A = B { p val[0] };
|
291
|
+
B = C;
|
292
|
+
|
293
|
+
However, here the output would be `[C, D]` as the `B` rule's branch contains
|
294
|
+
multiple steps:
|
295
|
+
|
296
|
+
A = B { p val[0] };
|
297
|
+
B = C D;
|
298
|
+
|
299
|
+
To summarize (`# =>` denotes the return value):
|
300
|
+
|
301
|
+
A = B; # => B
|
302
|
+
A = B C; # => [B, C]
|
303
|
+
|
304
|
+
You can override this behaviour simply by defining your own code block.
|
283
305
|
|
284
306
|
ruby-ll parsers recurse into rules before unwinding, this means that the
|
285
307
|
inner-most rule is processed first.
|
@@ -301,6 +323,50 @@ name as a terminal, as such the following is invalid:
|
|
301
323
|
|
302
324
|
It's also an error to re-define an existing rule.
|
303
325
|
|
326
|
+
### Operators
|
327
|
+
|
328
|
+
Grammars can use two operators to define a sequence of terminals/non-terminals:
|
329
|
+
the star (`*`) and plus (`+`) operators.
|
330
|
+
|
331
|
+
The star operator indicates that something should occur 0 or more times. Here
|
332
|
+
the "B" identifier could occur 0 times, once, twice or many more times:
|
333
|
+
|
334
|
+
A = B*;
|
335
|
+
|
336
|
+
The plus operator indicates that something should occur at least once followed
|
337
|
+
by any number of more occurrences. For example, this grammar states that "B"
|
338
|
+
should occur at least once but can also occur, say, 10 times:
|
339
|
+
|
340
|
+
A = B+;
|
341
|
+
|
342
|
+
Operators can be applied either to a single terminal/rule or a series of
|
343
|
+
terminals/rules grouped together using parenthesis. For example, both are
|
344
|
+
perfectly valid:
|
345
|
+
|
346
|
+
A = B+;
|
347
|
+
A = (B C)+;
|
348
|
+
|
349
|
+
When calling an operator on a single terminal/rule the corresponding entry in
|
350
|
+
the `val` array is simply set to the terminal/rule value. For example:
|
351
|
+
|
352
|
+
A = B+ { p val[0] };
|
353
|
+
|
354
|
+
For input `B B B` this would output `[B, B, B]`.
|
355
|
+
|
356
|
+
However, when grouping multiple terminals/rules using parenthesis every
|
357
|
+
occurrence is wrapped in an Array. For example:
|
358
|
+
|
359
|
+
A = (B C)+ { p val[0] };
|
360
|
+
|
361
|
+
For input `B C B C` this would output `[[B, C], [B, C]]`. To work around this
|
362
|
+
you can simply move the group of identifiers to its own rule and only return
|
363
|
+
whatever you need:
|
364
|
+
|
365
|
+
A = A1+ { p val[0] };
|
366
|
+
A1 = B C { val[0] }; # only return "B"
|
367
|
+
|
368
|
+
For input `B C B C` this would output `[B, B]`.
|
369
|
+
|
304
370
|
## Conflicts
|
305
371
|
|
306
372
|
LL(1) grammars can have two kinds of conflicts in a rule:
|
@@ -0,0 +1,32 @@
|
|
1
|
+
# Driver Architecture
|
2
|
+
|
3
|
+
The actual parsing of input is handled by a so called "driver" represented as
|
4
|
+
the class `LL::Driver`. This class is written in either C or Java depending on
|
5
|
+
the Ruby platform that's being used. The rationale for this is simple:
|
6
|
+
performance. While Ruby is a great language it's sadly not fast enough to handle
|
7
|
+
parsing of large inputs in a way that doesn't either require lots of memory,
|
8
|
+
time or both.
|
9
|
+
|
10
|
+
Both the C and Java drivers try to use native data structures as much as
|
11
|
+
possible instead of using Ruby structures. For example, their internal parsing
|
12
|
+
stacks are native stacks. In case of Java this is an ArrayDeque, in case of C
|
13
|
+
this is a vector created using the [kvec][kvec] library as C doesn't have a
|
14
|
+
native vector structure.
|
15
|
+
|
16
|
+
The driver operates by iterating over every token supplied by the `each_token`
|
17
|
+
method (this method must be defined by a parser itself). For every input token a
|
18
|
+
callback function in C/Java is executed that determines what to parse and how to
|
19
|
+
parse it.
|
20
|
+
|
21
|
+
The parsing process largely operates on integers, only using Ruby objects where
|
22
|
+
absolutely required. For example, all steps of a rule's branch are represented
|
23
|
+
as integers. Lookup tables are also simply arrays of integers with terminals
|
24
|
+
being mapped directly to the indexes of these arrays. See ruby-ll's own parser
|
25
|
+
for examples. Note that the integers for the `rules` Array are in reverse order,
|
26
|
+
so everything that comes first is processed last.
|
27
|
+
|
28
|
+
For more information on the internals its best to refer to the C driver code
|
29
|
+
located in `ext/c/driver.c`. The Java code is largely based on this code safe
|
30
|
+
for some code comments here and there.
|
31
|
+
|
32
|
+
[kvec]: https://github.com/attractivechaos/klib/blob/master/kvec.h
|
data/ext/c/driver.c
CHANGED
@@ -5,6 +5,10 @@
|
|
5
5
|
#define T_TERMINAL 1
|
6
6
|
#define T_EPSILON 2
|
7
7
|
#define T_ACTION 3
|
8
|
+
#define T_STAR 4
|
9
|
+
#define T_PLUS 5
|
10
|
+
#define T_ADD_VALUE_STACK 6
|
11
|
+
#define T_APPEND_VALUE_STACK 7
|
8
12
|
|
9
13
|
ID id_config_const;
|
10
14
|
ID id_each_token;
|
@@ -66,6 +70,8 @@ VALUE ll_driver_each_token(VALUE token, VALUE self)
|
|
66
70
|
VALUE method;
|
67
71
|
VALUE action_args;
|
68
72
|
VALUE action_retval;
|
73
|
+
VALUE operator_buffer;
|
74
|
+
VALUE last_value;
|
69
75
|
long num_args;
|
70
76
|
long args_i;
|
71
77
|
|
@@ -113,8 +119,8 @@ VALUE ll_driver_each_token(VALUE token, VALUE self)
|
|
113
119
|
}
|
114
120
|
}
|
115
121
|
|
116
|
-
/*
|
117
|
-
if ( stack_type == T_RULE )
|
122
|
+
/* A rule or the "+" operator */
|
123
|
+
if ( stack_type == T_RULE || stack_type == T_PLUS )
|
118
124
|
{
|
119
125
|
production_i = state->config->table[stack_value][token_id];
|
120
126
|
|
@@ -132,6 +138,19 @@ VALUE ll_driver_each_token(VALUE token, VALUE self)
|
|
132
138
|
}
|
133
139
|
else
|
134
140
|
{
|
141
|
+
/*
|
142
|
+
Append a "*" operator for all following occurrences as they are
|
143
|
+
optional
|
144
|
+
*/
|
145
|
+
if ( stack_type == T_PLUS )
|
146
|
+
{
|
147
|
+
kv_push(long, state->stack, T_STAR);
|
148
|
+
kv_push(long, state->stack, stack_value);
|
149
|
+
|
150
|
+
kv_push(long, state->stack, T_APPEND_VALUE_STACK);
|
151
|
+
kv_push(long, state->stack, 0);
|
152
|
+
}
|
153
|
+
|
135
154
|
FOR(rule_i, state->config->rule_lengths[production_i])
|
136
155
|
{
|
137
156
|
kv_push(
|
@@ -142,6 +161,56 @@ VALUE ll_driver_each_token(VALUE token, VALUE self)
|
|
142
161
|
}
|
143
162
|
}
|
144
163
|
}
|
164
|
+
/* "*" operator */
|
165
|
+
else if ( stack_type == T_STAR )
|
166
|
+
{
|
167
|
+
production_i = state->config->table[stack_value][token_id];
|
168
|
+
|
169
|
+
if ( production_i != T_EOF )
|
170
|
+
{
|
171
|
+
kv_push(long, state->stack, T_STAR);
|
172
|
+
kv_push(long, state->stack, stack_value);
|
173
|
+
|
174
|
+
kv_push(long, state->stack, T_APPEND_VALUE_STACK);
|
175
|
+
kv_push(long, state->stack, 0);
|
176
|
+
|
177
|
+
FOR(rule_i, state->config->rule_lengths[production_i])
|
178
|
+
{
|
179
|
+
kv_push(
|
180
|
+
long,
|
181
|
+
state->stack,
|
182
|
+
state->config->rules[production_i][rule_i]
|
183
|
+
);
|
184
|
+
}
|
185
|
+
}
|
186
|
+
}
|
187
|
+
/*
|
188
|
+
Adds a new array to the value stack that can be used to group operator
|
189
|
+
values together
|
190
|
+
*/
|
191
|
+
else if ( stack_type == T_ADD_VALUE_STACK )
|
192
|
+
{
|
193
|
+
operator_buffer = rb_ary_new();
|
194
|
+
|
195
|
+
kv_push(VALUE, state->value_stack, operator_buffer);
|
196
|
+
|
197
|
+
RB_GC_GUARD(operator_buffer);
|
198
|
+
}
|
199
|
+
/*
|
200
|
+
Appends the last value on the value stack to the operator buffer that
|
201
|
+
preceeds it.
|
202
|
+
*/
|
203
|
+
else if ( stack_type == T_APPEND_VALUE_STACK )
|
204
|
+
{
|
205
|
+
last_value = kv_pop(state->value_stack);
|
206
|
+
|
207
|
+
operator_buffer = kv_A(
|
208
|
+
state->value_stack,
|
209
|
+
kv_size(state->value_stack) - 1
|
210
|
+
);
|
211
|
+
|
212
|
+
rb_ary_push(operator_buffer, last_value);
|
213
|
+
}
|
145
214
|
/* Terminal */
|
146
215
|
else if ( stack_type == T_TERMINAL )
|
147
216
|
{
|
data/ext/c/driver_config.c
CHANGED
@@ -166,7 +166,7 @@ VALUE ll_driver_config_set_actions(VALUE self, VALUE array)
|
|
166
166
|
|
167
167
|
Data_Get_Struct(self, DriverConfig, config);
|
168
168
|
|
169
|
-
config->action_names = ALLOC_N(
|
169
|
+
config->action_names = ALLOC_N(VALUE, row_count);
|
170
170
|
config->action_arg_amounts = ALLOC_N(long, row_count);
|
171
171
|
|
172
172
|
FOR(rindex, row_count)
|
data/ext/c/driver_config.h
CHANGED
@@ -27,11 +27,15 @@ import org.jruby.runtime.builtin.IRubyObject;
|
|
27
27
|
@JRubyClass(name="LL::Driver", parent="Object")
|
28
28
|
public class Driver extends RubyObject
|
29
29
|
{
|
30
|
-
private static long T_EOF
|
31
|
-
private static long T_RULE
|
32
|
-
private static long T_TERMINAL
|
33
|
-
private static long T_EPSILON
|
34
|
-
private static long T_ACTION
|
30
|
+
private static long T_EOF = -1;
|
31
|
+
private static long T_RULE = 0;
|
32
|
+
private static long T_TERMINAL = 1;
|
33
|
+
private static long T_EPSILON = 2;
|
34
|
+
private static long T_ACTION = 3;
|
35
|
+
private static long T_STAR = 4;
|
36
|
+
private static long T_PLUS = 5;
|
37
|
+
private static long T_ADD_VALUE_STACK = 6;
|
38
|
+
private static long T_APPEND_VALUE_STACK = 7;
|
35
39
|
|
36
40
|
/**
|
37
41
|
* The current Ruby runtime.
|
@@ -132,8 +136,8 @@ public class Driver extends RubyObject
|
|
132
136
|
token_id = self.config.terminals.get(type);
|
133
137
|
}
|
134
138
|
|
135
|
-
//
|
136
|
-
if ( stack_type == self.T_RULE )
|
139
|
+
// A rule or the "+" operator
|
140
|
+
if ( stack_type == self.T_RULE || stack_type == self.T_PLUS )
|
137
141
|
{
|
138
142
|
Long production_i = self.config.table
|
139
143
|
.get(stack_value.intValue())
|
@@ -152,6 +156,17 @@ public class Driver extends RubyObject
|
|
152
156
|
}
|
153
157
|
else
|
154
158
|
{
|
159
|
+
// Append a "*" operator for all following
|
160
|
+
// occurrences as they are optional
|
161
|
+
if ( stack_type == self.T_PLUS )
|
162
|
+
{
|
163
|
+
stack.push(self.T_STAR);
|
164
|
+
stack.push(stack_value);
|
165
|
+
|
166
|
+
stack.push(self.T_APPEND_VALUE_STACK);
|
167
|
+
stack.push(Long.valueOf(0));
|
168
|
+
}
|
169
|
+
|
155
170
|
ArrayList<Long> row = self.config.rules
|
156
171
|
.get(production_i.intValue());
|
157
172
|
|
@@ -161,6 +176,47 @@ public class Driver extends RubyObject
|
|
161
176
|
}
|
162
177
|
}
|
163
178
|
}
|
179
|
+
// "*" operator
|
180
|
+
else if ( stack_type == self.T_STAR )
|
181
|
+
{
|
182
|
+
Long production_i = self.config.table
|
183
|
+
.get(stack_value.intValue())
|
184
|
+
.get(token_id.intValue());
|
185
|
+
|
186
|
+
if ( production_i != self.T_EOF )
|
187
|
+
{
|
188
|
+
stack.push(self.T_STAR);
|
189
|
+
stack.push(stack_value);
|
190
|
+
|
191
|
+
stack.push(self.T_APPEND_VALUE_STACK);
|
192
|
+
stack.push(Long.valueOf(0));
|
193
|
+
|
194
|
+
ArrayList<Long> row = self.config.rules
|
195
|
+
.get(production_i.intValue());
|
196
|
+
|
197
|
+
for ( int index = 0; index < row.size(); index++ )
|
198
|
+
{
|
199
|
+
stack.push(row.get(index));
|
200
|
+
}
|
201
|
+
}
|
202
|
+
}
|
203
|
+
// Adds a new array to the value stack that can be used to
|
204
|
+
// group operator values together
|
205
|
+
else if ( stack_type == self.T_ADD_VALUE_STACK )
|
206
|
+
{
|
207
|
+
RubyArray operator_buffer = self.runtime.newArray();
|
208
|
+
|
209
|
+
value_stack.push(operator_buffer);
|
210
|
+
}
|
211
|
+
// Appends the last value on the value stack to the operator
|
212
|
+
// buffer that preceeds it.
|
213
|
+
else if ( stack_type == self.T_APPEND_VALUE_STACK )
|
214
|
+
{
|
215
|
+
IRubyObject last_value = value_stack.pop();
|
216
|
+
RubyArray operator_buffer = (RubyArray) value_stack.peek();
|
217
|
+
|
218
|
+
operator_buffer.append(last_value);
|
219
|
+
}
|
164
220
|
// Terminal
|
165
221
|
else if ( stack_type == self.T_TERMINAL )
|
166
222
|
{
|
data/lib/libll.jar
CHANGED
Binary file
|
data/lib/ll.rb
CHANGED
@@ -18,6 +18,7 @@ require_relative 'll/rule'
|
|
18
18
|
require_relative 'll/branch'
|
19
19
|
require_relative 'll/terminal'
|
20
20
|
require_relative 'll/epsilon'
|
21
|
+
require_relative 'll/operator'
|
21
22
|
require_relative 'll/message'
|
22
23
|
require_relative 'll/ast/node'
|
23
24
|
require_relative 'll/erb_context'
|
data/lib/ll/branch.rb
CHANGED
@@ -8,11 +8,15 @@ module LL
|
|
8
8
|
# @return [Hash]
|
9
9
|
#
|
10
10
|
TYPES = {
|
11
|
-
:eof
|
12
|
-
:rule
|
13
|
-
:terminal
|
14
|
-
:epsilon
|
15
|
-
:action
|
11
|
+
:eof => -1,
|
12
|
+
:rule => 0,
|
13
|
+
:terminal => 1,
|
14
|
+
:epsilon => 2,
|
15
|
+
:action => 3,
|
16
|
+
:star => 4,
|
17
|
+
:plus => 5,
|
18
|
+
:add_value_stack => 6,
|
19
|
+
:append_value_stack => 7
|
16
20
|
}.freeze
|
17
21
|
|
18
22
|
##
|
@@ -105,7 +109,21 @@ module LL
|
|
105
109
|
|
106
110
|
grammar.rules.each do |rule|
|
107
111
|
rule.branches.each do |branch|
|
108
|
-
|
112
|
+
if branch.ruby_code
|
113
|
+
code = branch.ruby_code
|
114
|
+
|
115
|
+
# If a branch only contains a single, non-epsilon step we can just
|
116
|
+
# return that value as-is. This makes parsing code a little bit
|
117
|
+
# easier.
|
118
|
+
elsif !branch.ruby_code and branch.steps.length == 1 \
|
119
|
+
and !branch.steps[0].is_a?(Epsilon)
|
120
|
+
code = 'val[0]'
|
121
|
+
|
122
|
+
else
|
123
|
+
code = DEFAULT_RUBY_CODE
|
124
|
+
end
|
125
|
+
|
126
|
+
bodies[:"_rule_#{index}"] = code
|
109
127
|
|
110
128
|
index += 1
|
111
129
|
end
|
@@ -133,17 +151,24 @@ module LL
|
|
133
151
|
action_index += 1
|
134
152
|
|
135
153
|
branch.steps.reverse_each do |step|
|
136
|
-
if step.is_a?(
|
154
|
+
if step.is_a?(Terminal)
|
137
155
|
row << TYPES[:terminal]
|
138
156
|
row << term_indices[step] + 1
|
139
157
|
|
140
|
-
elsif step.is_a?(
|
158
|
+
elsif step.is_a?(Rule)
|
141
159
|
row << TYPES[:rule]
|
142
160
|
row << rule_indices[step]
|
143
161
|
|
144
|
-
elsif step.is_a?(
|
162
|
+
elsif step.is_a?(Epsilon)
|
145
163
|
row << TYPES[:epsilon]
|
146
164
|
row << 0
|
165
|
+
|
166
|
+
elsif step.is_a?(Operator)
|
167
|
+
row << TYPES[step.type]
|
168
|
+
row << rule_indices[step.receiver]
|
169
|
+
|
170
|
+
row << TYPES[:add_value_stack]
|
171
|
+
row << 0
|
147
172
|
end
|
148
173
|
end
|
149
174
|
|