yarp 0.6.0 → 0.8.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/CHANGELOG.md +55 -0
- data/CONTRIBUTING.md +4 -0
- data/{Makefile.in → Makefile} +5 -4
- data/README.md +6 -3
- data/config.yml +83 -274
- data/docs/build_system.md +4 -15
- data/docs/building.md +1 -5
- data/docs/encoding.md +1 -0
- data/docs/{extension.md → ruby_api.md} +6 -3
- data/docs/serialization.md +71 -24
- data/ext/yarp/api_node.c +173 -585
- data/ext/yarp/extconf.rb +15 -10
- data/ext/yarp/extension.c +4 -2
- data/ext/yarp/extension.h +1 -1
- data/include/yarp/ast.h +167 -306
- data/include/yarp/defines.h +5 -15
- data/include/yarp/enc/yp_encoding.h +1 -0
- data/include/yarp/unescape.h +1 -1
- data/include/yarp/util/yp_buffer.h +9 -0
- data/include/yarp/util/yp_constant_pool.h +3 -0
- data/include/yarp/util/yp_list.h +7 -7
- data/include/yarp/util/yp_newline_list.h +4 -0
- data/include/yarp/util/yp_state_stack.h +1 -1
- data/include/yarp/util/yp_string.h +5 -1
- data/include/yarp/version.h +2 -3
- data/include/yarp.h +4 -2
- data/lib/yarp/ffi.rb +226 -0
- data/lib/yarp/lex_compat.rb +16 -2
- data/lib/yarp/node.rb +594 -1437
- data/lib/yarp/ripper_compat.rb +3 -3
- data/lib/yarp/serialize.rb +312 -149
- data/lib/yarp.rb +167 -2
- data/src/enc/yp_unicode.c +9 -0
- data/src/node.c +92 -250
- data/src/prettyprint.c +81 -206
- data/src/serialize.c +124 -149
- data/src/unescape.c +29 -35
- data/src/util/yp_buffer.c +18 -0
- data/src/util/yp_list.c +7 -16
- data/src/util/yp_state_stack.c +0 -6
- data/src/util/yp_string.c +8 -17
- data/src/yarp.c +444 -717
- data/yarp.gemspec +5 -5
- metadata +6 -6
- data/config.h.in +0 -25
- data/configure +0 -4487
data/docs/build_system.md
CHANGED
@@ -16,8 +16,6 @@ The main solution for the second point seems a Makefile, otherwise many of the u
|
|
16
16
|
## General Design
|
17
17
|
|
18
18
|
1. Templates are generated by `templates/template.rb`
|
19
|
-
2. `autoconf` creates `./configure` and `autoheader` creates `config.h.in` (both files are platform-independent)
|
20
|
-
3. `./configure` creates `include/yarp/config.h` (which contains `HAVE_*` macros, platform-specific) and the `Makefile`
|
21
19
|
4. The `Makefile` compiles both `librubyparser.a` and `librubyparser.{so,dylib,dll}` from the `src/**/*.c` and `include/**/*.h` files
|
22
20
|
5. The `Rakefile` `:compile` task ensures the above prerequisites are done, then calls `make`,
|
23
21
|
and uses `Rake::ExtensionTask` to compile the C extension (using its `extconf.rb`), which uses `librubyparser.a`
|
@@ -36,14 +34,13 @@ loaded per process (i.e., at most one version of the yarp *gem* loaded in a proc
|
|
36
34
|
|
37
35
|
### Building the yarp gem by `gem install/bundle install`
|
38
36
|
|
39
|
-
The gem contains the pre-generated templates
|
37
|
+
The gem contains the pre-generated templates.
|
40
38
|
When installing the gem, `extconf.rb` is used and that:
|
41
|
-
* runs `./configure` which creates the `Makefile` and `include/yarp/config.h`
|
42
39
|
* runs `make build/librubyparser.a`
|
43
40
|
* compiles the C extension with mkmf
|
44
41
|
|
45
42
|
When installing the gem on JRuby and TruffleRuby, no C extension is built, so instead of the last step,
|
46
|
-
there is Ruby code using
|
43
|
+
there is Ruby code using FFI which uses `librubyparser.{so,dylib,dll}`
|
47
44
|
to implement the same methods as the C extension, but using serialization instead of many native calls/accesses
|
48
45
|
(JRuby does not support C extensions, serialization is faster on TruffleRuby than the C extension).
|
49
46
|
|
@@ -51,7 +48,6 @@ to implement the same methods as the C extension, but using serialization instea
|
|
51
48
|
|
52
49
|
The same as above, except the `extconf.rb` additionally runs first:
|
53
50
|
* `templates/template.rb` to generate the templates
|
54
|
-
* `autoconf` and `autoheader` to generate `configure` and `config.h.in`
|
55
51
|
|
56
52
|
Because of course those files are not part of the git repository.
|
57
53
|
|
@@ -61,21 +57,14 @@ Because of course those files are not part of the git repository.
|
|
61
57
|
|
62
58
|
The script generates the templates when importing.
|
63
59
|
|
64
|
-
`
|
65
|
-
It is assumed that CRuby's `./configure` is a superset of YARP's configure checks.
|
66
|
-
|
67
|
-
YARP's `autotools` is not used at all in CRuby and in fact YARP's `Makefile` is not used either.
|
68
|
-
Instead, CRuby's `autotools` setup is used, and `CRuby`'s Makefiles are used.
|
60
|
+
YARP's `Makefile` is not used at all in CRuby. Instead, CRuby's `Makefile` is used.
|
69
61
|
|
70
62
|
### Building YARP as part of TruffleRuby
|
71
63
|
|
72
64
|
[This script](https://github.com/oracle/truffleruby/blob/master/tool/import-yarp.sh) imports YARP sources in TruffleRuby.
|
73
65
|
The script generates the templates when importing.
|
74
|
-
It also generates `configure` and `config.h.in` (to avoid needing `autotools` on every machine building TruffleRuby).
|
75
66
|
|
76
|
-
Then when `mx build` builds TruffleRuby and the `yarp` mx project inside, it
|
77
|
-
* runs `./configure`
|
78
|
-
* runs `make`
|
67
|
+
Then when `mx build` builds TruffleRuby and the `yarp` mx project inside, it runs `make`.
|
79
68
|
|
80
69
|
Then the `yarp bindings` mx project is built, which contains the [bindings](https://github.com/oracle/truffleruby/blob/master/src/main/c/yarp_bindings/src/yarp_bindings.c)
|
81
70
|
and links to `librubyparser.a` (to avoid exporting symbols, so no conflict when installing the yarp gem).
|
data/docs/building.md
CHANGED
@@ -1,6 +1,7 @@
|
|
1
1
|
# Building
|
2
2
|
|
3
3
|
The following describes how to build YARP from source.
|
4
|
+
This comes directly from the [Makefile](../Makefile).
|
4
5
|
|
5
6
|
## Common
|
6
7
|
|
@@ -13,11 +14,6 @@ The following flags should be used to compile YARP:
|
|
13
14
|
* `-Werror` - Treat warnings as errors
|
14
15
|
* `-fvisibility=hidden` - Hide all symbols by default
|
15
16
|
|
16
|
-
The following flags can be used to compile YARP:
|
17
|
-
|
18
|
-
* `-DHAVE_MMAP` - Should be passed if the system has the `mmap` function
|
19
|
-
* `-DHAVE_SNPRINTF` - Should be passed if the system has the `snprintf` function
|
20
|
-
|
21
17
|
## Shared
|
22
18
|
|
23
19
|
If you want to build YARP as a shared library and link against it, you should compile with:
|
data/docs/encoding.md
CHANGED
@@ -1,6 +1,6 @@
|
|
1
|
-
#
|
1
|
+
# Ruby API
|
2
2
|
|
3
|
-
|
3
|
+
The `yarp` gem provides a Ruby API for accessing the syntax tree.
|
4
4
|
|
5
5
|
For the most part, the API for accessing the tree mirrors that found in the [Syntax Tree](https://github.com/ruby-syntax-tree/syntax_tree) project. This means:
|
6
6
|
|
@@ -8,7 +8,9 @@ For the most part, the API for accessing the tree mirrors that found in the [Syn
|
|
8
8
|
* Nodes in the tree respond to named methods for accessing their children as well as `#child_nodes`
|
9
9
|
* Nodes respond to the pattern matching interfaces `#deconstruct` and `#deconstruct_keys`
|
10
10
|
|
11
|
-
Every entry in `config.yml` will generate a Ruby class as well as the code that builds the nodes themselves.
|
11
|
+
Every entry in `config.yml` will generate a Ruby class as well as the code that builds the nodes themselves.
|
12
|
+
Creating a syntax tree involves calling one of the class methods on the `YARP` module.
|
13
|
+
The full API is documented below.
|
12
14
|
|
13
15
|
## API
|
14
16
|
|
@@ -18,3 +20,4 @@ Every entry in `config.yml` will generate a Ruby class as well as the code that
|
|
18
20
|
* `YARP.lex_file(filepath)` - parse the tokens corresponding to the given source file and return them as an array within a parse result
|
19
21
|
* `YARP.parse(source)` - parse the syntax tree corresponding to the given source string and return it within a parse result
|
20
22
|
* `YARP.parse_file(filepath)` - parse the syntax tree corresponding to the given source file and return it within a parse result
|
23
|
+
* `YARP.load(source, serialized)` - load the serialized syntax tree using the source as a reference into a syntax tree
|
data/docs/serialization.md
CHANGED
@@ -1,10 +1,58 @@
|
|
1
1
|
# Serialization
|
2
2
|
|
3
|
-
YARP ships with the ability to serialize a syntax tree to a single string.
|
3
|
+
YARP ships with the ability to serialize a syntax tree to a single string.
|
4
|
+
The string can then be deserialized back into a syntax tree using a language other than C.
|
5
|
+
This is useful for using the parsing logic in other tools without having to write a parser in that language.
|
6
|
+
The syntax tree still requires a copy of the original source, as for the most part it just contains byte offsets into the source string.
|
7
|
+
|
8
|
+
## Types
|
9
|
+
|
10
|
+
Let us define some simple types for readability.
|
11
|
+
|
12
|
+
### varint
|
13
|
+
|
14
|
+
A variable-length integer with the value fitting in `uint32_t` using between 1 and 5 bytes, using the [LEB128](https://en.wikipedia.org/wiki/LEB128) encoding.
|
15
|
+
This drastically cuts down on the size of the serialized string, especially when the source file is large.
|
16
|
+
|
17
|
+
### string
|
18
|
+
|
19
|
+
| # bytes | field |
|
20
|
+
| --- | --- |
|
21
|
+
| varint | the length of the string in bytes |
|
22
|
+
| ... | the string bytes |
|
23
|
+
|
24
|
+
### location
|
25
|
+
|
26
|
+
| # bytes | field |
|
27
|
+
| --- | --- |
|
28
|
+
| varint | byte offset into the source string where this location begins |
|
29
|
+
| varint | length of the location in bytes in the source string |
|
30
|
+
|
31
|
+
### comment
|
32
|
+
|
33
|
+
The comment type is one of:
|
34
|
+
* 0=`INLINE` (`# comment`)
|
35
|
+
* 1=`EMBEDDED_DOCUMENT` (`=begin`/`=end`)
|
36
|
+
* 2=`__END__` (after `__END__`)
|
37
|
+
|
38
|
+
| # bytes | field |
|
39
|
+
| --- | --- |
|
40
|
+
| `1` | comment type |
|
41
|
+
| location | the location in the source of this comment |
|
42
|
+
|
43
|
+
### diagnostic
|
44
|
+
|
45
|
+
| # bytes | field |
|
46
|
+
| --- | --- |
|
47
|
+
| string | diagnostic message (ASCII-only characters) |
|
48
|
+
| location | the location in the source this diagnostic applies to |
|
4
49
|
|
5
50
|
## Structure
|
6
51
|
|
7
|
-
The serialized string representing the syntax tree is composed of three parts: the header, the body, and the constant pool.
|
52
|
+
The serialized string representing the syntax tree is composed of three parts: the header, the body, and the constant pool.
|
53
|
+
The header contains information like the version of YARP that serialized the tree.
|
54
|
+
The body contains the actual nodes in the tree.
|
55
|
+
The constant pool contains constants that were interned while parsing.
|
8
56
|
|
9
57
|
The header is structured like the following table:
|
10
58
|
|
@@ -14,32 +62,28 @@ The header is structured like the following table:
|
|
14
62
|
| `1` | major version number |
|
15
63
|
| `1` | minor version number |
|
16
64
|
| `1` | patch version number |
|
17
|
-
| varint | the length of the encoding name |
|
18
65
|
| string | the encoding name |
|
66
|
+
| varint | number of comments |
|
67
|
+
| comment* | comments |
|
19
68
|
| varint | number of errors |
|
20
|
-
|
|
21
|
-
| string | error string, as byte[] in source encoding |
|
22
|
-
| varint | location in the source code - start |
|
23
|
-
| varint | location in the source code - length |
|
24
|
-
| ... | more errors |
|
69
|
+
| diagnostic* | errors |
|
25
70
|
| varint | number of warnings |
|
26
|
-
|
|
27
|
-
| string | warning string, as byte[] in source encoding |
|
28
|
-
| varint | location in the source code - start |
|
29
|
-
| varint | location in the source code - length |
|
30
|
-
| ... | more warnings |
|
71
|
+
| diagnostic* | warnings |
|
31
72
|
| `4` | content pool offset |
|
32
73
|
| varint | content pool size |
|
33
74
|
|
34
|
-
After the header comes the body of the serialized string.
|
75
|
+
After the header comes the body of the serialized string.
|
76
|
+
The body consistents of a sequence of nodes that is built using a prefix traversal order of the syntax tree.
|
77
|
+
Each node is structured like the following table:
|
35
78
|
|
36
79
|
| # bytes | field |
|
37
80
|
| --- | --- |
|
38
81
|
| `1` | node type |
|
39
|
-
|
|
40
|
-
| varint | length of the node in bytes in the source string |
|
82
|
+
| location | node location |
|
41
83
|
|
42
|
-
Each node's child is then appended to the serialized string.
|
84
|
+
Each node's child is then appended to the serialized string.
|
85
|
+
The child node types can be determined by referencing `config.yml`.
|
86
|
+
Depending on the type of child node, it could take a couple of different forms, described below:
|
43
87
|
|
44
88
|
* `node` - A child node that is a node itself. This is structured just as like parent node.
|
45
89
|
* `node?` - A child node that is optionally present. If the node is not present, then a single `0` byte will be written in its place. If it is present, then it will be structured just as like parent node.
|
@@ -52,7 +96,10 @@ Each node's child is then appended to the serialized string. The child node type
|
|
52
96
|
* `location[]` - A child node that is an array of locations. This is structured as a `4` byte length, followed by the locations themselves.
|
53
97
|
* `uint32` - A child node that is a 32-bit unsigned integer. This is structured as a variable-length integer.
|
54
98
|
|
55
|
-
After the syntax tree, the content pool is serialized.
|
99
|
+
After the syntax tree, the content pool is serialized.
|
100
|
+
This is a list of constants that were referenced from within the tree.
|
101
|
+
The content pool begins at the offset specified in the header.
|
102
|
+
Each constant is structured as:
|
56
103
|
|
57
104
|
| # bytes | field |
|
58
105
|
| --- | --- |
|
@@ -61,10 +108,6 @@ After the syntax tree, the content pool is serialized. This is a list of constan
|
|
61
108
|
|
62
109
|
At the end of the serialization, the buffer is null terminated.
|
63
110
|
|
64
|
-
## Variable-length integers
|
65
|
-
|
66
|
-
Variable-length integers are used throughout the serialized format, using the [LEB128](https://en.wikipedia.org/wiki/LEB128) encoding. This drastically cuts down on the size of the serialized string, especially when the source file is large.
|
67
|
-
|
68
111
|
## APIs
|
69
112
|
|
70
113
|
The relevant APIs and struct definitions are listed below:
|
@@ -105,7 +148,10 @@ serialize(const char *source, size_t length) {
|
|
105
148
|
}
|
106
149
|
```
|
107
150
|
|
108
|
-
The final argument to `yp_parse_serialize` controls the metadata of the source.
|
151
|
+
The final argument to `yp_parse_serialize` controls the metadata of the source.
|
152
|
+
This includes the filepath that the source is associated with, and any nested local variables scopes that are necessary to properly parse the file (in the case of parsing an `eval`).
|
153
|
+
Note that no `varint` are used here to make it easier to produce the metadata for the caller, and also serialized size is less important here.
|
154
|
+
The metadata is a serialized format itself, and is structured as follows:
|
109
155
|
|
110
156
|
| # bytes | field |
|
111
157
|
| --- | --- |
|
@@ -127,4 +173,5 @@ Each local variable within each scope is encoded as:
|
|
127
173
|
| `4` | the size of the local variable name |
|
128
174
|
| | the local variable name |
|
129
175
|
|
130
|
-
The metadata can be `NULL` (as seen in the example above).
|
176
|
+
The metadata can be `NULL` (as seen in the example above).
|
177
|
+
If it is not null, then a minimal metadata string would be `"\0\0\0\0\0\0\0\0"` which would use 4 bytes to indicate an empty filepath string and 4 bytes to indicate that there were no local variable scopes.
|