yarp 0.6.0 → 0.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (47) hide show
  1. checksums.yaml +4 -4
  2. data/CHANGELOG.md +55 -0
  3. data/CONTRIBUTING.md +4 -0
  4. data/{Makefile.in → Makefile} +5 -4
  5. data/README.md +6 -3
  6. data/config.yml +83 -274
  7. data/docs/build_system.md +4 -15
  8. data/docs/building.md +1 -5
  9. data/docs/encoding.md +1 -0
  10. data/docs/{extension.md → ruby_api.md} +6 -3
  11. data/docs/serialization.md +71 -24
  12. data/ext/yarp/api_node.c +173 -585
  13. data/ext/yarp/extconf.rb +15 -10
  14. data/ext/yarp/extension.c +4 -2
  15. data/ext/yarp/extension.h +1 -1
  16. data/include/yarp/ast.h +167 -306
  17. data/include/yarp/defines.h +5 -15
  18. data/include/yarp/enc/yp_encoding.h +1 -0
  19. data/include/yarp/unescape.h +1 -1
  20. data/include/yarp/util/yp_buffer.h +9 -0
  21. data/include/yarp/util/yp_constant_pool.h +3 -0
  22. data/include/yarp/util/yp_list.h +7 -7
  23. data/include/yarp/util/yp_newline_list.h +4 -0
  24. data/include/yarp/util/yp_state_stack.h +1 -1
  25. data/include/yarp/util/yp_string.h +5 -1
  26. data/include/yarp/version.h +2 -3
  27. data/include/yarp.h +4 -2
  28. data/lib/yarp/ffi.rb +226 -0
  29. data/lib/yarp/lex_compat.rb +16 -2
  30. data/lib/yarp/node.rb +594 -1437
  31. data/lib/yarp/ripper_compat.rb +3 -3
  32. data/lib/yarp/serialize.rb +312 -149
  33. data/lib/yarp.rb +167 -2
  34. data/src/enc/yp_unicode.c +9 -0
  35. data/src/node.c +92 -250
  36. data/src/prettyprint.c +81 -206
  37. data/src/serialize.c +124 -149
  38. data/src/unescape.c +29 -35
  39. data/src/util/yp_buffer.c +18 -0
  40. data/src/util/yp_list.c +7 -16
  41. data/src/util/yp_state_stack.c +0 -6
  42. data/src/util/yp_string.c +8 -17
  43. data/src/yarp.c +444 -717
  44. data/yarp.gemspec +5 -5
  45. metadata +6 -6
  46. data/config.h.in +0 -25
  47. data/configure +0 -4487
data/docs/build_system.md CHANGED
@@ -16,8 +16,6 @@ The main solution for the second point seems a Makefile, otherwise many of the u
16
16
  ## General Design
17
17
 
18
18
  1. Templates are generated by `templates/template.rb`
19
- 2. `autoconf` creates `./configure` and `autoheader` creates `config.h.in` (both files are platform-independent)
20
- 3. `./configure` creates `include/yarp/config.h` (which contains `HAVE_*` macros, platform-specific) and the `Makefile`
21
19
  4. The `Makefile` compiles both `librubyparser.a` and `librubyparser.{so,dylib,dll}` from the `src/**/*.c` and `include/**/*.h` files
22
20
  5. The `Rakefile` `:compile` task ensures the above prerequisites are done, then calls `make`,
23
21
  and uses `Rake::ExtensionTask` to compile the C extension (using its `extconf.rb`), which uses `librubyparser.a`
@@ -36,14 +34,13 @@ loaded per process (i.e., at most one version of the yarp *gem* loaded in a proc
36
34
 
37
35
  ### Building the yarp gem by `gem install/bundle install`
38
36
 
39
- The gem contains the pre-generated templates, as well as `configure` and `config.h.in`
37
+ The gem contains the pre-generated templates.
40
38
  When installing the gem, `extconf.rb` is used and that:
41
- * runs `./configure` which creates the `Makefile` and `include/yarp/config.h`
42
39
  * runs `make build/librubyparser.a`
43
40
  * compiles the C extension with mkmf
44
41
 
45
42
  When installing the gem on JRuby and TruffleRuby, no C extension is built, so instead of the last step,
46
- there is Ruby code using Fiddle which uses `librubyparser.{so,dylib,dll}`
43
+ there is Ruby code using FFI which uses `librubyparser.{so,dylib,dll}`
47
44
  to implement the same methods as the C extension, but using serialization instead of many native calls/accesses
48
45
  (JRuby does not support C extensions, serialization is faster on TruffleRuby than the C extension).
49
46
 
@@ -51,7 +48,6 @@ to implement the same methods as the C extension, but using serialization instea
51
48
 
52
49
  The same as above, except the `extconf.rb` additionally runs first:
53
50
  * `templates/template.rb` to generate the templates
54
- * `autoconf` and `autoheader` to generate `configure` and `config.h.in`
55
51
 
56
52
  Because of course those files are not part of the git repository.
57
53
 
@@ -61,21 +57,14 @@ Because of course those files are not part of the git repository.
61
57
 
62
58
  The script generates the templates when importing.
63
59
 
64
- `include/yarp/config.h` is replaced by `#include "ruby/config.h"`.
65
- It is assumed that CRuby's `./configure` is a superset of YARP's configure checks.
66
-
67
- YARP's `autotools` is not used at all in CRuby and in fact YARP's `Makefile` is not used either.
68
- Instead, CRuby's `autotools` setup is used, and `CRuby`'s Makefiles are used.
60
+ YARP's `Makefile` is not used at all in CRuby. Instead, CRuby's `Makefile` is used.
69
61
 
70
62
  ### Building YARP as part of TruffleRuby
71
63
 
72
64
  [This script](https://github.com/oracle/truffleruby/blob/master/tool/import-yarp.sh) imports YARP sources in TruffleRuby.
73
65
  The script generates the templates when importing.
74
- It also generates `configure` and `config.h.in` (to avoid needing `autotools` on every machine building TruffleRuby).
75
66
 
76
- Then when `mx build` builds TruffleRuby and the `yarp` mx project inside, it:
77
- * runs `./configure`
78
- * runs `make`
67
+ Then when `mx build` builds TruffleRuby and the `yarp` mx project inside, it runs `make`.
79
68
 
80
69
  Then the `yarp bindings` mx project is built, which contains the [bindings](https://github.com/oracle/truffleruby/blob/master/src/main/c/yarp_bindings/src/yarp_bindings.c)
81
70
  and links to `librubyparser.a` (to avoid exporting symbols, so no conflict when installing the yarp gem).
data/docs/building.md CHANGED
@@ -1,6 +1,7 @@
1
1
  # Building
2
2
 
3
3
  The following describes how to build YARP from source.
4
+ This comes directly from the [Makefile](../Makefile).
4
5
 
5
6
  ## Common
6
7
 
@@ -13,11 +14,6 @@ The following flags should be used to compile YARP:
13
14
  * `-Werror` - Treat warnings as errors
14
15
  * `-fvisibility=hidden` - Hide all symbols by default
15
16
 
16
- The following flags can be used to compile YARP:
17
-
18
- * `-DHAVE_MMAP` - Should be passed if the system has the `mmap` function
19
- * `-DHAVE_SNPRINTF` - Should be passed if the system has the `snprintf` function
20
-
21
17
  ## Shared
22
18
 
23
19
  If you want to build YARP as a shared library and link against it, you should compile with:
data/docs/encoding.md CHANGED
@@ -39,6 +39,7 @@ The key of the comment can be either "encoding" or "coding". The value of the co
39
39
  * `sjis`
40
40
  * `us-ascii`
41
41
  * `utf-8`
42
+ * `utf8-mac`
42
43
  * `windows-31j`
43
44
  * `windows-1251`
44
45
  * `windows-1252`
@@ -1,6 +1,6 @@
1
- # Extension
1
+ # Ruby API
2
2
 
3
- Part of this parser project is a native extension that provides a Ruby API that wraps calls to the C API. This allows you to invoke parsing from Ruby code. It also provides a Ruby API for accessing the syntax tree.
3
+ The `yarp` gem provides a Ruby API for accessing the syntax tree.
4
4
 
5
5
  For the most part, the API for accessing the tree mirrors that found in the [Syntax Tree](https://github.com/ruby-syntax-tree/syntax_tree) project. This means:
6
6
 
@@ -8,7 +8,9 @@ For the most part, the API for accessing the tree mirrors that found in the [Syn
8
8
  * Nodes in the tree respond to named methods for accessing their children as well as `#child_nodes`
9
9
  * Nodes respond to the pattern matching interfaces `#deconstruct` and `#deconstruct_keys`
10
10
 
11
- Every entry in `config.yml` will generate a Ruby class as well as the code that builds the nodes themselves. Creating a syntax tree involves calling one of the class methods on the `YARP` module. The full API is documented below.
11
+ Every entry in `config.yml` will generate a Ruby class as well as the code that builds the nodes themselves.
12
+ Creating a syntax tree involves calling one of the class methods on the `YARP` module.
13
+ The full API is documented below.
12
14
 
13
15
  ## API
14
16
 
@@ -18,3 +20,4 @@ Every entry in `config.yml` will generate a Ruby class as well as the code that
18
20
  * `YARP.lex_file(filepath)` - parse the tokens corresponding to the given source file and return them as an array within a parse result
19
21
  * `YARP.parse(source)` - parse the syntax tree corresponding to the given source string and return it within a parse result
20
22
  * `YARP.parse_file(filepath)` - parse the syntax tree corresponding to the given source file and return it within a parse result
23
+ * `YARP.load(source, serialized)` - load the serialized syntax tree using the source as a reference into a syntax tree
@@ -1,10 +1,58 @@
1
1
  # Serialization
2
2
 
3
- YARP ships with the ability to serialize a syntax tree to a single string. The string can then be deserialized back into a syntax tree using a language other than C. This is useful for using the parsing logic in other tools without having to write a parser in that language. The syntax tree still requires a copy of the original source, as for the most part it just contains byte offsets into the source string.
3
+ YARP ships with the ability to serialize a syntax tree to a single string.
4
+ The string can then be deserialized back into a syntax tree using a language other than C.
5
+ This is useful for using the parsing logic in other tools without having to write a parser in that language.
6
+ The syntax tree still requires a copy of the original source, as for the most part it just contains byte offsets into the source string.
7
+
8
+ ## Types
9
+
10
+ Let us define some simple types for readability.
11
+
12
+ ### varint
13
+
14
+ A variable-length integer with the value fitting in `uint32_t` using between 1 and 5 bytes, using the [LEB128](https://en.wikipedia.org/wiki/LEB128) encoding.
15
+ This drastically cuts down on the size of the serialized string, especially when the source file is large.
16
+
17
+ ### string
18
+
19
+ | # bytes | field |
20
+ | --- | --- |
21
+ | varint | the length of the string in bytes |
22
+ | ... | the string bytes |
23
+
24
+ ### location
25
+
26
+ | # bytes | field |
27
+ | --- | --- |
28
+ | varint | byte offset into the source string where this location begins |
29
+ | varint | length of the location in bytes in the source string |
30
+
31
+ ### comment
32
+
33
+ The comment type is one of:
34
+ * 0=`INLINE` (`# comment`)
35
+ * 1=`EMBEDDED_DOCUMENT` (`=begin`/`=end`)
36
+ * 2=`__END__` (after `__END__`)
37
+
38
+ | # bytes | field |
39
+ | --- | --- |
40
+ | `1` | comment type |
41
+ | location | the location in the source of this comment |
42
+
43
+ ### diagnostic
44
+
45
+ | # bytes | field |
46
+ | --- | --- |
47
+ | string | diagnostic message (ASCII-only characters) |
48
+ | location | the location in the source this diagnostic applies to |
4
49
 
5
50
  ## Structure
6
51
 
7
- The serialized string representing the syntax tree is composed of three parts: the header, the body, and the constant pool. The header contains information like the version of YARP that serialized the tree. The body contains the actual nodes in the tree. The constant pool contains constants that were interned while parsing.
52
+ The serialized string representing the syntax tree is composed of three parts: the header, the body, and the constant pool.
53
+ The header contains information like the version of YARP that serialized the tree.
54
+ The body contains the actual nodes in the tree.
55
+ The constant pool contains constants that were interned while parsing.
8
56
 
9
57
  The header is structured like the following table:
10
58
 
@@ -14,32 +62,28 @@ The header is structured like the following table:
14
62
  | `1` | major version number |
15
63
  | `1` | minor version number |
16
64
  | `1` | patch version number |
17
- | varint | the length of the encoding name |
18
65
  | string | the encoding name |
66
+ | varint | number of comments |
67
+ | comment* | comments |
19
68
  | varint | number of errors |
20
- | varint | byte length of error |
21
- | string | error string, as byte[] in source encoding |
22
- | varint | location in the source code - start |
23
- | varint | location in the source code - length |
24
- | ... | more errors |
69
+ | diagnostic* | errors |
25
70
  | varint | number of warnings |
26
- | varint | byte length of warning |
27
- | string | warning string, as byte[] in source encoding |
28
- | varint | location in the source code - start |
29
- | varint | location in the source code - length |
30
- | ... | more warnings |
71
+ | diagnostic* | warnings |
31
72
  | `4` | content pool offset |
32
73
  | varint | content pool size |
33
74
 
34
- After the header comes the body of the serialized string. The body consistents of a sequence of nodes that is built using a prefix traversal order of the syntax tree. Each node is structured like the following table:
75
+ After the header comes the body of the serialized string.
76
+ The body consistents of a sequence of nodes that is built using a prefix traversal order of the syntax tree.
77
+ Each node is structured like the following table:
35
78
 
36
79
  | # bytes | field |
37
80
  | --- | --- |
38
81
  | `1` | node type |
39
- | varint | byte offset into the source string where this node begins |
40
- | varint | length of the node in bytes in the source string |
82
+ | location | node location |
41
83
 
42
- Each node's child is then appended to the serialized string. The child node types can be determined by referencing `config.yml`. Depending on the type of child node, it could take a couple of different forms, described below:
84
+ Each node's child is then appended to the serialized string.
85
+ The child node types can be determined by referencing `config.yml`.
86
+ Depending on the type of child node, it could take a couple of different forms, described below:
43
87
 
44
88
  * `node` - A child node that is a node itself. This is structured just as like parent node.
45
89
  * `node?` - A child node that is optionally present. If the node is not present, then a single `0` byte will be written in its place. If it is present, then it will be structured just as like parent node.
@@ -52,7 +96,10 @@ Each node's child is then appended to the serialized string. The child node type
52
96
  * `location[]` - A child node that is an array of locations. This is structured as a `4` byte length, followed by the locations themselves.
53
97
  * `uint32` - A child node that is a 32-bit unsigned integer. This is structured as a variable-length integer.
54
98
 
55
- After the syntax tree, the content pool is serialized. This is a list of constants that were referenced from within the tree. The content pool begins at the offset specified in the header. Each constant is structured as:
99
+ After the syntax tree, the content pool is serialized.
100
+ This is a list of constants that were referenced from within the tree.
101
+ The content pool begins at the offset specified in the header.
102
+ Each constant is structured as:
56
103
 
57
104
  | # bytes | field |
58
105
  | --- | --- |
@@ -61,10 +108,6 @@ After the syntax tree, the content pool is serialized. This is a list of constan
61
108
 
62
109
  At the end of the serialization, the buffer is null terminated.
63
110
 
64
- ## Variable-length integers
65
-
66
- Variable-length integers are used throughout the serialized format, using the [LEB128](https://en.wikipedia.org/wiki/LEB128) encoding. This drastically cuts down on the size of the serialized string, especially when the source file is large.
67
-
68
111
  ## APIs
69
112
 
70
113
  The relevant APIs and struct definitions are listed below:
@@ -105,7 +148,10 @@ serialize(const char *source, size_t length) {
105
148
  }
106
149
  ```
107
150
 
108
- The final argument to `yp_parse_serialize` controls the metadata of the source. This includes the filepath that the source is associated with, and any nested local variables scopes that are necessary to properly parse the file (in the case of parsing an `eval`). The metadata is a serialized format itself, and is structured as follows:
151
+ The final argument to `yp_parse_serialize` controls the metadata of the source.
152
+ This includes the filepath that the source is associated with, and any nested local variables scopes that are necessary to properly parse the file (in the case of parsing an `eval`).
153
+ Note that no `varint` are used here to make it easier to produce the metadata for the caller, and also serialized size is less important here.
154
+ The metadata is a serialized format itself, and is structured as follows:
109
155
 
110
156
  | # bytes | field |
111
157
  | --- | --- |
@@ -127,4 +173,5 @@ Each local variable within each scope is encoded as:
127
173
  | `4` | the size of the local variable name |
128
174
  | | the local variable name |
129
175
 
130
- The metadata can be `NULL` (as seen in the example above). If it is not null, then a minimal metadata string would be `"\0\0\0\0\0\0\0\0"` which would use 4 bytes to indicate an empty filepath string and 4 bytes to indicate that there were no local variable scopes.
176
+ The metadata can be `NULL` (as seen in the example above).
177
+ If it is not null, then a minimal metadata string would be `"\0\0\0\0\0\0\0\0"` which would use 4 bytes to indicate an empty filepath string and 4 bytes to indicate that there were no local variable scopes.