prism 0.13.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/CHANGELOG.md +172 -0
- data/CODE_OF_CONDUCT.md +76 -0
- data/CONTRIBUTING.md +62 -0
- data/LICENSE.md +7 -0
- data/Makefile +84 -0
- data/README.md +89 -0
- data/config.yml +2481 -0
- data/docs/build_system.md +74 -0
- data/docs/building.md +22 -0
- data/docs/configuration.md +60 -0
- data/docs/design.md +53 -0
- data/docs/encoding.md +117 -0
- data/docs/fuzzing.md +93 -0
- data/docs/heredocs.md +36 -0
- data/docs/mapping.md +117 -0
- data/docs/ripper.md +36 -0
- data/docs/ruby_api.md +25 -0
- data/docs/serialization.md +181 -0
- data/docs/testing.md +55 -0
- data/ext/prism/api_node.c +4725 -0
- data/ext/prism/api_pack.c +256 -0
- data/ext/prism/extconf.rb +136 -0
- data/ext/prism/extension.c +626 -0
- data/ext/prism/extension.h +18 -0
- data/include/prism/ast.h +1932 -0
- data/include/prism/defines.h +45 -0
- data/include/prism/diagnostic.h +231 -0
- data/include/prism/enc/pm_encoding.h +95 -0
- data/include/prism/node.h +41 -0
- data/include/prism/pack.h +141 -0
- data/include/prism/parser.h +418 -0
- data/include/prism/regexp.h +19 -0
- data/include/prism/unescape.h +48 -0
- data/include/prism/util/pm_buffer.h +51 -0
- data/include/prism/util/pm_char.h +91 -0
- data/include/prism/util/pm_constant_pool.h +78 -0
- data/include/prism/util/pm_list.h +67 -0
- data/include/prism/util/pm_memchr.h +14 -0
- data/include/prism/util/pm_newline_list.h +61 -0
- data/include/prism/util/pm_state_stack.h +24 -0
- data/include/prism/util/pm_string.h +61 -0
- data/include/prism/util/pm_string_list.h +25 -0
- data/include/prism/util/pm_strpbrk.h +29 -0
- data/include/prism/version.h +4 -0
- data/include/prism.h +82 -0
- data/lib/prism/compiler.rb +465 -0
- data/lib/prism/debug.rb +157 -0
- data/lib/prism/desugar_compiler.rb +206 -0
- data/lib/prism/dispatcher.rb +2051 -0
- data/lib/prism/dsl.rb +750 -0
- data/lib/prism/ffi.rb +251 -0
- data/lib/prism/lex_compat.rb +838 -0
- data/lib/prism/mutation_compiler.rb +718 -0
- data/lib/prism/node.rb +14540 -0
- data/lib/prism/node_ext.rb +55 -0
- data/lib/prism/node_inspector.rb +68 -0
- data/lib/prism/pack.rb +185 -0
- data/lib/prism/parse_result/comments.rb +172 -0
- data/lib/prism/parse_result/newlines.rb +60 -0
- data/lib/prism/parse_result.rb +266 -0
- data/lib/prism/pattern.rb +239 -0
- data/lib/prism/ripper_compat.rb +174 -0
- data/lib/prism/serialize.rb +662 -0
- data/lib/prism/visitor.rb +470 -0
- data/lib/prism.rb +64 -0
- data/prism.gemspec +113 -0
- data/src/diagnostic.c +287 -0
- data/src/enc/pm_big5.c +52 -0
- data/src/enc/pm_euc_jp.c +58 -0
- data/src/enc/pm_gbk.c +61 -0
- data/src/enc/pm_shift_jis.c +56 -0
- data/src/enc/pm_tables.c +507 -0
- data/src/enc/pm_unicode.c +2324 -0
- data/src/enc/pm_windows_31j.c +56 -0
- data/src/node.c +2633 -0
- data/src/pack.c +493 -0
- data/src/prettyprint.c +2136 -0
- data/src/prism.c +14587 -0
- data/src/regexp.c +580 -0
- data/src/serialize.c +1899 -0
- data/src/token_type.c +349 -0
- data/src/unescape.c +637 -0
- data/src/util/pm_buffer.c +103 -0
- data/src/util/pm_char.c +272 -0
- data/src/util/pm_constant_pool.c +252 -0
- data/src/util/pm_list.c +41 -0
- data/src/util/pm_memchr.c +33 -0
- data/src/util/pm_newline_list.c +134 -0
- data/src/util/pm_state_stack.c +19 -0
- data/src/util/pm_string.c +200 -0
- data/src/util/pm_string_list.c +29 -0
- data/src/util/pm_strncasecmp.c +17 -0
- data/src/util/pm_strpbrk.c +66 -0
- metadata +138 -0
@@ -0,0 +1,181 @@
|
|
1
|
+
# Serialization
|
2
|
+
|
3
|
+
Prism ships with the ability to serialize a syntax tree to a single string.
|
4
|
+
The string can then be deserialized back into a syntax tree using a language other than C.
|
5
|
+
This is useful for using the parsing logic in other tools without having to write a parser in that language.
|
6
|
+
The syntax tree still requires a copy of the original source, as for the most part it just contains byte offsets into the source string.
|
7
|
+
|
8
|
+
## Types
|
9
|
+
|
10
|
+
Let us define some simple types for readability.
|
11
|
+
|
12
|
+
### varint
|
13
|
+
|
14
|
+
A variable-length integer with the value fitting in `uint32_t` using between 1 and 5 bytes, using the [LEB128](https://en.wikipedia.org/wiki/LEB128) encoding.
|
15
|
+
This drastically cuts down on the size of the serialized string, especially when the source file is large.
|
16
|
+
|
17
|
+
### string
|
18
|
+
|
19
|
+
| # bytes | field |
|
20
|
+
| --- | --- |
|
21
|
+
| varint | the length of the string in bytes |
|
22
|
+
| ... | the string bytes |
|
23
|
+
|
24
|
+
### location
|
25
|
+
|
26
|
+
| # bytes | field |
|
27
|
+
| --- | --- |
|
28
|
+
| varint | byte offset into the source string where this location begins |
|
29
|
+
| varint | length of the location in bytes in the source string |
|
30
|
+
|
31
|
+
### comment
|
32
|
+
|
33
|
+
The comment type is one of:
|
34
|
+
* 0=`INLINE` (`# comment`)
|
35
|
+
* 1=`EMBEDDED_DOCUMENT` (`=begin`/`=end`)
|
36
|
+
* 2=`__END__` (after `__END__`)
|
37
|
+
|
38
|
+
| # bytes | field |
|
39
|
+
| --- | --- |
|
40
|
+
| `1` | comment type |
|
41
|
+
| location | the location in the source of this comment |
|
42
|
+
|
43
|
+
### diagnostic
|
44
|
+
|
45
|
+
| # bytes | field |
|
46
|
+
| --- | --- |
|
47
|
+
| string | diagnostic message (ASCII-only characters) |
|
48
|
+
| location | the location in the source this diagnostic applies to |
|
49
|
+
|
50
|
+
## Structure
|
51
|
+
|
52
|
+
The serialized string representing the syntax tree is composed of three parts: the header, the body, and the constant pool.
|
53
|
+
The header contains information like the version of prism that serialized the tree.
|
54
|
+
The body contains the actual nodes in the tree.
|
55
|
+
The constant pool contains constants that were interned while parsing.
|
56
|
+
|
57
|
+
The header is structured like the following table:
|
58
|
+
|
59
|
+
| # bytes | field |
|
60
|
+
| --- | --- |
|
61
|
+
| `5` | "PRISM" |
|
62
|
+
| `1` | major version number |
|
63
|
+
| `1` | minor version number |
|
64
|
+
| `1` | patch version number |
|
65
|
+
| `1` | 1 indicates only semantics fields were serialized, 0 indicates all fields were serialized (including location fields) |
|
66
|
+
| string | the encoding name |
|
67
|
+
| varint | number of comments |
|
68
|
+
| comment* | comments |
|
69
|
+
| varint | number of errors |
|
70
|
+
| diagnostic* | errors |
|
71
|
+
| varint | number of warnings |
|
72
|
+
| diagnostic* | warnings |
|
73
|
+
| `4` | content pool offset |
|
74
|
+
| varint | content pool size |
|
75
|
+
|
76
|
+
After the header comes the body of the serialized string.
|
77
|
+
The body consists of a sequence of nodes that is built using a prefix traversal order of the syntax tree.
|
78
|
+
Each node is structured like the following table:
|
79
|
+
|
80
|
+
| # bytes | field |
|
81
|
+
| --- | --- |
|
82
|
+
| `1` | node type |
|
83
|
+
| location | node location |
|
84
|
+
|
85
|
+
Every field on the node is then appended to the serialized string. The fields can be determined by referencing `config.yml`. Depending on the type of field, it could take a couple of different forms, described below:
|
86
|
+
|
87
|
+
* `node` - A field that is a node. This is structured just as like parent node.
|
88
|
+
* `node?` - A field that is a node that is optionally present. If the node is not present, then a single `0` byte will be written in its place. If it is present, then it will be structured just as like parent node.
|
89
|
+
* `node[]` - A field that is an array of nodes. This is structured as a variable-length integer length, followed by the child nodes themselves.
|
90
|
+
* `string` - A field that is a string. For example, this is used as the name of the method in a call node, since it cannot directly reference the source string (as in `@-` or `foo=`). This is structured as a variable-length integer byte length, followed by the string itself (_without_ a trailing null byte).
|
91
|
+
* `constant` - A variable-length integer that represents an index in the constant pool.
|
92
|
+
* `constant?` - An optional variable-length integer that represents an index in the constant pool. If it's not present, then a single `0` byte will be written in its place.
|
93
|
+
* `location` - A field that is a location. This is structured as a variable-length integer start followed by a variable-length integer length.
|
94
|
+
* `location?` - A field that is a location that is optionally present. If the location is not present, then a single `0` byte will be written in its place. If it is present, then it will be structured just like the `location` child node.
|
95
|
+
* `uint32` - A field that is a 32-bit unsigned integer. This is structured as a variable-length integer.
|
96
|
+
|
97
|
+
After the syntax tree, the content pool is serialized. This is a list of constants that were referenced from within the tree. The content pool begins at the offset specified in the header. Constants can be either "owned" (in which case their contents are embedded in the serialization) or "shared" (in which case their contents represent a slice of the source string). The most significant bit of the constant indicates whether it is owned or shared.
|
98
|
+
|
99
|
+
In the case that it is owned, the constant is structured as follows:
|
100
|
+
|
101
|
+
| # bytes | field |
|
102
|
+
| --- | --- |
|
103
|
+
| `4` | the byte offset in the serialization for the contents of the constant |
|
104
|
+
| `4` | the byte length in the serialization |
|
105
|
+
|
106
|
+
Note that you will need to mask off the most significant bit for the byte offset in the serialization. In the case that it is shared, the constant is structured as follows:
|
107
|
+
|
108
|
+
| # bytes | field |
|
109
|
+
| --- | --- |
|
110
|
+
| `4` | the byte offset in the source string for the contents of the constant |
|
111
|
+
| `4` | the byte length in the source string |
|
112
|
+
|
113
|
+
After the constant pool, the contents of the owned constants are serialized. This is just a sequence of bytes that represent the contents of the constants. At the end of the serialization, the buffer is null terminated.
|
114
|
+
|
115
|
+
## APIs
|
116
|
+
|
117
|
+
The relevant APIs and struct definitions are listed below:
|
118
|
+
|
119
|
+
```c
|
120
|
+
// A pm_buffer_t is a simple memory buffer that stores data in a contiguous
|
121
|
+
// block of memory. It is used to store the serialized representation of a
|
122
|
+
// prism tree.
|
123
|
+
typedef struct {
|
124
|
+
char *value;
|
125
|
+
size_t length;
|
126
|
+
size_t capacity;
|
127
|
+
} pm_buffer_t;
|
128
|
+
|
129
|
+
// Initialize a pm_buffer_t with its default values.
|
130
|
+
bool pm_buffer_init(pm_buffer_t *);
|
131
|
+
|
132
|
+
// Free the memory associated with the buffer.
|
133
|
+
void pm_buffer_free(pm_buffer_t *);
|
134
|
+
|
135
|
+
// Parse and serialize the AST represented by the given source to the given
|
136
|
+
// buffer.
|
137
|
+
void pm_parse_serialize(const uint8_t *source, size_t length, pm_buffer_t *buffer, const char *metadata);
|
138
|
+
```
|
139
|
+
|
140
|
+
Typically you would use a stack-allocated `pm_buffer_t` and call `pm_parse_serialize`, as in:
|
141
|
+
|
142
|
+
```c
|
143
|
+
void
|
144
|
+
serialize(const uint8_t *source, size_t length) {
|
145
|
+
pm_buffer_t buffer;
|
146
|
+
if (!pm_buffer_init(&buffer)) return;
|
147
|
+
|
148
|
+
pm_parse_serialize(source, length, &buffer, NULL);
|
149
|
+
// Do something with the serialized string.
|
150
|
+
|
151
|
+
pm_buffer_free(&buffer);
|
152
|
+
}
|
153
|
+
```
|
154
|
+
|
155
|
+
The final argument to `pm_parse_serialize` controls the metadata of the source.
|
156
|
+
This includes the filepath that the source is associated with, and any nested local variables scopes that are necessary to properly parse the file (in the case of parsing an `eval`).
|
157
|
+
Note that no `varint` are used here to make it easier to produce the metadata for the caller, and also serialized size is less important here.
|
158
|
+
The metadata is a serialized format itself, and is structured as follows:
|
159
|
+
|
160
|
+
| # bytes | field |
|
161
|
+
| --- | --- |
|
162
|
+
| `4` | the size of the filepath string |
|
163
|
+
| | the filepath string |
|
164
|
+
| `4` | the number of local variable scopes |
|
165
|
+
|
166
|
+
Then, each local variable scope is encoded as:
|
167
|
+
|
168
|
+
| # bytes | field |
|
169
|
+
| --- | --- |
|
170
|
+
| `4` | the number of local variables in the scope |
|
171
|
+
| | the local variables |
|
172
|
+
|
173
|
+
Each local variable within each scope is encoded as:
|
174
|
+
|
175
|
+
| # bytes | field |
|
176
|
+
| --- | --- |
|
177
|
+
| `4` | the size of the local variable name |
|
178
|
+
| | the local variable name |
|
179
|
+
|
180
|
+
The metadata can be `NULL` (as seen in the example above).
|
181
|
+
If it is not null, then a minimal metadata string would be `"\0\0\0\0\0\0\0\0"` which would use 4 bytes to indicate an empty filepath string and 4 bytes to indicate that there were no local variable scopes.
|
data/docs/testing.md
ADDED
@@ -0,0 +1,55 @@
|
|
1
|
+
# Testing
|
2
|
+
|
3
|
+
This document explains how to test prism, both locally, and against existing test suites.
|
4
|
+
|
5
|
+
## Test suite
|
6
|
+
|
7
|
+
`rake test` will run all of the files in the `test/` directory. This can be conceived of as two parts: unit tests, and snapshot tests.
|
8
|
+
|
9
|
+
### Unit tests
|
10
|
+
|
11
|
+
These test specific prism implementation details like comments, errors, and regular expressions. There are corresponding files for each thing being tested (like `test/errors_test.rb`).
|
12
|
+
|
13
|
+
### Snapshot tests
|
14
|
+
|
15
|
+
Snapshot tests ensure that parsed output is equivalent to previous parsed output. There are many categorized examples of valid syntax within the `test/prism/fixtures/` directory. When the test suite runs, it will parse all of this syntax, and compare it against corresponding files in the `test/prism/snapshots/` directory. For example, `test/prism/fixtures/strings.txt` has a corresponding `test/prism/snapshots/strings.txt`.
|
16
|
+
|
17
|
+
If the parsed files do not match, it will raise an error. If there is not a corresponding file in the `test/prism/snapshots/` directory, one will be created so that it exists for the next test run.
|
18
|
+
|
19
|
+
### Testing against repositories
|
20
|
+
|
21
|
+
To test the parser against a repository, you can run `FILEPATHS='/path/to/repository/**/*.rb' rake lex`. This will run the parser against every file matched by the glob pattern and check its generated tokens against those generated by ripper.
|
22
|
+
|
23
|
+
## Local testing
|
24
|
+
|
25
|
+
As you are working, you will likely want to test your code locally. `test.rb` is ignored by git, so it can be used for local testing. There are also two executables which may help you:
|
26
|
+
|
27
|
+
1. **bin/lex** takes a filepath and compares prism's lexed output to Ripper's lexed output. It prints any lexed output that doesn't match. It does some minor transformations to the lexed output in order to compare them, like split prism's heredoc tokens to mirror Ripper's.
|
28
|
+
|
29
|
+
```
|
30
|
+
$ bin/lex test.rb
|
31
|
+
```
|
32
|
+
|
33
|
+
If you would like to see the full lexed comparison, and not only the output that doesn't match, you can run with `VERBOSE=1`:
|
34
|
+
|
35
|
+
```
|
36
|
+
$ VERBOSE=1 bin/lex test.rb
|
37
|
+
```
|
38
|
+
|
39
|
+
`bin/lex` can also be used with `-e` and then source code, like this:
|
40
|
+
|
41
|
+
```
|
42
|
+
$ bin/lex -e "1 + 2"
|
43
|
+
```
|
44
|
+
|
45
|
+
2. **bin/parse** takes a filepath and outputs prism's parsed node structure generated from reading the file.
|
46
|
+
|
47
|
+
```
|
48
|
+
$ bin/parse test.rb
|
49
|
+
```
|
50
|
+
|
51
|
+
`bin/parse` can also be used with `-e` and then source code, like this:
|
52
|
+
|
53
|
+
```
|
54
|
+
$ bin/parse -e "1 + 2"
|
55
|
+
```
|