jruby-prism-parser 0.23.0.pre.SNAPSHOT-java
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/CHANGELOG.md +401 -0
- data/CODE_OF_CONDUCT.md +76 -0
- data/CONTRIBUTING.md +62 -0
- data/LICENSE.md +7 -0
- data/Makefile +101 -0
- data/README.md +98 -0
- data/config.yml +2902 -0
- data/docs/build_system.md +91 -0
- data/docs/configuration.md +64 -0
- data/docs/cruby_compilation.md +27 -0
- data/docs/design.md +53 -0
- data/docs/encoding.md +121 -0
- data/docs/fuzzing.md +88 -0
- data/docs/heredocs.md +36 -0
- data/docs/javascript.md +118 -0
- data/docs/local_variable_depth.md +229 -0
- data/docs/mapping.md +117 -0
- data/docs/parser_translation.md +34 -0
- data/docs/parsing_rules.md +19 -0
- data/docs/releasing.md +98 -0
- data/docs/ripper.md +36 -0
- data/docs/ruby_api.md +43 -0
- data/docs/ruby_parser_translation.md +19 -0
- data/docs/serialization.md +209 -0
- data/docs/testing.md +55 -0
- data/ext/prism/api_node.c +5098 -0
- data/ext/prism/api_pack.c +267 -0
- data/ext/prism/extconf.rb +110 -0
- data/ext/prism/extension.c +1155 -0
- data/ext/prism/extension.h +18 -0
- data/include/prism/ast.h +5807 -0
- data/include/prism/defines.h +102 -0
- data/include/prism/diagnostic.h +339 -0
- data/include/prism/encoding.h +265 -0
- data/include/prism/node.h +57 -0
- data/include/prism/options.h +230 -0
- data/include/prism/pack.h +152 -0
- data/include/prism/parser.h +732 -0
- data/include/prism/prettyprint.h +26 -0
- data/include/prism/regexp.h +33 -0
- data/include/prism/util/pm_buffer.h +155 -0
- data/include/prism/util/pm_char.h +205 -0
- data/include/prism/util/pm_constant_pool.h +209 -0
- data/include/prism/util/pm_list.h +97 -0
- data/include/prism/util/pm_memchr.h +29 -0
- data/include/prism/util/pm_newline_list.h +93 -0
- data/include/prism/util/pm_state_stack.h +42 -0
- data/include/prism/util/pm_string.h +150 -0
- data/include/prism/util/pm_string_list.h +44 -0
- data/include/prism/util/pm_strncasecmp.h +32 -0
- data/include/prism/util/pm_strpbrk.h +46 -0
- data/include/prism/version.h +29 -0
- data/include/prism.h +289 -0
- data/jruby-prism.jar +0 -0
- data/lib/prism/compiler.rb +486 -0
- data/lib/prism/debug.rb +206 -0
- data/lib/prism/desugar_compiler.rb +207 -0
- data/lib/prism/dispatcher.rb +2150 -0
- data/lib/prism/dot_visitor.rb +4634 -0
- data/lib/prism/dsl.rb +785 -0
- data/lib/prism/ffi.rb +346 -0
- data/lib/prism/lex_compat.rb +908 -0
- data/lib/prism/mutation_compiler.rb +753 -0
- data/lib/prism/node.rb +17864 -0
- data/lib/prism/node_ext.rb +212 -0
- data/lib/prism/node_inspector.rb +68 -0
- data/lib/prism/pack.rb +224 -0
- data/lib/prism/parse_result/comments.rb +177 -0
- data/lib/prism/parse_result/newlines.rb +64 -0
- data/lib/prism/parse_result.rb +498 -0
- data/lib/prism/pattern.rb +250 -0
- data/lib/prism/serialize.rb +1354 -0
- data/lib/prism/translation/parser/compiler.rb +1838 -0
- data/lib/prism/translation/parser/lexer.rb +335 -0
- data/lib/prism/translation/parser/rubocop.rb +37 -0
- data/lib/prism/translation/parser.rb +178 -0
- data/lib/prism/translation/ripper.rb +577 -0
- data/lib/prism/translation/ruby_parser.rb +1521 -0
- data/lib/prism/translation.rb +11 -0
- data/lib/prism/version.rb +3 -0
- data/lib/prism/visitor.rb +495 -0
- data/lib/prism.rb +99 -0
- data/prism.gemspec +135 -0
- data/rbi/prism.rbi +7767 -0
- data/rbi/prism_static.rbi +207 -0
- data/sig/prism.rbs +4773 -0
- data/sig/prism_static.rbs +201 -0
- data/src/diagnostic.c +400 -0
- data/src/encoding.c +5132 -0
- data/src/node.c +2786 -0
- data/src/options.c +213 -0
- data/src/pack.c +493 -0
- data/src/prettyprint.c +8881 -0
- data/src/prism.c +18406 -0
- data/src/regexp.c +638 -0
- data/src/serialize.c +1554 -0
- data/src/token_type.c +700 -0
- data/src/util/pm_buffer.c +190 -0
- data/src/util/pm_char.c +318 -0
- data/src/util/pm_constant_pool.c +322 -0
- data/src/util/pm_list.c +49 -0
- data/src/util/pm_memchr.c +35 -0
- data/src/util/pm_newline_list.c +84 -0
- data/src/util/pm_state_stack.c +25 -0
- data/src/util/pm_string.c +203 -0
- data/src/util/pm_string_list.c +28 -0
- data/src/util/pm_strncasecmp.c +24 -0
- data/src/util/pm_strpbrk.c +180 -0
- metadata +156 -0
@@ -0,0 +1,91 @@
|
|
1
|
+
# Build System
|
2
|
+
|
3
|
+
There are many ways to build prism, which means the build system is a bit more complicated than usual.
|
4
|
+
|
5
|
+
## Requirements
|
6
|
+
|
7
|
+
* It must work to build prism for all 6 uses-cases below.
|
8
|
+
* It must be possible to build prism without needing ruby/rake/etc.
|
9
|
+
Because once prism is the single parser in TruffleRuby, JRuby or CRuby there won't be another Ruby parser around to parse such Ruby code.
|
10
|
+
Most/every Ruby implementations want to avoid depending on another Ruby during the build process as that is very brittle.
|
11
|
+
* It is desirable to compile prism with the same or very similar compiler flags for all use-cases (e.g. optimization level, warning flags, etc).
|
12
|
+
Otherwise, there is the risk prism does not work correctly with those different compiler flags.
|
13
|
+
|
14
|
+
The main solution for the second point seems a Makefile, otherwise many of the usages would have to duplicate the logic to build prism.
|
15
|
+
|
16
|
+
## General Design
|
17
|
+
|
18
|
+
1. Templates are generated by `templates/template.rb`
|
19
|
+
4. The `Makefile` compiles both `libprism.a` and `libprism.{so,dylib,dll}` from the `src/**/*.c` and `include/**/*.h` files
|
20
|
+
5. The `Rakefile` `:compile` task ensures the above prerequisites are done, then calls `make`,
|
21
|
+
and uses `Rake::ExtensionTask` to compile the C extension (using its `extconf.rb`), which uses `libprism.a`
|
22
|
+
|
23
|
+
This way there is minimal duplication, and each layer builds on the previous one and has its own responsibilities.
|
24
|
+
|
25
|
+
The static library exports no symbols, to avoid any conflict.
|
26
|
+
The shared library exports some symbols, and this is fine since there should only be one libprism shared library
|
27
|
+
loaded per process (i.e., at most one version of the prism *gem* loaded in a process, only the gem uses the shared library).
|
28
|
+
|
29
|
+
## The various ways to build prism
|
30
|
+
|
31
|
+
### Building from ruby/prism repository with `bundle exec rake`
|
32
|
+
|
33
|
+
`rake` calls `make` and then uses `Rake::ExtensionTask` to compile the C extension (see above).
|
34
|
+
|
35
|
+
### Building the prism gem by `gem install/bundle install`
|
36
|
+
|
37
|
+
The gem contains the pre-generated templates.
|
38
|
+
When installing the gem, `extconf.rb` is used and that:
|
39
|
+
* runs `make build/libprism.a`
|
40
|
+
* compiles the C extension with mkmf
|
41
|
+
|
42
|
+
When installing the gem on JRuby and TruffleRuby, no C extension is built, so instead of the last step,
|
43
|
+
there is Ruby code using FFI which uses `libprism.{so,dylib,dll}`
|
44
|
+
to implement the same methods as the C extension, but using serialization instead of many native calls/accesses
|
45
|
+
(JRuby does not support C extensions, serialization is faster on TruffleRuby than the C extension).
|
46
|
+
|
47
|
+
### Building the prism gem from git, e.g. `gem "prism", github: "ruby/prism"`
|
48
|
+
|
49
|
+
The same as above, except the `extconf.rb` additionally runs first:
|
50
|
+
* `templates/template.rb` to generate the templates
|
51
|
+
|
52
|
+
Because of course those files are not part of the git repository.
|
53
|
+
|
54
|
+
### Building prism as part of CRuby
|
55
|
+
|
56
|
+
[This script](https://github.com/ruby/ruby/blob/5124f9ac7513eb590c37717337c430cb93caa151/tool/sync_default_gems.rb#L399-L422) imports prism sources in CRuby.
|
57
|
+
|
58
|
+
The script generates the templates when importing.
|
59
|
+
|
60
|
+
prism's `Makefile` is not used at all in CRuby. Instead, CRuby's `Makefile` is used.
|
61
|
+
|
62
|
+
### Building prism as part of TruffleRuby
|
63
|
+
|
64
|
+
[This script](https://github.com/oracle/truffleruby/blob/master/tool/import-prism.sh) imports prism sources in TruffleRuby.
|
65
|
+
The script generates the templates when importing.
|
66
|
+
|
67
|
+
Then when `mx build` builds TruffleRuby and the `prism` mx project inside, it runs `make`.
|
68
|
+
|
69
|
+
Then the `prism bindings` mx project is built, which contains the [bindings](https://github.com/oracle/truffleruby/blob/master/src/main/c/prism_bindings/src/prism_bindings.c)
|
70
|
+
and links to `libprism.a` (to avoid exporting symbols, so no conflict when installing the prism gem).
|
71
|
+
|
72
|
+
### Building prism as part of JRuby
|
73
|
+
|
74
|
+
TODO, similar to TruffleRuby.
|
75
|
+
|
76
|
+
### Building prism from source as a C library
|
77
|
+
|
78
|
+
All of the source files match `src/**/*.c` and all of the headers match `include/**/*.h`.
|
79
|
+
|
80
|
+
If you want to build prism as a shared library and link against it, you should compile with:
|
81
|
+
|
82
|
+
* `-fPIC -shared` - Compile as a shared library
|
83
|
+
* `-DPRISM_EXPORT_SYMBOLS` - Export the symbols (by default nothing is exported)
|
84
|
+
|
85
|
+
#### Flags
|
86
|
+
|
87
|
+
`make` respects the `MAKEFLAGS` environment variable. As such, to speed up the build you can run:
|
88
|
+
|
89
|
+
```
|
90
|
+
MAKEFLAGS="-j10" bundle exec rake compile
|
91
|
+
```
|
@@ -0,0 +1,64 @@
|
|
1
|
+
# Configuration
|
2
|
+
|
3
|
+
A lot of code in prism's repository is templated from a single configuration file, [config.yml](../config.yml). This file is used to generate the following files:
|
4
|
+
|
5
|
+
* `ext/prism/api_node.c` - for defining how to build Ruby objects for the nodes out of C structs
|
6
|
+
* `include/prism/ast.h` - for defining the C structs that represent the nodes
|
7
|
+
* `javascript/src/deserialize.js` - for defining how to deserialize the nodes in JavaScript
|
8
|
+
* `javascript/src/nodes.js` - for defining the nodes in JavaScript
|
9
|
+
* `java/org/prism/AbstractNodeVisitor.java` - for defining the visitor interface for the nodes in Java
|
10
|
+
* `java/org/prism/Loader.java` - for defining how to deserialize the nodes in Java
|
11
|
+
* `java/org/prism/Nodes.java` - for defining the nodes in Java
|
12
|
+
* `lib/prism/compiler.rb` - for defining the compiler for the nodes in Ruby
|
13
|
+
* `lib/prism/dispatcher.rb` - for defining the dispatch visitors for the nodes in Ruby
|
14
|
+
* `lib/prism/dot_visitor.rb` - for defining the dot visitor for the nodes in Ruby
|
15
|
+
* `lib/prism/dsl.rb` - for defining the DSL for the nodes in Ruby
|
16
|
+
* `lib/prism/mutation_compiler.rb` - for defining the mutation compiler for the nodes in Ruby
|
17
|
+
* `lib/prism/node.rb` - for defining the nodes in Ruby
|
18
|
+
* `lib/prism/serialize.rb` - for defining how to deserialize the nodes in Ruby
|
19
|
+
* `lib/prism/visitor.rb` - for defining the visitor interface for the nodes in Ruby
|
20
|
+
* `src/node.c` - for defining how to free the nodes in C and calculate the size in memory in C
|
21
|
+
* `src/prettyprint.c` - for defining how to prettyprint the nodes in C
|
22
|
+
* `src/serialize.c` - for defining how to serialize the nodes in C
|
23
|
+
* `src/token_type.c` - for defining the names of the token types
|
24
|
+
|
25
|
+
Whenever the structure of the nodes changes, you can run `rake templates` to regenerate these files. Alternatively tasks like `rake test` should pick up on these changes automatically. Every file that is templated will include a comment at the top indicating that it was generated and that changes should be made to the template and not the generated file.
|
26
|
+
|
27
|
+
`config.yml` has a couple of top level fields, which we'll describe below.
|
28
|
+
|
29
|
+
## `tokens`
|
30
|
+
|
31
|
+
This is a list of tokens to be used by the lexer. It is shared here so that it can be templated out into both an enum and a function that is used for debugging that returns the name of the token.
|
32
|
+
|
33
|
+
Each token is expected to have a `name` key and a `comment` key (both as strings). Optionally they can have a `value` key (an integer) which is used to represent the value in the enum.
|
34
|
+
|
35
|
+
In C these tokens will be templated out with the prefix `PM_TOKEN_`. For example, if you have a `name` key with the value `PERCENT`, you can access this in C through `PM_TOKEN_PERCENT`.
|
36
|
+
|
37
|
+
## `flags`
|
38
|
+
|
39
|
+
Sometimes we need to communicate more information in the tree than can be represented by the types of the nodes themselves. For example, we need to represent the flags passed to a regular expression or the type of call that a call node is performing. In these circumstances, it's helpful to reference a bitset of flags. This field is a list of flags that can be used in the nodes.
|
40
|
+
|
41
|
+
Each flag is expected to have a `name` key (a string) and a `values` key (an array). Each value in the `values` key should be an object that contains both a `name` key (a string) that represents the name of the flag and a `comment` key (a string) that represents the comment for the flag.
|
42
|
+
|
43
|
+
In C these flags will get templated out with a `PM_` prefix, then a snake-case version of the flag name, then the flag itself. For example, if you have a flag with the name `RegularExpressionFlags` and a value with the name `IGNORE_CASE`, you can access this in C through `PM_REGULAR_EXPRESSION_FLAGS_IGNORE_CASE`.
|
44
|
+
|
45
|
+
## `nodes`
|
46
|
+
|
47
|
+
Every node in the tree is defined in `config.yml`. Each node is expected to have a `name` key (a string) and a `comment` key (a string). By convention, the `comment` key uses the multi-line syntax of `: |` because the newlines will get templated into the comments of various files.
|
48
|
+
|
49
|
+
Optionally, every node can define a `child_nodes` key that is an array. This array represents each part of the node that isn't communicated through the type and location of the node itself. Within the `child_nodes` key, each entry should be an object with a `name` key (a string) and a `type` key (a string). The `name` key represents the name of the child node and the `type` is used to determine how it should be represented in each language.
|
50
|
+
|
51
|
+
The available values for `type` are:
|
52
|
+
|
53
|
+
* `node` - A field that is a node. This is a `pm_node_t *` in C.
|
54
|
+
* `node?` - A field that is a node that is optionally present. This is also a `pm_node_t *` in C, but can be `NULL`.
|
55
|
+
* `node[]` - A field that is an array of nodes. This is a `pm_node_list_t` in C.
|
56
|
+
* `string` - A field that is a string. For example, this is used as the name of the method in a call node, since it cannot directly reference the source string (as in `@-` or `foo=`). This is a `pm_string_t` in C.
|
57
|
+
* `constant` - A field that is an integer that represents an index in the constant pool. This is a `pm_constant_id_t` in C.
|
58
|
+
* `constant[]` - A field that is an array of constants. This is a `pm_constant_id_list_t` in C.
|
59
|
+
* `location` - A field that is a location. This is a `pm_location_t` in C.
|
60
|
+
* `location?` - A field that is a location that is optionally present. This is a `pm_location_t` in C, but if the value is not present then the `start` and `end` fields will be `NULL`.
|
61
|
+
* `uint8` - A field that is an 8-bit unsigned integer. This is a `uint8_t` in C.
|
62
|
+
* `uint32` - A field that is a 32-bit unsigned integer. This is a `uint32_t` in C.
|
63
|
+
|
64
|
+
If the type is `node` or `node?` then the value also accepts an optional `kind` key (a string). This key is expected to match to the name of another node type within `config.yml`. This changes a couple of places where code is templated out to use the more specific struct name instead of the generic `pm_node_t`. For example, with `kind: StatementsNode` the `pm_node_t *` in C becomes a `pm_statements_node_t *`.
|
@@ -0,0 +1,27 @@
|
|
1
|
+
# Compiling Prism's AST
|
2
|
+
|
3
|
+
One important class of consumers of Prism's AST is compilers. Currently [CRuby](https://github.com/ruby/ruby), [JRuby](https://github.com/jruby/jruby), [TruffleRuby](https://github.com/oracle/truffleruby), and [Natalie](https://github.com/natalie-lang/natalie) have all built compilation code on top of Prism's AST.
|
4
|
+
|
5
|
+
This document will describe, at a high level, how CRuby's compilation of Prism's AST works.
|
6
|
+
|
7
|
+
As described in the [build system documentation](build_system.md), there is a "push" Webhook set up within the Prism repo triggered on each new commit to send information about the commit to [git.ruby-lang.org](https://github.com/ruby/git.ruby-lang.org). This in turn runs [a script](https://github.com/ruby/ruby/blob/master/tool/sync_default_gems.rb) to sync over new changes in Prism to their corresponding files in Ruby. Any failures in this sync script will show alerts in the #alerts-sync channel in the RubyLang Slack. The result of this step is that files are synced from Prism into ruby/ruby for its use. It is also worth noting that [`common.mk`](https://github.com/ruby/ruby/blob/master/common.mk) contains a list of Prism files which it needs to correctly compile. If there are new Prism files added, this file should also be updated.
|
8
|
+
|
9
|
+
ruby/ruby uses the Prism code to generate an AST from which it can generate instruction sequences. Compilation in ruby/ruby has three main steps:
|
10
|
+
|
11
|
+
1. Compute an AST
|
12
|
+
|
13
|
+
Syncing over the Prism code allows ruby/ruby to compute the AST using Prism. It currently does this within [`iseq.c`](https://github.com/ruby/ruby/blob/master/iseq.c) using the `pm_parser_init` fuction.
|
14
|
+
|
15
|
+
2. Run a first pass of compilation
|
16
|
+
|
17
|
+
Once the AST has been created, it is recursively descended in order to compute the appropriate instruction sequences. This is the crux of compilation, and we go into more detail about nuances in the following paragraphs.
|
18
|
+
|
19
|
+
The code for this step is almost exclusively in [`prism_compile.c`](https://github.com/ruby/ruby/blob/master/prism_compile.c). The main function used for compilation is `pm_compile_node` which is essentially a huge switch statement over practically every node type which computes the appropriate instruction sequences for that node type. There are several convenience helpers, such as `PM_COMPILE`, `PM_COMPILE_POPPED`, `PM_COMPILE_NOT_POPPED` which all call into the `pm_compile_node` function.
|
20
|
+
|
21
|
+
There are also several functions, like `parse_string`, `parse_integer` which consume Prism nodes and return CRuby values. These are all called for their relevant types within the big switch statement.
|
22
|
+
|
23
|
+
The Prism compiler also uses a concept of "scope nodes" which are not standard Prism nodes in the AST, but instead nodes constructed within the compiler for the sole purpose of making compilation easier. Scope nodes are defined in [`prism_compile.h`](https://github.com/ruby/ruby/blob/master/prism_compile.h) and store information such as locals, local table size, local depth offset and the index lookup tables. Scope nodes can be generated for node types which have their own "scope".
|
24
|
+
|
25
|
+
3. Run an optimization pass of compilation
|
26
|
+
|
27
|
+
After the instruction sequences are initially computed, there is an existing (non-Prism based) optimization pass of the instruction sequences. There are several optimizations currently inlined into step 2, however, most of them happen in this step. Specifically, any peephole optimizations happen in this step. By the end of step 2, however, the instruction sequences take the same form regardless of if the initial AST was generated by Prism or not. Therefore, step 3 is agnostic to the parser, and should not require any Prism specific code.
|
data/docs/design.md
ADDED
@@ -0,0 +1,53 @@
|
|
1
|
+
# Design
|
2
|
+
|
3
|
+
There are three overall goals for this project:
|
4
|
+
|
5
|
+
* to provide a documented and maintainable parser
|
6
|
+
* to provide an error-tolerant parser suitable for use in an IDE
|
7
|
+
* to provide a portable parser that can be used in projects that don't link against CRuby
|
8
|
+
|
9
|
+
The design of the parser is based around these main goals.
|
10
|
+
|
11
|
+
## Structure
|
12
|
+
|
13
|
+
The first piece to understand about the parser is the design of its syntax tree. This is documented in `config.yml`. Every token and node is defined in that file, along with comments about where they are found in what kinds of syntax. This file is used to template out a lot of different files, all found in the `templates` directory. The `templates/template.rb` script performs the templating and outputs all files matching the directory structure found in the templates directory.
|
14
|
+
|
15
|
+
The templated files contain all of the code required to allocate and initialize nodes, pretty print nodes, and serialize nodes. This means for the most part, you will only need to then hook up the parser to call the templated functions to create the nodes in the correct position. That means editing the parser itself, which is housed in `prism.c`.
|
16
|
+
|
17
|
+
## Pratt parsing
|
18
|
+
|
19
|
+
In order to provide the best possible error tolerance, the parser is hand-written. It is structured using Pratt parsing, a technique developed by Vaughan Pratt back in the 1970s. Below are a bunch of links to articles and papers that explain Pratt parsing in more detail.
|
20
|
+
|
21
|
+
* https://web.archive.org/web/20151223215421/http://hall.org.ua/halls/wizzard/pdf/Vaughan.Pratt.TDOP.pdf
|
22
|
+
* https://tdop.github.io/
|
23
|
+
* https://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/
|
24
|
+
* https://matklad.github.io/2020/04/13/simple-but-powerful-pratt-parsing.html
|
25
|
+
* https://chidiwilliams.com/post/on-recursive-descent-and-pratt-parsing/
|
26
|
+
|
27
|
+
You can find most of the functions that correspond to constructs in the Pratt parsing algorithm in `prism.c`. As a couple of examples:
|
28
|
+
|
29
|
+
* `parse` corresponds to the `parse_expression` function
|
30
|
+
* `nud` (null denotation) corresponds to the `parse_expression_prefix` function
|
31
|
+
* `led` (left denotation) corresponds to the `parse_expression_infix` function
|
32
|
+
* `lbp` (left binding power) corresponds to accessing the `left` field of an element in the `binding_powers` array
|
33
|
+
* `rbp` (right binding power) corresponds to accessing the `right` field of an element in the `binding_powers` array
|
34
|
+
|
35
|
+
## Portability
|
36
|
+
|
37
|
+
In order to enable using this parser in other projects, the parser is written in C99, and uses only the standard library. This means it can be embedded in most any other project without having to link against CRuby. It can be used directly through its C API to access individual fields, or it can used to parse a syntax tree and then serialize it to a single blob. For more information on serialization, see the [docs/serialization.md](serialization.md) file.
|
38
|
+
|
39
|
+
## Error tolerance
|
40
|
+
|
41
|
+
The design of the error tolerance of this parser is still very much in flux. We are experimenting with various approaches as the parser is being developed to try to determine the best approach. Below are a bunch of links to articles and papers that explain error tolerance in more detail, as well as document some of the approaches that we're evaluating.
|
42
|
+
|
43
|
+
* https://tratt.net/laurie/blog/2020/automatic_syntax_error_recovery.html
|
44
|
+
* https://diekmann.uk/diekmann_phd.pdf
|
45
|
+
* https://eelcovisser.org/publications/2012/JongeKVS12.pdf
|
46
|
+
* https://www.antlr.org/papers/allstar-techreport.pdf
|
47
|
+
* https://github.com/microsoft/tolerant-php-parser/blob/main/docs/HowItWorks.md
|
48
|
+
|
49
|
+
Currently, there are a couple of mechanisms for error tolerance that are in place:
|
50
|
+
|
51
|
+
* If the parser expects a token in a particular position (for example the `in` keyword in a for loop or the `{` after `BEGIN` or `END`) then it will insert a missing token if one can't be found and continue parsing.
|
52
|
+
* If the parser expects an expression in a particular position but encounters a token that can't be used as that expression, it checks up the stack to see if that token would close out a parent node. If so, it will close out all of its parent nodes using missing nodes wherever necessary and continue parsing.
|
53
|
+
* If the parser cannot understand a token in any capacity, it will skip past the token.
|
data/docs/encoding.md
ADDED
@@ -0,0 +1,121 @@
|
|
1
|
+
# Encoding
|
2
|
+
|
3
|
+
When parsing a Ruby file, there are times when the parser must parse identifiers. Identifiers are names of variables, methods, classes, etc. To determine the start of an identifier, the parser must be able to tell if the subsequent bytes form an alphabetic character. To determine the rest of the identifier, the parser must look forward through all alphanumeric characters.
|
4
|
+
|
5
|
+
Determining if a set of bytes comprise an alphabetic or alphanumeric character is encoding-dependent. By default, the parser assumes that all source files are encoded UTF-8. If the file is not encoded in UTF-8, it must be encoded using an encoding that is "ASCII compatible" (i.e., all of the codepoints below 128 match the corresponding codepoints in ASCII and the minimum number of bytes required to represent a codepoint is 1 byte).
|
6
|
+
|
7
|
+
If the file is not encoded in UTF-8, the user must specify the encoding in a "magic" comment at the top of the file. The comment looks like:
|
8
|
+
|
9
|
+
```ruby
|
10
|
+
# encoding: iso-8859-9
|
11
|
+
```
|
12
|
+
|
13
|
+
The key of the comment can be either "encoding" or "coding". The value of the comment must be a string that is a valid encoding name. The encodings that prism supports by default are:
|
14
|
+
|
15
|
+
* `ASCII-8BIT`
|
16
|
+
* `Big5`
|
17
|
+
* `Big5-HKSCS`
|
18
|
+
* `Big5-UAO`
|
19
|
+
* `CESU-8`
|
20
|
+
* `CP51932`
|
21
|
+
* `CP850`
|
22
|
+
* `CP852`
|
23
|
+
* `CP855`
|
24
|
+
* `CP949`
|
25
|
+
* `CP950`
|
26
|
+
* `CP951`
|
27
|
+
* `Emacs-Mule`
|
28
|
+
* `EUC-JP`
|
29
|
+
* `eucJP-ms`
|
30
|
+
* `EUC-JIS-2004`
|
31
|
+
* `EUC-KR`
|
32
|
+
* `EUC-TW`
|
33
|
+
* `GB12345`
|
34
|
+
* `GB18030`
|
35
|
+
* `GB1988`
|
36
|
+
* `GB2312`
|
37
|
+
* `GBK`
|
38
|
+
* `IBM437`
|
39
|
+
* `IBM720`
|
40
|
+
* `IBM737`
|
41
|
+
* `IBM775`
|
42
|
+
* `IBM852`
|
43
|
+
* `IBM855`
|
44
|
+
* `IBM857`
|
45
|
+
* `IBM860`
|
46
|
+
* `IBM861`
|
47
|
+
* `IBM862`
|
48
|
+
* `IBM863`
|
49
|
+
* `IBM864`
|
50
|
+
* `IBM865`
|
51
|
+
* `IBM866`
|
52
|
+
* `IBM869`
|
53
|
+
* `ISO-8859-1`
|
54
|
+
* `ISO-8859-2`
|
55
|
+
* `ISO-8859-3`
|
56
|
+
* `ISO-8859-4`
|
57
|
+
* `ISO-8859-5`
|
58
|
+
* `ISO-8859-6`
|
59
|
+
* `ISO-8859-7`
|
60
|
+
* `ISO-8859-8`
|
61
|
+
* `ISO-8859-9`
|
62
|
+
* `ISO-8859-10`
|
63
|
+
* `ISO-8859-11`
|
64
|
+
* `ISO-8859-13`
|
65
|
+
* `ISO-8859-14`
|
66
|
+
* `ISO-8859-15`
|
67
|
+
* `ISO-8859-16`
|
68
|
+
* `KOI8-R`
|
69
|
+
* `KOI8-U`
|
70
|
+
* `macCentEuro`
|
71
|
+
* `macCroatian`
|
72
|
+
* `macCyrillic`
|
73
|
+
* `macGreek`
|
74
|
+
* `macIceland`
|
75
|
+
* `MacJapanese`
|
76
|
+
* `macRoman`
|
77
|
+
* `macRomania`
|
78
|
+
* `macThai`
|
79
|
+
* `macTurkish`
|
80
|
+
* `macUkraine`
|
81
|
+
* `Shift_JIS`
|
82
|
+
* `SJIS-DoCoMo`
|
83
|
+
* `SJIS-KDDI`
|
84
|
+
* `SJIS-SoftBank`
|
85
|
+
* `stateless-ISO-2022-JP`
|
86
|
+
* `stateless-ISO-2022-JP-KDDI`
|
87
|
+
* `TIS-620`
|
88
|
+
* `US-ASCII`
|
89
|
+
* `UTF-8`
|
90
|
+
* `UTF8-MAC`
|
91
|
+
* `UTF8-DoCoMo`
|
92
|
+
* `UTF8-KDDI`
|
93
|
+
* `UTF8-SoftBank`
|
94
|
+
* `Windows-1250`
|
95
|
+
* `Windows-1251`
|
96
|
+
* `Windows-1252`
|
97
|
+
* `Windows-1253`
|
98
|
+
* `Windows-1254`
|
99
|
+
* `Windows-1255`
|
100
|
+
* `Windows-1256`
|
101
|
+
* `Windows-1257`
|
102
|
+
* `Windows-1258`
|
103
|
+
* `Windows-31J`
|
104
|
+
* `Windows-874`
|
105
|
+
|
106
|
+
For each of these encodings, prism provides functions for checking if the subsequent bytes can be interpreted as a character, and then if that character is alphabetic, alphanumeric, or uppercase.
|
107
|
+
|
108
|
+
## Getting notified when the encoding changes
|
109
|
+
|
110
|
+
You may want to get notified when the encoding changes based on the result of parsing an encoding comment. We use this internally for our `lex` function in order to provide the correct encodings for the tokens that are returned. For that you can register a callback with `pm_parser_register_encoding_changed_callback`. The callback will be called with a pointer to the parser. The encoding can be accessed through `parser->encoding`.
|
111
|
+
|
112
|
+
```c
|
113
|
+
// When the encoding that is being used to parse the source is changed by prism,
|
114
|
+
// we provide the ability here to call out to a user-defined function.
|
115
|
+
typedef void (*pm_encoding_changed_callback_t)(pm_parser_t *parser);
|
116
|
+
|
117
|
+
// Register a callback that will be called whenever prism changes the encoding
|
118
|
+
// it is using to parse based on the magic comment.
|
119
|
+
PRISM_EXPORTED_FUNCTION void
|
120
|
+
pm_parser_register_encoding_changed_callback(pm_parser_t *parser, pm_encoding_changed_callback_t callback);
|
121
|
+
```
|
data/docs/fuzzing.md
ADDED
@@ -0,0 +1,88 @@
|
|
1
|
+
# Fuzzing
|
2
|
+
|
3
|
+
We use fuzzing to test the various entrypoints to the library. The fuzzer we use is [AFL++](https://aflplus.plus). All files related to fuzzing live within the `fuzz` directory, which has the following structure:
|
4
|
+
|
5
|
+
```
|
6
|
+
fuzz
|
7
|
+
├── corpus
|
8
|
+
│ ├── parse fuzzing corpus for parsing (a symlink to our fixtures)
|
9
|
+
│ └── regexp fuzzing corpus for regexp
|
10
|
+
├── dict a AFL++ dictionary containing various tokens
|
11
|
+
├── docker
|
12
|
+
│ └── Dockerfile for building a container with the fuzzer toolchain
|
13
|
+
├── fuzz.c generic entrypoint for fuzzing
|
14
|
+
├── heisenbug.c entrypoint for reproducing a crash or hang
|
15
|
+
├── parse.c fuzz handler for parsing
|
16
|
+
├── parse.sh script to run parsing fuzzer
|
17
|
+
├── regexp.c fuzz handler for regular expression parsing
|
18
|
+
├── regexp.sh script to run regexp fuzzer
|
19
|
+
└── tools
|
20
|
+
├── backtrace.sh generates backtrace files for a crash directory
|
21
|
+
└── minimize.sh generates minimized crash or hang files
|
22
|
+
```
|
23
|
+
|
24
|
+
## Usage
|
25
|
+
|
26
|
+
There are currently three fuzzing targets
|
27
|
+
|
28
|
+
- `pm_serialize_parse` (parse)
|
29
|
+
- `pm_regexp_named_capture_group_names` (regexp)
|
30
|
+
|
31
|
+
Respectively, fuzzing can be performed with
|
32
|
+
|
33
|
+
```
|
34
|
+
make fuzz-run-parse
|
35
|
+
make fuzz-run-regexp
|
36
|
+
```
|
37
|
+
|
38
|
+
To end a fuzzing job, interrupt with CTRL+C. To enter a container with the fuzzing toolchain and debug utilities, run
|
39
|
+
|
40
|
+
```
|
41
|
+
make fuzz-debug
|
42
|
+
```
|
43
|
+
|
44
|
+
# Out-of-bounds reads
|
45
|
+
|
46
|
+
Currently, encoding functionality implementing the `pm_encoding_t` interface can read outside of inputs. For the time being, ASAN instrumentation is disabled for functions from src/enc. See `fuzz/asan.ignore`.
|
47
|
+
|
48
|
+
To disable ASAN read instrumentation globally, use the `FUZZ_FLAGS` environment variable e.g.
|
49
|
+
|
50
|
+
```
|
51
|
+
FUZZ_FLAGS="-mllvm -asan-instrument-reads=false" make fuzz-run-parse
|
52
|
+
```
|
53
|
+
|
54
|
+
Note, that this may make reproducing bugs difficult as they may depend on memory outside of the input buffer. In that case, try
|
55
|
+
|
56
|
+
```
|
57
|
+
make fuzz-debug # enter the docker container with build tools
|
58
|
+
make build/fuzz.heisenbug.parse # or .regexp
|
59
|
+
./build/fuzz.heisenbug.parse path-to-problem-input
|
60
|
+
```
|
61
|
+
|
62
|
+
# Triaging Crashes and Hangs
|
63
|
+
|
64
|
+
Triaging crashes and hangs is easier when the inputs are as short as possible. In the fuzz container, an entire crash or hang directory can be minimized using
|
65
|
+
|
66
|
+
```
|
67
|
+
./fuzz/tools/minimize.sh <directory>
|
68
|
+
```
|
69
|
+
|
70
|
+
e.g.
|
71
|
+
```
|
72
|
+
./fuzz/tools/minimize.sh fuzz/output/parse/default/crashes
|
73
|
+
```
|
74
|
+
|
75
|
+
This may take a long time. In the crash/hang directory, for each input file there will appear a minimized version with the extension `.min` appended.
|
76
|
+
|
77
|
+
Backtraces for crashes (not hangs) can be generated en masse with
|
78
|
+
|
79
|
+
```
|
80
|
+
./fuzz/tools/backtrace.sh <directory>
|
81
|
+
```
|
82
|
+
|
83
|
+
Files with basename equal to the input file name with extension `.bt` will be created e.g.
|
84
|
+
|
85
|
+
```
|
86
|
+
id:000000,sig:06,src:000006+000190,time:8480,execs:18929,op:splice,rep:4
|
87
|
+
id:000000,sig:06,src:000006+000190,time:8480,execs:18929,op:splice,rep:4.bt
|
88
|
+
```
|
data/docs/heredocs.md
ADDED
@@ -0,0 +1,36 @@
|
|
1
|
+
# Heredocs
|
2
|
+
|
3
|
+
Heredocs are one of the most complicated pieces of this parser. There are many different forms, there can be multiple open at the same time, and they can be nested. In order to support parsing them, we keep track of a lot of metadata. Below is a basic overview of how it works.
|
4
|
+
|
5
|
+
## 1. Lexing the identifier
|
6
|
+
|
7
|
+
When a heredoc identifier is encountered in the regular process of lexing, we push the `PM_LEX_HEREDOC` mode onto the stack with the following metadata:
|
8
|
+
|
9
|
+
* `ident_start`: A pointer to the start of the identifier for the heredoc. We need this to match against the end of the heredoc.
|
10
|
+
* `ident_length`: The length of the identifier for the heredoc. We also need this to match.
|
11
|
+
* `next_start`: A pointer to the place in source that the parser should resume lexing once it has completed this heredoc.
|
12
|
+
|
13
|
+
We also set the special `parser.next_start` field which is a pointer to the place in the source where we should start lexing the next token. This is set to the pointer of the character immediately following the next newline.
|
14
|
+
|
15
|
+
Note that if the `parser.heredoc_end` field is already set, then it means we have already encountered a heredoc on this line. In that case the `parser.next_start` field will be set to the `parser.heredoc_end` field. This is because we want to skip past the previous heredocs on this line and instead lex the body of this heredoc.
|
16
|
+
|
17
|
+
## 2. Lexing the body
|
18
|
+
|
19
|
+
The next time the lexer is asked for a token, it will be in the `PM_LEX_HEREDOC` mode. In this mode we are lexing the body of the heredoc. It will start by checking if the `next_start` field is set. If it is, then this is the first token within the body of the heredoc so we'll start lexing from there. Otherwise we'll start lexing from the end of the previous token.
|
20
|
+
|
21
|
+
Lexing these fields is extremely similar to lexing an interpolated string. The only difference is that we also do an additional check at the beginning of each line to check if we have hit the terminator.
|
22
|
+
|
23
|
+
## 3. Lexing the terminator
|
24
|
+
|
25
|
+
On every newline within the body of a heredoc, we check to see if it matches the terminator followed by a newline or a carriage return and a newline. If it does, then we pop the lex mode off the stack and set a couple of fields on the parser:
|
26
|
+
|
27
|
+
* `next_start`: This is set to the value that we previously stored on the heredoc to indicate where the lexer should resume lexing when it is done with this heredoc.
|
28
|
+
* `heredoc_end`: This is set to the end of the heredoc. When a newline character is found, this indicates that the lexer should skip past to this next point.
|
29
|
+
|
30
|
+
## 4. Lexing the rest of the line
|
31
|
+
|
32
|
+
Once the heredoc has been lexed, the lexer will resume lexing from the `next_start` field. Lexing will continue until the next newline character. When the next newline character is found, it will check to see if the `heredoc_end` field is set. If it is it will skip to that point, unset the field, and continue lexing.
|
33
|
+
|
34
|
+
## Compatibility with Ripper
|
35
|
+
|
36
|
+
The order in which tokens are emitted is different from that of Ripper. Ripper emits each token in the file in the order in which it appears. prism instead will emit the tokens that makes the most sense for the lexer, using the process described above. Therefore to line things up, `Prism.lex_compat` will shuffle the tokens around to match Ripper's output.
|
data/docs/javascript.md
ADDED
@@ -0,0 +1,118 @@
|
|
1
|
+
# JavaScript
|
2
|
+
|
3
|
+
Prism provides bindings to JavaScript out of the box.
|
4
|
+
|
5
|
+
## Node
|
6
|
+
|
7
|
+
To use the package from node, install the `@ruby/prism` dependency:
|
8
|
+
|
9
|
+
```sh
|
10
|
+
npm install @ruby/prism
|
11
|
+
```
|
12
|
+
|
13
|
+
Then import the package:
|
14
|
+
|
15
|
+
```js
|
16
|
+
import { loadPrism } from "@ruby/prism";
|
17
|
+
```
|
18
|
+
|
19
|
+
Then call the load function to get a parse function:
|
20
|
+
|
21
|
+
```js
|
22
|
+
const parse = await loadPrism();
|
23
|
+
```
|
24
|
+
|
25
|
+
## Browser
|
26
|
+
|
27
|
+
To use the package from the browser, you will need to do some additional work. The [javascript/example.html](../javascript/example.html) file shows an example of running Prism in the browser. You will need to instantiate the WebAssembly module yourself and then pass it to the `parsePrism` function.
|
28
|
+
|
29
|
+
First, get a shim for WASI since not all browsers support it yet.
|
30
|
+
|
31
|
+
```js
|
32
|
+
import { WASI } from "https://unpkg.com/@bjorn3/browser_wasi_shim@latest/dist/index.js";
|
33
|
+
```
|
34
|
+
|
35
|
+
Next, import the `parsePrism` function from `@ruby/prism`, either through a CDN or by bundling it with your application.
|
36
|
+
|
37
|
+
```js
|
38
|
+
import { parsePrism } from "https://unpkg.com/@ruby/prism@latest/src/parsePrism.js";
|
39
|
+
```
|
40
|
+
|
41
|
+
Next, fetch and instantiate the WebAssembly module. You can access it through a CDN or by bundling it with your application.
|
42
|
+
|
43
|
+
```js
|
44
|
+
const wasm = await WebAssembly.compileStreaming(fetch("https://unpkg.com/@ruby/prism@latest/src/prism.wasm"));
|
45
|
+
```
|
46
|
+
|
47
|
+
Next, instantiate the module and initialize WASI.
|
48
|
+
|
49
|
+
```js
|
50
|
+
const wasi = new WASI([], [], []);
|
51
|
+
const instance = await WebAssembly.instantiate(wasm, { wasi_snapshot_preview1: wasi.wasiImport });
|
52
|
+
wasi.initialize(instance);
|
53
|
+
```
|
54
|
+
|
55
|
+
Finally, you can create a function that will parse a string of Ruby code.
|
56
|
+
|
57
|
+
```js
|
58
|
+
function parse(source) {
|
59
|
+
return parsePrism(instance.exports, source);
|
60
|
+
}
|
61
|
+
```
|
62
|
+
|
63
|
+
## API
|
64
|
+
|
65
|
+
Now that we have access to a `parse` function, we can use it to parse Ruby code:
|
66
|
+
|
67
|
+
```js
|
68
|
+
const parseResult = parse("1 + 2");
|
69
|
+
```
|
70
|
+
|
71
|
+
A ParseResult object is very similar to the Prism::ParseResult object from Ruby. It has the same properties: `value`, `comments`, `magicComments`, `errors`, and `warnings`. Here we can serialize the AST to JSON.
|
72
|
+
|
73
|
+
```js
|
74
|
+
console.log(JSON.stringify(parseResult.value, null, 2));
|
75
|
+
```
|
76
|
+
|
77
|
+
## Visitors
|
78
|
+
|
79
|
+
Prism allows you to traverse the AST of parsed Ruby code using visitors.
|
80
|
+
|
81
|
+
Here's an example of a custom `FooCalls` visitor:
|
82
|
+
|
83
|
+
```js
|
84
|
+
import { loadPrism, Visitor } from "@ruby/prism"
|
85
|
+
|
86
|
+
const parse = await loadPrism();
|
87
|
+
const parseResult = parse("foo()");
|
88
|
+
|
89
|
+
class FooCalls extends Visitor {
|
90
|
+
visitCallNode(node) {
|
91
|
+
if (node.name === "foo") {
|
92
|
+
// Do something with the node
|
93
|
+
}
|
94
|
+
|
95
|
+
// Call super so that the visitor continues walking the tree
|
96
|
+
super.visitCallNode(node);
|
97
|
+
}
|
98
|
+
}
|
99
|
+
|
100
|
+
const fooVisitor = new FooCalls();
|
101
|
+
|
102
|
+
parseResult.value.accept(fooVisitor);
|
103
|
+
```
|
104
|
+
|
105
|
+
## Building
|
106
|
+
|
107
|
+
To build the WASM package yourself, first obtain a copy of `wasi-sdk`. You can retrieve this here: <https://github.com/WebAssembly/wasi-sdk>. Next, run:
|
108
|
+
|
109
|
+
```sh
|
110
|
+
make wasm WASI_SDK_PATH=path/to/wasi-sdk
|
111
|
+
```
|
112
|
+
|
113
|
+
This will generate `javascript/src/prism.wasm`. From there, you can run the tests to verify everything was generated correctly.
|
114
|
+
|
115
|
+
```sh
|
116
|
+
cd javascript
|
117
|
+
node test
|
118
|
+
```
|