prism 0.13.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (95) hide show
  1. checksums.yaml +7 -0
  2. data/CHANGELOG.md +172 -0
  3. data/CODE_OF_CONDUCT.md +76 -0
  4. data/CONTRIBUTING.md +62 -0
  5. data/LICENSE.md +7 -0
  6. data/Makefile +84 -0
  7. data/README.md +89 -0
  8. data/config.yml +2481 -0
  9. data/docs/build_system.md +74 -0
  10. data/docs/building.md +22 -0
  11. data/docs/configuration.md +60 -0
  12. data/docs/design.md +53 -0
  13. data/docs/encoding.md +117 -0
  14. data/docs/fuzzing.md +93 -0
  15. data/docs/heredocs.md +36 -0
  16. data/docs/mapping.md +117 -0
  17. data/docs/ripper.md +36 -0
  18. data/docs/ruby_api.md +25 -0
  19. data/docs/serialization.md +181 -0
  20. data/docs/testing.md +55 -0
  21. data/ext/prism/api_node.c +4725 -0
  22. data/ext/prism/api_pack.c +256 -0
  23. data/ext/prism/extconf.rb +136 -0
  24. data/ext/prism/extension.c +626 -0
  25. data/ext/prism/extension.h +18 -0
  26. data/include/prism/ast.h +1932 -0
  27. data/include/prism/defines.h +45 -0
  28. data/include/prism/diagnostic.h +231 -0
  29. data/include/prism/enc/pm_encoding.h +95 -0
  30. data/include/prism/node.h +41 -0
  31. data/include/prism/pack.h +141 -0
  32. data/include/prism/parser.h +418 -0
  33. data/include/prism/regexp.h +19 -0
  34. data/include/prism/unescape.h +48 -0
  35. data/include/prism/util/pm_buffer.h +51 -0
  36. data/include/prism/util/pm_char.h +91 -0
  37. data/include/prism/util/pm_constant_pool.h +78 -0
  38. data/include/prism/util/pm_list.h +67 -0
  39. data/include/prism/util/pm_memchr.h +14 -0
  40. data/include/prism/util/pm_newline_list.h +61 -0
  41. data/include/prism/util/pm_state_stack.h +24 -0
  42. data/include/prism/util/pm_string.h +61 -0
  43. data/include/prism/util/pm_string_list.h +25 -0
  44. data/include/prism/util/pm_strpbrk.h +29 -0
  45. data/include/prism/version.h +4 -0
  46. data/include/prism.h +82 -0
  47. data/lib/prism/compiler.rb +465 -0
  48. data/lib/prism/debug.rb +157 -0
  49. data/lib/prism/desugar_compiler.rb +206 -0
  50. data/lib/prism/dispatcher.rb +2051 -0
  51. data/lib/prism/dsl.rb +750 -0
  52. data/lib/prism/ffi.rb +251 -0
  53. data/lib/prism/lex_compat.rb +838 -0
  54. data/lib/prism/mutation_compiler.rb +718 -0
  55. data/lib/prism/node.rb +14540 -0
  56. data/lib/prism/node_ext.rb +55 -0
  57. data/lib/prism/node_inspector.rb +68 -0
  58. data/lib/prism/pack.rb +185 -0
  59. data/lib/prism/parse_result/comments.rb +172 -0
  60. data/lib/prism/parse_result/newlines.rb +60 -0
  61. data/lib/prism/parse_result.rb +266 -0
  62. data/lib/prism/pattern.rb +239 -0
  63. data/lib/prism/ripper_compat.rb +174 -0
  64. data/lib/prism/serialize.rb +662 -0
  65. data/lib/prism/visitor.rb +470 -0
  66. data/lib/prism.rb +64 -0
  67. data/prism.gemspec +113 -0
  68. data/src/diagnostic.c +287 -0
  69. data/src/enc/pm_big5.c +52 -0
  70. data/src/enc/pm_euc_jp.c +58 -0
  71. data/src/enc/pm_gbk.c +61 -0
  72. data/src/enc/pm_shift_jis.c +56 -0
  73. data/src/enc/pm_tables.c +507 -0
  74. data/src/enc/pm_unicode.c +2324 -0
  75. data/src/enc/pm_windows_31j.c +56 -0
  76. data/src/node.c +2633 -0
  77. data/src/pack.c +493 -0
  78. data/src/prettyprint.c +2136 -0
  79. data/src/prism.c +14587 -0
  80. data/src/regexp.c +580 -0
  81. data/src/serialize.c +1899 -0
  82. data/src/token_type.c +349 -0
  83. data/src/unescape.c +637 -0
  84. data/src/util/pm_buffer.c +103 -0
  85. data/src/util/pm_char.c +272 -0
  86. data/src/util/pm_constant_pool.c +252 -0
  87. data/src/util/pm_list.c +41 -0
  88. data/src/util/pm_memchr.c +33 -0
  89. data/src/util/pm_newline_list.c +134 -0
  90. data/src/util/pm_state_stack.c +19 -0
  91. data/src/util/pm_string.c +200 -0
  92. data/src/util/pm_string_list.c +29 -0
  93. data/src/util/pm_strncasecmp.c +17 -0
  94. data/src/util/pm_strpbrk.c +66 -0
  95. metadata +138 -0
@@ -0,0 +1,74 @@
1
+ # Build System
2
+
3
+ There are many ways to build prism, which means the build system is a bit more complicated than usual.
4
+
5
+ ## Requirements
6
+
7
+ * It must work to build prism for all 6 uses-cases below.
8
+ * It must be possible to build prism without needing ruby/rake/etc.
9
+ Because once prism is the single parser in TruffleRuby, JRuby or CRuby there won't be another Ruby parser around to parse such Ruby code.
10
+ Most/every Ruby implementations want to avoid depending on another Ruby during the build process as that is very brittle.
11
+ * It is desirable to compile prism with the same or very similar compiler flags for all use-cases (e.g. optimization level, warning flags, etc).
12
+ Otherwise, there is the risk prism does not work correctly with those different compiler flags.
13
+
14
+ The main solution for the second point seems a Makefile, otherwise many of the usages would have to duplicate the logic to build prism.
15
+
16
+ ## General Design
17
+
18
+ 1. Templates are generated by `templates/template.rb`
19
+ 4. The `Makefile` compiles both `librubyparser.a` and `librubyparser.{so,dylib,dll}` from the `src/**/*.c` and `include/**/*.h` files
20
+ 5. The `Rakefile` `:compile` task ensures the above prerequisites are done, then calls `make`,
21
+ and uses `Rake::ExtensionTask` to compile the C extension (using its `extconf.rb`), which uses `librubyparser.a`
22
+
23
+ This way there is minimal duplication, and each layer builds on the previous one and has its own responsibilities.
24
+
25
+ The static library exports no symbols, to avoid any conflict.
26
+ The shared library exports some symbols, and this is fine since there should only be one librubyparser shared library
27
+ loaded per process (i.e., at most one version of the prism *gem* loaded in a process, only the gem uses the shared library).
28
+
29
+ ## The various ways to build prism
30
+
31
+ ### Building from ruby/prism repository with `bundle exec rake`
32
+
33
+ `rake` calls `make` and then uses `Rake::ExtensionTask` to compile the C extension (see above).
34
+
35
+ ### Building the prism gem by `gem install/bundle install`
36
+
37
+ The gem contains the pre-generated templates.
38
+ When installing the gem, `extconf.rb` is used and that:
39
+ * runs `make build/librubyparser.a`
40
+ * compiles the C extension with mkmf
41
+
42
+ When installing the gem on JRuby and TruffleRuby, no C extension is built, so instead of the last step,
43
+ there is Ruby code using FFI which uses `librubyparser.{so,dylib,dll}`
44
+ to implement the same methods as the C extension, but using serialization instead of many native calls/accesses
45
+ (JRuby does not support C extensions, serialization is faster on TruffleRuby than the C extension).
46
+
47
+ ### Building the prism gem from git, e.g. `gem "prism", github: "ruby/prism"`
48
+
49
+ The same as above, except the `extconf.rb` additionally runs first:
50
+ * `templates/template.rb` to generate the templates
51
+
52
+ Because of course those files are not part of the git repository.
53
+
54
+ ### Building prism as part of CRuby
55
+
56
+ [This script](https://github.com/ruby/ruby/blob/32e828bb4a6c65a392b2300f3bdf93008c7b6f25/tool/sync_default_gems.rb#L399-L426) imports prism sources in CRuby.
57
+
58
+ The script generates the templates when importing.
59
+
60
+ prism's `Makefile` is not used at all in CRuby. Instead, CRuby's `Makefile` is used.
61
+
62
+ ### Building prism as part of TruffleRuby
63
+
64
+ [This script](https://github.com/oracle/truffleruby/blob/master/tool/import-prism.sh) imports prism sources in TruffleRuby.
65
+ The script generates the templates when importing.
66
+
67
+ Then when `mx build` builds TruffleRuby and the `prism` mx project inside, it runs `make`.
68
+
69
+ Then the `prism bindings` mx project is built, which contains the [bindings](https://github.com/oracle/truffleruby/blob/master/src/main/c/prism_bindings/src/prism_bindings.c)
70
+ and links to `librubyparser.a` (to avoid exporting symbols, so no conflict when installing the prism gem).
71
+
72
+ ### Building prism as part of JRuby
73
+
74
+ TODO, probably similar to TruffleRuby.
data/docs/building.md ADDED
@@ -0,0 +1,22 @@
1
+ # Building
2
+
3
+ The following describes how to build prism from source.
4
+ This comes directly from the [Makefile](../Makefile).
5
+
6
+ ## Common
7
+
8
+ All of the source files match `src/**/*.c` and all of the headers match `include/**/*.h`.
9
+
10
+ The following flags should be used to compile prism:
11
+
12
+ * `-std=c99` - Use the C99 standard
13
+ * `-Wall -Wconversion -Wextra -Wpedantic -Wundef` - Enable the warnings we care about
14
+ * `-Werror` - Treat warnings as errors
15
+ * `-fvisibility=hidden` - Hide all symbols by default
16
+
17
+ ## Shared
18
+
19
+ If you want to build prism as a shared library and link against it, you should compile with:
20
+
21
+ * `-fPIC -shared` - Compile as a shared library
22
+ * `-DPRISM_EXPORT_SYMBOLS` - Export the symbols (by default nothing is exported)
@@ -0,0 +1,60 @@
1
+ # Configuration
2
+
3
+ A lot of code in prism's repository is templated from a single configuration file, [config.yml](../config.yml). This file is used to generate the following files:
4
+
5
+ * `ext/prism/api_node.c` - for defining how to build Ruby objects for the nodes out of C structs
6
+ * `include/prism/ast.h` - for defining the C structs that represent the nodes
7
+ * `java/org/prism/AbstractNodeVisitor.java` - for defining the visitor interface for the nodes in Java
8
+ * `java/org/prism/Loader.java` - for defining how to deserialize the nodes in Java
9
+ * `java/org/prism/Nodes.java` - for defining the nodes in Java
10
+ * `lib/prism/compiler.rb` - for defining the compiler for the nodes in Ruby
11
+ * `lib/prism/dispatcher.rb` - for defining the dispatch visitors for the nodes in Ruby
12
+ * `lib/prism/dsl.rb` - for defining the DSL for the nodes in Ruby
13
+ * `lib/prism/mutation_compiler.rb` - for defining the mutation compiler for the nodes in Ruby
14
+ * `lib/prism/node.rb` - for defining the nodes in Ruby
15
+ * `lib/prism/serialize.rb` - for defining how to deserialize the nodes in Ruby
16
+ * `lib/prism/visitor.rb` - for defining the visitor interface for the nodes in Ruby
17
+ * `src/node.c` - for defining how to free the nodes in C and calculate the size in memory in C
18
+ * `src/prettyprint.c` - for defining how to prettyprint the nodes in C
19
+ * `src/serialize.c` - for defining how to serialize the nodes in C
20
+ * `src/token_type.c` - for defining the names of the token types
21
+
22
+ Whenever the structure of the nodes changes, you can run `rake templates` to regenerate these files. Alternatively tasks like `rake test` should pick up on these changes automatically. Every file that is templated will include a comment at the top indicating that it was generated and that changes should be made to the template and not the generated file.
23
+
24
+ `config.yml` has a couple of top level fields, which we'll describe below.
25
+
26
+ ## `tokens`
27
+
28
+ This is a list of tokens to be used by the lexer. It is shared here so that it can be templated out into both an enum and a function that is used for debugging that returns the name of the token.
29
+
30
+ Each token is expected to have a `name` key and a `comment` key (both as strings). Optionally they can have a `value` key (an integer) which is used to represent the value in the enum.
31
+
32
+ In C these tokens will be templated out with the prefix `PM_TOKEN_`. For example, if you have a `name` key with the value `PERCENT`, you can access this in C through `PM_TOKEN_PERCENT`.
33
+
34
+ ## `flags`
35
+
36
+ Sometimes we need to communicate more information in the tree than can be represented by the types of the nodes themselves. For example, we need to represent the flags passed to a regular expression or the type of call that a call node is performing. In these circumstances, it's helpful to reference a bitset of flags. This field is a list of flags that can be used in the nodes.
37
+
38
+ Each flag is expected to have a `name` key (a string) and a `values` key (an array). Each value in the `values` key should be an object that contains both a `name` key (a string) that represents the name of the flag and a `comment` key (a string) that represents the comment for the flag.
39
+
40
+ In C these flags will get templated out with a `PM_` prefix, then a snake-case version of the flag name, then the flag itself. For example, if you have a flag with the name `RegularExpressionFlags` and a value with the name `IGNORE_CASE`, you can access this in C through `PM_REGULAR_EXPRESSION_FLAGS_IGNORE_CASE`.
41
+
42
+ ## `nodes`
43
+
44
+ Every node in the tree is defined in `config.yml`. Each node is expected to have a `name` key (a string) and a `comment` key (a string). By convention, the `comment` key uses the multi-line syntax of `: |` because the newlines will get templated into the comments of various files.
45
+
46
+ Optionally, every node can define a `child_nodes` key that is an array. This array represents each part of the node that isn't communicated through the type and location of the node itself. Within the `child_nodes` key, each entry should be an object with a `name` key (a string) and a `type` key (a string). The `name` key represents the name of the child node and the `type` is used to determine how it should be represented in each language.
47
+
48
+ The available values for `type` are:
49
+
50
+ * `node` - A child node that is a node itself. This is a `pm_node_t *` in C.
51
+ * `node?` - A child node that is optionally present. This is also a `pm_node_t *` in C, but can be `NULL`.
52
+ * `node[]` - A child node that is an array of nodes. This is a `pm_node_list_t` in C.
53
+ * `string` - A child node that is a string. For example, this is used as the name of the method in a call node, since it cannot directly reference the source string (as in `@-` or `foo=`). This is a `pm_string_t` in C.
54
+ * `constant` - A variable-length integer that represents an index in the constant pool. This is a `pm_constant_id_t` in C.
55
+ * `constant[]` - A child node that is an array of constants. This is a `pm_constant_id_list_t` in C.
56
+ * `location` - A child node that is a location. This is a `pm_location_t` in C.
57
+ * `location?` - A child node that is a location that is optionally present. This is a `pm_location_t` in C, but if the value is not present then the `start` and `end` fields will be `NULL`.
58
+ * `uint32` - A child node that is a 32-bit unsigned integer. This is a `uint32_t` in C.
59
+
60
+ If the type is `node` or `node?` then the value also accepts an optional `kind` key (a string). This key is expected to match to the name of another node type within `config.yml`. This changes a couple of places where code is templated out to use the more specific struct name instead of the generic `pm_node_t`. For example, with `kind: StatementsNode` the `pm_node_t *` in C becomes a `pm_statements_node_t *`.
data/docs/design.md ADDED
@@ -0,0 +1,53 @@
1
+ # Design
2
+
3
+ There are three overall goals for this project:
4
+
5
+ * to provide a documented and maintainable parser
6
+ * to provide an error-tolerant parser suitable for use in an IDE
7
+ * to provide a portable parser that can be used in projects that don't link against CRuby
8
+
9
+ The design of the parser is based around these main goals.
10
+
11
+ ## Structure
12
+
13
+ The first piece to understand about the parser is the design of its syntax tree. This is documented in `config.yml`. Every token and node is defined in that file, along with comments about where they are found in what kinds of syntax. This file is used to template out a lot of different files, all found in the `templates` directory. The `templates/template.rb` script performs the templating and outputs all files matching the directory structure found in the templates directory.
14
+
15
+ The templated files contain all of the code required to allocate and initialize nodes, pretty print nodes, and serialize nodes. This means for the most part, you will only need to then hook up the parser to call the templated functions to create the nodes in the correct position. That means editing the parser itself, which is housed in `prism.c`.
16
+
17
+ ## Pratt parsing
18
+
19
+ In order to provide the best possible error tolerance, the parser is hand-written. It is structured using Pratt parsing, a technique developed by Vaughan Pratt back in the 1970s. Below are a bunch of links to articles and papers that explain Pratt parsing in more detail.
20
+
21
+ * https://web.archive.org/web/20151223215421/http://hall.org.ua/halls/wizzard/pdf/Vaughan.Pratt.TDOP.pdf
22
+ * https://tdop.github.io/
23
+ * https://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/
24
+ * https://matklad.github.io/2020/04/13/simple-but-powerful-pratt-parsing.html
25
+ * https://chidiwilliams.com/post/on-recursive-descent-and-pratt-parsing/
26
+
27
+ You can find most of the functions that correspond to constructs in the Pratt parsing algorithm in `prism.c`. As a couple of examples:
28
+
29
+ * `parse` corresponds to the `parse_expression` function
30
+ * `nud` (null denotation) corresponds to the `parse_expression_prefix` function
31
+ * `led` (left denotation) corresponds to the `parse_expression_infix` function
32
+ * `lbp` (left binding power) corresponds to accessing the `left` field of an element in the `binding_powers` array
33
+ * `rbp` (right binding power) corresponds to accessing the `right` field of an element in the `binding_powers` array
34
+
35
+ ## Portability
36
+
37
+ In order to enable using this parser in other projects, the parser is written in C99, and uses only the standard library. This means it can be embedded in most any other project without having to link against CRuby. It can be used directly through its C API to access individual fields, or it can used to parse a syntax tree and then serialize it to a single blob. For more information on serialization, see the [docs/serialization.md](serialization.md) file.
38
+
39
+ ## Error tolerance
40
+
41
+ The design of the error tolerance of this parser is still very much in flux. We are experimenting with various approaches as the parser is being developed to try to determine the best approach. Below are a bunch of links to articles and papers that explain error tolerance in more detail, as well as document some of the approaches that we're evaluating.
42
+
43
+ * https://tratt.net/laurie/blog/2020/automatic_syntax_error_recovery.html
44
+ * https://diekmann.uk/diekmann_phd.pdf
45
+ * https://eelcovisser.org/publications/2012/JongeKVS12.pdf
46
+ * https://www.antlr.org/papers/allstar-techreport.pdf
47
+ * https://github.com/microsoft/tolerant-php-parser/blob/main/docs/HowItWorks.md
48
+
49
+ Currently, there are a couple of mechanisms for error tolerance that are in place:
50
+
51
+ * If the parser expects a token in a particular position (for example the `in` keyword in a for loop or the `{` after `BEGIN` or `END`) then it will insert a missing token if one can't be found and continue parsing.
52
+ * If the parser expects an expression in a particular position but encounters a token that can't be used as that expression, it checks up the stack to see if that token would close out a parent node. If so, it will close out all of its parent nodes using missing nodes wherever necessary and continue parsing.
53
+ * If the parser cannot understand a token in any capacity, it will skip past the token.
data/docs/encoding.md ADDED
@@ -0,0 +1,117 @@
1
+ # Encoding
2
+
3
+ When parsing a Ruby file, there are times when the parser must parse identifiers. Identifiers are names of variables, methods, classes, etc. To determine the start of an identifier, the parser must be able to tell if the subsequent bytes form an alphabetic character. To determine the rest of the identifier, the parser must look forward through all alphanumeric characters.
4
+
5
+ Determining if a set of bytes comprise an alphabetic or alphanumeric character is encoding-dependent. By default, the parser assumes that all source files are encoded UTF-8. If the file is not encoded in UTF-8, it must be encoded using an encoding that is "ASCII compatible" (i.e., all of the codepoints below 128 match the corresponding codepoints in ASCII and the minimum number of bytes required to represent a codepoint is 1 byte).
6
+
7
+ If the file is not encoded in UTF-8, the user must specify the encoding in a "magic" comment at the top of the file. The comment looks like:
8
+
9
+ ```ruby
10
+ # encoding: iso-8859-9
11
+ ```
12
+
13
+ The key of the comment can be either "encoding" or "coding". The value of the comment must be a string that is a valid encoding name. The encodings that prism supports by default are:
14
+
15
+ * `ascii`
16
+ * `ascii-8bit`
17
+ * `big5`
18
+ * `binary`
19
+ * `cp932`
20
+ * `euc-jp`
21
+ * `gbk`
22
+ * `iso-8859-1`
23
+ * `iso-8859-2`
24
+ * `iso-8859-3`
25
+ * `iso-8859-4`
26
+ * `iso-8859-5`
27
+ * `iso-8859-6`
28
+ * `iso-8859-7`
29
+ * `iso-8859-8`
30
+ * `iso-8859-9`
31
+ * `iso-8859-10`
32
+ * `iso-8859-11`
33
+ * `iso-8859-13`
34
+ * `iso-8859-14`
35
+ * `iso-8859-15`
36
+ * `iso-8859-16`
37
+ * `koi8-r`
38
+ * `shift_jis`
39
+ * `sjis`
40
+ * `us-ascii`
41
+ * `utf-8`
42
+ * `utf8-mac`
43
+ * `windows-31j`
44
+ * `windows-1251`
45
+ * `windows-1252`
46
+
47
+ For each of these encodings, prism provides a function for checking if the subsequent bytes form an alphabetic or alphanumeric character.
48
+
49
+ ## Support for other encodings
50
+
51
+ If an encoding is encountered that is not supported by prism, prism will call a user-provided callback function with the name of the encoding if one is provided. That function can be registered with `pm_parser_register_encoding_decode_callback`. The user-provided callback function can then provide a pointer to an encoding struct that contains the requisite functions that prism will use those to parse identifiers going forward.
52
+
53
+ If the user-provided callback function returns `NULL` (the value also provided by the default implementation in case a callback was not registered), an error will be added to the parser's error list and parsing will continue on using the default UTF-8 encoding.
54
+
55
+ ```c
56
+ // This struct defines the functions necessary to implement the encoding
57
+ // interface so we can determine how many bytes the subsequent character takes.
58
+ // Each callback should return the number of bytes, or 0 if the next bytes are
59
+ // invalid for the encoding and type.
60
+ typedef struct {
61
+ // Return the number of bytes that the next character takes if it is valid
62
+ // in the encoding. Does not read more than n bytes. It is assumed that n is
63
+ // at least 1.
64
+ size_t (*char_width)(const uint8_t *b, ptrdiff_t n);
65
+
66
+ // Return the number of bytes that the next character takes if it is valid
67
+ // in the encoding and is alphabetical. Does not read more than n bytes. It
68
+ // is assumed that n is at least 1.
69
+ size_t (*alpha_char)(const uint8_t *b, ptrdiff_t n);
70
+
71
+ // Return the number of bytes that the next character takes if it is valid
72
+ // in the encoding and is alphanumeric. Does not read more than n bytes. It
73
+ // is assumed that n is at least 1.
74
+ size_t (*alnum_char)(const uint8_t *b, ptrdiff_t n);
75
+
76
+ // Return true if the next character is valid in the encoding and is an
77
+ // uppercase character. Does not read more than n bytes. It is assumed that
78
+ // n is at least 1.
79
+ bool (*isupper_char)(const uint8_t *b, ptrdiff_t n);
80
+
81
+ // The name of the encoding. This should correspond to a value that can be
82
+ // passed to Encoding.find in Ruby.
83
+ const char *name;
84
+
85
+ // Return true if the encoding is a multibyte encoding.
86
+ bool multibyte;
87
+ } pm_encoding_t;
88
+
89
+ // When an encoding is encountered that isn't understood by prism, we provide
90
+ // the ability here to call out to a user-defined function to get an encoding
91
+ // struct. If the function returns something that isn't NULL, we set that to
92
+ // our encoding and use it to parse identifiers.
93
+ typedef pm_encoding_t *(*pm_encoding_decode_callback_t)(pm_parser_t *parser, const uint8_t *name, size_t width);
94
+
95
+ // Register a callback that will be called when prism encounters a magic comment
96
+ // with an encoding referenced that it doesn't understand. The callback should
97
+ // return NULL if it also doesn't understand the encoding or it should return a
98
+ // pointer to a pm_encoding_t struct that contains the functions necessary to
99
+ // parse identifiers.
100
+ PRISM_EXPORTED_FUNCTION void
101
+ pm_parser_register_encoding_decode_callback(pm_parser_t *parser, pm_encoding_decode_callback_t callback);
102
+ ```
103
+
104
+ ## Getting notified when the encoding changes
105
+
106
+ You may want to get notified when the encoding changes based on the result of parsing an encoding comment. We use this internally for our `lex` function in order to provide the correct encodings for the tokens that are returned. For that you can register a callback with `pm_parser_register_encoding_changed_callback`. The callback will be called with a pointer to the parser. The encoding can be accessed through `parser->encoding`.
107
+
108
+ ```c
109
+ // When the encoding that is being used to parse the source is changed by prism,
110
+ // we provide the ability here to call out to a user-defined function.
111
+ typedef void (*pm_encoding_changed_callback_t)(pm_parser_t *parser);
112
+
113
+ // Register a callback that will be called whenever prism changes the encoding
114
+ // it is using to parse based on the magic comment.
115
+ PRISM_EXPORTED_FUNCTION void
116
+ pm_parser_register_encoding_changed_callback(pm_parser_t *parser, pm_encoding_changed_callback_t callback);
117
+ ```
data/docs/fuzzing.md ADDED
@@ -0,0 +1,93 @@
1
+ # Fuzzing
2
+
3
+ We use fuzzing to test the various entrypoints to the library. The fuzzer we use is [AFL++](https://aflplus.plus). All files related to fuzzing live within the `fuzz` directory, which has the following structure:
4
+
5
+ ```
6
+ fuzz
7
+ ├── corpus
8
+ │   ├── parse fuzzing corpus for parsing (a symlink to our fixtures)
9
+ │   ├── regexp fuzzing corpus for regexp
10
+ │   └── unescape fuzzing corpus for unescaping strings
11
+ ├── dict a AFL++ dictionary containing various tokens
12
+ ├── docker
13
+ │   └── Dockerfile for building a container with the fuzzer toolchain
14
+ ├── fuzz.c generic entrypoint for fuzzing
15
+ ├── heisenbug.c entrypoint for reproducing a crash or hang
16
+ ├── parse.c fuzz handler for parsing
17
+ ├── parse.sh script to run parsing fuzzer
18
+ ├── regexp.c fuzz handler for regular expression parsing
19
+ ├── regexp.sh script to run regexp fuzzer
20
+ ├── tools
21
+ │   ├── backtrace.sh generates backtrace files for a crash directory
22
+ │   └── minimize.sh generates minimized crash or hang files
23
+ ├── unescape.c fuzz handler for unescape functionality
24
+ └── unescape.sh script to run unescape fuzzer
25
+ ```
26
+
27
+ ## Usage
28
+
29
+ There are currently three fuzzing targets
30
+
31
+ - `pm_parse_serialize` (parse)
32
+ - `pm_regexp_named_capture_group_names` (regexp)
33
+ - `pm_unescape_manipulate_string` (unescape)
34
+
35
+ Respectively, fuzzing can be performed with
36
+
37
+ ```
38
+ make fuzz-run-parse
39
+ make fuzz-run-regexp
40
+ make fuzz-run-unescape
41
+ ```
42
+
43
+ To end a fuzzing job, interrupt with CTRL+C. To enter a container with the fuzzing toolchain and debug utilities, run
44
+
45
+ ```
46
+ make fuzz-debug
47
+ ```
48
+
49
+ # Out-of-bounds reads
50
+
51
+ Currently, encoding functionality implementing the `pm_encoding_t` interface can read outside of inputs. For the time being, ASAN instrumentation is disabled for functions from src/enc. See `fuzz/asan.ignore`.
52
+
53
+ To disable ASAN read instrumentation globally, use the `FUZZ_FLAGS` environment variable e.g.
54
+
55
+ ```
56
+ FUZZ_FLAGS="-mllvm -asan-instrument-reads=false" make fuzz-run-parse
57
+ ```
58
+
59
+ Note, that this may make reproducing bugs difficult as they may depend on memory outside of the input buffer. In that case, try
60
+
61
+ ```
62
+ make fuzz-debug # enter the docker container with build tools
63
+ make build/fuzz.heisenbug.parse # or .unescape or .regexp
64
+ ./build/fuzz.heisenbug.parse path-to-problem-input
65
+ ```
66
+
67
+ # Triaging Crashes and Hangs
68
+
69
+ Triaging crashes and hangs is easier when the inputs are as short as possible. In the fuzz container, an entire crash or hang directory can be minimized using
70
+
71
+ ```
72
+ ./fuzz/tools/minimize.sh <directory>
73
+ ```
74
+
75
+ e.g.
76
+ ```
77
+ ./fuzz/tools/minimize.sh fuzz/output/parse/default/crashes
78
+ ```
79
+
80
+ This may take a long time. In the crash/hang directory, for each input file there will appear a minimized version with the extension `.min` appended.
81
+
82
+ Backtraces for crashes (not hangs) can be generated en masse with
83
+
84
+ ```
85
+ ./fuzz/tools/backtrace.sh <directory>
86
+ ```
87
+
88
+ Files with basename equal to the input file name with extension `.bt` will be created e.g.
89
+
90
+ ```
91
+ id:000000,sig:06,src:000006+000190,time:8480,execs:18929,op:splice,rep:4
92
+ id:000000,sig:06,src:000006+000190,time:8480,execs:18929,op:splice,rep:4.bt
93
+ ```
data/docs/heredocs.md ADDED
@@ -0,0 +1,36 @@
1
+ # Heredocs
2
+
3
+ Heredocs are one of the most complicated pieces of this parser. There are many different forms, there can be multiple open at the same time, and they can be nested. In order to support parsing them, we keep track of a lot of metadata. Below is a basic overview of how it works.
4
+
5
+ ## 1. Lexing the identifier
6
+
7
+ When a heredoc identifier is encountered in the regular process of lexing, we push the `PM_LEX_HEREDOC` mode onto the stack with the following metadata:
8
+
9
+ * `ident_start`: A pointer to the start of the identifier for the heredoc. We need this to match against the end of the heredoc.
10
+ * `ident_length`: The length of the identifier for the heredoc. We also need this to match.
11
+ * `next_start`: A pointer to the place in source that the parser should resume lexing once it has completed this heredoc.
12
+
13
+ We also set the special `parser.next_start` field which is a pointer to the place in the source where we should start lexing the next token. This is set to the pointer of the character immediately following the next newline.
14
+
15
+ Note that if the `parser.heredoc_end` field is already set, then it means we have already encountered a heredoc on this line. In that case the `parser.next_start` field will be set to the `parser.heredoc_end` field. This is because we want to skip past the heredoc previous heredocs on this line and instead lex the body of this heredoc.
16
+
17
+ ## 2. Lexing the body
18
+
19
+ The next time the lexer is asked for a token, it will be in the `PM_LEX_HEREDOC` mode. In this mode we are lexing the body of the heredoc. It will start by checking if the `next_start` field is set. If it is, then this is the first token within the body of the heredoc so we'll start lexing from there. Otherwise we'll start lexing from the end of the previous token.
20
+
21
+ Lexing these fields is extremely similar to lexing an interpolated string. The only difference is that we also do an additional check at the beginning of each line to check if we have hit the terminator.
22
+
23
+ ## 3. Lexing the terminator
24
+
25
+ On every newline within the body of a heredoc, we check to see if it matches the terminator followed by a newline or a carriage return and a newline. If it does, then we pop the lex mode off the stack and set a couple of fields on the parser:
26
+
27
+ * `next_start`: This is set to the value that we previously stored on the heredoc to indicate where the lexer should resume lexing when it is done with this heredoc.
28
+ * `heredoc_end`: This is set to the end of the heredoc. When a newline character is found, this indicates that the lexer should skip past to this next point.
29
+
30
+ ## 4. Lexing the rest of the line
31
+
32
+ Once the heredoc has been lexed, the lexer will resume lexing from the `next_start` field. Lexing will continue until the next newline character. When the next newline character is found, it will check to see if the `heredoc_end` field is set. If it is it will skip to that point, unset the field, and continue lexing.
33
+
34
+ ## Compatibility with Ripper
35
+
36
+ The order in which tokens are emitted is different from that of Ripper. Ripper emits each token in the file in the order in which it appears. prism instead will emit the tokens that makes the most sense for the lexer, using the process described above. Therefore to line things up, `Prism.lex_compat` will shuffle the tokens around to match Ripper's output.
data/docs/mapping.md ADDED
@@ -0,0 +1,117 @@
1
+ # Mapping
2
+
3
+ When considering the previous CRuby parser versus prism, this document should be helpful to understand how various concepts are mapped.
4
+
5
+ ## Nodes
6
+
7
+ The following table shows how the various CRuby nodes are mapped to prism nodes.
8
+
9
+ | CRuby | prism |
10
+ | --- | --- |
11
+ | `NODE_SCOPE` | |
12
+ | `NODE_BLOCK` | |
13
+ | `NODE_IF` | `PM_IF_NODE` |
14
+ | `NODE_UNLESS` | `PM_UNLESS_NODE` |
15
+ | `NODE_CASE` | `PM_CASE_NODE` |
16
+ | `NODE_CASE2` | `PM_CASE_NODE` (with a null predicate) |
17
+ | `NODE_CASE3` | |
18
+ | `NODE_WHEN` | `PM_WHEN_NODE` |
19
+ | `NODE_IN` | `PM_IN_NODE` |
20
+ | `NODE_WHILE` | `PM_WHILE_NODE` |
21
+ | `NODE_UNTIL` | `PM_UNTIL_NODE` |
22
+ | `NODE_ITER` | `PM_CALL_NODE` (with a non-null block) |
23
+ | `NODE_FOR` | `PM_FOR_NODE` |
24
+ | `NODE_FOR_MASGN` | `PM_FOR_NODE` (with a multi-write node as the index) |
25
+ | `NODE_BREAK` | `PM_BREAK_NODE` |
26
+ | `NODE_NEXT` | `PM_NEXT_NODE` |
27
+ | `NODE_REDO` | `PM_REDO_NODE` |
28
+ | `NODE_RETRY` | `PM_RETRY_NODE` |
29
+ | `NODE_BEGIN` | `PM_BEGIN_NODE` |
30
+ | `NODE_RESCUE` | `PM_RESCUE_NODE` |
31
+ | `NODE_RESBODY` | |
32
+ | `NODE_ENSURE` | `PM_ENSURE_NODE` |
33
+ | `NODE_AND` | `PM_AND_NODE` |
34
+ | `NODE_OR` | `PM_OR_NODE` |
35
+ | `NODE_MASGN` | `PM_MULTI_WRITE_NODE` |
36
+ | `NODE_LASGN` | `PM_LOCAL_VARIABLE_WRITE_NODE` |
37
+ | `NODE_DASGN` | `PM_LOCAL_VARIABLE_WRITE_NODE` |
38
+ | `NODE_GASGN` | `PM_GLOBAL_VARIABLE_WRITE_NODE` |
39
+ | `NODE_IASGN` | `PM_INSTANCE_VARIABLE_WRITE_NODE` |
40
+ | `NODE_CDECL` | `PM_CONSTANT_PATH_WRITE_NODE` |
41
+ | `NODE_CVASGN` | `PM_CLASS_VARIABLE_WRITE_NODE` |
42
+ | `NODE_OP_ASGN1` | |
43
+ | `NODE_OP_ASGN2` | |
44
+ | `NODE_OP_ASGN_AND` | `PM_OPERATOR_AND_ASSIGNMENT_NODE` |
45
+ | `NODE_OP_ASGN_OR` | `PM_OPERATOR_OR_ASSIGNMENT_NODE` |
46
+ | `NODE_OP_CDECL` | |
47
+ | `NODE_CALL` | `PM_CALL_NODE` |
48
+ | `NODE_OPCALL` | `PM_CALL_NODE` (with an operator as the method) |
49
+ | `NODE_FCALL` | `PM_CALL_NODE` (with a null receiver and parentheses) |
50
+ | `NODE_VCALL` | `PM_CALL_NODE` (with a null receiver and parentheses or arguments) |
51
+ | `NODE_QCALL` | `PM_CALL_NODE` (with a &. operator) |
52
+ | `NODE_SUPER` | `PM_SUPER_NODE` |
53
+ | `NODE_ZSUPER` | `PM_FORWARDING_SUPER_NODE` |
54
+ | `NODE_LIST` | `PM_ARRAY_NODE` |
55
+ | `NODE_ZLIST` | `PM_ARRAY_NODE` (with no child elements) |
56
+ | `NODE_VALUES` | `PM_ARGUMENTS_NODE` |
57
+ | `NODE_HASH` | `PM_HASH_NODE` |
58
+ | `NODE_RETURN` | `PM_RETURN_NODE` |
59
+ | `NODE_YIELD` | `PM_YIELD_NODE` |
60
+ | `NODE_LVAR` | `PM_LOCAL_VARIABLE_READ_NODE` |
61
+ | `NODE_DVAR` | `PM_LOCAL_VARIABLE_READ_NODE` |
62
+ | `NODE_GVAR` | `PM_GLOBAL_VARIABLE_READ_NODE` |
63
+ | `NODE_IVAR` | `PM_INSTANCE_VARIABLE_READ_NODE` |
64
+ | `NODE_CONST` | `PM_CONSTANT_PATH_READ_NODE` |
65
+ | `NODE_CVAR` | `PM_CLASS_VARIABLE_READ_NODE` |
66
+ | `NODE_NTH_REF` | `PM_NUMBERED_REFERENCE_READ_NODE` |
67
+ | `NODE_BACK_REF` | `PM_BACK_REFERENCE_READ_NODE` |
68
+ | `NODE_MATCH` | |
69
+ | `NODE_MATCH2` | `PM_CALL_NODE` (with regular expression as receiver) |
70
+ | `NODE_MATCH3` | `PM_CALL_NODE` (with regular expression as only argument) |
71
+ | `NODE_LIT` | |
72
+ | `NODE_STR` | `PM_STRING_NODE` |
73
+ | `NODE_DSTR` | `PM_INTERPOLATED_STRING_NODE` |
74
+ | `NODE_XSTR` | `PM_X_STRING_NODE` |
75
+ | `NODE_DXSTR` | `PM_INTERPOLATED_X_STRING_NODE` |
76
+ | `NODE_EVSTR` | `PM_STRING_INTERPOLATED_NODE` |
77
+ | `NODE_DREGX` | `PM_INTERPOLATED_REGULAR_EXPRESSION_NODE` |
78
+ | `NODE_ONCE` | |
79
+ | `NODE_ARGS` | `PM_PARAMETERS_NODE` |
80
+ | `NODE_ARGS_AUX` | |
81
+ | `NODE_OPT_ARG` | `PM_OPTIONAL_PARAMETER_NODE` |
82
+ | `NODE_KW_ARG` | `PM_KEYWORD_PARAMETER_NODE` |
83
+ | `NODE_POSTARG` | `PM_REQUIRED_PARAMETER_NODE` |
84
+ | `NODE_ARGSCAT` | |
85
+ | `NODE_ARGSPUSH` | |
86
+ | `NODE_SPLAT` | `PM_SPLAT_NODE` |
87
+ | `NODE_BLOCK_PASS` | `PM_BLOCK_ARGUMENT_NODE` |
88
+ | `NODE_DEFN` | `PM_DEF_NODE` (with a null receiver) |
89
+ | `NODE_DEFS` | `PM_DEF_NODE` (with a non-null receiver) |
90
+ | `NODE_ALIAS` | `PM_ALIAS_NODE` |
91
+ | `NODE_VALIAS` | `PM_ALIAS_NODE` (with a global variable first argument) |
92
+ | `NODE_UNDEF` | `PM_UNDEF_NODE` |
93
+ | `NODE_CLASS` | `PM_CLASS_NODE` |
94
+ | `NODE_MODULE` | `PM_MODULE_NODE` |
95
+ | `NODE_SCLASS` | `PM_S_CLASS_NODE` |
96
+ | `NODE_COLON2` | `PM_CONSTANT_PATH_NODE` |
97
+ | `NODE_COLON3` | `PM_CONSTANT_PATH_NODE` (with a null receiver) |
98
+ | `NODE_DOT2` | `PM_RANGE_NODE` (with a .. operator) |
99
+ | `NODE_DOT3` | `PM_RANGE_NODE` (with a ... operator) |
100
+ | `NODE_FLIP2` | `PM_RANGE_NODE` (with a .. operator) |
101
+ | `NODE_FLIP3` | `PM_RANGE_NODE` (with a ... operator) |
102
+ | `NODE_SELF` | `PM_SELF_NODE` |
103
+ | `NODE_NIL` | `PM_NIL_NODE` |
104
+ | `NODE_TRUE` | `PM_TRUE_NODE` |
105
+ | `NODE_FALSE` | `PM_FALSE_NODE` |
106
+ | `NODE_ERRINFO` | |
107
+ | `NODE_DEFINED` | `PM_DEFINED_NODE` |
108
+ | `NODE_POSTEXE` | `PM_POST_EXECUTION_NODE` |
109
+ | `NODE_DSYM` | `PM_INTERPOLATED_SYMBOL_NODE` |
110
+ | `NODE_ATTRASGN` | `PM_CALL_NODE` (with a message that ends with =) |
111
+ | `NODE_LAMBDA` | `PM_LAMBDA_NODE` |
112
+ | `NODE_ARYPTN` | `PM_ARRAY_PATTERN_NODE` |
113
+ | `NODE_HSHPTN` | `PM_HASH_PATTERN_NODE` |
114
+ | `NODE_FNDPTN` | `PM_FIND_PATTERN_NODE` |
115
+ | `NODE_ERROR` | `PM_MISSING_NODE` |
116
+ | `NODE_LAST` | |
117
+ ```
data/docs/ripper.md ADDED
@@ -0,0 +1,36 @@
1
+ # Ripper
2
+
3
+ To test the parser, we compare against the output from `Ripper`, both for testing the lexer and testing the parser. The lexer test suite is much more feature complete at the moment.
4
+
5
+ To lex source code using `prism`, you typically would run `Prism.lex(source)`. If you want to instead get output that `Ripper` would normally produce, you can run `Prism.lex_compat(source)`. This will produce tokens that should be equivalent to `Ripper`.
6
+
7
+ To parse source code using `prism`, you typically would run `Prism.parse(source)`. If you want to instead using the `Ripper` streaming interface, you can inherit from `Prism::RipperCompat` and override the `on_*` methods. This will produce a syntax tree that should be equivalent to `Ripper`. That would look like:
8
+
9
+ ```ruby
10
+ class ArithmeticRipper < Prism::RipperCompat
11
+ def on_binary(left, operator, right)
12
+ left.public_send(operator, right)
13
+ end
14
+
15
+ def on_int(value)
16
+ value.to_i
17
+ end
18
+
19
+ def on_program(stmts)
20
+ stmts
21
+ end
22
+
23
+ def on_stmts_new
24
+ []
25
+ end
26
+
27
+ def on_stmts_add(stmts, stmt)
28
+ stmts << stmt
29
+ stmts
30
+ end
31
+ end
32
+
33
+ ArithmeticRipper.new("1 + 2 - 3").parse # => [0]
34
+ ```
35
+
36
+ There are also APIs for building trees similar to the s-expression builders in `Ripper`. The method names are the same. These include `Prism::RipperCompat.sexp_raw(source)` and `Prism::RipperCompat.sexp(source)`.
data/docs/ruby_api.md ADDED
@@ -0,0 +1,25 @@
1
+ # Ruby API
2
+
3
+ The `prism` gem provides a Ruby API for accessing the syntax tree.
4
+
5
+ For the most part, the API for accessing the tree mirrors that found in the [Syntax Tree](https://github.com/ruby-syntax-tree/syntax_tree) project. This means:
6
+
7
+ * Walking the tree involves creating a visitor and passing it to the `#accept` method on any node in the tree
8
+ * Nodes in the tree respond to named methods for accessing their children as well as `#child_nodes`
9
+ * Nodes respond to the pattern matching interfaces `#deconstruct` and `#deconstruct_keys`
10
+
11
+ Every entry in `config.yml` will generate a Ruby class as well as the code that builds the nodes themselves.
12
+ Creating a syntax tree involves calling one of the class methods on the `Prism` module.
13
+ The full API is documented below.
14
+
15
+ ## API
16
+
17
+ * `Prism.dump(source, filepath)` - parse the syntax tree corresponding to the given source string and filepath, and serialize it to a string. Filepath can be nil.
18
+ * `Prism.dump_file(filepath)` - parse the syntax tree corresponding to the given source file and serialize it to a string
19
+ * `Prism.lex(source)` - parse the tokens corresponding to the given source string and return them as an array within a parse result
20
+ * `Prism.lex_file(filepath)` - parse the tokens corresponding to the given source file and return them as an array within a parse result
21
+ * `Prism.parse(source)` - parse the syntax tree corresponding to the given source string and return it within a parse result
22
+ * `Prism.parse_file(filepath)` - parse the syntax tree corresponding to the given source file and return it within a parse result
23
+ * `Prism.parse_lex(source)` - parse the syntax tree corresponding to the given source string and return it within a parse result, along with the tokens
24
+ * `Prism.parse_lex_file(filepath)` - parse the syntax tree corresponding to the given source file and return it within a parse result, along with the tokens
25
+ * `Prism.load(source, serialized)` - load the serialized syntax tree using the source as a reference into a syntax tree