descent 0.7.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/CHANGELOG.md +285 -0
- data/README.md +583 -0
- data/SYNTAX.md +334 -0
- data/exe/descent +15 -0
- data/lib/descent/ast.rb +69 -0
- data/lib/descent/generator.rb +489 -0
- data/lib/descent/ir.rb +98 -0
- data/lib/descent/ir_builder.rb +1479 -0
- data/lib/descent/lexer.rb +308 -0
- data/lib/descent/parser.rb +450 -0
- data/lib/descent/railroad.rb +272 -0
- data/lib/descent/templates/rust/_command.liquid +174 -0
- data/lib/descent/templates/rust/parser.liquid +1163 -0
- data/lib/descent/tools/debug.rb +115 -0
- data/lib/descent/tools/diagram.rb +48 -0
- data/lib/descent/tools/generate.rb +47 -0
- data/lib/descent/tools/validate.rb +56 -0
- data/lib/descent/validator.rb +231 -0
- data/lib/descent/version.rb +5 -0
- data/lib/descent.rb +34 -0
- metadata +101 -0
checksums.yaml
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
1
|
+
---
|
|
2
|
+
SHA256:
|
|
3
|
+
metadata.gz: 75c7bef6464798eee6b29147f663d41fcb237c71338e67f14596bd6ce98d4574
|
|
4
|
+
data.tar.gz: 9a76dd2385b9087cc95640437fa5be20fcde8468525211c4d1c1cfd6b72bd626
|
|
5
|
+
SHA512:
|
|
6
|
+
metadata.gz: f3065ddce128bf8864915aef45ba67b2f24403d67ce2d159d7616cdceb9559821080692ee7be330802134868b1819a5a381c40d538addc12852f5abbdbd2f7ce
|
|
7
|
+
data.tar.gz: 79231da63252a41e0ac42666cbf17001a87397a7885bd3c63bfb79d153656c904a5f6f04b3b9c80591f3f39effec841fc902112748bfafad19cd03dda4297509
|
data/CHANGELOG.md
ADDED
|
@@ -0,0 +1,285 @@
|
|
|
1
|
+
# Changelog
|
|
2
|
+
|
|
3
|
+
All notable changes to descent will be documented in this file.
|
|
4
|
+
|
|
5
|
+
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
|
|
6
|
+
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
|
|
7
|
+
|
|
8
|
+
## [0.7.1] - 2026-01-09
|
|
9
|
+
|
|
10
|
+
### Fixed
|
|
11
|
+
- **Commands after /error preserved**: Removed stale `filter_unreachable_after_error`
|
|
12
|
+
that incorrectly dropped commands (including `|return`) after `/error` calls.
|
|
13
|
+
|
|
14
|
+
## [0.7.0] - 2026-01-09
|
|
15
|
+
|
|
16
|
+
### Added
|
|
17
|
+
- **Comprehensive test suite**: 148 tests covering Lexer, Parser, IRBuilder, Validator,
|
|
18
|
+
and Generator modules. Includes integration harness tests that compile and run parsers.
|
|
19
|
+
- **SYNTAX.md reference**: Complete .desc DSL syntax reference document with BNF-style
|
|
20
|
+
grammar, all directives, actions, and character classes.
|
|
21
|
+
- **Validator checks**: Missing parser name validation, duplicate types, undefined
|
|
22
|
+
function calls, invalid state transitions, and more.
|
|
23
|
+
- **`streaming:` option**: Generator now accepts `streaming: false` to omit StreamingParser
|
|
24
|
+
infrastructure (~200 lines) when not needed.
|
|
25
|
+
|
|
26
|
+
### Changed
|
|
27
|
+
- **`/error` no longer auto-returns**: The `/error(Code)` command now only emits the
|
|
28
|
+
error event. Add explicit `|return` to exit. This enables error recovery patterns
|
|
29
|
+
like `/error(NoTabs) | ->['\n'] |>>`.
|
|
30
|
+
- **Escape sequences consolidated**: Single `ESCAPE_SEQUENCES` constant used by both
|
|
31
|
+
`rust_expr` and `transform_call_args`, eliminating duplication.
|
|
32
|
+
|
|
33
|
+
### Fixed
|
|
34
|
+
- **COL in function args**: `/element(COL)` now correctly generates `self.parse_element(self.col(), on_event)`
|
|
35
|
+
instead of broken output. Function calls are processed before COL/LINE/PREV expansion.
|
|
36
|
+
- **Unused variable warnings**: Locals only assigned at entry (not reassigned in body)
|
|
37
|
+
now emit `let` instead of `let mut`, eliminating `unused_mut` warnings.
|
|
38
|
+
- **Double return warnings**: Fixed template generating two consecutive returns when
|
|
39
|
+
`/error` was followed by explicit `|return`.
|
|
40
|
+
|
|
41
|
+
## [0.6.17] - 2026-01-02
|
|
42
|
+
|
|
43
|
+
### Fixed
|
|
44
|
+
- **O(n²) chained memchr**: scan_to4/5/6 now limit the second search to the range
|
|
45
|
+
found by the first search. Previously both searches scanned the entire remaining
|
|
46
|
+
input independently, causing O(n²) behavior on large documents.
|
|
47
|
+
|
|
48
|
+
## [0.6.16] - 2026-01-02
|
|
49
|
+
|
|
50
|
+
### Changed
|
|
51
|
+
- **SIMD newline injection for line tracking**: Scannable states now automatically
|
|
52
|
+
inject `'\n'` into scan targets (if not already present and size < 6). This enables
|
|
53
|
+
correct line/column tracking during SIMD scans without runtime checks. When the
|
|
54
|
+
injected newline is hit, the parser updates line/column and continues scanning.
|
|
55
|
+
Scan functions simplified to just add offset to column, trusting no newlines exist
|
|
56
|
+
between start and found position.
|
|
57
|
+
|
|
58
|
+
## [0.6.15] - 2026-01-02
|
|
59
|
+
|
|
60
|
+
### Fixed
|
|
61
|
+
- **pascalcase preserves PascalCase**: The `pascalcase` filter now correctly handles
|
|
62
|
+
already-PascalCase input like `UnclosedInterpolation` instead of lowercasing it
|
|
63
|
+
to `Unclosedinterpolation`. Splits on case transitions in addition to delimiters.
|
|
64
|
+
- **Error code deduplication**: Custom `/error(Code)` calls no longer create duplicate
|
|
65
|
+
enum variants when the same code is auto-generated from `expects_char` inference.
|
|
66
|
+
|
|
67
|
+
## [0.6.14] - 2026-01-02
|
|
68
|
+
|
|
69
|
+
### Added
|
|
70
|
+
- **advance_to validation**: `->[...]` now validates its arguments at IR build time.
|
|
71
|
+
Errors on: character classes (LETTER, DIGIT, etc.), parameter references (:param),
|
|
72
|
+
and >6 characters. Only literal bytes are supported (uses SIMD memchr).
|
|
73
|
+
|
|
74
|
+
### Fixed
|
|
75
|
+
- **advance_to 4-6 chars**: `->[...]` now correctly supports 4-6 characters using
|
|
76
|
+
chained memchr (scan_to4/5/6). Previously the template generated broken code for >3 chars.
|
|
77
|
+
|
|
78
|
+
## [0.6.13] - 2026-01-01
|
|
79
|
+
|
|
80
|
+
### Fixed
|
|
81
|
+
- **BRACKET End event on inline emit**: BRACKET type functions now correctly emit
|
|
82
|
+
their End event on return even when preceded by inline emits like `RawContent(USE_MARK)`.
|
|
83
|
+
Previously `suppress_auto_emit` incorrectly skipped the End event for BRACKET types.
|
|
84
|
+
- **If-case break after return**: `|if[cond] |return` followed by `| -> |>> :state` now
|
|
85
|
+
correctly generates two separate match arms. Previously the bare action case commands
|
|
86
|
+
were appended to the if-case, causing unreachable code warnings.
|
|
87
|
+
- **Entry actions preserved**: Function-level entry actions like `| val = 5` are now
|
|
88
|
+
correctly preserved through IR transformations (prepend_values, type coercion).
|
|
89
|
+
- **Local variable initialization**: Locals with entry action assignments now initialize
|
|
90
|
+
directly with the value (`let mut val: i32 = 5;`) instead of init-then-assign,
|
|
91
|
+
eliminating "value assigned is never read" warnings.
|
|
92
|
+
- **set_term helper emission**: `TERM` commands now correctly trigger emission of the
|
|
93
|
+
`set_term` helper method regardless of offset value. Previously `TERM(0)` failed to
|
|
94
|
+
compile because the helper wasn't generated.
|
|
95
|
+
|
|
96
|
+
## [0.6.12] - 2026-01-01
|
|
97
|
+
|
|
98
|
+
### Changed
|
|
99
|
+
- **Conditional helper emission**: Generated parsers now only include helper methods
|
|
100
|
+
that are actually used, eliminating dead_code warnings. The generator analyzes
|
|
101
|
+
usage of `col()`, `prev()`, character class methods (`is_letter`, `is_digit`, etc.),
|
|
102
|
+
and scan methods (`scan_to1` through `scan_to6`) and only emits what's needed.
|
|
103
|
+
|
|
104
|
+
## [0.6.9] - 2026-01-01
|
|
105
|
+
|
|
106
|
+
### Fixed
|
|
107
|
+
- **Unconditional state handling**: States with bare action cases (no character match)
|
|
108
|
+
now execute immediately without waiting for a byte. Previously, `| MARK |>> :next`
|
|
109
|
+
would generate `Some(_) =>` which waited for a byte before executing MARK.
|
|
110
|
+
|
|
111
|
+
## [0.6.8] - 2026-01-01
|
|
112
|
+
|
|
113
|
+
### Fixed
|
|
114
|
+
- **Empty content span bug**: `span_from_mark()` and `term()` now correctly handle
|
|
115
|
+
empty content where TERM is called at the same position as MARK. Uses sentinel
|
|
116
|
+
value (`usize::MAX`) to distinguish "TERM not called" from "TERM called with
|
|
117
|
+
empty content". Fixes spans like `!{{}}` returning 6..8 instead of 6..6.
|
|
118
|
+
- **Example syntax**: Fixed `c[\n]` → `c['\n']` in example .desc files. Bare
|
|
119
|
+
escape sequences must be quoted per characters.md spec.
|
|
120
|
+
|
|
121
|
+
## [0.6.7] - 2025-01-01
|
|
122
|
+
|
|
123
|
+
### Fixed
|
|
124
|
+
- **`:param` in conditionals**: `if[col <= :line_col]` now correctly generates
|
|
125
|
+
`col <= line_col` instead of literal `:line_col`.
|
|
126
|
+
- **`<>` for `:byte` params**: Empty class now generates `0u8` (never-match sentinel)
|
|
127
|
+
instead of `b'?'` which incorrectly matched question marks.
|
|
128
|
+
- **Function call arg validation**: `/func(param)` where `param` matches a known
|
|
129
|
+
parameter now errors with helpful message suggesting `:param` or `'param'`.
|
|
130
|
+
|
|
131
|
+
## [0.6.6] - 2025-01-01
|
|
132
|
+
|
|
133
|
+
### Added
|
|
134
|
+
- **Unified CharacterClass parser**: New `CharacterClass` module implements the
|
|
135
|
+
`characters.md` spec with consistent parsing everywhere (c[...], function args,
|
|
136
|
+
PREPEND). All character class syntax now goes through a single code path.
|
|
137
|
+
- **Param reference validation**: Bare identifiers matching param names now raise
|
|
138
|
+
helpful errors in both PREPEND and function calls:
|
|
139
|
+
- `PREPEND(foo)` → suggests `PREPEND(:foo)` or `PREPEND('foo')`
|
|
140
|
+
- `/func(foo)` → suggests `/func(:foo)` or `/func('foo')`
|
|
141
|
+
- This prevents confusing bugs where param names are treated as literal strings
|
|
142
|
+
|
|
143
|
+
### Fixed
|
|
144
|
+
- **`<>` empty class consistency**: `<>` now correctly means "empty" everywhere:
|
|
145
|
+
- `PREPEND(<>)` → `b""` (no-op, empty prepend)
|
|
146
|
+
- `/func(<>)` for `:bytes` param → `b""` (empty byte slice)
|
|
147
|
+
- Previously `PREPEND(<>)` incorrectly output literal `<>` characters
|
|
148
|
+
- **Type inference for numeric comparisons**: Conditions like `space_term == 0`
|
|
149
|
+
no longer incorrectly type the param as `:byte`. Numeric flag comparisons stay
|
|
150
|
+
as `:i32`; only character literal comparisons (e.g., `close == '|'`) set `:byte`.
|
|
151
|
+
- **`:byte` type propagation**: When function A passes `:param` to function B
|
|
152
|
+
where B's param is `:byte`, A's param now correctly becomes `:byte`. Previously
|
|
153
|
+
only `:bytes` was propagated.
|
|
154
|
+
- **Hex escapes in literals**: `'\x00'` and other hex escapes now work correctly
|
|
155
|
+
in PREPEND and function arguments, producing actual byte values.
|
|
156
|
+
|
|
157
|
+
### Changed
|
|
158
|
+
- Removed duplicate constant definitions (PREDEFINED_RANGES, SINGLE_CHAR_CLASSES)
|
|
159
|
+
in favor of unified CharacterClass module.
|
|
160
|
+
- `bytes_like_value?` now only matches `<>` - single-char values like `'|'` are
|
|
161
|
+
typed based on usage, not call-site inference.
|
|
162
|
+
|
|
163
|
+
## [0.6.5] - 2024-12-31
|
|
164
|
+
|
|
165
|
+
### Fixed
|
|
166
|
+
- **PREPEND quote stripping**: `PREPEND('|')` now correctly generates `b"|"` (1 byte)
|
|
167
|
+
instead of `b"'|'"` (3 bytes). Quoted literals are properly unquoted before embedding.
|
|
168
|
+
- **Lexer bracket extraction**: `c[']']` now works correctly - the lexer respects
|
|
169
|
+
single quotes when extracting bracket content, so `]` inside quotes doesn't close.
|
|
170
|
+
|
|
171
|
+
### Changed
|
|
172
|
+
- **Stricter character validation**: Characters outside `/A-Za-z0-9_-/` in `c[...]`
|
|
173
|
+
must now be quoted. This catches common errors and enforces consistent syntax:
|
|
174
|
+
- `c["]` is invalid, use `c['"']`
|
|
175
|
+
- `c[#]` is invalid, use `c['#']`
|
|
176
|
+
- `c[abc]` is valid (alphanumeric)
|
|
177
|
+
- `c[-_]` is valid (hyphen and underscore allowed bare)
|
|
178
|
+
- **Escape sequences outside class wrapper**: Using `<SQ>`, `<P>` etc. outside a
|
|
179
|
+
`<...>` class wrapper now raises a clear error suggesting proper syntax.
|
|
180
|
+
|
|
181
|
+
## [0.6.3] - 2024-12-31
|
|
182
|
+
|
|
183
|
+
### Fixed
|
|
184
|
+
- **Semicolon in quoted strings**: `PREPEND(';')` no longer treats the semicolon
|
|
185
|
+
as a comment start. The lexer now tracks quotes when stripping comments.
|
|
186
|
+
- **Pipe in quoted arguments**: Function calls like `/func('|')` now parse correctly.
|
|
187
|
+
The lexer tracks quotes when splitting on pipe delimiters.
|
|
188
|
+
|
|
189
|
+
### Changed
|
|
190
|
+
- **Validation for character syntax**: Added comprehensive validation for `c[...]`
|
|
191
|
+
patterns to catch unterminated quotes, bare special characters, and invalid
|
|
192
|
+
legacy syntax before parsing.
|
|
193
|
+
|
|
194
|
+
## [0.6.2] - 2024-12-31
|
|
195
|
+
|
|
196
|
+
### Fixed
|
|
197
|
+
- **Conditionals in SCAN branches**: Character literals and escape sequences like
|
|
198
|
+
`<P>` now work correctly in conditional expressions (e.g., `|if[PREV == <P>]`).
|
|
199
|
+
- **Escape sequences in expressions**: `rust_expr` filter now transforms embedded
|
|
200
|
+
escape sequences like `<P>` to `b'|'` in all expression contexts.
|
|
201
|
+
|
|
202
|
+
## [0.6.1] - 2024-12-31
|
|
203
|
+
|
|
204
|
+
### Added
|
|
205
|
+
- **LINE variable**: Access current line number (1-indexed) in expressions.
|
|
206
|
+
Transforms to `self.line as i32` in generated Rust code.
|
|
207
|
+
|
|
208
|
+
## [0.6.0] - 2024-12-31
|
|
209
|
+
|
|
210
|
+
### Changed
|
|
211
|
+
- **PREPEND semantics fixed**: PREPEND now correctly adds bytes to the accumulation
|
|
212
|
+
buffer instead of emitting a separate Text event. The prepended content is combined
|
|
213
|
+
with the next `term()` result using `Cow<[u8]>` for zero-copy in the common case.
|
|
214
|
+
- **Event content type**: Content fields in events are now `Cow<'a, [u8]>` instead of
|
|
215
|
+
`&'a [u8]`. This enables zero-copy when no PREPEND is used, with owned data only
|
|
216
|
+
when prepending is needed.
|
|
217
|
+
|
|
218
|
+
### Added
|
|
219
|
+
- **Unicode identifier classes**: `XID_START`, `XID_CONT`, `XLBL_START`, `XLBL_CONT`
|
|
220
|
+
for Unicode-aware identifier parsing (requires `unicode-xid` crate)
|
|
221
|
+
- **Conditional unicode-xid import**: The crate is only required when Unicode
|
|
222
|
+
classes are actually used in the parser
|
|
223
|
+
|
|
224
|
+
### Fixed
|
|
225
|
+
- **PREPEND buffer persistence**: The prepend buffer now persists across nested
|
|
226
|
+
function calls, allowing `PREPEND(*) | /paragraph` patterns to work correctly
|
|
227
|
+
|
|
228
|
+
## [0.2.1] - 2024-12-30
|
|
229
|
+
|
|
230
|
+
### Added
|
|
231
|
+
- **DIGIT character class**: Matches ASCII digits (0-9) using `is_ascii_digit()`
|
|
232
|
+
- **HEX_DIGIT character class**: Matches hex digits (0-9, a-f, A-F) using `is_ascii_hexdigit()`
|
|
233
|
+
- **`|eof` directive**: Explicit EOF handling with custom actions and inline emits
|
|
234
|
+
- **Parameterized byte terminators**: Functions can take byte parameters for dynamic character matching
|
|
235
|
+
- Syntax: `|c[:param]|` matches against parameter value
|
|
236
|
+
- Parameters used in char matches become `u8` type automatically
|
|
237
|
+
- Enables single functions to handle multiple bracket types ([], {}, ())
|
|
238
|
+
- **Escape sequences**: `<LP>` for `(` and `<RP>` for `)` in function arguments
|
|
239
|
+
- **PREPEND with parameter references**: `PREPEND(:param)` emits parameter value as Text event
|
|
240
|
+
- `PREPEND()` with empty content is a no-op
|
|
241
|
+
- `PREPEND(:param)` where param is 0 is also a no-op (runtime check)
|
|
242
|
+
- Parameters used in PREPEND are inferred as `u8` type
|
|
243
|
+
|
|
244
|
+
### Fixed
|
|
245
|
+
- **Double emit bug (#11)**: CONTENT functions with inline emits no longer emit twice
|
|
246
|
+
- Inline emit (e.g., `Integer(USE_MARK)`) followed by bare `|return` now correctly
|
|
247
|
+
suppresses the auto-emit for the function's return type
|
|
248
|
+
- **EOF bypasses inline emits (#12)**: Use `|eof` directive for explicit EOF behavior
|
|
249
|
+
- **`|eof` not generating code (#13)**: The `|eof` directive now properly generates
|
|
250
|
+
action code including inline emits
|
|
251
|
+
- **Quote characters in function parameters**: Bare `"` and `'` now correctly convert
|
|
252
|
+
to `b'"'` and `b'\''` when passed as function arguments
|
|
253
|
+
|
|
254
|
+
### Changed
|
|
255
|
+
- EOF handling documentation updated to reflect explicit `|eof` support
|
|
256
|
+
- README and CLAUDE.md updated with all character classes and EOF directive
|
|
257
|
+
|
|
258
|
+
## [0.2.0] - 2024-12-29
|
|
259
|
+
|
|
260
|
+
### Added
|
|
261
|
+
- Parameterized functions with `:param` syntax
|
|
262
|
+
- Combined character classes: `|c[LETTER'[.?!]|` matches class OR literal chars
|
|
263
|
+
- TERM adjustments: `TERM(-1)` terminates slice before current position
|
|
264
|
+
- PREPEND command: `PREPEND(|)` emits literal as text event
|
|
265
|
+
- Inline literal emits: `TypeName`, `TypeName(literal)`, `TypeName(USE_MARK)`
|
|
266
|
+
- PREV variable for previous byte context
|
|
267
|
+
- Custom error codes via `/error(ErrorCode)`
|
|
268
|
+
|
|
269
|
+
### Fixed
|
|
270
|
+
- Duplicate error code generation for same return types
|
|
271
|
+
- Local variable scoping across states
|
|
272
|
+
- Functions with no states now handled gracefully
|
|
273
|
+
- Return with value for INTERNAL types
|
|
274
|
+
|
|
275
|
+
## [0.1.0] - 2024-12-20
|
|
276
|
+
|
|
277
|
+
### Added
|
|
278
|
+
- Initial release
|
|
279
|
+
- Lexer, Parser, IR Builder, Generator pipeline
|
|
280
|
+
- Rust code generation via Liquid templates
|
|
281
|
+
- SCAN optimization inference (memchr-based SIMD scanning)
|
|
282
|
+
- Type system: BRACKET, CONTENT, INTERNAL
|
|
283
|
+
- Character classes: LETTER, LABEL_CONT
|
|
284
|
+
- Automatic MARK/TERM for CONTENT types
|
|
285
|
+
- Recursive descent with true call stack
|