descent 0.7.1
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +7 -0
- data/CHANGELOG.md +285 -0
- data/README.md +583 -0
- data/SYNTAX.md +334 -0
- data/exe/descent +15 -0
- data/lib/descent/ast.rb +69 -0
- data/lib/descent/generator.rb +489 -0
- data/lib/descent/ir.rb +98 -0
- data/lib/descent/ir_builder.rb +1479 -0
- data/lib/descent/lexer.rb +308 -0
- data/lib/descent/parser.rb +450 -0
- data/lib/descent/railroad.rb +272 -0
- data/lib/descent/templates/rust/_command.liquid +174 -0
- data/lib/descent/templates/rust/parser.liquid +1163 -0
- data/lib/descent/tools/debug.rb +115 -0
- data/lib/descent/tools/diagram.rb +48 -0
- data/lib/descent/tools/generate.rb +47 -0
- data/lib/descent/tools/validate.rb +56 -0
- data/lib/descent/validator.rb +231 -0
- data/lib/descent/version.rb +5 -0
- data/lib/descent.rb +34 -0
- metadata +101 -0
data/README.md
ADDED
|
@@ -0,0 +1,583 @@
|
|
|
1
|
+
# `descent` Recursive Descent Parser Generator
|
|
2
|
+
|
|
3
|
+
A recursive descent parser generator that produces high-performance callback-based
|
|
4
|
+
parsers from declarative `.desc` specifications.
|
|
5
|
+
|
|
6
|
+
**See also:**
|
|
7
|
+
- [SYNTAX.md](SYNTAX.md) - Complete `.desc` DSL syntax reference
|
|
8
|
+
- [characters.md](characters.md) - Character, String, and Class literal specification
|
|
9
|
+
|
|
10
|
+
## Philosophy
|
|
11
|
+
|
|
12
|
+
**The DSL describes *what* to parse. The generator figures out *how*.**
|
|
13
|
+
|
|
14
|
+
- **Type-driven emit**: Return types determine events, no explicit `emit()` needed
|
|
15
|
+
- **Inferred EOF**: Default EOF behavior derived from context (`|eof` available for explicit control)
|
|
16
|
+
- **Auto SCAN**: SIMD-accelerated scanning inferred from state structure
|
|
17
|
+
- **True recursion**: Call stack IS the element stack
|
|
18
|
+
- **Minimal state**: Small functions compose into complex parsers
|
|
19
|
+
|
|
20
|
+
## Installation
|
|
21
|
+
|
|
22
|
+
```bash
|
|
23
|
+
gem install descent
|
|
24
|
+
```
|
|
25
|
+
|
|
26
|
+
Or in your Gemfile:
|
|
27
|
+
|
|
28
|
+
```ruby
|
|
29
|
+
gem 'descent'
|
|
30
|
+
```
|
|
31
|
+
|
|
32
|
+
## Usage
|
|
33
|
+
|
|
34
|
+
```bash
|
|
35
|
+
# Generate Rust parser (to stdout)
|
|
36
|
+
descent generate parser.desc
|
|
37
|
+
|
|
38
|
+
# Generate with output file
|
|
39
|
+
descent generate parser.desc -o parser.rs
|
|
40
|
+
|
|
41
|
+
# Generate with debug tracing enabled
|
|
42
|
+
descent generate parser.desc --trace
|
|
43
|
+
|
|
44
|
+
# Validate .desc file without generating
|
|
45
|
+
descent validate parser.desc
|
|
46
|
+
|
|
47
|
+
# Debug: inspect tokens, AST, or IR
|
|
48
|
+
descent debug parser.desc
|
|
49
|
+
descent debug --tokens parser.desc
|
|
50
|
+
descent debug --ast parser.desc
|
|
51
|
+
|
|
52
|
+
# Show help
|
|
53
|
+
descent --help
|
|
54
|
+
descent generate --help
|
|
55
|
+
```
|
|
56
|
+
|
|
57
|
+
## The `.desc` DSL
|
|
58
|
+
|
|
59
|
+
`.desc` files are valid UDON documents (enabling future bootstrapping). The format
|
|
60
|
+
uses pipe-delimited declarations.
|
|
61
|
+
|
|
62
|
+
### Document Structure
|
|
63
|
+
|
|
64
|
+
```
|
|
65
|
+
|parser myparser ; Parser name (required)
|
|
66
|
+
|
|
67
|
+
|type[Element] BRACKET ; Type declarations
|
|
68
|
+
|type[Text] CONTENT
|
|
69
|
+
|
|
70
|
+
|entry-point /document ; Where parsing begins
|
|
71
|
+
|
|
72
|
+
|function[document] ; Function definitions
|
|
73
|
+
|state[:main]
|
|
74
|
+
...
|
|
75
|
+
```
|
|
76
|
+
|
|
77
|
+
### Type System
|
|
78
|
+
|
|
79
|
+
Types declare what happens when a function returns:
|
|
80
|
+
|
|
81
|
+
| Category | On Entry | On Return |
|
|
82
|
+
|------------|-------------------|------------------------------|
|
|
83
|
+
| `BRACKET` | Emit `TypeStart` | Emit `TypeEnd` |
|
|
84
|
+
| `CONTENT` | `MARK` position | Emit `Type` with content |
|
|
85
|
+
| `INTERNAL` | Nothing | Nothing (internal use only) |
|
|
86
|
+
|
|
87
|
+
```
|
|
88
|
+
|type[Element] BRACKET ; ElementStart on entry, ElementEnd on return
|
|
89
|
+
|type[Name] CONTENT ; MARK on entry, emit Name with content on return
|
|
90
|
+
|type[INT] INTERNAL ; No emit - internal computation only
|
|
91
|
+
```
|
|
92
|
+
|
|
93
|
+
### Functions
|
|
94
|
+
|
|
95
|
+
```
|
|
96
|
+
|function[name] ; Void function (no auto-emit)
|
|
97
|
+
|function[name:ReturnType] ; Returns/emits ReturnType
|
|
98
|
+
|function[name:Type] :param ; With parameter
|
|
99
|
+
|function[name:Type] :p1 :p2 ; Multiple parameters
|
|
100
|
+
```
|
|
101
|
+
|
|
102
|
+
Functions returning `BRACKET` types automatically emit Start on entry, End on return.
|
|
103
|
+
Functions returning `CONTENT` types automatically MARK on entry, emit content on return.
|
|
104
|
+
|
|
105
|
+
**Parameterized byte matching:**
|
|
106
|
+
|
|
107
|
+
Parameters used in `|c[:param]|` become byte (`u8`) type, enabling functions to work
|
|
108
|
+
with different terminators:
|
|
109
|
+
|
|
110
|
+
```
|
|
111
|
+
; A single function handles [], {}, () by parameterizing the close char
|
|
112
|
+
|function[bracketed:Bracketed] :close
|
|
113
|
+
|c[:close] | -> |return ; Match against param value
|
|
114
|
+
|default | /content(:close) |>> ; Pass param to nested calls
|
|
115
|
+
```
|
|
116
|
+
|
|
117
|
+
Called as: `/bracketed(']')` for `]`, `/bracketed('}')` for `}`, `/bracketed(')')` for `)`.
|
|
118
|
+
(Legacy syntax `/bracketed(<R>)` etc. also works.)
|
|
119
|
+
|
|
120
|
+
This eliminates duplicate functions that differ only in their terminator character.
|
|
121
|
+
|
|
122
|
+
### States
|
|
123
|
+
|
|
124
|
+
```
|
|
125
|
+
|function[element:Element] :col
|
|
126
|
+
|state[:identity]
|
|
127
|
+
|LETTER |.name | /name |>> :after
|
|
128
|
+
|default |.anon | |>> :content
|
|
129
|
+
|
|
130
|
+
|state[:after]
|
|
131
|
+
|c['\n'] |.eol | -> |>> :children
|
|
132
|
+
|default |.inline | /text(col) |>> :children
|
|
133
|
+
```
|
|
134
|
+
|
|
135
|
+
### Cases (Character Matching)
|
|
136
|
+
|
|
137
|
+
Each case has: `match | actions | transition`
|
|
138
|
+
|
|
139
|
+
```
|
|
140
|
+
|c[x] | actions |>> :state ; Match single char
|
|
141
|
+
|c['\n'] | actions |>> ; Match newline (self-loop)
|
|
142
|
+
|c[' \t'] | actions |return ; Match space or tab (quoted)
|
|
143
|
+
|c[abc] | actions |>> :next ; Match any of a, b, c
|
|
144
|
+
|c[<0-9>] | actions |>> :digit ; Match digit (using class syntax)
|
|
145
|
+
|c[<LETTER '_'>]| actions |>> :ident ; Match letter or underscore
|
|
146
|
+
|LETTER | actions |>> :name ; Match ASCII letter (a-z, A-Z)
|
|
147
|
+
|LABEL_CONT | actions |>> ; Match letter/digit/_/-
|
|
148
|
+
|DIGIT | actions |>> :num ; Match ASCII digit (0-9)
|
|
149
|
+
|HEX_DIGIT | actions |>> :hex ; Match hex digit (0-9, a-f, A-F)
|
|
150
|
+
|default | actions |>> :other ; Fallback case
|
|
151
|
+
```
|
|
152
|
+
|
|
153
|
+
**Note:** Characters outside `/A-Za-z0-9_-/` must be quoted in `c[...]`:
|
|
154
|
+
- `c['"']` for double quote, `c['#']` for hash, `c['.']` for dot
|
|
155
|
+
- `c[<P>]` or `c['|']` for pipe (DSL delimiter)
|
|
156
|
+
- `c[<L>]` or `c['[']` for brackets (DSL delimiters)
|
|
157
|
+
|
|
158
|
+
### Character, String, and Class Literals
|
|
159
|
+
|
|
160
|
+
The DSL supports three literal types. See [characters.md](characters.md) for complete specification.
|
|
161
|
+
|
|
162
|
+
| Type | Syntax | Semantics |
|
|
163
|
+
|------|--------|-----------|
|
|
164
|
+
| **Character** | `'x'` | Single byte or Unicode codepoint |
|
|
165
|
+
| **String** | `'hello'` | Ordered sequence (decomposed to chars in classes) |
|
|
166
|
+
| **Class** | `<...>` | Unordered set of characters |
|
|
167
|
+
|
|
168
|
+
**Character class syntax** (`<...>`):
|
|
169
|
+
|
|
170
|
+
```
|
|
171
|
+
<abc> ; Bare lowercase decomposed: a, b, c
|
|
172
|
+
<'|'> ; Quoted character (for special chars)
|
|
173
|
+
<LETTER> ; Predefined class
|
|
174
|
+
<0-9> ; Predefined range (digits)
|
|
175
|
+
<LETTER 0-9 '_'> ; Combined: letters, digits, underscore
|
|
176
|
+
<:var> ; Include variable's chars
|
|
177
|
+
```
|
|
178
|
+
|
|
179
|
+
**Predefined ranges:**
|
|
180
|
+
|
|
181
|
+
| Name | Characters |
|
|
182
|
+
| ------------- | --------------- |
|
|
183
|
+
| `0-9` | Decimal digits |
|
|
184
|
+
| `0-7` | Octal digits |
|
|
185
|
+
| `0-1` | Binary digits |
|
|
186
|
+
| `a-z` | Lowercase ASCII |
|
|
187
|
+
| `A-Z` | Uppercase ASCII |
|
|
188
|
+
| `a-f` / `A-F` | Hex letters |
|
|
189
|
+
|
|
190
|
+
**Predefined classes:**
|
|
191
|
+
|
|
192
|
+
| Name | Description |
|
|
193
|
+
| ------------ | ------------------------------ |
|
|
194
|
+
| `LETTER` | `a-z` + `A-Z` |
|
|
195
|
+
| `DIGIT` | `0-9` |
|
|
196
|
+
| `HEX_DIGIT` | `0-9` + `a-f` + `A-F` |
|
|
197
|
+
| `LABEL_CONT` | `LETTER` + `DIGIT` + `_` + `-` |
|
|
198
|
+
| `WS` | Space + tab |
|
|
199
|
+
| `NL` | Newline |
|
|
200
|
+
|
|
201
|
+
**Unicode classes** (requires `unicode-xid` crate):
|
|
202
|
+
|
|
203
|
+
| Name | Description |
|
|
204
|
+
| ------------ | ----------------------------- |
|
|
205
|
+
| `XID_START` | Unicode identifier start |
|
|
206
|
+
| `XID_CONT` | Unicode identifier continue |
|
|
207
|
+
| `XLBL_START` | = `XID_START` (label start) |
|
|
208
|
+
| `XLBL_CONT` | `XID_CONT` + `-` (kebab-case) |
|
|
209
|
+
|
|
210
|
+
**DSL-reserved single-char classes:**
|
|
211
|
+
|
|
212
|
+
| Name | Char | Name | Char |
|
|
213
|
+
| ---- | ---- | ---- | ---- |
|
|
214
|
+
| `P` | `\|` | `SQ` | `'` |
|
|
215
|
+
| `L` | `[` | `DQ` | `"` |
|
|
216
|
+
| `R` | `]` | `BS` | `\` |
|
|
217
|
+
| `LB` | `{` | `LP` | `(` |
|
|
218
|
+
| `RB` | `}` | `RP` | `)` |
|
|
219
|
+
|
|
220
|
+
**Escape sequences** (in quoted strings):
|
|
221
|
+
|
|
222
|
+
| Syntax | Result |
|
|
223
|
+
| -------- | ----------------- |
|
|
224
|
+
| `\n` | Newline |
|
|
225
|
+
| `\t` | Tab |
|
|
226
|
+
| `\r` | Carriage return |
|
|
227
|
+
| `\\` | Backslash |
|
|
228
|
+
| `\'` | Single quote |
|
|
229
|
+
| `\xHH` | Hex byte |
|
|
230
|
+
| `\uXXXX` | Unicode codepoint |
|
|
231
|
+
|
|
232
|
+
### Actions
|
|
233
|
+
|
|
234
|
+
Actions are pipe-separated, execute left-to-right:
|
|
235
|
+
|
|
236
|
+
```
|
|
237
|
+
| -> ; Advance one character
|
|
238
|
+
| ->['\n'] ; Advance TO newline (SIMD scan, 1-6 chars)
|
|
239
|
+
| ->['"\''] ; Advance TO first " or ' (multi-char scan)
|
|
240
|
+
| MARK ; Mark position for accumulation
|
|
241
|
+
| TERM ; Terminate slice (MARK to current)
|
|
242
|
+
| TERM(-1) ; Terminate slice excluding last N bytes
|
|
243
|
+
| /function ; Call function
|
|
244
|
+
| /function(args) ; Call with arguments
|
|
245
|
+
| /error ; Emit error event (add |return to exit)
|
|
246
|
+
| /error(CustomCode) ; Emit error with custom code
|
|
247
|
+
| var = value ; Assignment
|
|
248
|
+
| var += 1 ; Increment
|
|
249
|
+
| PREPEND('|') ; Prepend literal to next accumulated content
|
|
250
|
+
| PREPEND('**') ; Prepend multi-byte literal
|
|
251
|
+
| PREPEND(:param) ; Prepend parameter bytes (empty = no-op)
|
|
252
|
+
```
|
|
253
|
+
|
|
254
|
+
**Advance-to (`->[...]`)**: SIMD-accelerated scan using memchr. Supports 1-6 literal
|
|
255
|
+
bytes only. Does NOT support character classes (LETTER, DIGIT) or parameter refs (:param).
|
|
256
|
+
Use quoted chars for special bytes: `->['\n\t']`, `->['|']`.
|
|
257
|
+
|
|
258
|
+
**Error handling (`/error`)**: Emits an Error event. Custom error codes
|
|
259
|
+
are converted to PascalCase. Add `|return` to exit after the error:
|
|
260
|
+
|
|
261
|
+
```
|
|
262
|
+
|c['\t'] | /error(no_tabs) |return ; Emits [Error, "NoTabs"]
|
|
263
|
+
|c[' '] | /error(no_spaces) |return ; Emits [Error, "NoSpaces"]
|
|
264
|
+
|default | /error |return ; Emits [Error, "UnexpectedChar"]
|
|
265
|
+
```
|
|
266
|
+
|
|
267
|
+
`PREPEND` adds bytes to the accumulation buffer that will be combined with the
|
|
268
|
+
next `TERM` result. This is useful for restoring consumed characters during
|
|
269
|
+
lookahead. Parameters used in `PREPEND(:param)` become `&'static [u8]` type,
|
|
270
|
+
allowing empty (`<>`), single byte (`'x'`), or multi-byte (`'**'`) values.
|
|
271
|
+
Empty content is naturally a no-op.
|
|
272
|
+
|
|
273
|
+
### Inline Literal Events
|
|
274
|
+
|
|
275
|
+
For emitting events directly with literal or accumulated content:
|
|
276
|
+
|
|
277
|
+
```
|
|
278
|
+
| TypeName ; Emit event with no payload (BoolTrue, Nil)
|
|
279
|
+
| TypeName('value') ; Emit with literal value (e.g., Attr('$id'), Attr('?'))
|
|
280
|
+
| TypeName(USE_MARK) ; Emit using current MARK/TERM content
|
|
281
|
+
```
|
|
282
|
+
|
|
283
|
+
Examples:
|
|
284
|
+
```
|
|
285
|
+
|c['?'] | Attr(?) | BoolTrue | -> |return ; Emit Attr "?", BoolTrue
|
|
286
|
+
|c[<L>] | Attr($id) | /value |>> :next ; Emit Attr "$id", parse value
|
|
287
|
+
|default | TERM | Text(USE_MARK) |return ; Emit Text with accumulated
|
|
288
|
+
```
|
|
289
|
+
|
|
290
|
+
### Transitions
|
|
291
|
+
|
|
292
|
+
```
|
|
293
|
+
|>> ; Self-loop (stay in current state)
|
|
294
|
+
|>> :state ; Go to named state
|
|
295
|
+
|return ; Return from function
|
|
296
|
+
```
|
|
297
|
+
|
|
298
|
+
### Conditionals
|
|
299
|
+
|
|
300
|
+
Single-line guards only (no block structure):
|
|
301
|
+
|
|
302
|
+
```
|
|
303
|
+
|if[COL <= col] |return ; If true, return
|
|
304
|
+
| |>> :continue ; Else, continue
|
|
305
|
+
```
|
|
306
|
+
|
|
307
|
+
### Special Variables
|
|
308
|
+
|
|
309
|
+
| Variable | Meaning |
|
|
310
|
+
|----------|--------------------------------------|
|
|
311
|
+
| `COL` | Current column (1-indexed) |
|
|
312
|
+
| `LINE` | Current line (1-indexed) |
|
|
313
|
+
| `PREV` | Previous byte (0 at start of input) |
|
|
314
|
+
|
|
315
|
+
### Keywords (phf Perfect Hash)
|
|
316
|
+
|
|
317
|
+
For keyword matching (like `true`/`false`/`null`), use phf perfect hash for O(1) lookup:
|
|
318
|
+
|
|
319
|
+
```
|
|
320
|
+
|keywords[bare] :fallback /identifier
|
|
321
|
+
| true => BoolTrue
|
|
322
|
+
| false => BoolFalse
|
|
323
|
+
| null => Nil
|
|
324
|
+
| nil => Nil
|
|
325
|
+
|
|
326
|
+
|function[value]
|
|
327
|
+
|state[:main]
|
|
328
|
+
|LABEL_CONT |.cont | -> |>>
|
|
329
|
+
|default |.done | TERM | KEYWORDS(bare) |return
|
|
330
|
+
```
|
|
331
|
+
|
|
332
|
+
- `|keywords[name]` defines a keyword map
|
|
333
|
+
- `:fallback /function` specifies what to call if no match
|
|
334
|
+
- `| keyword => EventType` maps keywords to event types
|
|
335
|
+
- `KEYWORDS(name)` in actions triggers the lookup
|
|
336
|
+
|
|
337
|
+
Generates efficient phf_map! with O(1) compile-time perfect hash lookup.
|
|
338
|
+
|
|
339
|
+
## Automatic Optimizations
|
|
340
|
+
|
|
341
|
+
### SCAN Inference
|
|
342
|
+
|
|
343
|
+
If a state has a self-looping default case (`|default | -> |>>`), the explicit
|
|
344
|
+
character cases become SCAN targets for SIMD-accelerated bulk scanning.
|
|
345
|
+
|
|
346
|
+
```
|
|
347
|
+
|state[:prose]
|
|
348
|
+
|c['\n'] |.eol | ... |>> :next
|
|
349
|
+
|c[<P>] |.pipe | ... |>> :check
|
|
350
|
+
|default |.collect | -> |>> ; ← triggers auto-SCAN
|
|
351
|
+
```
|
|
352
|
+
|
|
353
|
+
The generator detects this and uses `memchr` to scan for `\n` and `|` in bulk.
|
|
354
|
+
|
|
355
|
+
### EOF Handling
|
|
356
|
+
|
|
357
|
+
By default, the generator infers EOF behavior based on return type:
|
|
358
|
+
|
|
359
|
+
- `BRACKET` → emit End event
|
|
360
|
+
- `CONTENT` → emit content event (finalizing any active MARK)
|
|
361
|
+
- `INTERNAL` / void → just return
|
|
362
|
+
|
|
363
|
+
The generator also infers unclosed-delimiter errors from the structure of return
|
|
364
|
+
cases (e.g., a string function that only returns on `"` will emit `UnclosedStringValue`
|
|
365
|
+
if EOF is reached).
|
|
366
|
+
|
|
367
|
+
For explicit control, use the `|eof` directive:
|
|
368
|
+
|
|
369
|
+
```
|
|
370
|
+
|function[number:Number]
|
|
371
|
+
|state[:main]
|
|
372
|
+
|DIGIT |.digit | -> |>>
|
|
373
|
+
|default |.done | TERM | Integer(USE_MARK) |return
|
|
374
|
+
|eof |.eof | TERM | Integer(USE_MARK) |return
|
|
375
|
+
```
|
|
376
|
+
|
|
377
|
+
This is useful when EOF should emit a different type than the function's return type,
|
|
378
|
+
or when you need specific actions at EOF (like inline emits).
|
|
379
|
+
|
|
380
|
+
## Example: Line Parser
|
|
381
|
+
|
|
382
|
+
```
|
|
383
|
+
|parser lines
|
|
384
|
+
|
|
385
|
+
|type[Text] CONTENT
|
|
386
|
+
|
|
387
|
+
|entry-point /document
|
|
388
|
+
|
|
389
|
+
|function[document]
|
|
390
|
+
|state[:main]
|
|
391
|
+
|c['\n'] |.blank | -> |>>
|
|
392
|
+
|default |.start | /line |>>
|
|
393
|
+
|
|
394
|
+
|function[line:Text]
|
|
395
|
+
|state[:main]
|
|
396
|
+
|c['\n'] |.eol | -> |return
|
|
397
|
+
|default |.collect | -> |>>
|
|
398
|
+
```
|
|
399
|
+
|
|
400
|
+
What the generator infers:
|
|
401
|
+
- `line` returns `CONTENT` type → MARK on entry, emit Text on return
|
|
402
|
+
- `line:main` has self-looping default → SCAN for `\n`
|
|
403
|
+
- EOF in `line` → emit accumulated Text, return
|
|
404
|
+
- EOF in `document` → just return (void function)
|
|
405
|
+
|
|
406
|
+
## Example: Element Parser
|
|
407
|
+
|
|
408
|
+
```
|
|
409
|
+
|parser elements
|
|
410
|
+
|
|
411
|
+
|type[Element] BRACKET
|
|
412
|
+
|type[Name] CONTENT
|
|
413
|
+
|type[Text] CONTENT
|
|
414
|
+
|
|
415
|
+
|entry-point /document
|
|
416
|
+
|
|
417
|
+
|function[document]
|
|
418
|
+
|state[:main]
|
|
419
|
+
|c['\n'] |.blank |-> |>>
|
|
420
|
+
|c[<P>] |.pipe |-> | /element(0)|>>
|
|
421
|
+
|default |.text |/text(0) |>>
|
|
422
|
+
|
|
423
|
+
|function[element:Element] :col
|
|
424
|
+
|state[:identity]
|
|
425
|
+
|LETTER |.name |/name |>> :after
|
|
426
|
+
|default |.anon | |>> :content
|
|
427
|
+
|state[:after]
|
|
428
|
+
|c['\n'] |.eol |-> |>> :children
|
|
429
|
+
|default |.text |/text(col) |>> :children
|
|
430
|
+
|state[:children]
|
|
431
|
+
|c['\n'] |.blank |-> |>>
|
|
432
|
+
|c[' \t'] |.ws |-> |>> :check
|
|
433
|
+
|default |.dedent | |return
|
|
434
|
+
|state[:check]
|
|
435
|
+
|if[COL <= col] | |return
|
|
436
|
+
|c[<P>] |.child |-> | /element(COL) |>> :children
|
|
437
|
+
|default |.text | /text(COL) |>> :children
|
|
438
|
+
|
|
439
|
+
|function[name:Name]
|
|
440
|
+
|state[:main]
|
|
441
|
+
|LABEL_CONT |.cont |-> |>>
|
|
442
|
+
|default |.done | |return
|
|
443
|
+
|
|
444
|
+
|function[text:Text] :col
|
|
445
|
+
|state[:main]
|
|
446
|
+
|c['\n'] |.eol | |return
|
|
447
|
+
|default |.collect |-> |>>
|
|
448
|
+
```
|
|
449
|
+
|
|
450
|
+
What the generator produces:
|
|
451
|
+
- `element` returns `BRACKET` → emit `ElementStart` on entry, `ElementEnd` on return
|
|
452
|
+
- `name` returns `CONTENT` → MARK on entry, emit `Name` on return
|
|
453
|
+
- Recursive `/element(COL)` calls naturally handle nesting
|
|
454
|
+
- Column-based dedent via `|if[COL <= col]` unwinds the call stack
|
|
455
|
+
|
|
456
|
+
## Generated Code
|
|
457
|
+
|
|
458
|
+
descent generates callback-based parsers (2-7x faster than ring-buffer alternatives):
|
|
459
|
+
|
|
460
|
+
```rust
|
|
461
|
+
impl<'a> Parser<'a> {
|
|
462
|
+
pub fn parse<F>(self, on_event: F)
|
|
463
|
+
where
|
|
464
|
+
F: FnMut(Event<'a>)
|
|
465
|
+
{
|
|
466
|
+
self.parse_document(&mut on_event);
|
|
467
|
+
}
|
|
468
|
+
|
|
469
|
+
fn parse_element<F>(&mut self, col: i32, on_event: &mut F) {
|
|
470
|
+
on_event(Event::ElementStart { .. });
|
|
471
|
+
// ... parse content ...
|
|
472
|
+
// ... recursive calls for children ...
|
|
473
|
+
on_event(Event::ElementEnd { .. });
|
|
474
|
+
}
|
|
475
|
+
}
|
|
476
|
+
```
|
|
477
|
+
|
|
478
|
+
### Debug Tracing
|
|
479
|
+
|
|
480
|
+
Generate parsers with `--trace` to output detailed execution traces to stderr:
|
|
481
|
+
|
|
482
|
+
```bash
|
|
483
|
+
descent generate --trace parser.desc > parser.rs
|
|
484
|
+
```
|
|
485
|
+
|
|
486
|
+
Trace output shows the exact execution path through the parser:
|
|
487
|
+
|
|
488
|
+
```
|
|
489
|
+
TRACE: L14 ENTER document | byte='H' pos=0
|
|
490
|
+
TRACE: L16 document:main.collect | byte='H' term=[] pos=0
|
|
491
|
+
TRACE: L16 document:main.collect | byte='e' term=["H"] pos=1
|
|
492
|
+
TRACE: L15 document:main EOF | term=["Hello"] pos=5
|
|
493
|
+
```
|
|
494
|
+
|
|
495
|
+
Each line shows:
|
|
496
|
+
- **L14**: Source line number from the `.desc` file
|
|
497
|
+
- **document:main.collect**: Function name, state name, and case label (substate)
|
|
498
|
+
- **byte='H'**: Current byte being processed
|
|
499
|
+
- **term=["H"]**: Accumulated content in the term buffer
|
|
500
|
+
- **pos=0**: Current position in input
|
|
501
|
+
|
|
502
|
+
This is invaluable for debugging parser behavior and understanding how the
|
|
503
|
+
generated state machine processes input.
|
|
504
|
+
|
|
505
|
+
## Architecture
|
|
506
|
+
|
|
507
|
+
```
|
|
508
|
+
┌──────────────┐
|
|
509
|
+
parser.desc ───▶ │ Lexer │ ───▶ Tokens
|
|
510
|
+
└──────────────┘
|
|
511
|
+
│
|
|
512
|
+
▼
|
|
513
|
+
┌──────────────┐
|
|
514
|
+
│ Parser │ ───▶ AST
|
|
515
|
+
└──────────────┘
|
|
516
|
+
│
|
|
517
|
+
▼
|
|
518
|
+
┌──────────────┐
|
|
519
|
+
│ IR Builder │ ───▶ IR (with inferred SCAN, etc.)
|
|
520
|
+
└──────────────┘
|
|
521
|
+
│
|
|
522
|
+
▼
|
|
523
|
+
┌──────────────┐ ┌─────────────────────┐
|
|
524
|
+
│ Generator │◀──────│ templates/rust/ │
|
|
525
|
+
│ (Liquid) │ │ templates/c/ │
|
|
526
|
+
└──────────────┘ └─────────────────────┘
|
|
527
|
+
│
|
|
528
|
+
▼
|
|
529
|
+
parser.rs
|
|
530
|
+
```
|
|
531
|
+
|
|
532
|
+
- **Lexer**: Tokenizes pipe-delimited `.desc` format
|
|
533
|
+
- **Parser**: Builds AST from tokens (pure Data structs)
|
|
534
|
+
- **IR Builder**: Semantic analysis, SCAN inference, validation
|
|
535
|
+
- **Generator**: Renders IR through Liquid templates
|
|
536
|
+
|
|
537
|
+
**Adding new commands**: The parser uses `command_like?()` to detect command tokens
|
|
538
|
+
generically (uppercase words, `/function`, `->`, `>>`). To add a new command like `RESET`:
|
|
539
|
+
|
|
540
|
+
1. `parser.rb`: Add to `classify_command()` - what type is it?
|
|
541
|
+
2. `ir_builder.rb`: Add to command transformation - what args does it have?
|
|
542
|
+
3. `_command.liquid`: Add rendering - what Rust code does it generate?
|
|
543
|
+
|
|
544
|
+
No changes needed to case detection or structural parsing.
|
|
545
|
+
|
|
546
|
+
## Targets
|
|
547
|
+
|
|
548
|
+
| Target | Status | Output |
|
|
549
|
+
|--------|--------|--------|
|
|
550
|
+
| Rust | Working | Single `.rs` file with callback API |
|
|
551
|
+
| C | Not implemented | `.c` + `.h` files (planned) |
|
|
552
|
+
|
|
553
|
+
## Bootstrapping
|
|
554
|
+
|
|
555
|
+
The `.desc` format is valid UDON. When the UDON parser (generated by descent)
|
|
556
|
+
is mature, descent can use it to parse its own input format - the tool generates
|
|
557
|
+
the parser that parses its own specifications.
|
|
558
|
+
|
|
559
|
+
## Development
|
|
560
|
+
|
|
561
|
+
```bash
|
|
562
|
+
# Install dependencies
|
|
563
|
+
bundle install
|
|
564
|
+
|
|
565
|
+
# Run tests
|
|
566
|
+
dx test
|
|
567
|
+
|
|
568
|
+
# Lint
|
|
569
|
+
dx lint
|
|
570
|
+
dx lint --fix
|
|
571
|
+
|
|
572
|
+
# Build and install locally
|
|
573
|
+
dx gem install
|
|
574
|
+
```
|
|
575
|
+
|
|
576
|
+
## Related
|
|
577
|
+
|
|
578
|
+
- [libudon](https://github.com/josephwecker/libudon) - The UDON parser library (uses descent)
|
|
579
|
+
- [UDON Specification](https://github.com/josephwecker/udon) - The UDON markup language
|
|
580
|
+
|
|
581
|
+
## License
|
|
582
|
+
|
|
583
|
+
MIT
|