descent 0.7.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/README.md ADDED
@@ -0,0 +1,583 @@
1
+ # `descent` Recursive Descent Parser Generator
2
+
3
+ A recursive descent parser generator that produces high-performance callback-based
4
+ parsers from declarative `.desc` specifications.
5
+
6
+ **See also:**
7
+ - [SYNTAX.md](SYNTAX.md) - Complete `.desc` DSL syntax reference
8
+ - [characters.md](characters.md) - Character, String, and Class literal specification
9
+
10
+ ## Philosophy
11
+
12
+ **The DSL describes *what* to parse. The generator figures out *how*.**
13
+
14
+ - **Type-driven emit**: Return types determine events, no explicit `emit()` needed
15
+ - **Inferred EOF**: Default EOF behavior derived from context (`|eof` available for explicit control)
16
+ - **Auto SCAN**: SIMD-accelerated scanning inferred from state structure
17
+ - **True recursion**: Call stack IS the element stack
18
+ - **Minimal state**: Small functions compose into complex parsers
19
+
20
+ ## Installation
21
+
22
+ ```bash
23
+ gem install descent
24
+ ```
25
+
26
+ Or in your Gemfile:
27
+
28
+ ```ruby
29
+ gem 'descent'
30
+ ```
31
+
32
+ ## Usage
33
+
34
+ ```bash
35
+ # Generate Rust parser (to stdout)
36
+ descent generate parser.desc
37
+
38
+ # Generate with output file
39
+ descent generate parser.desc -o parser.rs
40
+
41
+ # Generate with debug tracing enabled
42
+ descent generate parser.desc --trace
43
+
44
+ # Validate .desc file without generating
45
+ descent validate parser.desc
46
+
47
+ # Debug: inspect tokens, AST, or IR
48
+ descent debug parser.desc
49
+ descent debug --tokens parser.desc
50
+ descent debug --ast parser.desc
51
+
52
+ # Show help
53
+ descent --help
54
+ descent generate --help
55
+ ```
56
+
57
+ ## The `.desc` DSL
58
+
59
+ `.desc` files are valid UDON documents (enabling future bootstrapping). The format
60
+ uses pipe-delimited declarations.
61
+
62
+ ### Document Structure
63
+
64
+ ```
65
+ |parser myparser ; Parser name (required)
66
+
67
+ |type[Element] BRACKET ; Type declarations
68
+ |type[Text] CONTENT
69
+
70
+ |entry-point /document ; Where parsing begins
71
+
72
+ |function[document] ; Function definitions
73
+ |state[:main]
74
+ ...
75
+ ```
76
+
77
+ ### Type System
78
+
79
+ Types declare what happens when a function returns:
80
+
81
+ | Category | On Entry | On Return |
82
+ |------------|-------------------|------------------------------|
83
+ | `BRACKET` | Emit `TypeStart` | Emit `TypeEnd` |
84
+ | `CONTENT` | `MARK` position | Emit `Type` with content |
85
+ | `INTERNAL` | Nothing | Nothing (internal use only) |
86
+
87
+ ```
88
+ |type[Element] BRACKET ; ElementStart on entry, ElementEnd on return
89
+ |type[Name] CONTENT ; MARK on entry, emit Name with content on return
90
+ |type[INT] INTERNAL ; No emit - internal computation only
91
+ ```
92
+
93
+ ### Functions
94
+
95
+ ```
96
+ |function[name] ; Void function (no auto-emit)
97
+ |function[name:ReturnType] ; Returns/emits ReturnType
98
+ |function[name:Type] :param ; With parameter
99
+ |function[name:Type] :p1 :p2 ; Multiple parameters
100
+ ```
101
+
102
+ Functions returning `BRACKET` types automatically emit Start on entry, End on return.
103
+ Functions returning `CONTENT` types automatically MARK on entry, emit content on return.
104
+
105
+ **Parameterized byte matching:**
106
+
107
+ Parameters used in `|c[:param]|` become byte (`u8`) type, enabling functions to work
108
+ with different terminators:
109
+
110
+ ```
111
+ ; A single function handles [], {}, () by parameterizing the close char
112
+ |function[bracketed:Bracketed] :close
113
+ |c[:close] | -> |return ; Match against param value
114
+ |default | /content(:close) |>> ; Pass param to nested calls
115
+ ```
116
+
117
+ Called as: `/bracketed(']')` for `]`, `/bracketed('}')` for `}`, `/bracketed(')')` for `)`.
118
+ (Legacy syntax `/bracketed(<R>)` etc. also works.)
119
+
120
+ This eliminates duplicate functions that differ only in their terminator character.
121
+
122
+ ### States
123
+
124
+ ```
125
+ |function[element:Element] :col
126
+ |state[:identity]
127
+ |LETTER |.name | /name |>> :after
128
+ |default |.anon | |>> :content
129
+
130
+ |state[:after]
131
+ |c['\n'] |.eol | -> |>> :children
132
+ |default |.inline | /text(col) |>> :children
133
+ ```
134
+
135
+ ### Cases (Character Matching)
136
+
137
+ Each case has: `match | actions | transition`
138
+
139
+ ```
140
+ |c[x] | actions |>> :state ; Match single char
141
+ |c['\n'] | actions |>> ; Match newline (self-loop)
142
+ |c[' \t'] | actions |return ; Match space or tab (quoted)
143
+ |c[abc] | actions |>> :next ; Match any of a, b, c
144
+ |c[<0-9>] | actions |>> :digit ; Match digit (using class syntax)
145
+ |c[<LETTER '_'>]| actions |>> :ident ; Match letter or underscore
146
+ |LETTER | actions |>> :name ; Match ASCII letter (a-z, A-Z)
147
+ |LABEL_CONT | actions |>> ; Match letter/digit/_/-
148
+ |DIGIT | actions |>> :num ; Match ASCII digit (0-9)
149
+ |HEX_DIGIT | actions |>> :hex ; Match hex digit (0-9, a-f, A-F)
150
+ |default | actions |>> :other ; Fallback case
151
+ ```
152
+
153
+ **Note:** Characters outside `/A-Za-z0-9_-/` must be quoted in `c[...]`:
154
+ - `c['"']` for double quote, `c['#']` for hash, `c['.']` for dot
155
+ - `c[<P>]` or `c['|']` for pipe (DSL delimiter)
156
+ - `c[<L>]` or `c['[']` for brackets (DSL delimiters)
157
+
158
+ ### Character, String, and Class Literals
159
+
160
+ The DSL supports three literal types. See [characters.md](characters.md) for complete specification.
161
+
162
+ | Type | Syntax | Semantics |
163
+ |------|--------|-----------|
164
+ | **Character** | `'x'` | Single byte or Unicode codepoint |
165
+ | **String** | `'hello'` | Ordered sequence (decomposed to chars in classes) |
166
+ | **Class** | `<...>` | Unordered set of characters |
167
+
168
+ **Character class syntax** (`<...>`):
169
+
170
+ ```
171
+ <abc> ; Bare lowercase decomposed: a, b, c
172
+ <'|'> ; Quoted character (for special chars)
173
+ <LETTER> ; Predefined class
174
+ <0-9> ; Predefined range (digits)
175
+ <LETTER 0-9 '_'> ; Combined: letters, digits, underscore
176
+ <:var> ; Include variable's chars
177
+ ```
178
+
179
+ **Predefined ranges:**
180
+
181
+ | Name | Characters |
182
+ | ------------- | --------------- |
183
+ | `0-9` | Decimal digits |
184
+ | `0-7` | Octal digits |
185
+ | `0-1` | Binary digits |
186
+ | `a-z` | Lowercase ASCII |
187
+ | `A-Z` | Uppercase ASCII |
188
+ | `a-f` / `A-F` | Hex letters |
189
+
190
+ **Predefined classes:**
191
+
192
+ | Name | Description |
193
+ | ------------ | ------------------------------ |
194
+ | `LETTER` | `a-z` + `A-Z` |
195
+ | `DIGIT` | `0-9` |
196
+ | `HEX_DIGIT` | `0-9` + `a-f` + `A-F` |
197
+ | `LABEL_CONT` | `LETTER` + `DIGIT` + `_` + `-` |
198
+ | `WS` | Space + tab |
199
+ | `NL` | Newline |
200
+
201
+ **Unicode classes** (requires `unicode-xid` crate):
202
+
203
+ | Name | Description |
204
+ | ------------ | ----------------------------- |
205
+ | `XID_START` | Unicode identifier start |
206
+ | `XID_CONT` | Unicode identifier continue |
207
+ | `XLBL_START` | = `XID_START` (label start) |
208
+ | `XLBL_CONT` | `XID_CONT` + `-` (kebab-case) |
209
+
210
+ **DSL-reserved single-char classes:**
211
+
212
+ | Name | Char | Name | Char |
213
+ | ---- | ---- | ---- | ---- |
214
+ | `P` | `\|` | `SQ` | `'` |
215
+ | `L` | `[` | `DQ` | `"` |
216
+ | `R` | `]` | `BS` | `\` |
217
+ | `LB` | `{` | `LP` | `(` |
218
+ | `RB` | `}` | `RP` | `)` |
219
+
220
+ **Escape sequences** (in quoted strings):
221
+
222
+ | Syntax | Result |
223
+ | -------- | ----------------- |
224
+ | `\n` | Newline |
225
+ | `\t` | Tab |
226
+ | `\r` | Carriage return |
227
+ | `\\` | Backslash |
228
+ | `\'` | Single quote |
229
+ | `\xHH` | Hex byte |
230
+ | `\uXXXX` | Unicode codepoint |
231
+
232
+ ### Actions
233
+
234
+ Actions are pipe-separated, execute left-to-right:
235
+
236
+ ```
237
+ | -> ; Advance one character
238
+ | ->['\n'] ; Advance TO newline (SIMD scan, 1-6 chars)
239
+ | ->['"\''] ; Advance TO first " or ' (multi-char scan)
240
+ | MARK ; Mark position for accumulation
241
+ | TERM ; Terminate slice (MARK to current)
242
+ | TERM(-1) ; Terminate slice excluding last N bytes
243
+ | /function ; Call function
244
+ | /function(args) ; Call with arguments
245
+ | /error ; Emit error event (add |return to exit)
246
+ | /error(CustomCode) ; Emit error with custom code
247
+ | var = value ; Assignment
248
+ | var += 1 ; Increment
249
+ | PREPEND('|') ; Prepend literal to next accumulated content
250
+ | PREPEND('**') ; Prepend multi-byte literal
251
+ | PREPEND(:param) ; Prepend parameter bytes (empty = no-op)
252
+ ```
253
+
254
+ **Advance-to (`->[...]`)**: SIMD-accelerated scan using memchr. Supports 1-6 literal
255
+ bytes only. Does NOT support character classes (LETTER, DIGIT) or parameter refs (:param).
256
+ Use quoted chars for special bytes: `->['\n\t']`, `->['|']`.
257
+
258
+ **Error handling (`/error`)**: Emits an Error event. Custom error codes
259
+ are converted to PascalCase. Add `|return` to exit after the error:
260
+
261
+ ```
262
+ |c['\t'] | /error(no_tabs) |return ; Emits [Error, "NoTabs"]
263
+ |c[' '] | /error(no_spaces) |return ; Emits [Error, "NoSpaces"]
264
+ |default | /error |return ; Emits [Error, "UnexpectedChar"]
265
+ ```
266
+
267
+ `PREPEND` adds bytes to the accumulation buffer that will be combined with the
268
+ next `TERM` result. This is useful for restoring consumed characters during
269
+ lookahead. Parameters used in `PREPEND(:param)` become `&'static [u8]` type,
270
+ allowing empty (`<>`), single byte (`'x'`), or multi-byte (`'**'`) values.
271
+ Empty content is naturally a no-op.
272
+
273
+ ### Inline Literal Events
274
+
275
+ For emitting events directly with literal or accumulated content:
276
+
277
+ ```
278
+ | TypeName ; Emit event with no payload (BoolTrue, Nil)
279
+ | TypeName('value') ; Emit with literal value (e.g., Attr('$id'), Attr('?'))
280
+ | TypeName(USE_MARK) ; Emit using current MARK/TERM content
281
+ ```
282
+
283
+ Examples:
284
+ ```
285
+ |c['?'] | Attr(?) | BoolTrue | -> |return ; Emit Attr "?", BoolTrue
286
+ |c[<L>] | Attr($id) | /value |>> :next ; Emit Attr "$id", parse value
287
+ |default | TERM | Text(USE_MARK) |return ; Emit Text with accumulated
288
+ ```
289
+
290
+ ### Transitions
291
+
292
+ ```
293
+ |>> ; Self-loop (stay in current state)
294
+ |>> :state ; Go to named state
295
+ |return ; Return from function
296
+ ```
297
+
298
+ ### Conditionals
299
+
300
+ Single-line guards only (no block structure):
301
+
302
+ ```
303
+ |if[COL <= col] |return ; If true, return
304
+ | |>> :continue ; Else, continue
305
+ ```
306
+
307
+ ### Special Variables
308
+
309
+ | Variable | Meaning |
310
+ |----------|--------------------------------------|
311
+ | `COL` | Current column (1-indexed) |
312
+ | `LINE` | Current line (1-indexed) |
313
+ | `PREV` | Previous byte (0 at start of input) |
314
+
315
+ ### Keywords (phf Perfect Hash)
316
+
317
+ For keyword matching (like `true`/`false`/`null`), use phf perfect hash for O(1) lookup:
318
+
319
+ ```
320
+ |keywords[bare] :fallback /identifier
321
+ | true => BoolTrue
322
+ | false => BoolFalse
323
+ | null => Nil
324
+ | nil => Nil
325
+
326
+ |function[value]
327
+ |state[:main]
328
+ |LABEL_CONT |.cont | -> |>>
329
+ |default |.done | TERM | KEYWORDS(bare) |return
330
+ ```
331
+
332
+ - `|keywords[name]` defines a keyword map
333
+ - `:fallback /function` specifies what to call if no match
334
+ - `| keyword => EventType` maps keywords to event types
335
+ - `KEYWORDS(name)` in actions triggers the lookup
336
+
337
+ Generates efficient phf_map! with O(1) compile-time perfect hash lookup.
338
+
339
+ ## Automatic Optimizations
340
+
341
+ ### SCAN Inference
342
+
343
+ If a state has a self-looping default case (`|default | -> |>>`), the explicit
344
+ character cases become SCAN targets for SIMD-accelerated bulk scanning.
345
+
346
+ ```
347
+ |state[:prose]
348
+ |c['\n'] |.eol | ... |>> :next
349
+ |c[<P>] |.pipe | ... |>> :check
350
+ |default |.collect | -> |>> ; ← triggers auto-SCAN
351
+ ```
352
+
353
+ The generator detects this and uses `memchr` to scan for `\n` and `|` in bulk.
354
+
355
+ ### EOF Handling
356
+
357
+ By default, the generator infers EOF behavior based on return type:
358
+
359
+ - `BRACKET` → emit End event
360
+ - `CONTENT` → emit content event (finalizing any active MARK)
361
+ - `INTERNAL` / void → just return
362
+
363
+ The generator also infers unclosed-delimiter errors from the structure of return
364
+ cases (e.g., a string function that only returns on `"` will emit `UnclosedStringValue`
365
+ if EOF is reached).
366
+
367
+ For explicit control, use the `|eof` directive:
368
+
369
+ ```
370
+ |function[number:Number]
371
+ |state[:main]
372
+ |DIGIT |.digit | -> |>>
373
+ |default |.done | TERM | Integer(USE_MARK) |return
374
+ |eof |.eof | TERM | Integer(USE_MARK) |return
375
+ ```
376
+
377
+ This is useful when EOF should emit a different type than the function's return type,
378
+ or when you need specific actions at EOF (like inline emits).
379
+
380
+ ## Example: Line Parser
381
+
382
+ ```
383
+ |parser lines
384
+
385
+ |type[Text] CONTENT
386
+
387
+ |entry-point /document
388
+
389
+ |function[document]
390
+ |state[:main]
391
+ |c['\n'] |.blank | -> |>>
392
+ |default |.start | /line |>>
393
+
394
+ |function[line:Text]
395
+ |state[:main]
396
+ |c['\n'] |.eol | -> |return
397
+ |default |.collect | -> |>>
398
+ ```
399
+
400
+ What the generator infers:
401
+ - `line` returns `CONTENT` type → MARK on entry, emit Text on return
402
+ - `line:main` has self-looping default → SCAN for `\n`
403
+ - EOF in `line` → emit accumulated Text, return
404
+ - EOF in `document` → just return (void function)
405
+
406
+ ## Example: Element Parser
407
+
408
+ ```
409
+ |parser elements
410
+
411
+ |type[Element] BRACKET
412
+ |type[Name] CONTENT
413
+ |type[Text] CONTENT
414
+
415
+ |entry-point /document
416
+
417
+ |function[document]
418
+ |state[:main]
419
+ |c['\n'] |.blank |-> |>>
420
+ |c[<P>] |.pipe |-> | /element(0)|>>
421
+ |default |.text |/text(0) |>>
422
+
423
+ |function[element:Element] :col
424
+ |state[:identity]
425
+ |LETTER |.name |/name |>> :after
426
+ |default |.anon | |>> :content
427
+ |state[:after]
428
+ |c['\n'] |.eol |-> |>> :children
429
+ |default |.text |/text(col) |>> :children
430
+ |state[:children]
431
+ |c['\n'] |.blank |-> |>>
432
+ |c[' \t'] |.ws |-> |>> :check
433
+ |default |.dedent | |return
434
+ |state[:check]
435
+ |if[COL <= col] | |return
436
+ |c[<P>] |.child |-> | /element(COL) |>> :children
437
+ |default |.text | /text(COL) |>> :children
438
+
439
+ |function[name:Name]
440
+ |state[:main]
441
+ |LABEL_CONT |.cont |-> |>>
442
+ |default |.done | |return
443
+
444
+ |function[text:Text] :col
445
+ |state[:main]
446
+ |c['\n'] |.eol | |return
447
+ |default |.collect |-> |>>
448
+ ```
449
+
450
+ What the generator produces:
451
+ - `element` returns `BRACKET` → emit `ElementStart` on entry, `ElementEnd` on return
452
+ - `name` returns `CONTENT` → MARK on entry, emit `Name` on return
453
+ - Recursive `/element(COL)` calls naturally handle nesting
454
+ - Column-based dedent via `|if[COL <= col]` unwinds the call stack
455
+
456
+ ## Generated Code
457
+
458
+ descent generates callback-based parsers (2-7x faster than ring-buffer alternatives):
459
+
460
+ ```rust
461
+ impl<'a> Parser<'a> {
462
+ pub fn parse<F>(self, on_event: F)
463
+ where
464
+ F: FnMut(Event<'a>)
465
+ {
466
+ self.parse_document(&mut on_event);
467
+ }
468
+
469
+ fn parse_element<F>(&mut self, col: i32, on_event: &mut F) {
470
+ on_event(Event::ElementStart { .. });
471
+ // ... parse content ...
472
+ // ... recursive calls for children ...
473
+ on_event(Event::ElementEnd { .. });
474
+ }
475
+ }
476
+ ```
477
+
478
+ ### Debug Tracing
479
+
480
+ Generate parsers with `--trace` to output detailed execution traces to stderr:
481
+
482
+ ```bash
483
+ descent generate --trace parser.desc > parser.rs
484
+ ```
485
+
486
+ Trace output shows the exact execution path through the parser:
487
+
488
+ ```
489
+ TRACE: L14 ENTER document | byte='H' pos=0
490
+ TRACE: L16 document:main.collect | byte='H' term=[] pos=0
491
+ TRACE: L16 document:main.collect | byte='e' term=["H"] pos=1
492
+ TRACE: L15 document:main EOF | term=["Hello"] pos=5
493
+ ```
494
+
495
+ Each line shows:
496
+ - **L14**: Source line number from the `.desc` file
497
+ - **document:main.collect**: Function name, state name, and case label (substate)
498
+ - **byte='H'**: Current byte being processed
499
+ - **term=["H"]**: Accumulated content in the term buffer
500
+ - **pos=0**: Current position in input
501
+
502
+ This is invaluable for debugging parser behavior and understanding how the
503
+ generated state machine processes input.
504
+
505
+ ## Architecture
506
+
507
+ ```
508
+ ┌──────────────┐
509
+ parser.desc ───▶ │ Lexer │ ───▶ Tokens
510
+ └──────────────┘
511
+
512
+
513
+ ┌──────────────┐
514
+ │ Parser │ ───▶ AST
515
+ └──────────────┘
516
+
517
+
518
+ ┌──────────────┐
519
+ │ IR Builder │ ───▶ IR (with inferred SCAN, etc.)
520
+ └──────────────┘
521
+
522
+
523
+ ┌──────────────┐ ┌─────────────────────┐
524
+ │ Generator │◀──────│ templates/rust/ │
525
+ │ (Liquid) │ │ templates/c/ │
526
+ └──────────────┘ └─────────────────────┘
527
+
528
+
529
+ parser.rs
530
+ ```
531
+
532
+ - **Lexer**: Tokenizes pipe-delimited `.desc` format
533
+ - **Parser**: Builds AST from tokens (pure Data structs)
534
+ - **IR Builder**: Semantic analysis, SCAN inference, validation
535
+ - **Generator**: Renders IR through Liquid templates
536
+
537
+ **Adding new commands**: The parser uses `command_like?()` to detect command tokens
538
+ generically (uppercase words, `/function`, `->`, `>>`). To add a new command like `RESET`:
539
+
540
+ 1. `parser.rb`: Add to `classify_command()` - what type is it?
541
+ 2. `ir_builder.rb`: Add to command transformation - what args does it have?
542
+ 3. `_command.liquid`: Add rendering - what Rust code does it generate?
543
+
544
+ No changes needed to case detection or structural parsing.
545
+
546
+ ## Targets
547
+
548
+ | Target | Status | Output |
549
+ |--------|--------|--------|
550
+ | Rust | Working | Single `.rs` file with callback API |
551
+ | C | Not implemented | `.c` + `.h` files (planned) |
552
+
553
+ ## Bootstrapping
554
+
555
+ The `.desc` format is valid UDON. When the UDON parser (generated by descent)
556
+ is mature, descent can use it to parse its own input format - the tool generates
557
+ the parser that parses its own specifications.
558
+
559
+ ## Development
560
+
561
+ ```bash
562
+ # Install dependencies
563
+ bundle install
564
+
565
+ # Run tests
566
+ dx test
567
+
568
+ # Lint
569
+ dx lint
570
+ dx lint --fix
571
+
572
+ # Build and install locally
573
+ dx gem install
574
+ ```
575
+
576
+ ## Related
577
+
578
+ - [libudon](https://github.com/josephwecker/libudon) - The UDON parser library (uses descent)
579
+ - [UDON Specification](https://github.com/josephwecker/udon) - The UDON markup language
580
+
581
+ ## License
582
+
583
+ MIT