@plurnk/plurnk-grammar 0.1.1 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
package/SPEC.md DELETED
@@ -1,625 +0,0 @@
1
- # Plurnk Grammar Specification
2
-
3
- ## 1. Overview
4
-
5
- Plurnk extends HEREDOC formatting into a state-machine grammar for LLM
6
- agents. Every plurnk statement is a single self-contained operation: a
7
- canonical open tag, an optional payload, and a colon-fenced opaque body
8
- terminated by a matching close tag. Statements are flat — there is no
9
- composition or substitution. Documents may contain arbitrary
10
- interstatement text, which the parser captures verbatim and surfaces
11
- to consumers without imposing meaning on it.
12
-
13
- The parser produces a typed AST (per OP discriminated union) plus a
14
- list of structured errors. Both are JSON-serializable. Errors are
15
- per-statement; the parser recovers at statement boundaries when it
16
- can, and surfaces an `unparsedTail` when a boundary-destroying error
17
- prevents further recovery. See §12 for the consumer contract.
18
-
19
- Note: SEND status codes (§9) are a *protocol-level* convention for
20
- SEND statements emitted by the model and runtime. They are unrelated
21
- to parse-time `PlurnkParseError` objects produced by this package (§12).
22
-
23
- ## 1.1 Domain Boundary
24
-
25
- The grammar is purely syntactic. A rule belongs in the grammar if and
26
- only if it can be expressed as a shape constraint on character
27
- sequences. A rule has crossed into runtime as soon as it requires any
28
- of:
29
-
30
- - **Variables** — state held across statements, named bindings, references to prior values.
31
- - **Magic numbers** — values that carry semantic weight (`410` means "Gone," `200` means "OK"); the grammar accepts the digit string, not its meaning.
32
- - **Embedded code** — executing language fragments to determine well-formedness (compiling a regex, validating an xpath, resolving a URI).
33
-
34
- Anything that fits inside that constraint belongs in `.g4`. Anything
35
- that needs interpretation belongs in the runtime resolver.
36
-
37
- **Concretely in domain — parser-managed (lexer + parser):**
38
-
39
- - Statement structure: open tag, slots, body, close tag.
40
- - Lexical tokens: `<<`, OP keywords, `[…]`, `(…)`, `<N>`, `:`, body, close tag, interstatement TEXT.
41
- - Slot *shape* constraints: URI shape (scheme grammar + path character class), line-marker integer form, suffix character class, CSV form of `[signal]`.
42
- - HEREDOC discipline: open/close tag character match, body opacity between `:body:` fences, nesting via suffix.
43
- - Whitespace rules (§11).
44
- - Hard constraint: `:OPsuffix` close tag must character-match the open tag's `OPsuffix`.
45
-
46
- **Concretely in domain — Visitor-managed (typed AST construction):**
47
-
48
- - Extracting `op`, `suffix`, `signal` (split on comma), `path` (raw),
49
- `lineMarker` (parsed `<N>` or `<N-M>` integer form), and `body` (raw)
50
- from the parse tree into a typed discriminated union.
51
- - Native-JS validation of slot contents where useful (e.g., `new URL()`
52
- for path, `new RegExp()` for regex bodies). This is preferred over
53
- ANTLR sub-grammars for URI/regex/xpath/jsonpath — Node's built-ins are
54
- authoritative, well-tested, and zero-cost to invoke.
55
-
56
- **Concretely out of domain — runtime:**
57
-
58
- - URI resolution: what `known://`, `unknown://`, `file://` actually point at; what bare paths resolve to.
59
- - Tag-matching combination (AND/OR), tag-set semantics.
60
- - Line-marker arithmetic, out-of-range handling, result-set ordering for pagination.
61
- - Status code *meanings*: any digit string is grammatically valid in `[signal]`; whether `[410]` means "Gone" or any code carries privileged semantics on any OP is runtime convention.
62
- - Empty-body semantics (e.g., empty EDIT clears the entry).
63
- - EXEC body execution: runtime selection, sandboxing, permissions.
64
- - Filter composition (how SHOW/HIDE combine path × tag × body filters).
65
- - Output shape returned to the model after a statement executes. The §4 Per-OP Output table documents convention, not grammar rules.
66
-
67
- ## 2. Canonical Statement Form
68
-
69
- ```
70
- <<OPsuffix [signal]? (path)? <L>? : body? :OPsuffix
71
- ```
72
-
73
- The `:` characters fence the body. Everything between the opening `:`
74
- and the closing `:OPsuffix` literal is body, verbatim. This is what
75
- makes plurnk solve grammatical enclosure: body content is fully opaque
76
- to OP keywords, modifier-like characters, and the protocol's own
77
- syntax.
78
-
79
- Optionality:
80
-
81
- | Element | Status |
82
- |-------------|---------------|
83
- | `<<` | required |
84
- | `OP` | required |
85
- | `suffix` | optional; used for nesting and `:OPkeyword` escape (see §8) |
86
- | `[signal]` | optional, OP-dependent contents |
87
- | `(path)` | required for all OPs except SEND |
88
- | `<L>` | optional; single position or range (see §7) |
89
- | `:` | required (header → body delimiter) |
90
- | `body` | optional, OP-dependent meaning |
91
- | `:OPsuffix` | required (close tag: `:` + open tag's OP and suffix, character-matching) |
92
-
93
- Hard constraints:
94
-
95
- - Close-tag `:OPsuffix` must character-match the open tag's `OPsuffix`.
96
- - Header elements appear in the order shown above (signal, then path, then `<L>`, then `:`).
97
-
98
- All other restrictions are runtime concerns, not grammar concerns.
99
-
100
- ## 3. Lexical Elements
101
-
102
- - `<<` — open delimiter.
103
- - `OP` — exactly one of: `FIND`, `READ`, `EDIT`, `COPY`, `MOVE`, `SHOW`, `HIDE`, `SEND`, `EXEC`.
104
- - `suffix` — `[A-Za-z0-9_]*` immediately concatenated to `OP`, no separator.
105
- - `[` … `]` — signal slot; contents are OP-dependent (see §4).
106
- - `(` … `)` — path slot; contents are a URI (see §5).
107
- - `<L>` — line marker. Shape: `<` `-?[0-9]+` (`-` `-?[0-9]+`)? `>`. A single signed integer denotes a position; two signed integers separated by `-` denote an inclusive range.
108
- - `:` — body delimiter. Appears between header and body, and (with the OP+suffix following) at the close.
109
- - `body` — opaque byte stream between the opening `:` and the matching close tag `:OPsuffix`.
110
- - `:OPsuffix` close — `:` immediately followed by the open tag's `OP` and `suffix` (character-matching, no whitespace).
111
-
112
- ## 4. Per-OP Semantics
113
-
114
- | OP | `[signal]` | `(path)` | `body` | `<LineN>` |
115
- |--------|-------------------|----------|-------------------------|---------------|
116
- | FIND | tag filter (CSV) | required | pattern matcher | result-set pagination |
117
- | READ | tag filter (CSV) | required | pattern matcher | per-entry lines |
118
- | EDIT | tags (CSV) | required | content (empty body clears the entry) | entry lines |
119
- | COPY | tags to apply (CSV) | required | destination URI | entry lines |
120
- | MOVE | tags to apply (CSV) | required | destination URI | entry lines |
121
- | SHOW | tag filter (CSV) | required | optional pattern matcher | result-set pagination |
122
- | HIDE | tag filter (CSV) | required | optional pattern matcher | result-set pagination |
123
- | SEND | HTTP status code (single integer) | optional | message payload (JSON by convention for structured responses) | not applicable |
124
- | EXEC | runtime tag (single string; `sh` default, `node`, `python`, …) | required | command or code snippet | not applicable |
125
-
126
- The `<L>` slot is optional. Its referent shifts by OP (per the column
127
- above) but the syntax is uniform: a single integer denotes one
128
- position, an integer range `<N-M>` selects items at positions `N..M`
129
- inclusive of whatever sequence the OP operates on or produces.
130
-
131
- EDIT line-marker semantics (single source of authority):
132
-
133
- - No `<L>` + body present: replace entire entry contents with body.
134
- - No `<L>` + no body: clear entry contents (empty replacement).
135
- - `<N>` (single position) + body: replace the single line at `N` with body.
136
- - `<N-M>` (range) + body: replace lines `N..M` inclusive with body.
137
- - `<0>` + body: prepend body before line 1.
138
- - `<-1>` + body: append body after the last line.
139
-
140
- SHOW and HIDE filters are AND-combined: an entry is selected when its
141
- path matches `(path)`, its tags satisfy `[signal]` (if present), and its
142
- content matches `body` (if present).
143
-
144
- ### Per-OP Output (what each OP produces)
145
-
146
- | OP | Produces |
147
- |------|----------|
148
- | FIND | list of matching paths |
149
- | READ | content of matched entries (or matched substrings if `body` is a pattern) |
150
- | EDIT | status; resulting entry content on success |
151
- | COPY | status; destination path on success |
152
- | MOVE | status; destination path on success |
153
- | SHOW | status; list of paths moved into the Index |
154
- | HIDE | status; list of paths moved into the Archive |
155
- | SEND | status; recipient ack if applicable |
156
- | EXEC | exit code, stdout, stderr |
157
-
158
- Output is delivered to the model in the next turn. The shape of "status"
159
- is a SEND-style status code (see §9) so that errors are uniform across
160
- all OPs.
161
-
162
- ## 5. Path Grammar
163
-
164
- Paths are URI-shaped, drawn from RFC 3986 in spirit but not strictly.
165
- Two RFC concessions justify the relaxation:
166
-
167
- 1. RFC 3986 lists `)` as a sub-delim — a valid path character. Plurnk
168
- reserves `)` to close the path slot. Strict compliance would
169
- require an escape mechanism; plurnk does not provide one.
170
- 2. Bulk Pattern Matching extends path segments with glob
171
- metacharacters (`*`, `**`, `?`, `[…]`) that fall outside the RFC
172
- character set.
173
-
174
- Lexer-enforced shape:
175
-
176
- - Optional scheme: `[a-z][a-z0-9+.-]*` followed by `://`.
177
- - Path content: any character except `)` and newline.
178
- - Glob metacharacters in path segments are permitted.
179
-
180
- Runtime-enforced semantics:
181
-
182
- - Bare paths (no scheme) resolve as `file://` at runtime.
183
- - Conventional schemes include `known://`, `unknown://`, `log://`,
184
- `file://`, `http://`, `https://`. Any scheme matching the lexer
185
- shape is grammatically valid; resolution is a runtime concern.
186
- - Percent-encoding, authority structure, port range, and other RFC
187
- 3986 finer points are validated by the runtime URI resolver, not
188
- the parser.
189
-
190
- ## 6. Bulk Pattern Matching
191
-
192
- For FIND, READ, SHOW, and HIDE, `body` is an optional pattern matcher.
193
- The lexer captures the body opaquely (between the `:body:` fences) —
194
- dialect dispatch is not a lexer concern. Dialect is determined by the
195
- body's leading characters, and validated by the Visitor using native
196
- JS facilities (`new RegExp()` etc.) where applicable:
197
-
198
- | Leading prefix | Dialect | Canonical form | Validation |
199
- |----------------|-----------|---------------------------|--------------------|
200
- | `//` | xpath | `//…` | runtime (xpath lib) |
201
- | `/` | regex | `/pattern/flags` (trailing `/` required, flags `[a-z]*`) | `new RegExp()` in Visitor |
202
- | `$` | jsonpath | `$…` | runtime (jsonpath lib) |
203
- | otherwise | glob | `…` (literal substring if no metacharacters) | runtime (glob library) |
204
-
205
- Dialect conventions (the Visitor uses these to construct typed AST
206
- body fields; the lexer is unaware):
207
-
208
- - Xpath body begins with `//` (descendant-or-self axis). Absolute-root
209
- `/foo` is unreachable (collides with regex prefix); rework as `//foo`.
210
- - Regex body is a delimited literal: opens with `/`, ends with `/`
211
- before the close fence, with optional flag chars `[a-z]*` between
212
- the closing `/` and the close fence. Literal `/` inside the pattern
213
- must be escaped `\/`.
214
- - Regex anchors `^` and `$` go inside the slashes: `/^foo$/`.
215
- - Flag semantics (`i` case-insensitive, `m` multiline, `s` dotall,
216
- etc.) follow ECMAScript regex.
217
- - Glob is the catch-all and includes the literal-substring case when
218
- no metacharacters are present.
219
-
220
- **Implemented validation in the Visitor (Node-native):**
221
-
222
- - **Path**: the Visitor distinguishes local paths from URLs by the
223
- presence of a scheme prefix (`[a-z][a-z0-9+.-]*://`). Local paths
224
- (filesystem-style, no scheme) are stored as `{ kind: "local", raw }`
225
- without further parsing — `new URL()` is not invoked, so no URL
226
- conventions are imposed on what was clearly intended as a local
227
- reference. URLs are parsed by `new URL(raw)` and decomposed into
228
- components (`scheme`, `username`, `password`, `hostname`, `port`,
229
- `pathname`, `search`, `fragment`). Genuine URL-protocol violations
230
- (malformed authority, unterminated IPv6 brackets, invalid port, etc.)
231
- produce a `PlurnkParseError` with source `"visitor"`.
232
- - **Regex body** (matcher-body OPs only, leading `/` and not `//`):
233
- the Visitor extracts `pattern` and `flags` (respecting `\/` escapes)
234
- and calls `new RegExp(pattern, flags)`. On failure (missing closing
235
- `/`, unterminated character class, invalid flag, etc.), a
236
- `PlurnkParseError` with source `"visitor"` is emitted.
237
- - **XPath body** (matcher-body OPs only, leading `//`): the Visitor
238
- calls `xpath.parse()` from the `xpath` npm package (XPath 1.0
239
- parser-only, no DOM execution). On failure (unterminated predicate,
240
- invalid operator, etc.), a `PlurnkParseError` with source `"visitor"`
241
- is emitted.
242
- - **JsonPath body** (matcher-body OPs only, leading `$`): the Visitor
243
- calls `JSONPath({ path: body, json: {} })` from the `jsonpath-plus`
244
- npm package. The empty `{}` ensures syntax parsing happens without
245
- document evaluation. Syntax errors (unclosed parens, malformed
246
- filter expressions, etc.) throw and become `PlurnkParseError` with
247
- source `"visitor"`.
248
-
249
- **Deferred validation:**
250
-
251
- - **Glob** bodies — pass through as raw; runtime applies whatever
252
- glob matcher is appropriate.
253
-
254
- **Why not ANTLR sub-grammars for any of these?** Node's `new URL()`
255
- and `new RegExp()` are authoritative, well-tested, and zero-cost to
256
- invoke; `xpath` and `jsonpath-plus` are the de facto Node parsers for
257
- their respective dialects. ANTLR sub-grammars for any of these would
258
- add hundreds of lines of generated parser code with no validation
259
- benefit over the native or library facilities.
260
-
261
- ## 7. Line Markers
262
-
263
- A line marker selects a position or range from the sequence an OP
264
- operates on or produces. The sequence type is OP-specific (see §4
265
- per-OP table): entry lines for EDIT/COPY/MOVE, matched content lines
266
- for READ, positions in the matched-paths list for FIND/SHOW/HIDE.
267
-
268
- **Token shape:** `<` `-?[0-9]+` (`-` `-?[0-9]+`)? `>`.
269
-
270
- | Form | Meaning |
271
- |----------|--------------------------------------|
272
- | `<N>` | single position N |
273
- | `<N-M>` | inclusive range N..M |
274
- | `<0>` | prepend anchor (before position 1) |
275
- | `<-1>` | append anchor (after last position) |
276
-
277
- Examples involving negative integers:
278
-
279
- - `<-1-5>` — range from -1 to 5
280
- - `<0--5>` — range from 0 to -5
281
- - `<-3--1>` — range from -3 to -1
282
-
283
- **Parsing rule:** greedy. The first signed integer consumes leading
284
- `-` and digits maximally; the optional `-` range separator follows; the
285
- optional second signed integer consumes its own optional `-` and
286
- digits. So `<-1-5>` parses as first=`-1`, separator=`-`, second=`5`.
287
- This falls out of standard ANTLR longest-match.
288
-
289
- **Runtime concerns** (not enforced by the parser):
290
-
291
- - `N ≥ 1`: 1-indexed position.
292
- - Validity of any specific value (out-of-range, inverted range where
293
- `N > M`, sentinel meanings beyond the canonical `0`/`-1`) is decided
294
- per-OP at runtime.
295
-
296
- **Result-set ordering** (FIND, SHOW, HIDE): the runtime must produce a
297
- deterministic order so that `<N-M>` pagination is reproducible.
298
- Lexicographic ascending order over the matched path strings is the
299
- canonical ordering. Runtime guarantee, not a parser concern.
300
-
301
- ## 8. Suffix Discipline
302
-
303
- The `:body:` fencing handles the vast majority of grammatical-enclosure
304
- concerns: body content is fully opaque to OP keywords and modifier-like
305
- characters. The suffix is reserved for the residual edge case where
306
- body content literally contains the close-tag pattern `:OPkeyword`.
307
- That happens in two scenarios:
308
-
309
- 1. **Nesting plurnk statements inside a body** (recording a plurnk
310
- transcript, storing examples, etc.). The inner statement's close
311
- `:OP` would prematurely terminate the outer's body.
312
- 2. **Body content contains `:OPkeyword` as literal text** (e.g., a
313
- stored JSON object with a value mentioning plurnk syntax).
314
-
315
- Suffix rules:
316
-
317
- - `suffix` is `[A-Za-z0-9_]*`, concatenated to `OP` with no separator, on both open and close.
318
- - Open `<<OPsuffix` and close `:OPsuffix` must character-match.
319
- - A non-empty suffix on the outer statement ensures its close tag
320
- (`:OPsuffix`) is distinct from any `:OP` substring that may appear in
321
- body content (whether as nested plurnk or as literal text).
322
- - The body of a statement cannot contain its own exact close-tag
323
- literal; choose a suffix that does not collide.
324
- - Empty suffix is the default. Most statements need no suffix.
325
-
326
- Example — nested EDIT inside an outer EDITa:
327
-
328
- ```
329
- <<EDITa(known://demo):
330
- The following is a quoted plurnk operation, preserved verbatim:
331
- <<EDIT(known://inner):hello world:EDIT
332
- :EDITa
333
- ```
334
-
335
- The inner's `:EDIT` close does not terminate the outer because the
336
- outer's close tag is `:EDITa`.
337
-
338
- ## 9. SEND Status Codes
339
-
340
- SEND status codes align with HTTP semantics so that model training
341
- transfers directly:
342
-
343
- - `1xx` Informational — continuation; `102 Processing` is the canonical loop-continuation code.
344
- - `2xx` Success — terminal delivery; `200 OK` is the canonical final-answer code.
345
- - `3xx` Redirection — handoff to another agent or address.
346
- - `4xx` Client Error — model-side failure (malformed plurnk, missing path, contract violation).
347
- - `5xx` Server Error — runtime or infrastructure failure (network, permission, tool unavailable).
348
-
349
- SEND with no `(path)` broadcasts to the default control channel. SEND
350
- with `(path)` directs the message at a specific recipient URI.
351
-
352
- ### Response Body Convention
353
-
354
- Structured responses (errors, query results, multi-field acknowledgments)
355
- are emitted as **JSON in the SEND body**, so the model can consume them
356
- with the same jsonpath dialect it uses for matching:
357
-
358
- ```
359
- <<SEND[400](err://lex)
360
- {"reason":"unexpected token","position":{"line":47,"column":12},"expected":[")"],"got":"["}
361
- SEND
362
- ```
363
-
364
- The model retrieves a field with `<<READ(err://lex)$.reasonREAD` or
365
- similar. Plain-text bodies remain valid for simple terminal answers
366
- (`<<SEND[200]ParisSEND`). The JSON convention is runtime policy; the
367
- grammar treats body as opaque.
368
-
369
- ## 10. Implementation Notes
370
-
371
- - ANTLR4 split follows standard convention: `plurnkLexer.g4` defines
372
- tokens; `plurnkParser.g4` defines statement structure. Generated
373
- using `antlr-ng` targeting the `antlr4ng` runtime.
374
- - The body is fenced by `:` on the header side and `:OPsuffix` on the
375
- close side. The lexer enters body mode when it consumes the opening
376
- `:` after the last header element. In body mode, the close-tag rule
377
- uses a semantic predicate (`atColonCloseTag()`) that fires when the
378
- next characters match `:OPsuffix` exactly. The open tag (`OP +
379
- suffix`) is captured at statement start and held on the lexer
380
- instance.
381
- - The body is uniformly opaque at the lexer level (a sequence of
382
- `BODY_TEXT` tokens). The Visitor reconstructs body content as a
383
- single string and, per OP semantics, interprets it as content,
384
- destination URI, payload, command, or matcher.
385
- - Header mode hierarchy: state machine `DEFAULT → OPENED → SIGNAL → POST_SIGNAL → PATH → POST_PATH → POST_L → BODY` tracks which
386
- header elements remain valid at each position (after signal, signal
387
- is no longer valid; after path, neither signal nor path; after `<L>`,
388
- only the `:` body delimiter is valid). Each header mode requires the
389
- `:` to transition to BODY; no fallback.
390
- - PATH and SIGNAL content reject `<<` (single `<` is permitted inside
391
- them, double `<<` is the statement-opener prefix and must not appear).
392
- This prevents a malformed path or signal from silently swallowing the
393
- next statement.
394
- - Interstatement content (between statements) is captured as `TEXT`
395
- tokens. The lexer's `TEXT` rule matches any chars that aren't a
396
- recognized statement opener; a `<<` followed by a non-OP sequence is
397
- rolled into `TEXT` rather than producing an error.
398
- - Error model: the parser uses ANTLR's `DefaultErrorStrategy` for
399
- cross-statement recovery (sync to next statement opener on error).
400
- An error listener records every syntax error as a `PlurnkParseError`
401
- (line, column, source: `"lexer" | "parser"`, message). The Visitor's
402
- caller correlates errors to statement positions and emits them in
403
- the result's `items` array in order.
404
- - Boundary-destroying errors (lexer ends in a non-DEFAULT mode at EOF,
405
- typically meaning a statement was never closed) surface as
406
- `unparsedTail` on the parse result. The agent's consumer treats this
407
- as "the document past this point is unparseable; do not execute
408
- anything after the last successful item."
409
-
410
- ## 11. Whitespace and Comments
411
-
412
- Plurnk is HEREDOC-disciplined and LLM-tolerant: forgiving where
413
- forgiveness is safe, strict where laxity would corrupt content.
414
-
415
- - **Between header elements** (`OPsuffix`, `[signal]`, `(path)`,
416
- `<L>`, the body-delimiter `:`): whitespace (spaces, tabs, newlines)
417
- is optional and non-significant.
418
- - **Inside header elements** (between the brackets/parens/angles
419
- themselves — e.g., inside `[…]`, `(…)`, `<…>`, between `OP` and
420
- `suffix`): whitespace is forbidden. These are strict tokens.
421
- - **Body interior**: whitespace is preserved verbatim. Body content
422
- begins at the character immediately after the opening `:` and ends
423
- immediately before the closing `:OPsuffix`. Leading and trailing
424
- newlines in body content (common for multi-line bodies written by
425
- the model) are part of the body; runtime consumers may normalize
426
- them.
427
- - **Close tag** (`:OPsuffix`): the `:` and the `OPsuffix` must be
428
- character-adjacent — no whitespace permitted between them. Whitespace
429
- *before* the close `:` (i.e., trailing whitespace in body) is body
430
- content, preserved verbatim.
431
-
432
- Comments: plurnk has no comment syntax. The protocol is wire-shaped,
433
- not source-shaped. To leave a self-documenting breadcrumb, use
434
- `<<EDIT(known://notes/…):…:EDIT` (model-visible) or
435
- `<<SEND[1xx](…):…:SEND` (orchestrator-visible).
436
-
437
- ## 12. Public API
438
-
439
- This package exports a single entry point `parse(input: string): ParseResult` and the AST type union. The full surface area:
440
-
441
- ```typescript
442
- parse(input: string): ParseResult
443
-
444
- type ParseResult = {
445
- items: ParseItem[];
446
- unparsedTail?: { from: Position; reason: string };
447
- };
448
-
449
- type ParseItem =
450
- | { kind: "statement"; statement: PlurnkStatement }
451
- | { kind: "error"; error: PlurnkParseError }
452
- | { kind: "text"; text: string; position: Position };
453
-
454
- type Position = { line: number; column: number };
455
-
456
- type PlurnkOp = "FIND" | "READ" | "EDIT" | "COPY" | "MOVE" | "SHOW" | "HIDE" | "SEND" | "EXEC";
457
-
458
- type PlurnkStatement =
459
- | FindStatement | ReadStatement | EditStatement
460
- | CopyStatement | MoveStatement
461
- | ShowStatement | HideStatement
462
- | SendStatement | ExecStatement;
463
-
464
- interface StatementBase<S> {
465
- suffix: string; // empty string if no suffix
466
- signal: S | null; // null = no [signal] slot; type S varies per OP (see below)
467
- path: ParsedPath | null; // typed parse of (path); null if no slot or empty
468
- lineMarker: LineMarker | null;
469
- position: Position;
470
- // body type varies per OP — declared on each concrete statement (below).
471
- }
472
-
473
- interface LineMarker { first: number; last: number | null; }
474
-
475
- // Path is local (no scheme) or URL (has scheme). The Visitor decides by
476
- // matching the leading [a-z][a-z0-9+.-]*:// pattern; only URLs are passed
477
- // through `new URL()` for component breakdown.
478
- type ParsedPath = LocalPath | UrlPath;
479
-
480
- interface LocalPath {
481
- kind: "local";
482
- raw: string; // filesystem path or other non-URL identifier
483
- }
484
-
485
- interface UrlPath {
486
- kind: "url";
487
- raw: string;
488
- scheme: string; // protocol without trailing ':'
489
- username: string | null;
490
- password: string | null;
491
- hostname: string | null; // first authority segment; for custom schemes like
492
- // `known://entries/foo`, hostname = "entries"
493
- port: number | null;
494
- pathname: string; // path component, may be empty
495
- search: Record<string, string | string[]>;
496
- fragment: string | null;
497
- }
498
-
499
- // Typed body for FIND/READ/SHOW/HIDE — dialect dispatch with compiled regex.
500
- type MatcherBody =
501
- | { dialect: "xpath"; raw: string }
502
- | { dialect: "regex"; raw: string; pattern: string; flags: string; regexp: RegExp }
503
- | { dialect: "jsonpath"; raw: string }
504
- | { dialect: "glob"; raw: string };
505
-
506
- // Typed body for SEND — best-effort JSON parse alongside raw.
507
- interface SendBody {
508
- raw: string;
509
- json: unknown | null; // parsed value if body is valid JSON, else null
510
- }
511
-
512
- // Each variant declares its own body type. Tag-bearing OPs share signal=string[];
513
- // SEND uses number; EXEC uses string.
514
-
515
- // Matcher OPs — body is a typed pattern matcher.
516
- interface FindStatement extends StatementBase<string[]> { op: "FIND"; body: MatcherBody | null; }
517
- interface ReadStatement extends StatementBase<string[]> { op: "READ"; body: MatcherBody | null; }
518
- interface ShowStatement extends StatementBase<string[]> { op: "SHOW"; body: MatcherBody | null; }
519
- interface HideStatement extends StatementBase<string[]> { op: "HIDE"; body: MatcherBody | null; }
520
-
521
- // EDIT — body is arbitrary content (markdown, code, prose). Raw.
522
- interface EditStatement extends StatementBase<string[]> { op: "EDIT"; body: string | null; }
523
-
524
- // COPY/MOVE — body is the destination URI, parsed identically to the path slot.
525
- interface CopyStatement extends StatementBase<string[]> { op: "COPY"; body: ParsedPath | null; }
526
- interface MoveStatement extends StatementBase<string[]> { op: "MOVE"; body: ParsedPath | null; }
527
-
528
- // SEND — body is raw + best-effort JSON.
529
- interface SendStatement extends StatementBase<number> { op: "SEND"; body: SendBody | null; }
530
-
531
- // EXEC — body is a command or code snippet. Raw.
532
- interface ExecStatement extends StatementBase<string> { op: "EXEC"; body: string | null; }
533
- ```
534
-
535
- The `op` field is the discriminator. TypeScript narrows the statement
536
- type per-branch: `switch (s.op) { case "EDIT": /* s is EditStatement */ }`.
537
-
538
- **Items are ordered.** The agent consumer iterates in order: execute on
539
- `statement`, halt on `error`, surface or ignore `text` per policy.
540
-
541
- **ANTLR types do not leak.** All `antlr4ng` types are internal to this
542
- package; consumers receive only the types listed above.
543
-
544
- ### CLI
545
-
546
- The package also exposes a `plurnk` CLI for local development and tooling:
547
-
548
- ```
549
- plurnk [file] Parse plurnk source from a file (or stdin if omitted or '-')
550
- and print the parse result as JSON.
551
- plurnk --help Show usage.
552
- ```
553
-
554
- Exit codes: `0` for a clean parse (no error items, no `unparsedTail`),
555
- `1` otherwise. `RegExp` values inside `MatcherBody` serialize as their
556
- `/pattern/flags` string form; `PlurnkParseError` instances serialize via
557
- their `toJSON()` method to `{ line, column, source, message }`.
558
-
559
- ## 13. Error Format
560
-
561
- `PlurnkParseError` is a JSON-serializable Error subclass:
562
-
563
- ```typescript
564
- type ErrorSource = "lexer" | "parser" | "visitor";
565
-
566
- class PlurnkParseError extends Error {
567
- readonly line: number;
568
- readonly column: number;
569
- readonly source: ErrorSource;
570
- // .message is "Plurnk <source> error at <line>:<column> — <message>"
571
- }
572
- ```
573
-
574
- The three sources distinguish:
575
-
576
- - **`"lexer"`** — token-level failures (unrecognized character, malformed integer in `<L>`, etc.).
577
- - **`"parser"`** — structural failures at parse-tree level (missing close tag, wrong token order, etc.).
578
- - **`"visitor"`** — semantic failures during AST construction (SEND signal not an integer, EXEC signal with multiple values, etc.).
579
-
580
- Serialization convention for transmission to the model (the agent
581
- runtime constructs this; the parser provides the fields):
582
-
583
- ```json
584
- {
585
- "line": 1,
586
- "column": 12,
587
- "source": "parser",
588
- "message": "expected close tag; got end of input"
589
- }
590
- ```
591
-
592
- **Message style rules** (enforced by `PlurnkErrorStrategy` and the
593
- lexer message translator):
594
-
595
- - **Protocol vocabulary only.** Messages refer to plurnk concepts (open
596
- tag, close tag, signal, path, line marker, body, statement header,
597
- between statements) — never ANTLR or parser-internal terms (no
598
- "token recognition error," "extraneous input," "RPAREN," "no viable
599
- alternative," "<EOF>", etc.).
600
- - **Terse but complete.** One short sentence naming what was wrong.
601
- No suggestions, no recovery hints, no extra context. The model
602
- receives `line`/`column` separately and doesn't need duplication.
603
- - **Slot/feature references**, not rule references. "in path" rather
604
- than "in rule path"; "expected close tag" rather than "expected
605
- CLOSE_TAG."
606
-
607
- Examples of canonical messages:
608
-
609
- - `unrecognized character '<<' in path`
610
- - `unrecognized character ':' in signal`
611
- - `unrecognized character 'X' in statement header`
612
- - `expected close tag; got end of input`
613
- - `expected ')'; got ':'`
614
-
615
- **Per-statement semantics.** A single statement produces at most one
616
- error (fail-hard within a statement; first error wins, no cascading
617
- within-statement reports). Across statements, the parser recovers and
618
- continues — independent malformations in different statements each
619
- get their own error item in the result.
620
-
621
- **Boundary-destroying errors.** When the lexer cannot determine where
622
- a malformed statement ends (e.g., a body that never finds its close
623
- tag), the parser cannot reliably parse content after that point. The
624
- result's `unparsedTail` field marks the position from which parsing
625
- gave up. Consumers must treat anything past that point as undefined.