red_quilt 0.7.0 → 0.7.2

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,423 @@
1
+ # How to use the Arena class
2
+
3
+ `RedQuilt::Arena` is the low-level storage class that holds the actual AST of
4
+ RedQuilt. This document describes its API and assumptions for people who touch
5
+ the Arena directly: block parsers, inline builders, renderers, custom
6
+ transformers, and any other code under `lib/red_quilt`.
7
+
8
+ > If you only need to work with the AST as an external API, the standard path is
9
+ > to go through `RedQuilt::Document` and `RedQuilt::NodeRef`. The Arena is a more
10
+ > internal layer; it is the very data structure that `NodeRef` is built on.
11
+
12
+ ---
13
+
14
+ ## 0. A quick example
15
+
16
+ The code below works if you copy and paste it as is. It should give you a feel
17
+ for how the Arena "builds a tree from a source string using various IDs".
18
+
19
+ ```ruby
20
+ require "red_quilt"
21
+
22
+ source = "Hello *world*"
23
+ arena = RedQuilt::Arena.new(source)
24
+
25
+ # (a) Create nodes. The return value is a node id (Integer).
26
+ para_id = arena.add_node(RedQuilt::NodeType::PARAGRAPH,
27
+ source_start: 0, source_len: source.bytesize)
28
+ text_id = arena.add_node(RedQuilt::NodeType::TEXT,
29
+ source_start: 0, source_len: 6) # "Hello "
30
+ em_id = arena.add_node(RedQuilt::NodeType::EMPHASIS,
31
+ source_start: 6, source_len: 7) # "*world*"
32
+ inner_id = arena.add_node(RedQuilt::NodeType::TEXT,
33
+ source_start: 7, source_len: 5) # "world"
34
+
35
+ # (b) Build parent/child relationships
36
+ arena.append_child(para_id, text_id)
37
+ arena.append_child(para_id, em_id)
38
+ arena.append_child(em_id, inner_id)
39
+
40
+ # (c) Read content back out
41
+ puts "type: #{arena.type_name(para_id)}"
42
+ puts "text: #{arena.text(text_id).inspect}"
43
+ puts "inner: #{arena.text(inner_id).inspect}"
44
+ puts "span: #{arena.source_span(em_id).inspect}"
45
+
46
+ # (d) Iterate over children (block form, no Enumerator)
47
+ puts "children of paragraph:"
48
+ arena.each_child(para_id) do |child_id|
49
+ puts " #{arena.type_name(child_id)}: #{arena.text(child_id).inspect}"
50
+ end
51
+ ```
52
+
53
+ The output looks like this:
54
+
55
+ ```
56
+ type: paragraph
57
+ text: "Hello "
58
+ inner: "world"
59
+ span: #<RedQuilt::SourceSpan:0x... @start_byte=6, @end_byte=13>
60
+ children of paragraph:
61
+ text: "Hello "
62
+ emphasis: "*world*"
63
+ ```
64
+
65
+ The AST that this sample builds looks like this:
66
+
67
+ ```
68
+ PARAGRAPH [0, 13) "Hello *world*"
69
+ ├─ TEXT [0, 6) "Hello "
70
+ └─ EMPHASIS [6, 13) "*world*"
71
+ └─ TEXT [7, 12) "world"
72
+ ```
73
+
74
+ The key points when working with the Arena are:
75
+
76
+ - `add_node` returns a node ID (Integer). Every later API call uses this ID as
77
+ the key.
78
+ - `source_start` / `source_len` specify a range over the original source string
79
+ in bytes, not characters. The Arena does not keep a copy of the string itself.
80
+ - `text(id)` returns str1 if it exists; otherwise it byteslices `source`.
81
+ - `each_child(id)` is the basic traversal API and is used on the hot path.
82
+
83
+ Keep these in mind, and the later sections will read as "what does this API
84
+ actually guarantee, and how should I use it?".
85
+
86
+ ---
87
+
88
+ ## 1. Design highlights
89
+
90
+ The Arena represents the AST not as a "tree of objects" but as a
91
+ [parallel array](https://en.wikipedia.org/wiki/Parallel_array).
92
+
93
+ - Nodes are identified by an integer ID (`node_id`).
94
+ - The attributes for each ID (parent / source span / payload) are kept as
95
+ columns in separate Arrays.
96
+ - Adding a node is just an append to the end of each Array; no new Ruby object is
97
+ created at all.
98
+
99
+ As a result, the Arena has the following properties:
100
+
101
+ - On the hot path you only pass around Integers (IDs).
102
+ - Memory locality is good and GC pressure is low.
103
+ - Nodes can be treated as "lightweight" values, which makes it easy to inline the
104
+ Renderer and Builder.
105
+
106
+ #### List of columns
107
+
108
+ | Column | Purpose |
109
+ |------|------|
110
+ | `@type` | NodeType (an Integer constant) |
111
+ | `@parent` / `@first_child` / `@last_child` / `@next_sibling` / `@prev_sibling` | Parent / child / sibling links. The value is a node id (`NO_NODE` means "none"). |
112
+ | `@source_start` / `@source_len` | Byte range within the document source. `source_start < 0` means "no span". |
113
+ | `@int1` / `@int2` / `@int3` | Integer slots whose meaning depends on the NodeType (default `0`). |
114
+ | `@str1` / `@str2` | String slots whose meaning depends on the NodeType (default `nil`). |
115
+
116
+ ---
117
+
118
+ ## 2. Invariants
119
+
120
+ These assumptions always hold when you work with the Arena.
121
+
122
+ 1. Node IDs increase monotonically.
123
+ The ID handed out by `add_node` starts at `@type.length` and increases by 1
124
+ each time you add a node. IDs are never reused.
125
+ 2. Detached nodes stay in the columns.
126
+ `detach` only resets the parent and sibling links to `NO_NODE`; the column
127
+ record itself stays in the arena. A later `add_node` never reuses that slot.
128
+ This is a deliberate choice that keeps allocation simple.
129
+ 3. Treat `@source` as immutable.
130
+ You must not rewrite the source after the Arena is built. `source_start` /
131
+ `source_len` point directly at byte ranges, so if the source changes, the
132
+ return values of `text` / `source_span` break.
133
+ 4. `NO_NODE` = -1.
134
+ This is the sentinel meaning "no parent or sibling exists". You can reference
135
+ it as the `Arena::NO_NODE` constant.
136
+ 5. `source_start < 0` means "no span".
137
+ In this case the content of a leaf node is often held as a literal in `@str1`
138
+ (for example, a paragraph after a blockquote is removed, or a TEXT node after
139
+ entity decoding). However, some NodeTypes, like container inlines, have no
140
+ span and also do not use `str1`; they build their content from child nodes.
141
+
142
+ ---
143
+
144
+ ## 3. API layers
145
+
146
+ The public methods of the Arena are easier to understand if you read them in
147
+ these three layers.
148
+
149
+ ### 3.1 Structure mutation (mutators)
150
+
151
+ APIs for building and editing the tree. They assume you pass a valid id and do
152
+ minimal safety checking.
153
+
154
+ | Method | Summary |
155
+ |----------|------|
156
+ | `add_node(type, **fields)` | Append a new node at the end and return its ID. It starts detached. |
157
+ | `append_child(parent_id, child_id)` | Append to the end of the parent's child list. |
158
+ | `insert_before(parent_id, ref_id, new_id)` | Insert immediately before `ref_id`. |
159
+ | `detach(child_id)` | Detach from the parent. The node itself remains. |
160
+ | `reparent(new_parent_id, first_id, last_id)` | Move the sibling range `first_id..last_id` to a new parent. |
161
+ | `update_span(id, start_byte, end_byte)` | Reset the source span. |
162
+ | `update_str1(id, value)` / `update_int3(id, value)` | Overwrite an individual slot. |
163
+
164
+ ### 3.2 Structure access (raw id accessors)
165
+
166
+ These return the raw column value, which may be `NO_NODE`. The naming convention
167
+ `raw_X_id` means "the return value is a node id and may be -1 (`NO_NODE`)".
168
+
169
+ | Method | Return value |
170
+ |----------|--------|
171
+ | `raw_parent_id(id)` | Parent id, or `NO_NODE`. |
172
+ | `raw_first_child_id(id)` / `raw_last_child_id(id)` | Child id, or `NO_NODE`. |
173
+ | `raw_next_sibling_id(id)` / `raw_prev_sibling_id(id)` | Sibling id, or `NO_NODE`. |
174
+
175
+ ### 3.3 Payload access (column accessors)
176
+
177
+ These return each column as raw data. You should read from the return type
178
+ whether a sentinel can come back.
179
+
180
+ | Method | Return value |
181
+ |----------|--------|
182
+ | `type(id)` | NodeType constant (Integer). |
183
+ | `type_name(id)` | Symbol (for example, `:paragraph`). |
184
+ | `source_start(id)` / `source_len(id)` | Byte offset / byte length. `source_start < 0` means no span. |
185
+ | `int1(id)` / `int2(id)` / `int3(id)` | Integer (default 0). |
186
+ | `str1(id)` / `str2(id)` | String or `nil`. |
187
+
188
+ ### 3.4 Semantic accessors
189
+
190
+ These interpret the low-level columns and return an "easy to use" value. They can
191
+ return `nil` to explicitly express "none".
192
+
193
+ | Method | Return value |
194
+ |----------|--------|
195
+ | `source_span(id)` | `SourceSpan`, or `nil` if there is no span. |
196
+ | `text(id)` | str1 if present; otherwise `source.byteslice(...)`. `nil` if neither exists. |
197
+
198
+ ### 3.5 Traversal
199
+
200
+ | Method | Purpose |
201
+ |----------|------|
202
+ | `each_child(id) { |child_id| ... }` | Block form. Recommended on the hot path (no Enumerator). |
203
+ | `child_ids(id)` | Returns an `Enumerator`, for chaining `map` / `select`, etc. |
204
+
205
+ ---
206
+
207
+ ## 4. Slot usage per NodeType
208
+
209
+ Which int / str slots each NodeType uses is fixed by convention. The current
210
+ conventions are below.
211
+
212
+ #### Block nodes
213
+
214
+ | NodeType | int1 | int2 | int3 | str1 | str2 |
215
+ |----------|------|------|------|------|------|
216
+ | `DOCUMENT` | - | - | - | - | - |
217
+ | `PARAGRAPH` | - | - | - | A joined literal when needed (when transformed, or when leading indent is removed, etc.) | - |
218
+ | `HEADING` | level (1-6) | - | - | An inline literal when needed (when transformed, setext heading, etc.) | - |
219
+ | `THEMATIC_BREAK` | - | - | - | - | - |
220
+ | `BLOCKQUOTE` | - | - | - | - | - |
221
+ | `LIST` | ordered? (0/1) | start_number | tight? (1=tight) | marker (`-`/`*`/`+`/`.`/`)`) | - |
222
+ | `LIST_ITEM` | - | - | - | - | - |
223
+ | `CODE_BLOCK` | - | - | - | code content (literal) | info string (fenced only) |
224
+ | `HTML_BLOCK` | - | - | - | HTML content (literal) | - |
225
+ | `TABLE` | - | - | - | - | - |
226
+ | `TABLE_ROW` | header? (1/0) | - | - | - | - |
227
+ | `TABLE_CELL` | header? (1/0) | - | - | stripped cell text | - |
228
+ | `FOOTNOTE_DEFINITION` | - | - | - | normalized label | - |
229
+ | `FOOTNOTES_SECTION` | - | - | - | - | - |
230
+
231
+ #### Inline nodes
232
+
233
+ | NodeType | int1 | int2 | int3 | str1 | str2 |
234
+ |----------|------|------|------|------|------|
235
+ | `TEXT` | - | - | - | literal (after entity decode, etc.) or `nil` (span-based) | - |
236
+ | `SOFTBREAK` / `HARDBREAK` | - | - | - | `"\n"` | - |
237
+ | `EMPHASIS` / `STRONG` / `STRIKETHROUGH` | - | - | - | - | - |
238
+ | `CODE_SPAN` | - | - | - | normalized content (literal) | - |
239
+ | `LINK` | - | - | - | sanitized destination | title (or `nil`) |
240
+ | `IMAGE` | - | - | - | sanitized destination | title (or `nil`) |
241
+ | `HTML_INLINE` | - | - | - | matched HTML literal | - |
242
+ | `FOOTNOTE_REFERENCE` | footnote number | occurrence count (the Nth one for the same label) | - | normalized label | - |
243
+
244
+ > `-` means "not used" (left at the default `0` / `nil`).
245
+
246
+ > Footnotes are only generated when `footnotes: true`. `FOOTNOTES_SECTION` is
247
+ > placed as the last child directly under the root (span-less,
248
+ > `source_start: -1`), and holds the referenced `FOOTNOTE_DEFINITION`s in
249
+ > first-reference order. The number of backrefs is computed at render time from
250
+ > the footnote number and label.
251
+
252
+ #### Source span conventions
253
+
254
+ - `source_start` / `source_len`: bytes of the original document (absolute byte
255
+ offset).
256
+ - `source_start < 0`: no span. A leaf node often holds its content as a literal
257
+ in `str1`, but a container inline may have only child nodes.
258
+ - The span of a block node serves two different purposes depending on use.
259
+ - For inline targets (paragraph / heading / table cell), the span is also the
260
+ byte range that the InlinePass tokenizes, so it points at the inline body
261
+ with `#` or other prefixes removed.
262
+ - For everything else (list / blockquote / table / code / html block, etc.)
263
+ the span is not used for tokenizing and only carries
264
+ structural / line-level position information.
265
+
266
+ ---
267
+
268
+ ## 5. Typical usage
269
+
270
+ ### 5.1 Create an Arena and build a small AST
271
+
272
+ ```ruby
273
+ source = "Hello *world*"
274
+ arena = RedQuilt::Arena.new(source)
275
+
276
+ doc_id = arena.add_node(RedQuilt::NodeType::DOCUMENT,
277
+ source_start: 0, source_len: source.bytesize)
278
+
279
+ para_id = arena.add_node(RedQuilt::NodeType::PARAGRAPH,
280
+ source_start: 0, source_len: source.bytesize)
281
+ arena.append_child(doc_id, para_id)
282
+
283
+ text_id = arena.add_node(RedQuilt::NodeType::TEXT,
284
+ source_start: 0, source_len: 6) # "Hello "
285
+ arena.append_child(para_id, text_id)
286
+
287
+ em_id = arena.add_node(RedQuilt::NodeType::EMPHASIS,
288
+ source_start: 6, source_len: 7) # "*world*"
289
+ arena.append_child(para_id, em_id)
290
+
291
+ inner_id = arena.add_node(RedQuilt::NodeType::TEXT,
292
+ source_start: 7, source_len: 5) # "world"
293
+ arena.append_child(em_id, inner_id)
294
+
295
+ arena.text(text_id) # => "Hello "
296
+ arena.text(inner_id) # => "world"
297
+ arena.source_span(em_id) # => #<SourceSpan @start_byte=6 @end_byte=13>
298
+ ```
299
+
300
+ ### 5.2 Loop over siblings (hot path)
301
+
302
+ ```ruby
303
+ arena.each_child(para_id) do |child_id|
304
+ case arena.type(child_id)
305
+ when RedQuilt::NodeType::TEXT
306
+ output << arena.text(child_id)
307
+ when RedQuilt::NodeType::EMPHASIS
308
+ output << "<em>"
309
+ render_children(child_id)
310
+ output << "</em>"
311
+ end
312
+ end
313
+ ```
314
+
315
+ If you want to chain over an `Enumerator` (for example in NodeRef), do this:
316
+
317
+ ```ruby
318
+ arena.child_ids(para_id).map { |id| arena.type_name(id) }
319
+ # => [:text, :emphasis]
320
+ ```
321
+
322
+ ### 5.3 Move a node to a different parent
323
+
324
+ `reparent` is an API that replaces the children of the destination node, so the
325
+ destination should normally be a newly created empty node.
326
+
327
+ ```ruby
328
+ # Move the children of `em_id` under a new strong_id
329
+ strong_id = arena.add_node(RedQuilt::NodeType::STRONG,
330
+ source_start: arena.source_start(em_id),
331
+ source_len: arena.source_len(em_id))
332
+ arena.insert_before(arena.raw_parent_id(em_id), em_id, strong_id)
333
+
334
+ first = arena.raw_first_child_id(em_id)
335
+ last = arena.raw_last_child_id(em_id)
336
+ arena.reparent(strong_id, first, last) if first != RedQuilt::Arena::NO_NODE
337
+
338
+ # Detach em_id while it is empty. strong_id stays where em_id was.
339
+ arena.detach(em_id)
340
+ ```
341
+
342
+ ### 5.4 Replace a node
343
+
344
+ ```ruby
345
+ # Replace em_id with strong_id (keep the contents)
346
+ strong_id = arena.add_node(RedQuilt::NodeType::STRONG,
347
+ source_start: arena.source_start(em_id),
348
+ source_len: arena.source_len(em_id))
349
+ arena.insert_before(arena.raw_parent_id(em_id), em_id, strong_id)
350
+
351
+ first = arena.raw_first_child_id(em_id)
352
+ last = arena.raw_last_child_id(em_id)
353
+ arena.reparent(strong_id, first, last) if first != RedQuilt::Arena::NO_NODE
354
+
355
+ arena.detach(em_id)
356
+ ```
357
+
358
+ ### 5.5 Update column values directly
359
+
360
+ ```ruby
361
+ # A heading's level is in int1, but there is no dedicated setter for it,
362
+ # so add one if needed. Currently only str1 / int3 / span have public setters:
363
+ arena.update_str1(text_id, "Hello, world!")
364
+ arena.update_int3(list_id, 1) # make it tight
365
+ arena.update_span(text_id, 0, 12)
366
+ ```
367
+
368
+ Note that there are currently no setters for int1 / int2 / str2. The plan is to
369
+ add `update_int1` and similar when the need arises.
370
+
371
+ ---
372
+
373
+ ## 6. Performance notes
374
+
375
+ #### Use `each_child` on the hot path
376
+
377
+ Yielding directly to a block avoids Enumerator allocation. `child_ids` is for the
378
+ external API.
379
+
380
+ #### `text(id)` prefers str1
381
+
382
+ To avoid an extra `byteslice`, content that can be reconstructed from the source
383
+ should leave `str1` as `nil`. However, use `str1` in cases where a literal is
384
+ required for correctness: TEXT after entity decode, code/html literals, table
385
+ cells, transformed/literal inline targets, and so on.
386
+
387
+ #### `source_span(id)` allocates a `SourceSpan` every time
388
+
389
+ If you use it on the hot path, it is better to read `source_start` /
390
+ `source_len` directly.
391
+
392
+ #### Detached nodes cannot be reclaimed
393
+
394
+ Repeatedly detaching many nodes keeps growing the arena's columns. The scale is
395
+ fine within parsing a single document, but it is not suited to a long-lived
396
+ arena.
397
+
398
+ ---
399
+
400
+ ## 7. Pitfalls
401
+
402
+ #### When you use a `raw_*_id` return value directly as a foreign key
403
+
404
+ Do not forget the `NO_NODE` (-1) check. Using it with `Array#[-1]` reads the last
405
+ element of the array and corrupts the tree.
406
+
407
+ #### Preconditions of `reparent`
408
+
409
+ You must be able to reach `last_id` by following `next_sibling` from `first_id`.
410
+ Passing nodes with a different parent, or a `last_id` that is unreachable behind
411
+ `first_id`, can cause an infinite loop (the builder actually hit this in the
412
+ past).
413
+
414
+ #### The meaning of `source_start < 0`
415
+
416
+ It is "literal mode, with position information discarded". The user-facing APIs
417
+ (`SourceMap`, `node.source_location`, etc.) treat it as having no span. Do not
418
+ forget this and get confused in the debugger by "there is no position
419
+ information".
420
+
421
+ #### Do not change `@source` afterward
422
+
423
+ If you do, the return values of `text` / `source_span` break silently.