red_quilt 0.7.1 → 0.8.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -0,0 +1,316 @@
1
+ # RedQuilt CommonMark Conformance
2
+
3
+ ## 1. Scope of this document
4
+
5
+ This document describes how RedQuilt differs from the CommonMark / GFM spec.
6
+ For behavior that follows the spec, refer directly to the spec documents
7
+ (<https://spec.commonmark.org/0.31.2/>, <https://github.github.com/gfm/>); this
8
+ document does not repeat them.
9
+
10
+ #### What this document covers
11
+
12
+ - Places where the implementation **narrows** what the spec allows (interpreting
13
+ or tightening ambiguous areas).
14
+ - Features outside the spec (security, diagnostics, option flags).
15
+ - The extensions that are **enabled** (GFM, etc.) and their opt-in conditions.
16
+ - Unsupported features and known limitations.
17
+
18
+ #### What this document does not cover
19
+
20
+ - Descriptions of standard behavior that matches the spec.
21
+ - Design background or data structure choices.
22
+
23
+ ### 1.1 Target versions
24
+
25
+ - CommonMark: 0.31.2
26
+ - GitHub Flavored Markdown: 0.29-gfm
27
+
28
+ ### 1.2 Implementation assumptions
29
+
30
+ - Input is a UTF-8 string. Preprocessing such as `force_encoding(Encoding::UTF_8)`
31
+ is the caller's responsibility.
32
+ - The normalization required by spec §2.3 / §2.4 (NUL -> U+FFFD,
33
+ `\r\n` / `\r` -> `\n`) and limiting the blank-line definition to space/tab are
34
+ all implemented. These follow the spec, so this document does not list them
35
+ individually.
36
+
37
+ ### 1.3 Format of each item
38
+
39
+ ```
40
+ ### N.N <Title>
41
+
42
+ **Spec**: the relevant section and the spec rule (or ambiguity)
43
+ **RedQuilt behavior**: how the implementation behaves / where it narrows or extends
44
+ **Implementation**: file:line / main symbols
45
+ **Test**: spec file / example number
46
+ ```
47
+
48
+ ## 2. Points where the spec is tightened
49
+
50
+ Where the spec wording allows more than one interpretation, or where a "must" is
51
+ left ambiguous, the implementation chooses the stricter side.
52
+
53
+ ### 2.1 URI autolink rejects U+007F (DEL)
54
+
55
+ **Spec**: §6.5 — a URI autolink does not contain "ASCII control characters,
56
+ space, `<`, `>`". Whether the range of "ASCII control characters" is only
57
+ U+0000–U+001F or also includes U+007F is not stated.
58
+
59
+ **RedQuilt behavior**: also rejects U+007F.
60
+
61
+ **Implementation**: `lib/red_quilt/inline/lexer.rb` — `URI_AUTOLINK_RE`
62
+
63
+ **Test**: `spec/whitespace_strictness_spec.rb` — "URI autolink (CommonMark 6.5)"
64
+
65
+ ### 2.2 Raw HTML tag separators limited to space/tab/CR/LF
66
+
67
+ **Spec**: §6.6 — defines the separators between attributes and around `=` as
68
+ "whitespace". In the spec's terminology (§2.1), the "whitespace" set is broad and
69
+ includes space / tab / newline / line tabulation (U+000B) / form feed (U+000C) /
70
+ carriage return.
71
+
72
+ **RedQuilt behavior**: within the tag grammar, only `[ \t\r\n]` is allowed as a
73
+ separator. FF (U+000C) / VT (U+000B) are not included. The same constraint
74
+ applies to inline raw HTML and to HTML block types 1 / 6 / 7.
75
+
76
+ **Implementation**:
77
+ - Inline: `lib/red_quilt/inline/lexer.rb` — `HTML_OPEN_TAG_RE` /
78
+ `HTML_CLOSING_TAG_RE`
79
+ - Block: `lib/red_quilt/block_parser.rb` — `HTML_TYPE_7_OPEN_TAG_RE` /
80
+ `HTML_TYPE_7_CLOSING_TAG_RE` / `HTML_BLOCK_TYPE_6_RE` / type 1 regex
81
+
82
+ **Test**: `spec/whitespace_strictness_spec.rb` — "raw HTML tag whitespace
83
+ (CommonMark 6.6)"
84
+
85
+ ### 2.3 Inline link tail separators limited to space/tab/at most 1 LF
86
+
87
+ **Spec**: §6.3 — the link tail (inside `(dest "title")`) is separated by "spaces,
88
+ tabs, and up to one line ending". FF / VT are not mentioned.
89
+
90
+ **RedQuilt behavior**: within the link tail, only space / tab are separators, and
91
+ a line ending is counted separately, up to one. If FF / VT appears, it does not
92
+ form a link (it is treated as normal paragraph text).
93
+
94
+ **Implementation**: `lib/red_quilt/inline/link_scanner.rb` —
95
+ `link_tail_whitespace_byte?`, `skip_link_whitespace`, `inline_link`,
96
+ `parse_link_title`
97
+
98
+ **Test**: `spec/whitespace_strictness_spec.rb` — "inline link tail whitespace
99
+ (CommonMark 6.3)"
100
+
101
+ ### 2.4 Reference definition raw destination validated the same as inline links
102
+
103
+ **Spec**: §6.3 — the raw form of a link destination is "a nonempty sequence of
104
+ characters that does not start with `<`, does not include ASCII control
105
+ characters or space character, and includes parentheses only if (a) they are
106
+ backslash-escaped or (b) they are part of a balanced pair of unescaped
107
+ parentheses".
108
+
109
+ **RedQuilt behavior**: validates all of the above for the raw destination of a
110
+ reference definition too. Specifically, it rejects ASCII control
111
+ (U+0000–U+001F) / U+007F (DEL) / space, and tracks the depth of unescaped
112
+ parens, invalidating the definition if they are unbalanced.
113
+
114
+ **Past behavior**: it accepted destinations with a simple `/\A(\S+)(.*)\z/`, so
115
+ `[x]: foo(bar` or `[x]: foo\bbar` were also accepted as definitions.
116
+
117
+ **Implementation**: `lib/red_quilt/reference_definition.rb` —
118
+ `parse_raw_destination`, `RAW_DEST_FORBIDDEN_RE`
119
+
120
+ **Test**: `spec/link_validation_spec.rb` — "reference definition raw destination
121
+ validation"
122
+
123
+ ### 2.5 Apply the 999-character link label limit on all paths
124
+
125
+ **Spec**: §6.3 — "A link label can have at most 999 characters inside the square
126
+ brackets."
127
+
128
+ **RedQuilt behavior**: rejects more than 999 characters on both the reference
129
+ definition side and the reference link usage side (shortcut / collapsed / full,
130
+ all of them).
131
+
132
+ **Implementation**:
133
+ - Constant: `lib/red_quilt/reference_definition.rb` —
134
+ `LABEL_MAX_LENGTH = 999`, the `label_too_long?` helper
135
+ - Definition side: `match_label` (decides for both single-line and multi-line)
136
+ - Usage side: `lib/red_quilt/inline/builder.rb` — `try_reference_link`,
137
+ `lib/red_quilt/inline/link_scanner.rb` — `reference_label`
138
+
139
+ **Test**: `spec/link_validation_spec.rb` — "link label length limit (999
140
+ characters)"
141
+
142
+ ### 2.6 NCR digit limits and U+FFFD replacement of invalid codepoints
143
+
144
+ **Spec**: §6.4 — a decimal NCR is 1–7 digits, a hex NCR is 1–6 digits. If the
145
+ decode result is U+0000, a surrogate (U+D800–U+DFFF), or out of the Unicode range
146
+ (> U+10FFFF), it is replaced with U+FFFD.
147
+
148
+ **RedQuilt behavior**: implements all of the above.
149
+
150
+ **Past behavior**: it delegated to `CGI.unescapeHTML`, so an 8-digit decimal like
151
+ `&#00000065;` or a surrogate like `&#xD800;` would each decode to "A" or raise a
152
+ `RangeError`.
153
+
154
+ **Implementation**: `lib/red_quilt/inline/html_entities.rb` —
155
+ `Inline.decode_entity`, `Inline::ENTITY_RE`, `decode_numeric_codepoint`. The
156
+ `SURROGATE_RANGE` and `MAX_UNICODE_CODEPOINT` constants.
157
+
158
+ **Test**: `spec/numeric_character_reference_spec.rb`
159
+
160
+ ### 2.7 GFM table header / delimiter cell-count match requirement
161
+
162
+ **Spec (GFM §4.10)**: "The header row must match the delimiter row in the number
163
+ of cells. If not, a table will not be recognized."
164
+
165
+ **RedQuilt behavior**: if the cell count of the header and the delimiter do not
166
+ match, it is not recognized as a table and is treated as a paragraph.
167
+
168
+ **Implementation**: `lib/red_quilt/block_parser.rb` — `table_start?`
169
+
170
+ **Test**: `spec/red_quilt_spec.rb` — "table separator validation (GFM spec)"
171
+
172
+ ### 2.8 GFM extended autolink domain underscore constraint
173
+
174
+ **Spec (GFM §6.9)**: "If the domain name contains an underscore (`_`) in its last
175
+ two segments, it is invalid."
176
+
177
+ **RedQuilt behavior**: when extended autolinks are enabled, a URL / email whose
178
+ domain has `_` in its last two segments is not linkified.
179
+
180
+ **Implementation**: `lib/red_quilt/extended_autolink_pass.rb` — `valid_domain?` /
181
+ `extract_domain`
182
+
183
+ **Test**: `spec/extended_autolink_spec.rb` — "domain validation (GFM spec)"
184
+
185
+ ## 3. Features outside the spec
186
+
187
+ Features not defined in the spec that RedQuilt provides for safety and
188
+ convenience.
189
+
190
+ ### 3.1 Sanitizing unsafe URL schemes
191
+
192
+ **RedQuilt behavior**: if the scheme of a link / image destination is not in the
193
+ safe list below, it outputs `href` / `src` as an empty string. At the same time
194
+ it emits an `:unsafe_url` diagnostic as a warning. For CommonMark autolinks
195
+ (`<scheme:...>`), to stay spec-conformant, a denylist is used instead of a safe
196
+ list, and only schemes that could lead to script execution get an empty href.
197
+
198
+ **Safe schemes**: `http`, `https`, `mailto`, `ftp`, `tel`, `ssh`
199
+
200
+ **Schemes blocked in autolinks**: `javascript`, `vbscript`, `data`
201
+
202
+ **Implementation**: `lib/red_quilt/inline/builder.rb` — `SAFE_SCHEMES`,
203
+ `UNSAFE_AUTOLINK_SCHEMES`, `sanitize_destination`, `block_unsafe_autolink`
204
+
205
+ **Test**: `spec/red_quilt_spec.rb` — "sanitizes unsafe URL schemes"
206
+
207
+ ### 3.2 Diagnostics
208
+
209
+ **RedQuilt behavior**: suspicious syntax, missing references, and potential
210
+ security events detected during parse / render are accumulated in
211
+ `Document#diagnostics` as `RedQuilt::Diagnostic` objects. Processing is never
212
+ interrupted (a tree and HTML are always returned).
213
+
214
+ **Rules currently emitted**:
215
+
216
+ | Rule | Severity | Description |
217
+ |---|---|---|
218
+ | `:missing_reference` | warning | A full reference link `[text][ref]` has no definition. |
219
+ | `:duplicate_reference` | warning | There were multiple reference definitions with the same label (the first one is used). |
220
+ | `:duplicate_footnote` | warning | There were multiple footnote definitions with the same label (the first one is used; only when `footnotes: true`). |
221
+ | `:unsafe_url` | warning | An unsafe URL was replaced with an empty `href` / `src`. |
222
+ | `:empty_link` | warning | The link destination is empty (only when `lint: true`). |
223
+ | `:missing_alt` | info | An image's alt text is empty (only when `lint: true`). |
224
+ | `:heading_level_skip` | info | A heading level jumped by more than one (only when `lint: true`). |
225
+
226
+ **Implementation**: `lib/red_quilt/diagnostic.rb` (value object),
227
+ `lib/red_quilt/block_parser.rb` (duplicate reference),
228
+ `lib/red_quilt/footnote_definition.rb` (duplicate footnote),
229
+ `lib/red_quilt/inline/builder.rb` (missing / unsafe),
230
+ `lib/red_quilt/lint_pass.rb` (lint rules)
231
+
232
+ ### 3.3 `allow_html` / `disallow_raw_html` flags
233
+
234
+ **RedQuilt behavior**:
235
+
236
+ | Flag | Default | Effect |
237
+ |---|---|---|
238
+ | `allow_html` | `false` | When false, raw HTML is fully escaped (turned into `&lt;`). When true, HTML blocks and inline raw HTML are output as-is. |
239
+ | `disallow_raw_html` | `false` | The GFM "Disallowed Raw HTML" extension, enabled under `allow_html: true`. It rewrites `<` to `&lt;` for the specified tags. |
240
+
241
+ The disallowed tag set defined by GFM: `title`, `textarea`, `style`, `xmp`,
242
+ `iframe`, `noembed`, `noframes`, `script`, `plaintext`
243
+
244
+ **Implementation**: `lib/red_quilt/document.rb` — `allow_html?` /
245
+ `disallow_raw_html?`
246
+ **Implementation (filter)**: `lib/red_quilt/renderer/html.rb` —
247
+ `DISALLOWED_RAW_TAGS` / `DISALLOWED_RAW_TAG_RE` / `filter_disallowed_raw`
248
+
249
+ ## 4. Enabled extensions
250
+
251
+ ### 4.1 GFM Table
252
+
253
+ Always enabled. In addition to the spec, the column-count match requirement from
254
+ 2.7 is applied.
255
+
256
+ **Implementation**: `lib/red_quilt/block_parser.rb` — `table_start?` /
257
+ `parse_table`
258
+
259
+ ### 4.2 GFM Strikethrough
260
+
261
+ Always enabled. Only the double tilde `~~text~~` is supported (matching GFM
262
+ behavior). A single tilde `~text~` is treated as normal text.
263
+
264
+ **Implementation**: `lib/red_quilt/inline/lexer.rb` (handling of `~` in
265
+ `SPECIAL_BYTES` and `scan_delim_run`), `lib/red_quilt/inline/builder.rb`
266
+ (generating `STRIKETHROUGH` in `process_emphasis`)
267
+
268
+ ### 4.3 GFM Disallowed Raw HTML
269
+
270
+ Opt-in. It only works when `allow_html: true, disallow_raw_html: true` are used
271
+ together (under `allow_html: false` all HTML is escaped, so it has no effect).
272
+ See 3.3 for details.
273
+
274
+ ### 4.4 GFM Extended Autolink
275
+
276
+ Opt-in. Specifying `extended_autolinks: true` runs `ExtendedAutolinkPass` as a
277
+ pass that linkifies bare URLs / emails / `www.`-prefixed strings that are not
278
+ wrapped in `<...>`.
279
+
280
+ **Additional constraint**: implements the domain underscore check from 2.8.
281
+
282
+ **Implementation**: `lib/red_quilt/extended_autolink_pass.rb`
283
+
284
+ ### 4.5 GFM Footnotes
285
+
286
+ Opt-in. Specifying `footnotes: true` removes `[^label]: ...` definitions from the
287
+ body flow and converts `[^label]` references into sup links. Only the referenced
288
+ definitions are kept, ordered by first reference, and output as a
289
+ `FOOTNOTES_SECTION` at the end of the root. Unreferenced definitions are not
290
+ output.
291
+
292
+ **Implementation**: `lib/red_quilt/footnote_definition.rb`,
293
+ `lib/red_quilt/footnote_registry.rb`, `lib/red_quilt/footnote_pass.rb`
294
+
295
+ ## 5. Unsupported / known limitations
296
+
297
+ - GFM Task List Items (`- [ ]` / `- [x]`) are not supported. They are parsed as
298
+ normal list items.
299
+
300
+ ## 6. Correspondence with tests
301
+
302
+ This section collects the spec files that verify the difference items.
303
+
304
+ | Aspect | Spec file |
305
+ |---|---|
306
+ | Passing the official CommonMark examples | `spec/commonmark_compat_spec.rb` |
307
+ | Input normalization (line endings / NUL / blank line) | `spec/input_normalization_spec.rb` |
308
+ | Whitespace strictness (autolink / raw HTML / link tail) | `spec/whitespace_strictness_spec.rb` |
309
+ | Link / reference validation (label cap / raw dest) | `spec/link_validation_spec.rb` |
310
+ | NCR digit limits and invalid codepoints | `spec/numeric_character_reference_spec.rb` |
311
+ | GFM table column-count match | `spec/red_quilt_spec.rb` — "table separator validation" |
312
+ | GFM extended autolink domain validation | `spec/extended_autolink_spec.rb` |
313
+ | GFM footnotes | `spec/footnotes_spec.rb` |
314
+ | URL scheme sanitization | `spec/red_quilt_spec.rb` — "sanitizes unsafe URL schemes" |
315
+ | Diagnostics / lint diagnostics | `spec/diagnostic_spec.rb` |
316
+ | Disallowed Raw HTML | `spec/red_quilt_spec.rb` — disallow_raw_html cases |
@@ -231,6 +231,102 @@ module RedQuilt
231
231
  @str2[id]
232
232
  end
233
233
 
234
+ # --- Semantic payload accessors -------------------------------------
235
+ #
236
+ # int1..int3 / str1 / str2 are anonymous columns; their meaning is
237
+ # per-NodeType (see the class comment). These readers are the single
238
+ # source of truth for those conventions, so callers (renderers, the
239
+ # NodeRef wrapper, AST/MDAST export) never need to know which raw
240
+ # column a field lives in. Writers use add_node's keyword args,
241
+ # update_*, or a small set of typed writers (e.g.
242
+ # #resolve_footnote_definition) when the intent is worth naming.
243
+ #
244
+ # The reader is responsible for calling these only on the matching
245
+ # NodeType; on a mismatching node they return whatever the raw column
246
+ # holds (typically the 0 / nil default).
247
+
248
+ # HEADING: nesting level (1..6).
249
+ def heading_level(id)
250
+ @int1[id]
251
+ end
252
+
253
+ # LIST: ordered (`1.`) vs bullet (`-`).
254
+ def list_ordered?(id)
255
+ @int1[id] == 1
256
+ end
257
+
258
+ # LIST: the start number of an ordered list (1 unless overridden).
259
+ def list_start(id)
260
+ @int2[id]
261
+ end
262
+
263
+ # LIST: tight (no blank lines between items) vs loose.
264
+ def list_tight?(id)
265
+ @int3[id] == 1
266
+ end
267
+
268
+ # LIST: the item delimiter as authored (e.g. "-", "1.").
269
+ def list_delimiter(id)
270
+ @str1[id]
271
+ end
272
+
273
+ # TABLE_ROW: header row (rendered in <thead>) vs body row.
274
+ def table_row_header?(id)
275
+ @int1[id] == 1
276
+ end
277
+
278
+ # TABLE_CELL: header cell (<th>) vs data cell (<td>).
279
+ def table_cell_header?(id)
280
+ @int1[id] == 1
281
+ end
282
+
283
+ # CODE_BLOCK: the fence info string (e.g. 'ruby', 'vtt audio="x"').
284
+ def code_block_info(id)
285
+ @str2[id]
286
+ end
287
+
288
+ # LINK / IMAGE: destination URL.
289
+ def link_destination(id)
290
+ @str1[id]
291
+ end
292
+
293
+ # LINK / IMAGE: optional title attribute (nil/empty when absent).
294
+ def link_title(id)
295
+ @str2[id]
296
+ end
297
+
298
+ # FOOTNOTE_REFERENCE: the assigned footnote number.
299
+ def footnote_number(id)
300
+ @int1[id]
301
+ end
302
+
303
+ # FOOTNOTE_REFERENCE: which occurrence of a repeated reference this is
304
+ # (1-based), used to give each backref a unique anchor.
305
+ def footnote_occurrence(id)
306
+ @int2[id]
307
+ end
308
+
309
+ # FOOTNOTE_REFERENCE / FOOTNOTE_DEFINITION: the author-written label.
310
+ def footnote_label(id)
311
+ @str1[id]
312
+ end
313
+
314
+ # FOOTNOTE_DEFINITION: total number of references to this footnote,
315
+ # materialized by FootnotePass so renderers can read it off the node
316
+ # instead of consulting the registry. (FOOTNOTE_REFERENCE reuses int2
317
+ # for its own occurrence index; see #footnote_occurrence.)
318
+ def footnote_total_references(id)
319
+ @int2[id]
320
+ end
321
+
322
+ # FOOTNOTE_DEFINITION: records the resolved number and total reference
323
+ # count onto the node (read back via #footnote_number /
324
+ # #footnote_total_references).
325
+ def resolve_footnote_definition(id, number, total_references)
326
+ @int1[id] = number
327
+ @int2[id] = total_references
328
+ end
329
+
234
330
  # Returns a SourceSpan for the node, or nil when the node has no
235
331
  # span (source_start < 0, meaning the content is held in str1).
236
332
  def source_span(id)