canon 0.1.22 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (56) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop_todo.yml +174 -25
  3. data/docs/INDEX.adoc +4 -0
  4. data/docs/advanced/diff-classification.adoc +3 -2
  5. data/docs/features/configuration-profiles.adoc +288 -0
  6. data/docs/features/diff-formatting/character-visualization.adoc +153 -454
  7. data/docs/features/diff-formatting/display-filtering.adoc +44 -0
  8. data/docs/features/diff-formatting/display-preprocessing.adoc +656 -0
  9. data/docs/features/diff-formatting/index.adoc +47 -0
  10. data/docs/features/diff-formatting/pretty-diff-mode.adoc +154 -0
  11. data/docs/features/environment-configuration/override-system.adoc +10 -3
  12. data/docs/features/index.adoc +9 -0
  13. data/docs/features/match-options/index.adoc +32 -42
  14. data/docs/features/match-options/pretty-printed-fixtures.adoc +270 -0
  15. data/docs/guides/choosing-configuration.adoc +22 -0
  16. data/docs/reference/environment-variables.adoc +121 -1
  17. data/docs/reference/options-across-interfaces.adoc +182 -2
  18. data/lib/canon/cli.rb +20 -0
  19. data/lib/canon/commands/diff_command.rb +7 -2
  20. data/lib/canon/commands/format_command.rb +1 -1
  21. data/lib/canon/comparison/html_comparator.rb +20 -15
  22. data/lib/canon/comparison/html_compare_profile.rb +4 -4
  23. data/lib/canon/comparison/markup_comparator.rb +12 -3
  24. data/lib/canon/comparison/match_options/base_resolver.rb +29 -7
  25. data/lib/canon/comparison/match_options/json_resolver.rb +9 -0
  26. data/lib/canon/comparison/match_options/xml_resolver.rb +16 -2
  27. data/lib/canon/comparison/match_options/yaml_resolver.rb +10 -0
  28. data/lib/canon/comparison/match_options.rb +4 -1
  29. data/lib/canon/comparison/whitespace_sensitivity.rb +189 -137
  30. data/lib/canon/comparison/xml_comparator/child_comparison.rb +21 -4
  31. data/lib/canon/comparison/xml_comparator.rb +14 -12
  32. data/lib/canon/comparison/xml_node_comparison.rb +51 -6
  33. data/lib/canon/comparison.rb +52 -9
  34. data/lib/canon/config/env_schema.rb +32 -4
  35. data/lib/canon/config/override_resolver.rb +16 -3
  36. data/lib/canon/config/profile_loader.rb +135 -0
  37. data/lib/canon/config/profiles/metanorma.yml +74 -0
  38. data/lib/canon/config/profiles/metanorma_debug.yml +8 -0
  39. data/lib/canon/config/type_converter.rb +8 -0
  40. data/lib/canon/config.rb +469 -5
  41. data/lib/canon/diff/diff_classifier.rb +41 -11
  42. data/lib/canon/diff_formatter/diff_detail_formatter/dimension_formatter.rb +48 -17
  43. data/lib/canon/diff_formatter/diff_detail_formatter/node_utils.rb +58 -0
  44. data/lib/canon/diff_formatter/diff_detail_formatter.rb +22 -7
  45. data/lib/canon/diff_formatter/theme.rb +24 -17
  46. data/lib/canon/diff_formatter.rb +493 -36
  47. data/lib/canon/pretty_printer/xml_normalized.rb +395 -0
  48. data/lib/canon/rspec_matchers.rb +36 -0
  49. data/lib/canon/tree_diff/matchers/hash_matcher.rb +26 -11
  50. data/lib/canon/version.rb +1 -1
  51. data/lib/canon/xml/nodes/namespace_node.rb +4 -0
  52. data/lib/canon/xml/nodes/processing_instruction_node.rb +4 -0
  53. data/lib/canon/xml/nodes/root_node.rb +4 -0
  54. data/lib/canon/xml/nodes/text_node.rb +4 -0
  55. data/lib/tasks/performance_helpers.rb +2 -2
  56. metadata +24 -2
@@ -168,6 +168,50 @@ When using the `spec_friendly` profile, attribute order is informative:
168
168
  The cyan color and `~` symbol indicate this differs does not affect equivalence in the `spec_friendly` profile.
169
169
  ====
170
170
 
171
+ == Semantic Diff Report filtering
172
+
173
+ The `show_diffs` filter applies to **both** the line/tree diff output and the
174
+ Semantic Diff Report header block. When the Semantic Diff Report is emitted
175
+ (whenever `verbose: true` and differences exist), its two sections are gated
176
+ by the same flag:
177
+
178
+ [cols="1,1,1"]
179
+ |===
180
+ |`show_diffs` |NORMATIVE DIFFERENCES section |INFORMATIVE DIFFERENCES section
181
+
182
+ |`:all` (default)
183
+ |shown
184
+ |shown
185
+
186
+ |`:normative`
187
+ |shown
188
+ |**hidden**
189
+
190
+ |`:informative`
191
+ |**hidden**
192
+ |shown
193
+ |===
194
+
195
+ This means that setting `show_diffs: :normative` (or
196
+ `cfg.xml.diff.show_diffs = :normative` in config) will suppress the
197
+ `INFORMATIVE DIFFERENCES` section from the Semantic Diff Report as well as
198
+ from the line diff, giving a consistent, noise-free view of only the
199
+ differences that affect equivalence.
200
+
201
+ [example]
202
+ [source,ruby]
203
+ ----
204
+ Canon::Config.configure do |cfg|
205
+ cfg.xml.diff.show_diffs = :normative
206
+ end
207
+
208
+ # The Semantic Diff Report now omits the INFORMATIVE DIFFERENCES section.
209
+ # Only the NORMATIVE DIFFERENCES block is rendered.
210
+ expect(actual).to be_xml_equivalent_to(expected,
211
+ match: { comments: :ignore }
212
+ )
213
+ ----
214
+
171
215
  == Usage examples
172
216
 
173
217
  === Ruby API
@@ -0,0 +1,656 @@
1
+ ---
2
+ title: Display Preprocessing
3
+ parent: Diff Formatting
4
+ nav_order: 6
5
+ ---
6
+ = Display preprocessing
7
+ :toc:
8
+ :toclevels: 3
9
+
10
+ == Purpose
11
+
12
+ Display preprocessing controls how documents are normalized *before the
13
+ line-by-line diff is rendered*. It is entirely separate from
14
+ link:../../features/match-options/index.adoc[comparison preprocessing]
15
+ (`FormatConfig#preprocessing`), which controls normalization for equivalence
16
+ detection.
17
+
18
+ The motivation is straightforward: line-by-line diffing only produces
19
+ meaningful output when both sides have identical structural formatting. If one
20
+ document is deeply indented and multi-line and the other is compact/inline, the
21
+ LCS algorithm will match lines by text content across structurally unrelated
22
+ regions of the document, producing misleading output. By pretty-printing both
23
+ sides through the same formatter first, the diff reflects only genuine content
24
+ differences.
25
+
26
+ == How it works
27
+
28
+ [source]
29
+ ----
30
+ COMPARISON PHASE (unchanged)
31
+ Expected ──► Comparison preprocessing ──► Semantic equivalence check
32
+ Actual ──► Semantic Diff Report
33
+
34
+ DISPLAY PHASE (new)
35
+ Expected ──► Display preprocessing ──► Format-specific by-line formatter ──► line diff
36
+ Actual ──► line diff
37
+ ----
38
+
39
+ The comparison result (semantic diff report, equivalence determination) is
40
+ never affected by display preprocessing. Only the visual line-by-line diff
41
+ section changes.
42
+
43
+ == Two-layer display pipeline
44
+
45
+ There are two independent formatting layers in the by-line display path.
46
+ Understanding both is essential to predicting the output.
47
+
48
+ === Layer 1 — Format-specific by-line formatter (always active)
49
+
50
+ Every format has its own by-line formatter that is *always* applied, regardless
51
+ of whether display preprocessing is enabled:
52
+
53
+ * **XML** — Splits the serialized string at `><` boundaries (one tag per
54
+ line), then applies character visualization (e.g. ASCII space → `░`,
55
+ tab → `⇥`, U+00A0 → `␣`) to every content character in the output.
56
+ * **HTML** — Similar to XML.
57
+ * **JSON** — Calls `Canon.format(doc, :json)` (re-serializes via
58
+ `JSON.pretty_generate`), producing consistent 2-space indented JSON.
59
+ * **YAML** — Calls `Canon.format(doc, :yaml)`.
60
+
61
+ This layer runs on the strings that arrive at the by-line formatter, whatever
62
+ their source.
63
+
64
+ === Layer 2 — Display preprocessing (optional, configured)
65
+
66
+ Display preprocessing, controlled by `display_preprocessing`, is applied
67
+ *before* the documents reach Layer 1.
68
+
69
+ `:none` (default)::
70
+ No extra processing. The documents arrive at the by-line formatter as-is,
71
+ and Layer 1 applies its own reformatting. This preserves the existing
72
+ behaviour prior to this feature.
73
+
74
+ `:pretty_print`::
75
+ Both documents are run through a format-specific pretty-printer *before*
76
+ reaching the by-line formatter. For XML, this uses `Canon::PrettyPrinter::Xml`
77
+ (Nokogiri `to_xml` with `noblanks`). For HTML, this uses
78
+ `Canon::PrettyPrinter::Html` (Nokogiri HTML5 serializer). Both replace
79
+ Layer 1's simple `><`-split with a proper Nokogiri-based pretty-print,
80
+ producing structurally consistent indentation on both sides. For JSON and
81
+ YAML, this is a pass-through (Layer 1 handles those formats adequately).
82
+ +
83
+ NOTE: `:pretty_print` has a fundamental limitation for markup with mixed
84
+ content (elements containing both text and child elements): standard
85
+ pretty-printers keep such elements on a single line because inserting newlines
86
+ would add whitespace text nodes, changing the document's semantic content.
87
+ See <<_mixed_content_and_normalize_pretty_print>> for the solution.
88
+
89
+ `:normalize_pretty_print`::
90
+ Like `:pretty_print` but uses `Canon::PrettyPrinter::XmlNormalized`, a
91
+ custom recursive DOM serializer that guarantees one line per XML node even
92
+ for mixed-content elements. See <<_mixed_content_and_normalize_pretty_print>>
93
+ for the full rationale and algorithm.
94
+
95
+ `:c14n`::
96
+ Both documents are run through a canonical normalization pass before reaching
97
+ the by-line formatter. For XML, this uses canonical XML (C14N), which sorts
98
+ attributes alphabetically and normalizes namespace declarations. For HTML,
99
+ this uses Nokogiri HTML5 serialization, which produces a consistent canonical
100
+ form of the HTML5 DOM (normalizing attribute order and whitespace).
101
+
102
+ === Effect on character visualization
103
+
104
+ Character visualization is applied by Layer 1 (the by-line formatter) to the
105
+ output lines. Because `Canon::PrettyPrinter::Xml` introduces only ASCII
106
+ U+0020 spaces and U+000A newlines for structural indentation, and because
107
+ those characters *are* in Canon's visualization map (space → `░`,
108
+ newline → invisible), the pretty-printer's structural indentation is
109
+ visualized as `░░` (2 spaces at default indent) in the diff output.
110
+
111
+ This is correct and intentional: the `░` characters in context lines represent
112
+ the structural indentation that was introduced by the pretty-printer, making
113
+ the depth of the tree visually apparent in the diff.
114
+
115
+ NOTE: If the visualization map is extended to cover additional characters, the
116
+ two-layer architecture remains valid as long as those characters do not appear
117
+ in the pretty-printer's structural output. See the
118
+ <<_future_extensibility_constraint>> section below.
119
+
120
+ == Configuration
121
+
122
+ === `display_preprocessing` values
123
+
124
+ `:none` (default):: Documents are used as-is for the line diff (preserves
125
+ existing behaviour).
126
+
127
+ `:pretty_print`:: Both documents are run through `Canon::PrettyPrinter::Xml`
128
+ before the line diff. This produces one element per line with consistent
129
+ indentation, so the LCS algorithm matches lines by semantic position rather
130
+ than by coincidental textual similarity. Does not handle mixed content (see
131
+ `:normalize_pretty_print`).
132
+
133
+ `:normalize_pretty_print`:: Both documents are run through
134
+ `Canon::PrettyPrinter::XmlNormalized` before the line diff. Like
135
+ `:pretty_print` but guarantees one line per XML node even for mixed-content
136
+ elements, and visualizes boundary content whitespace using the character map.
137
+ Recommended for XML with inline markup (e.g. Metanorma output).
138
+
139
+ `:c14n`:: Both documents are run through XML C14N normalization before the
140
+ line diff.
141
+
142
+ === Setting in `spec_helper.rb`
143
+
144
+ [source,ruby]
145
+ ----
146
+ Canon::Config.instance.tap do |cfg|
147
+ cfg.xml.diff.display_preprocessing = :normalize_pretty_print
148
+ cfg.xml.diff.pretty_printer.indent = 2 # default: 2
149
+ cfg.xml.diff.pretty_printer.indent_type = :space # :space or :tab
150
+ cfg.xml.diff.collapse_whitespace_elements = %w[p li td th] # opt-in for mixed-content elements
151
+ end
152
+ ----
153
+
154
+ === Environment variables
155
+
156
+ [source,bash]
157
+ ----
158
+ export CANON_XML_DIFF_DISPLAY_PREPROCESSING=normalize_pretty_print
159
+ export CANON_XML_DIFF_PRETTY_PRINTER_INDENT=2
160
+ export CANON_XML_DIFF_PRETTY_PRINTER_INDENT_TYPE=space
161
+ export CANON_XML_COLLAPSE_WHITESPACE_ELEMENTS=p,li,td,th
162
+ ----
163
+
164
+ == Pretty-printer format
165
+
166
+ `Canon::PrettyPrinter::Xml` uses Nokogiri's `to_xml` with `noblanks` (to strip
167
+ blank text nodes) and consistent indentation. The result is one element per
168
+ line with child elements indented relative to their parent.
169
+
170
+ .Input (mixed formatting)
171
+ [example]
172
+ ====
173
+ [source,xml]
174
+ ----
175
+ <bibitem id="ISO712"><biblio-tag>ISO&#xA0;712, </biblio-tag><formattedref><em>Cereals</em>.</formattedref></bibitem>
176
+ ----
177
+ ====
178
+
179
+ .After `:pretty_print` (indent: 2)
180
+ [example]
181
+ ====
182
+ [source,xml]
183
+ ----
184
+ <bibitem id="ISO712">
185
+ <biblio-tag>ISO&#xA0;712, </biblio-tag>
186
+ <formattedref>
187
+ <em>Cereals</em>.
188
+ </formattedref>
189
+ </bibitem>
190
+ ----
191
+ ====
192
+
193
+ With both sides formatted this way, a line diff of two versions of the same
194
+ element produces clean, contextual output:
195
+
196
+ [source]
197
+ ----
198
+ <bibitem id="ISO712">
199
+ < <biblio-tag>ISO\u00a0712, </biblio-tag>
200
+ > <biblio-tag>ISO&#xA0;712, </biblio-tag>
201
+ <formattedref>
202
+ <em>Cereals and cereal products</em>.
203
+ </formattedref>
204
+ ----
205
+
206
+ [[_mixed_content_and_normalize_pretty_print]]
207
+ == Mixed content and `:normalize_pretty_print`
208
+
209
+ === The problem
210
+
211
+ Standard XML/HTML pretty-printers — including Nokogiri's built-in
212
+ `to_xml(indent: N)` — always keep mixed-content elements (those containing
213
+ both text and child elements) on a single line. They have no choice: inserting
214
+ a newline between `<p>See ` and `<xref.../>` would create a new whitespace
215
+ text node in the document, changing its semantic content.
216
+
217
+ This causes two related problems for line-by-line diffs:
218
+
219
+ . **Structural non-uniformity**: if the expected document (e.g. a hand-written
220
+ test fixture) has formatting whitespace inside a `<p>` and the actual
221
+ document (e.g. Metanorma compact XML output) does not, the two sides
222
+ serialize differently even when the content is the same. The diff shows
223
+ many spurious lines of insertions and deletions instead of a single
224
+ meaningful change.
225
+
226
+ . **1000-character lines**: when a paragraph contains many inline elements
227
+ (formulae, cross-references, emphasis runs), the entire `<p>` element
228
+ collapses onto one line. A change anywhere inside the paragraph marks the
229
+ whole line as changed; there is no way to see which part differs. This is
230
+ the pathological case documented in
231
+ https://github.com/lutaml/canon/issues/53[issue #53].
232
+
233
+ === Why stripping whitespace does not work
234
+
235
+ The obvious workaround — strip all inter-element whitespace, then
236
+ pretty-print — creates a different problem. Stripping whitespace text nodes
237
+ changes the document. In `<p>See <strong>note</strong> for details</p>`, the
238
+ spaces before and after `<strong>` are part of the rendered output: removing
239
+ them makes "Seenote for details" visually. Moreover, a document that had those
240
+ spaces and one that did not would serialize identically after stripping, hiding
241
+ a real content difference.
242
+
243
+ === The solution: two categories of whitespace
244
+
245
+ `Canon::PrettyPrinter::XmlNormalized` distinguishes two kinds of whitespace at
246
+ serialization time:
247
+
248
+ **Structural whitespace**:: Indentation characters *introduced by the
249
+ serializer itself*. These do not exist in the source document. Emitted as
250
+ ordinary ASCII space and newline characters.
251
+
252
+ **Content whitespace**:: Whitespace that exists as text-node content *in the
253
+ source document* — the spaces and newlines between inline elements. Visualized
254
+ using the character map (U+0020 → `░`, U+000A → `↵`, etc.) and placed at the
255
+ *end of the preceding line* rather than the beginning of the next.
256
+
257
+ The invariant is: **every XML element starts on its own line, always.**
258
+ Content whitespace at element boundaries is visible and unambiguously distinct
259
+ from the structural indentation.
260
+
261
+ === Algorithm
262
+
263
+ For each element in a recursive DOM walk:
264
+
265
+ Element-only children (no text siblings)::
266
+ Standard indented pretty-print. One child per line.
267
+
268
+ Pure-text element (text content, no child elements)::
269
+ Single line: `<tag>content</tag>`. Internal spaces are left unchanged
270
+ (they are inside a text node with no element siblings and are unambiguously
271
+ content).
272
+
273
+ Mixed content (text nodes and element children both present)::
274
+ Processed in document order:
275
+ +
276
+ * *Pure-whitespace text node between elements* (e.g. `\n ` from a
277
+ hand-indented fixture): visualized and appended to the *end* of the current
278
+ line (the preceding element or opening tag). No new line is created.
279
+ * *Text node with printable content* (e.g. `See ` before a cross-reference):
280
+ leading whitespace visualized and appended to the current line (then the
281
+ current line is flushed); printable content placed on its own indented line;
282
+ trailing whitespace visualized and appended to the content line.
283
+ * *Element child*: the current accumulated line is flushed; element serialized
284
+ recursively at depth+1.
285
+
286
+ === Example: compact Metanorma XML vs. hand-indented fixture
287
+
288
+ Input (compact XML, from Metanorma):
289
+ [source,xml]
290
+ ----
291
+ <p id="A">See <xref target="M" id="_"/><semx element="xref" source="_">note</semx></p>
292
+ ----
293
+
294
+ After `:normalize_pretty_print`:
295
+ [source]
296
+ ----
297
+ <p id="A">
298
+ See░
299
+ <xref target="M" id="_"/>
300
+ <semx element="xref" source="_">note</semx>
301
+ </p>
302
+ ----
303
+
304
+ (`░` = U+0020 space from the document — the trailing space after "See")
305
+
306
+ Input (hand-indented fixture):
307
+ [source,xml]
308
+ ----
309
+ <p id="A">
310
+ See
311
+ <xref target="M" id="_"/>
312
+ </p>
313
+ ----
314
+
315
+ After `:normalize_pretty_print`:
316
+ [source]
317
+ ----
318
+ <p id="A">↵░░
319
+ See↵░░
320
+ <xref target="M" id="_"/>↵
321
+ </p>
322
+ ----
323
+
324
+ (`↵` = U+000A newline; `░░` = two U+0020 spaces — all from the document, all visualized)
325
+
326
+ Diff between the two (2 changed lines instead of 7+):
327
+ [source]
328
+ ----
329
+ - <p id="A">↵░░
330
+ + <p id="A">
331
+ - See↵░░
332
+ + See░
333
+ <xref target="M" id="_"/>
334
+ ----
335
+
336
+ === Whitespace in practice: Metanorma compact XML
337
+
338
+ Metanorma emits compact XML with no inter-element whitespace. Every
339
+ whitespace character in a Metanorma output text node is real document content.
340
+ The serializer's boundary-whitespace visualization therefore accurately
341
+ reflects the document — there are no structural indentation artefacts to
342
+ suppress.
343
+
344
+ When comparing against hand-written fixture strings (test heredocs), the
345
+ fixture often has indentation whitespace inside mixed-content elements. That
346
+ whitespace appears as visualized characters (e.g. `↵░░`) on the expected side
347
+ and as plain `░` on the actual side. This is correct and intentional: the
348
+ whitespace difference is visible, but it does not obscure the structural diff
349
+ the way a compact single line does.
350
+
351
+ === Why a new serializer is required
352
+
353
+ Nokogiri's `to_xml(indent: N)` is implemented in C and provides no hook into
354
+ mixed-content handling. The only way to implement the boundary-visualization
355
+ algorithm is a recursive Ruby-level DOM walk that builds the output string
356
+ directly. `PrettyPrinter::XmlNormalized` is that serializer.
357
+
358
+ === When to use `:normalize_pretty_print`
359
+
360
+ Use `:normalize_pretty_print` when:
361
+
362
+ * The documents contain inline markup (cross-references, emphasis, formulae)
363
+ — typical of Metanorma, DocBook, DITA, or similar technical documentation
364
+ XML formats.
365
+ * You need to diagnose differences inside `<p>` or similar mixed elements
366
+ without wading through 1000-character compact lines.
367
+ * The expected (fixture) and actual (generated) sides may have different
368
+ inter-element whitespace conventions (hand-indented vs. compact).
369
+
370
+ Use `:pretty_print` when documents are element-only (no mixed content) and
371
+ both sides come from the same serializer.
372
+
373
+ == Character visualization and display preprocessing
374
+
375
+ === How they interact
376
+
377
+ Character visualization (substituting special characters with visible
378
+ placeholders) is applied by the by-line formatter (Layer 1) to the *output
379
+ lines*, after display preprocessing (Layer 2) has already run.
380
+
381
+ Canon's default visualization map includes ASCII U+0020 space (`░`) and
382
+ U+0009 tab (`⇥`). Because the pretty-printer introduces U+0020 spaces for
383
+ structural indentation, those indentation spaces *are* visualized as `░` in
384
+ the diff output. This is intentional: the `░` characters in context lines
385
+ reflect the structural depth of each element, making the tree structure
386
+ visually apparent.
387
+
388
+ Special content characters in text nodes — such as U+00A0 NO-BREAK SPACE
389
+ (`␣`) — are also visualized, maintaining their distinction from regular
390
+ structural indentation in the output.
391
+
392
+ === Future extensibility constraint
393
+
394
+ IMPORTANT: The current approach visualizes *all* characters in the output
395
+ lines, including both content characters and the pretty-printer's structural
396
+ indentation. This is intentional for the standard visualization map, where
397
+ structural indentation (U+0020 spaces, U+0009 tabs) provides useful visual
398
+ depth information in the diff.
399
+
400
+ However, if Canon's visualization map is extended to include characters that
401
+ should be distinct between *structural* use and *content* use — for example,
402
+ if U+0020 should be visualized differently when it appears in text content
403
+ versus as indentation — the current post-serialization approach cannot make
404
+ that distinction.
405
+
406
+ Should that become necessary, character visualization will need to be moved
407
+ to a DOM-level pass applied *before* serialization:
408
+
409
+ . Parse the document to a DOM.
410
+ . Walk all text nodes; apply character substitutions only to text node values.
411
+ . Serialize with the pretty-printer.
412
+
413
+ This would guarantee that visualization applies only to characters that were
414
+ in the original content, not to structural whitespace introduced by the
415
+ pretty-printer. This is a known future work item.
416
+
417
+ == Format support
418
+
419
+ [cols="1,1,1,1,3"]
420
+ |===
421
+ |Format |`:pretty_print` |`:normalize_pretty_print` |`:c14n` |Notes
422
+
423
+ |XML
424
+ |✓ Full
425
+ |✓ Full
426
+ |✓ Full
427
+ |`:pretty_print` uses `Canon::PrettyPrinter::Xml`; `:normalize_pretty_print` uses `Canon::PrettyPrinter::XmlNormalized` (mixed-content aware); `:c14n` uses canonical XML (attribute-order sorting)
428
+
429
+ |HTML
430
+ |✓ Full
431
+ |✓ (via XML serializer)
432
+ |✓ Full
433
+ |`:pretty_print` uses `Canon::PrettyPrinter::Html`; `:normalize_pretty_print` falls back to `XmlNormalized` pending a dedicated `HtmlNormalized`; `:c14n` uses Nokogiri HTML5 serialization
434
+
435
+ |JSON
436
+ |Planned
437
+ |N/A
438
+ |N/A
439
+ |Will use `Canon::PrettyPrinter::Json` in a future release
440
+
441
+ |YAML
442
+ |Planned
443
+ |N/A
444
+ |N/A
445
+ |Will use `Canon::PrettyPrinter::Yaml` in a future release
446
+ |===
447
+
448
+ When a mode is requested for an unsupported format, the documents are
449
+ used as-is (same as `:none`).
450
+
451
+ == Known limitations
452
+
453
+ === HTML standalone format without DiffNodes
454
+
455
+ When `DiffFormatter#format` is called directly on HTML documents without
456
+ providing DiffNodes (i.e. using an empty differences array `[]`), the
457
+ line-by-line diff output contains only the header line — no content diff is
458
+ shown. This is a consequence of the legacy DOM path in `HtmlFormatter` (which
459
+ handles the no-DiffNodes case) having an incomplete implementation of
460
+ `LineRangeMapper` for the HTML root element.
461
+
462
+ *This limitation does not affect normal Canon usage.* The RSpec matcher
463
+ (`be_html_equivalent_to`) always supplies a `ComparisonResult` with DiffNodes,
464
+ which causes `HtmlFormatter` to take the `format_with_pipeline` path. That
465
+ path works correctly and produces full HTML content diffs.
466
+
467
+ The legacy DOM path is scheduled for removal (see
468
+ https://github.com/lutaml/canon/issues/84[issue #84]), after which
469
+ `HtmlFormatter` will exclusively use the DiffNodes pipeline.
470
+
471
+ === Trailing content-space before inline elements: mixed-content false positive
472
+
473
+ When a fixture (indented XML) contains a text node of the form `See \n <xref` —
474
+ that is, content text followed by a space and then a newline before an inline
475
+ child element — `XmlNormalized` serializes it as a single line ending in
476
+ `See ` (content text with a trailing space U+0020). The corresponding compact
477
+ XML has `See<xref`, whose text node is `See` (no trailing space). The diff
478
+ therefore reports a false positive:
479
+
480
+ ----
481
+ - See
482
+ + See░
483
+ ----
484
+
485
+ This cannot be suppressed automatically: the text node `See ` is not
486
+ whitespace-only (it contains the word "See"), so the trailing space is real
487
+ content that is left intact and visualized as `░`.
488
+
489
+ In Metanorma (and similar mixed-content XML vocabularies), a trailing space
490
+ before an inline child element and a newline+indent before the same element are
491
+ semantically equivalent. Canon has no application-specific knowledge to make
492
+ that equivalence, so this display-only false positive is unavoidable.
493
+
494
+ *This false positive does not affect equivalence determination* — the semantic
495
+ comparison operates on the normalized DOM and is unaffected. It is a
496
+ presentation-only artefact of the display layer.
497
+
498
+ === Controlling whitespace visualization with element classification
499
+
500
+ `XmlNormalized` supports a three-way per-element whitespace classification
501
+ that governs how the pretty-printer renders whitespace in mixed-content
502
+ elements. Classification is *ancestor-based*: the closest matching ancestor
503
+ determines the class for all text nodes within it.
504
+
505
+ `preserve_whitespace_elements`::
506
+ Every whitespace character is significant and is visualized verbatim:
507
+ `" "` → `░`, `"\n "` → `↵░░`.
508
+ Suitable for preformatted elements such as `<pre>` or `<code>`.
509
+ XML default: none.
510
+ HTML default: `pre`, `code`, `textarea`, `script`, `style`.
511
+ Configure via `cfg.xml.diff.preserve_whitespace_elements = %w[pre code]`
512
+ or the environment variable `CANON_XML_PRESERVE_WHITESPACE_ELEMENTS=pre,code`.
513
+
514
+ `collapse_whitespace_elements`::
515
+ Whitespace presence matters but all forms are equivalent: both `" "`
516
+ and `"\n "` collapse to a single `░`.
517
+ Suitable for inline mixed-content elements where a word-boundary space
518
+ is real content (e.g. `<p>See <xref/>`) but its exact form is not.
519
+ XML default: none.
520
+ HTML default: `p`, `li`, `dt`, `dd`, `td`, `th`, `h1`–`h6`,
521
+ `caption`, `figcaption`, `label`, `legend`, `summary`,
522
+ `blockquote`, `address`, `button`.
523
+ Configure via `cfg.xml.diff.collapse_whitespace_elements = %w[p li td]`
524
+ or the environment variable `CANON_XML_COLLAPSE_WHITESPACE_ELEMENTS=p,li,td`.
525
+
526
+ Default (neither list — XML)::
527
+ All inter-element whitespace is dropped entirely. Both compact and
528
+ indented XML produce identical serialized lines — the spurious `↵░░░`
529
+ markers disappear. This is the correct mode for purely structural
530
+ containers such as `<bibdata>`.
531
+
532
+ Format defaults (HTML) can be overridden by listing an element in
533
+ `strip_whitespace_elements`, which acts as a blacklist and wins over any
534
+ sensitive ancestor:
535
+
536
+ [source,ruby]
537
+ ----
538
+ Canon::Config.configure do |cfg|
539
+ # Force <p> to be treated as insensitive despite the HTML default
540
+ cfg.html.diff.strip_whitespace_elements = %w[p]
541
+ end
542
+ ----
543
+
544
+ .Example: configure for Metanorma inline mixed-content (XML)
545
+ [source,ruby]
546
+ ----
547
+ Canon::Config.configure do |cfg|
548
+ cfg.xml.diff.display_preprocessing = :normalize_pretty_print
549
+ cfg.xml.diff.collapse_whitespace_elements = %w[p li td th]
550
+ end
551
+ ----
552
+
553
+ == Comparison with comparison preprocessing
554
+
555
+ [cols="1,2,2"]
556
+ |===
557
+ |Aspect |Comparison preprocessing |Display preprocessing
558
+
559
+ |Setting
560
+ |`cfg.xml.preprocessing`
561
+ |`cfg.xml.diff.display_preprocessing`
562
+
563
+ |Purpose
564
+ |Normalize for equivalence detection
565
+ |Normalize for line-diff readability
566
+
567
+ |Affects equivalence?
568
+ |Yes
569
+ |No
570
+
571
+ |Default
572
+ |`:none`
573
+ |`:none`
574
+
575
+ |Format scope
576
+ |`FormatConfig`
577
+ |`DiffConfig`
578
+ |===
579
+
580
+ [[_pretty_printed_flags]]
581
+ == Pretty-printed document flags
582
+
583
+ === Purpose
584
+
585
+ The `pretty_printed_expected` and `pretty_printed_received` flags are companion
586
+ settings to `collapse_whitespace_elements` / `preserve_whitespace_elements` that
587
+ solve an asymmetric formatting problem: fixture files are typically
588
+ hand-indented heredocs, while received XML is compact.
589
+
590
+ For the full rationale, algorithm, and examples see the dedicated guide:
591
+ link:../match-options/pretty-printed-fixtures.adoc[Pretty-printed fixture support].
592
+
593
+ === Quick reference
594
+
595
+ `pretty_printed_expected`::
596
+ When `true`, whitespace-only text nodes that start with `"\n"` inside
597
+ `:collapse` elements are dropped from the **expected (fixture)** document
598
+ before it reaches the by-line diff. `:preserve` elements are always preserved.
599
+ Default: `false`.
600
+
601
+ `pretty_printed_received`::
602
+ Same treatment for the **received (actual)** document. Useful when the
603
+ received side is pretty-printed but the fixture is compact.
604
+ Default: `false`.
605
+
606
+ === Configuration
607
+
608
+ [source,ruby]
609
+ ----
610
+ Canon::Config.configure do |cfg|
611
+ cfg.xml.diff.display_preprocessing = :normalize_pretty_print
612
+ cfg.xml.diff.collapse_whitespace_elements = %w[fmt-title semx p]
613
+ cfg.xml.diff.pretty_printed_expected = true # fixture is indented
614
+ cfg.xml.diff.pretty_printed_received = false # received is compact
615
+ end
616
+ ----
617
+
618
+ [source,bash]
619
+ ----
620
+ export CANON_XML_DIFF_PRETTY_PRINTED_EXPECTED=true
621
+ export CANON_XML_DIFF_PRETTY_PRINTED_RECEIVED=false
622
+ ----
623
+
624
+ === Interaction with whitespace classification
625
+
626
+ The flags interact with the three-way whitespace classification exactly as
627
+ described in <<_controlling_whitespace_visualization_with_element_classification>>:
628
+
629
+ [cols="1,3"]
630
+ |===
631
+ |Classification |Effect when `pretty_printed_*` is `true`
632
+
633
+ |`:strip`
634
+ |No change — whitespace is already dropped for both sides
635
+
636
+ |`:collapse`
637
+ |Whitespace-only nodes starting with `"\n"` are dropped from the flagged side.
638
+ Nodes starting with a space (no initial `"\n"`) are kept and normalized.
639
+
640
+ |`:preserve`
641
+ |No change — every whitespace character is preserved regardless of flags
642
+
643
+ |===
644
+
645
+ == See also
646
+
647
+ * link:character-visualization.adoc[Character visualization] — how special
648
+ characters are rendered in diff output
649
+ * link:../match-options/pretty-printed-fixtures.adoc[Pretty-printed fixture support]
650
+ — full rationale, algorithm, and walkthrough examples
651
+ * link:../../reference/options-across-interfaces.adoc[Options across interfaces]
652
+ — Layer 1b display preprocessing table
653
+ * link:../../reference/environment-variables.adoc[Environment variables] —
654
+ `CANON_DISPLAY_PREPROCESSING` etc.
655
+ * link:../match-options/index.adoc[Match options] — comparison preprocessing
656
+ (`cfg.xml.preprocessing`)