canon 0.1.22 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (56) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop_todo.yml +174 -25
  3. data/docs/INDEX.adoc +4 -0
  4. data/docs/advanced/diff-classification.adoc +3 -2
  5. data/docs/features/configuration-profiles.adoc +288 -0
  6. data/docs/features/diff-formatting/character-visualization.adoc +153 -454
  7. data/docs/features/diff-formatting/display-filtering.adoc +44 -0
  8. data/docs/features/diff-formatting/display-preprocessing.adoc +656 -0
  9. data/docs/features/diff-formatting/index.adoc +47 -0
  10. data/docs/features/diff-formatting/pretty-diff-mode.adoc +154 -0
  11. data/docs/features/environment-configuration/override-system.adoc +10 -3
  12. data/docs/features/index.adoc +9 -0
  13. data/docs/features/match-options/index.adoc +32 -42
  14. data/docs/features/match-options/pretty-printed-fixtures.adoc +270 -0
  15. data/docs/guides/choosing-configuration.adoc +22 -0
  16. data/docs/reference/environment-variables.adoc +121 -1
  17. data/docs/reference/options-across-interfaces.adoc +182 -2
  18. data/lib/canon/cli.rb +20 -0
  19. data/lib/canon/commands/diff_command.rb +7 -2
  20. data/lib/canon/commands/format_command.rb +1 -1
  21. data/lib/canon/comparison/html_comparator.rb +20 -15
  22. data/lib/canon/comparison/html_compare_profile.rb +4 -4
  23. data/lib/canon/comparison/markup_comparator.rb +12 -3
  24. data/lib/canon/comparison/match_options/base_resolver.rb +29 -7
  25. data/lib/canon/comparison/match_options/json_resolver.rb +9 -0
  26. data/lib/canon/comparison/match_options/xml_resolver.rb +16 -2
  27. data/lib/canon/comparison/match_options/yaml_resolver.rb +10 -0
  28. data/lib/canon/comparison/match_options.rb +4 -1
  29. data/lib/canon/comparison/whitespace_sensitivity.rb +189 -137
  30. data/lib/canon/comparison/xml_comparator/child_comparison.rb +21 -4
  31. data/lib/canon/comparison/xml_comparator.rb +14 -12
  32. data/lib/canon/comparison/xml_node_comparison.rb +51 -6
  33. data/lib/canon/comparison.rb +52 -9
  34. data/lib/canon/config/env_schema.rb +32 -4
  35. data/lib/canon/config/override_resolver.rb +16 -3
  36. data/lib/canon/config/profile_loader.rb +135 -0
  37. data/lib/canon/config/profiles/metanorma.yml +74 -0
  38. data/lib/canon/config/profiles/metanorma_debug.yml +8 -0
  39. data/lib/canon/config/type_converter.rb +8 -0
  40. data/lib/canon/config.rb +469 -5
  41. data/lib/canon/diff/diff_classifier.rb +41 -11
  42. data/lib/canon/diff_formatter/diff_detail_formatter/dimension_formatter.rb +48 -17
  43. data/lib/canon/diff_formatter/diff_detail_formatter/node_utils.rb +58 -0
  44. data/lib/canon/diff_formatter/diff_detail_formatter.rb +22 -7
  45. data/lib/canon/diff_formatter/theme.rb +24 -17
  46. data/lib/canon/diff_formatter.rb +493 -36
  47. data/lib/canon/pretty_printer/xml_normalized.rb +395 -0
  48. data/lib/canon/rspec_matchers.rb +36 -0
  49. data/lib/canon/tree_diff/matchers/hash_matcher.rb +26 -11
  50. data/lib/canon/version.rb +1 -1
  51. data/lib/canon/xml/nodes/namespace_node.rb +4 -0
  52. data/lib/canon/xml/nodes/processing_instruction_node.rb +4 -0
  53. data/lib/canon/xml/nodes/root_node.rb +4 -0
  54. data/lib/canon/xml/nodes/text_node.rb +4 -0
  55. data/lib/tasks/performance_helpers.rb +2 -2
  56. metadata +24 -2
@@ -0,0 +1,270 @@
1
+ ---
2
+ title: Pretty-Printed Fixture Support
3
+ parent: Match Options
4
+ nav_order: 5
5
+ ---
6
+ = Pretty-printed fixture support
7
+ :toc:
8
+ :toclevels: 3
9
+
10
+ == Problem statement
11
+
12
+ A common pattern in Metanorma and similar XML-generating libraries is to compare
13
+ a *compact* generated document (no inter-element whitespace) against a
14
+ *hand-indented fixture* stored as an RSpec heredoc:
15
+
16
+ [source,ruby]
17
+ ----
18
+ expected_xml = <<~XML
19
+ <root>
20
+ <fmt-title>
21
+ <semx element="title" source="_">Foreword</semx>
22
+ </fmt-title>
23
+ </root>
24
+ XML
25
+
26
+ received_xml = "<root><fmt-title><semx element=\"title\" source=\"_\">Foreword</semx></fmt-title></root>"
27
+ ----
28
+
29
+ When `fmt-title` and `semx` are classified as `:collapse` whitespace elements
30
+ (their inline content matters but all whitespace forms are equivalent), the
31
+ comparison _correctly_ detects whitespace differences: the fixture has text nodes
32
+ `"\n "` and `"\n"` inside `<fmt-title>` that are absent from the compact
33
+ received document.
34
+
35
+ This is the right behaviour in general. However, for *fixture files that
36
+ cannot be written as compact single-line strings* — because doing so would
37
+ produce unreadable 1000-character lines — the indentation whitespace in the
38
+ fixture is a necessary formatting artefact, not a meaningful content difference.
39
+
40
+ === Why the whitespace cannot simply be stripped
41
+
42
+ The whitespace classification is intentionally asymmetric across elements:
43
+
44
+ * Structural container elements (e.g. `<clause>`, `<foreword>`) are
45
+ `:strip` — their inter-element whitespace is already dropped in both
46
+ compact and indented forms.
47
+ * Mixed-content elements (e.g. `<fmt-title>`, `<semx>`) are `:collapse` —
48
+ the _presence_ of a space between inline children is significant (it
49
+ represents a word-boundary), but its exact form is not.
50
+
51
+ Stripping all whitespace from the fixture would destroy the signal inside
52
+ mixed-content elements: `<p>See <em>note</em></p>` would become identical to
53
+ `<p>See<em>note</em></p>`, hiding a real content difference.
54
+
55
+ == Solution: asymmetric pretty-print flags
56
+
57
+ Canon provides two match options and two display-preprocessing flags that
58
+ instruct the relevant subsystem to treat whitespace-only text nodes that begin
59
+ with `"\n"` as *structural indentation from the pretty-printer* and drop them
60
+ from the comparison. They operate independently so that only the
61
+ actually-pretty-printed side is affected.
62
+
63
+ `pretty_printed_expected`::
64
+ When `true`, whitespace-only text nodes that start with `"\n"` in
65
+ `:collapse`-classified elements of the **expected** (first / fixture)
66
+ document are dropped before comparison and before display.
67
+ Use this when fixture files are indented heredocs but received XML is compact.
68
+
69
+ `pretty_printed_received`::
70
+ When `true`, the same treatment applies to the **received** (second / actual)
71
+ document. Use this when the received output may be pretty-printed but the
72
+ fixture is compact.
73
+
74
+ === The heuristic explained
75
+
76
+ The heuristic is:
77
+
78
+ [source]
79
+ ----
80
+ Whitespace-only text node starts with "\n" → structural indentation → drop
81
+ Whitespace-only text node has no "\n" → inline content space → keep
82
+ ----
83
+
84
+ The reasoning: compact Metanorma XML (and most programmatic XML serializers)
85
+ never emit a bare `"\n"` immediately after a closing or opening tag inside a
86
+ mixed-content element. If a fixture heredoc contains such a `"\n"`, it was
87
+ introduced by the human author formatting the fixture for readability — it is
88
+ not part of the original document content.
89
+
90
+ A _space_ not preceded by `"\n"` is treated as real content: in
91
+ `<p>See <em>note</em></p>`, the text node `"See "` starts with `"S"`, so its
92
+ trailing space is kept and normalized to a single `░` symbol in the display.
93
+
94
+ NOTE: This heuristic is only activated for `:collapse` elements. `:preserve`
95
+ elements always preserve every whitespace character regardless of these flags.
96
+ `:strip` elements already drop all whitespace and are unaffected.
97
+
98
+ == Configuration
99
+
100
+ === Comparison (equivalence detection)
101
+
102
+ Pass `pretty_printed_expected` or `pretty_printed_received` inside the `match:`
103
+ hash of `Canon::Comparison.equivalent?` or the RSpec `be_xml_equivalent_to`
104
+ matcher:
105
+
106
+ [source,ruby]
107
+ ----
108
+ # One-off: fixture is indented, received is compact
109
+ result = Canon::Comparison.equivalent?(
110
+ fixture_xml,
111
+ received_xml,
112
+ match: {
113
+ collapse_whitespace_elements: %w[fmt-title semx],
114
+ pretty_printed_expected: true,
115
+ }
116
+ )
117
+
118
+ # RSpec global config in spec_helper.rb
119
+ Canon::Config.configure do |cfg|
120
+ cfg.xml.match.options = {
121
+ collapse_whitespace_elements: %w[fmt-title semx],
122
+ pretty_printed_expected: true,
123
+ }
124
+ end
125
+ ----
126
+
127
+ === Display preprocessing (`:normalize_pretty_print`)
128
+
129
+ When using `display_preprocessing: :normalize_pretty_print`, set the
130
+ corresponding diff-config flags so that the display side also strips structural
131
+ indentation from only the flagged side:
132
+
133
+ [source,ruby]
134
+ ----
135
+ Canon::Config.configure do |cfg|
136
+ cfg.xml.diff.display_preprocessing = :normalize_pretty_print
137
+ cfg.xml.diff.collapse_whitespace_elements = %w[fmt-title semx]
138
+
139
+ # Drop structural indentation from the indented fixture side only
140
+ cfg.xml.diff.pretty_printed_expected = true
141
+ cfg.xml.diff.pretty_printed_received = false # compact received is unchanged
142
+ end
143
+ ----
144
+
145
+ === Environment variables
146
+
147
+ [source,bash]
148
+ ----
149
+ export CANON_XML_DIFF_PRETTY_PRINTED_EXPECTED=true
150
+ export CANON_XML_DIFF_PRETTY_PRINTED_RECEIVED=false
151
+ ----
152
+
153
+ == Behaviour by whitespace class
154
+
155
+ [cols="1,3,3,3"]
156
+ |===
157
+ |Element class |`pretty_printed_expected: false` (default) |`pretty_printed_expected: true`
158
+ |Note
159
+
160
+ |`:strip`
161
+ |All inter-element whitespace dropped (both sides)
162
+ |Same — no change
163
+ |Already dropped; flag has no effect
164
+
165
+ |`:collapse`
166
+ |`"\n "` nodes kept; compact vs. indented comparison detects difference
167
+ |`"\n"`-starting nodes dropped from expected; compact vs. indented now equivalent
168
+ |Inline content spaces (`" "` without `"\n"`) are always kept
169
+
170
+ |`:preserve`
171
+ |Every whitespace character compared verbatim
172
+ |Same — no change
173
+ |`:preserve` always preserves all whitespace regardless of flag
174
+
175
+ |===
176
+
177
+ == Example walkthrough
178
+
179
+ === Fixture (hand-indented heredoc)
180
+
181
+ [source,xml]
182
+ ----
183
+ <root>
184
+ <clause>
185
+ <fmt-title>
186
+ <semx element="title" source="_">Foreword</semx>
187
+ </fmt-title>
188
+ </clause>
189
+ </root>
190
+ ----
191
+
192
+ Text nodes inside `<fmt-title>` after Nokogiri parsing:
193
+
194
+ * `"\n "` between `<fmt-title>` open tag and `<semx>` — whitespace-only, starts with `"\n"`
195
+ * `"\n "` between `</semx>` and `</fmt-title>` — whitespace-only, starts with `"\n"`
196
+
197
+ === Received (compact Metanorma output)
198
+
199
+ [source,xml]
200
+ ----
201
+ <root><clause><fmt-title><semx element="title" source="_">Foreword</semx></fmt-title></clause></root>
202
+ ----
203
+
204
+ No whitespace nodes inside `<fmt-title>`.
205
+
206
+ === Without `pretty_printed_expected: true`
207
+
208
+ `clause` is `:strip`, so its inter-element whitespace is dropped.
209
+ `fmt-title` and `semx` are `:collapse`, so their whitespace nodes are kept.
210
+
211
+ The comparison finds two extra whitespace nodes in the expected side →
212
+ **not equivalent**.
213
+
214
+ === With `pretty_printed_expected: true`
215
+
216
+ The `"\n"`-starting nodes inside `:collapse` elements are dropped from the
217
+ expected side before comparison. The expected side now matches the compact
218
+ received side → **equivalent**.
219
+
220
+ == Interaction with `:pretty_diff` mode
221
+
222
+ `:pretty_diff` mode applies `display_preprocessing` to both documents before
223
+ running a text-LCS diff on the resulting lines. When
224
+ `pretty_printed_expected: true` is set on the formatter, `XmlNormalized` is
225
+ instantiated with `pretty_printed: true` for the expected-side printer only.
226
+ This drops the `"\n"`-starting whitespace visualization from the expected
227
+ side's serialized lines, so the LCS diff sees identical lines for purely
228
+ structural indentation differences.
229
+
230
+ [source,ruby]
231
+ ----
232
+ Canon::DiffFormatter.new(
233
+ mode: :pretty_diff,
234
+ display_preprocessing: :normalize_pretty_print,
235
+ collapse_whitespace_elements: %w[fmt-title semx],
236
+ pretty_printed_expected: true, # strip structural \n nodes from fixture side
237
+ pretty_printed_received: false, # compact received side is unchanged
238
+ )
239
+ ----
240
+
241
+ == Relation to deprecated `normalize_pretty_print_ignore_structural_newlines`
242
+
243
+ The deprecated flag `normalize_pretty_print_ignore_structural_newlines` applied
244
+ the newline-stripping heuristic to **both** sides simultaneously and without
245
+ regard to whitespace classification.
246
+
247
+ The new `pretty_printed_expected` / `pretty_printed_received` flags replace it
248
+ with a more granular design:
249
+
250
+ [cols="3,3"]
251
+ |===
252
+ |Old flag |New equivalent
253
+
254
+ |`normalize_pretty_print_ignore_structural_newlines: true`
255
+ |`pretty_printed_expected: true, pretty_printed_received: true`
256
+ (plus `collapse_whitespace_elements:` to restrict to the right elements)
257
+ |===
258
+
259
+ The old flag is deprecated and still emits a warning; use the new flags for all
260
+ new code.
261
+
262
+ == See also
263
+
264
+ * link:../../features/diff-formatting/display-preprocessing.adoc[Display preprocessing]
265
+ — the `display_preprocessing` option and `XmlNormalized` serializer
266
+ * link:index.adoc[Match options overview] — whitespace sensitivity classification
267
+ * link:../../reference/options-across-interfaces.adoc[Options across interfaces]
268
+ — cross-interface reference table
269
+ * link:../../reference/environment-variables.adoc[Environment variables]
270
+ — `CANON_XML_DIFF_PRETTY_PRINTED_EXPECTED`, `CANON_XML_DIFF_PRETTY_PRINTED_RECEIVED`
@@ -679,8 +679,30 @@ When changing configuration:
679
679
  - [ ] Test with sample documents
680
680
  - [ ] Update documentation
681
681
 
682
+ == Using Configuration Profiles
683
+
684
+ If you use the same configuration across multiple gems or projects,
685
+ consider using a **configuration profile** instead of repeating settings.
686
+ A single line replaces dozens of configuration calls:
687
+
688
+ [source,ruby]
689
+ ----
690
+ # Instead of 60+ lines of per-format settings:
691
+ Canon::Config.instance.profile = :metanorma
692
+
693
+ # Or load a custom profile from a local YAML file:
694
+ Canon::Config.instance.profile = "config/my_canon_profile.yml"
695
+ ----
696
+
697
+ Profiles bundle all layers (preprocessing, match profile, diff settings,
698
+ whitespace element lists) into a named preset defined in YAML.
699
+ Custom file profiles can inherit from built-in profiles.
700
+
701
+ See link:../features/configuration-profiles.adoc[Configuration Profiles] for full documentation.
702
+
682
703
  == See Also
683
704
 
705
+ * link:../features/configuration-profiles.adoc[Configuration Profiles] - Named config presets
684
706
  * link:../understanding/comparison-pipeline.adoc[Comparison Pipeline] - Understanding the 4 layers
685
707
  * link:../understanding/algorithms/[Algorithms] - Detailed algorithm documentation
686
708
  * link:../features/match-options/algorithm-specific-behavior.adoc[Algorithm-Specific Behavior] - How algorithms differ
@@ -86,9 +86,129 @@ export CANON_JSON_FORMAT_PREPROCESSING=normalize
86
86
  |`CANON_MODE`
87
87
  |symbol
88
88
  |`:by_line`
89
- |Diff output mode: `by_line` or `by_object`
89
+ |Diff output mode: `by_line`, `by_object`, or `rspec`
90
90
  |All formats
91
91
 
92
+ |`CANON_DISPLAY_PREPROCESSING`
93
+ |symbol
94
+ |`:none`
95
+ |How documents are normalized _for display_ before the line diff: `none`, `pretty_print`, `c14n`. Format-specific: `CANON_{FORMAT}_DIFF_DISPLAY_PREPROCESSING`
96
+ |All formats
97
+
98
+ |`CANON_PRETTY_PRINTER_INDENT`
99
+ |integer
100
+ |`2`
101
+ |Indentation depth used by the pretty-printer when `display_preprocessing: :pretty_print`. Format-specific: `CANON_{FORMAT}_DIFF_PRETTY_PRINTER_INDENT`
102
+ |All formats (display only)
103
+
104
+ |`CANON_PRETTY_PRINTER_INDENT_TYPE`
105
+ |symbol
106
+ |`:space`
107
+ |Indentation character for the pretty-printer: `space` or `tab`. Format-specific: `CANON_{FORMAT}_DIFF_PRETTY_PRINTER_INDENT_TYPE`
108
+ |All formats (display only)
109
+
110
+ |`CANON_XML_DIFF_COLLAPSE_WHITESPACE_ELEMENTS`
111
+ |string array
112
+ |`[]`
113
+ |Comma-separated list of XML element names whose inter-element whitespace is *presence-significant but form-insensitive*: both `" "` and `"\n "` collapse to a single `░`. Suitable for inline mixed-content elements such as `<p>`, `<li>`, `<td>`. Only applies when `display_preprocessing: :normalize_pretty_print`. Format-specific; no global form.
114
+ |XML (display only)
115
+
116
+ |`CANON_XML_DIFF_PRESERVE_WHITESPACE_ELEMENTS`
117
+ |string array
118
+ |`[]`
119
+ |Comma-separated list of XML element names where every whitespace character is significant and visualized verbatim (`" "` → `░`, `"\n "` → `↵░░`). Suitable for preformatted elements such as `<pre>`, `<code>`. Only applies when `display_preprocessing: :normalize_pretty_print`. Format-specific; no global form.
120
+ |XML (display only)
121
+
122
+ |`CANON_XML_DIFF_STRIP_WHITESPACE_ELEMENTS`
123
+ |string array
124
+ |`[]`
125
+ |Comma-separated list of XML element names where whitespace-only text nodes are stripped entirely. Only applies when `display_preprocessing: :normalize_pretty_print`. Format-specific; no global form.
126
+ |XML (display only)
127
+
128
+ |`CANON_XML_DIFF_PRETTY_PRINTED_EXPECTED`
129
+ |boolean
130
+ |`false`
131
+ |When `true`, whitespace-only text nodes that start with `"\n"` inside `:collapse`-classified elements are dropped from the **expected (fixture)** document before it reaches the line diff. Solves the asymmetric case where the expected side is a hand-indented heredoc fixture but the received side is compact programmatic XML. Only applies when `display_preprocessing: :normalize_pretty_print`. Format-specific; no global form. See also link:../features/match-options/pretty-printed-fixtures.adoc[Pretty-printed fixture support].
132
+ |XML (display only)
133
+
134
+ |`CANON_XML_DIFF_PRETTY_PRINTED_RECEIVED`
135
+ |boolean
136
+ |`false`
137
+ |Same as `CANON_XML_DIFF_PRETTY_PRINTED_EXPECTED` but applied to the **received (actual)** document. Useful when the received output may be pretty-printed but the fixture is compact. Format-specific; no global form.
138
+ |XML (display only)
139
+
140
+ |`CANON_XML_DIFF_PRETTY_PRINTER_SORT_ATTRIBUTES`
141
+ |boolean
142
+ |`false`
143
+ |When `true`, attributes on each element are sorted by namespace URI then local name in the pretty-printed display. Eliminates spurious diff lines caused by differing attribute order between expected and received XML. Only applies when `display_preprocessing` is `:pretty_print` or `:normalize_pretty_print`. Format-specific; no global form.
144
+ |XML (display only)
145
+
146
+ |`CANON_SHOW_RAW_INPUTS`
147
+ |boolean
148
+ |`false`
149
+ |Show the raw (un-preprocessed) file contents of both sides before the diff output. Equivalent to enabling both `CANON_SHOW_RAW_EXPECTED` and `CANON_SHOW_RAW_RECEIVED`. Format-specific: `CANON_{FORMAT}_DIFF_SHOW_RAW_INPUTS`
150
+ |All formats (display only)
151
+
152
+ |`CANON_SHOW_RAW_EXPECTED`
153
+ |boolean
154
+ |`false`
155
+ |Show only the EXPECTED (fixture) block in the raw-inputs section. Has no effect unless `show_raw_inputs` or `verbose_diff` is also set. Format-specific: `CANON_{FORMAT}_DIFF_SHOW_RAW_EXPECTED`
156
+ |All formats (display only)
157
+
158
+ |`CANON_SHOW_RAW_RECEIVED`
159
+ |boolean
160
+ |`false`
161
+ |Show only the RECEIVED (actual) block in the raw-inputs section. Format-specific: `CANON_{FORMAT}_DIFF_SHOW_RAW_RECEIVED`
162
+ |All formats (display only)
163
+
164
+ |`CANON_SHOW_PREPROCESSED_INPUTS`
165
+ |boolean
166
+ |`false`
167
+ |Show the preprocessed (post-comparison-preprocessing) contents of both sides before the diff output. Equivalent to enabling both `CANON_SHOW_PREPROCESSED_EXPECTED` and `CANON_SHOW_PREPROCESSED_RECEIVED`. Format-specific: `CANON_{FORMAT}_DIFF_SHOW_PREPROCESSED_INPUTS`
168
+ |All formats (display only)
169
+
170
+ |`CANON_SHOW_PREPROCESSED_EXPECTED`
171
+ |boolean
172
+ |`false`
173
+ |Show only the EXPECTED (fixture) block in the preprocessed-inputs section. Has no effect unless `show_preprocessed_inputs` or `verbose_diff` is also set. Format-specific: `CANON_{FORMAT}_DIFF_SHOW_PREPROCESSED_EXPECTED`
174
+ |All formats (display only)
175
+
176
+ |`CANON_SHOW_PREPROCESSED_RECEIVED`
177
+ |boolean
178
+ |`false`
179
+ |Show only the RECEIVED (actual) block in the preprocessed-inputs section. Format-specific: `CANON_{FORMAT}_DIFF_SHOW_PREPROCESSED_RECEIVED`
180
+ |All formats (display only)
181
+
182
+ |`CANON_SHOW_PRETTYPRINT_INPUTS`
183
+ |boolean
184
+ |`false`
185
+ |Show a fixture-ready pretty-printed form of **both** original input sides before the diff output. The output is formatted with one XML/HTML tag per line and proper indentation (using the configured `pretty_printer_indent` / `pretty_printer_indent_type`), but with **no character visualization** — whitespace appears as plain ASCII so the output can be copy-pasted directly into RSpec heredoc fixtures. Unlike `show_preprocessed_inputs`, this always pretty-prints the original strings regardless of the `preprocessing` or `display_preprocessing` settings. Equivalent to enabling both `CANON_SHOW_PRETTYPRINT_EXPECTED` and `CANON_SHOW_PRETTYPRINT_RECEIVED`. Format-specific: `CANON_{FORMAT}_DIFF_SHOW_PRETTYPRINT_INPUTS`
186
+ |All formats (display only)
187
+
188
+ |`CANON_SHOW_PRETTYPRINT_EXPECTED`
189
+ |boolean
190
+ |`false`
191
+ |Show only the EXPECTED (fixture) block in the fixture-ready pretty-printed section. Use this to see the current fixture re-formatted for copy-pasting when the fixture is the side that needs updating. Format-specific: `CANON_{FORMAT}_DIFF_SHOW_PRETTYPRINT_EXPECTED`
192
+ |All formats (display only)
193
+
194
+ |`CANON_SHOW_PRETTYPRINT_RECEIVED`
195
+ |boolean
196
+ |`false`
197
+ |Show only the RECEIVED (actual) block in the fixture-ready pretty-printed section. This is the most common fixture-update workflow: enable this option to get a copy-pasteable pretty-printed form of the generated output that can replace the old fixture heredoc. Format-specific: `CANON_{FORMAT}_DIFF_SHOW_PRETTYPRINT_RECEIVED`
198
+ |All formats (display only)
199
+
200
+ |`CANON_COMPACT_SEMANTIC_REPORT`
201
+ |boolean
202
+ |`false`
203
+ |Render element nodes in the Semantic Diff Report as compact inline XML (e.g. `<strong>Annex</strong>`) instead of the verbose `node_info` description string (e.g. `name: strong namespace_uri: …`). Useful when reading Semantic Diff output in the terminal and wanting to see the actual XML markup rather than a textual node description. See format-specific form `CANON_{FORMAT}_DIFF_COMPACT_SEMANTIC_REPORT`.
204
+ |All formats (display only)
205
+
206
+ |`CANON_CHARACTER_VISUALIZATION`
207
+ |symbol
208
+ |`true`
209
+ |Replace invisible characters with visible Unicode symbols in diff output: `true`, `false`, or `content_only`. Format-specific: `CANON_{FORMAT}_DIFF_CHARACTER_VISUALIZATION`. Set to `false` to keep plain-text output.
210
+ |All formats (display only)
211
+
92
212
  |`CANON_USE_COLOR`
93
213
  |boolean
94
214
  |`true`
@@ -29,7 +29,7 @@ The following table shows how major Canon options are expressed in each interfac
29
29
  |===
30
30
  |Option |CLI Flag |Ruby API |RSpec Config |ENV Variable
31
31
 
32
- |Preprocessing
32
+ |Comparison Preprocessing
33
33
  |`--preprocessing normalize`
34
34
  |`preprocessing: :normalize`
35
35
  |`config.canon.xml.preprocessing = :normalize`
@@ -38,6 +38,111 @@ The following table shows how major Canon options are expressed in each interfac
38
38
 
39
39
  Preprocessing values: `none`, `c14n`, `normalize`, `format`
40
40
 
41
+ NOTE: This controls how documents are normalized _before comparison_ (equivalence detection). It is independent of display preprocessing, which controls how documents are formatted for the diff output.
42
+
43
+ === Layer 1b: Display Preprocessing Options
44
+
45
+ Display preprocessing controls how documents are normalized _before the line diff is rendered_. This is independent of comparison preprocessing. Because both sides go through the same normalization, the resulting line diff shows only genuine content differences rather than formatting differences.
46
+
47
+ [cols="2,2,3,3,3"]
48
+ |===
49
+ |Option |CLI Flag |Ruby API |RSpec Config |ENV Variable
50
+
51
+ |Display Preprocessing
52
+ |`--display-preprocessing pretty_print`
53
+ |`display_preprocessing: :pretty_print`
54
+ |`cfg.xml.diff.display_preprocessing = :pretty_print`
55
+ |`CANON_XML_DIFF_DISPLAY_PREPROCESSING=pretty_print`
56
+
57
+ |Pretty-printer Indent
58
+ |N/A
59
+ |N/A (config only)
60
+ |`cfg.xml.diff.pretty_printer.indent = 2`
61
+ |`CANON_XML_DIFF_PRETTY_PRINTER_INDENT=2`
62
+
63
+ |Pretty-printer Indent Type
64
+ |N/A
65
+ |N/A (config only)
66
+ |`cfg.xml.diff.pretty_printer.indent_type = :space`
67
+ |`CANON_XML_DIFF_PRETTY_PRINTER_INDENT_TYPE=space`
68
+
69
+ |Collapse Whitespace Elements
70
+ |N/A
71
+ |N/A (config only)
72
+ |`cfg.xml.match.collapse_whitespace_elements = %w[p li td th]`
73
+ |`CANON_XML_MATCH_COLLAPSE_WHITESPACE_ELEMENTS=p,li,td,th`
74
+
75
+ |Preserve Whitespace Elements
76
+ |N/A
77
+ |N/A (config only)
78
+ |`cfg.xml.match.preserve_whitespace_elements = %w[pre code]`
79
+ |`CANON_XML_MATCH_PRESERVE_WHITESPACE_ELEMENTS=pre,code`
80
+
81
+ |Pretty-printed Expected
82
+ |N/A
83
+ |`match: { pretty_printed_expected: true }`
84
+ |`cfg.xml.diff.pretty_printed_expected = true`
85
+ |`CANON_XML_DIFF_PRETTY_PRINTED_EXPECTED=true`
86
+
87
+ |Pretty-printed Received
88
+ |N/A
89
+ |`match: { pretty_printed_received: true }`
90
+ |`cfg.xml.diff.pretty_printed_received = true`
91
+ |`CANON_XML_DIFF_PRETTY_PRINTED_RECEIVED=true`
92
+
93
+ |Sort Attributes (pretty printer)
94
+ |N/A
95
+ |N/A (config only)
96
+ |`cfg.xml.diff.pretty_printer_sort_attributes = true`
97
+ |`CANON_XML_DIFF_PRETTY_PRINTER_SORT_ATTRIBUTES=true`
98
+
99
+ |Compact Semantic Report
100
+ |N/A
101
+ |N/A (config only)
102
+ |`cfg.xml.diff.compact_semantic_report = true`
103
+ |`CANON_XML_DIFF_COMPACT_SEMANTIC_REPORT=true`
104
+
105
+ |Expand Difference
106
+ |N/A
107
+ |N/A (config only)
108
+ |`cfg.xml.diff.expand_difference = true`
109
+ |`CANON_XML_DIFF_EXPAND_DIFFERENCE=true`
110
+ |===
111
+
112
+ Display preprocessing values: `none` (default), `pretty_print`, `normalize_pretty_print`, `c14n`
113
+
114
+ * `:none` — documents are used as-is for the line diff (existing behaviour)
115
+ * `:pretty_print` — both documents are run through a format-specific pretty-printer before the line diff: `Canon::PrettyPrinter::Xml` for XML (one tag per line, consistent indentation); `Canon::PrettyPrinter::Html` (Nokogiri HTML5 serializer) for HTML. Recommended for XML and HTML RSpec tests.
116
+ * `:normalize_pretty_print` — like `:pretty_print` but uses `Canon::PrettyPrinter::XmlNormalized`, which guarantees one line per XML node even for mixed-content elements (those containing both text and child elements). Recommended when XML contains inline markup (Metanorma, DocBook). Use `collapse_whitespace_elements` or `preserve_whitespace_elements` to control per-element whitespace visualization.
117
+ * `:c14n` — both documents are run through canonical normalization before the line diff. For XML: canonical XML (C14N, attribute-order sorting). For HTML: Nokogiri HTML5 serialization (normalizes attribute order and whitespace).
118
+
119
+ NOTE: Pretty-printer indent options only apply when `display_preprocessing` is `:pretty_print` or `:normalize_pretty_print`.
120
+
121
+ NOTE: `compact_semantic_report` controls the representation of XML element nodes in the *Semantic Diff Report* section (the structured summary that appears above the line diff). When `false` (default), element nodes are described with their `node_info` string (e.g. `name: strong namespace_uri: …`), which is unambiguous but verbose. When `true`, element nodes are serialized as compact inline XML (e.g. `<strong>Annex A</strong>`), which is much easier to read at a glance. Plain text nodes (leaf text content) are not affected — they always display their decoded string value.
122
+
123
+ === Layer 1c: Character Visualization
124
+
125
+ Character visualization controls whether invisible characters (spaces, tabs, non-breaking spaces, etc.) are replaced with visible Unicode symbols in diff output.
126
+
127
+ [cols="2,2,3,3,3"]
128
+ |===
129
+ |Option |CLI Flag |Ruby API |RSpec Config |ENV Variable
130
+
131
+ |Character Visualization
132
+ |N/A
133
+ |`character_visualization: true`
134
+ |`cfg.xml.diff.character_visualization = false`
135
+ |`CANON_XML_DIFF_CHARACTER_VISUALIZATION=false`
136
+ |===
137
+
138
+ Character visualization values:
139
+
140
+ * `true` — (default) the full default visualization map is applied; spaces appear as `░`, tabs as `⇥`, non-breaking spaces as `␣`, etc.
141
+ * `false` — visualization is disabled; all characters appear as plain text.
142
+ * `:content_only` — reserved for future use; currently behaves as `true`. Future intent: apply visualization only to text-node content, leaving structural indentation whitespace plain.
143
+
144
+ NOTE: Setting `character_visualization: false` is useful when copying diff output from a failure message, or when downstream tooling cannot handle Unicode symbol substitutions.
145
+
41
146
  === Layer 2: Algorithm Selection
42
147
 
43
148
  [cols="2,2,3,3,3"]
@@ -58,7 +163,7 @@ Preprocessing values: `none`, `c14n`, `normalize`, `format`
58
163
  |===
59
164
 
60
165
  Algorithm values: `dom`, `semantic` +
61
- Diff mode values: `by_line`, `by_object`
166
+ Diff mode values: `by_line`, `by_object`, `pretty_diff`
62
167
 
63
168
  === Layer 3: Match Options
64
169
 
@@ -195,6 +300,81 @@ Values: `strict`, `ignore`
195
300
  |`CANON_GROUPING_LINES=15`
196
301
  |===
197
302
 
303
+ === Debug Input Display Options
304
+
305
+ These options control whether Canon dumps the raw or preprocessed document content to the terminal alongside the diff. They are useful for debugging comparisons — e.g. inspecting what the received output actually looks like before filing an issue.
306
+
307
+ [cols="2,2,3,3,3"]
308
+ |===
309
+ |Option |CLI Flag |Ruby API |RSpec Config |ENV Variable
310
+
311
+ |Show Raw Inputs (both)
312
+ |`--show-raw-inputs`
313
+ |N/A (config only)
314
+ |`cfg.xml.diff.show_raw_inputs = true`
315
+ |`CANON_XML_DIFF_SHOW_RAW_INPUTS=true`
316
+
317
+ |Show Raw Expected only
318
+ |N/A
319
+ |N/A (config only)
320
+ |`cfg.xml.diff.show_raw_expected = true`
321
+ |`CANON_XML_DIFF_SHOW_RAW_EXPECTED=true`
322
+
323
+ |Show Raw Received only
324
+ |N/A
325
+ |N/A (config only)
326
+ |`cfg.xml.diff.show_raw_received = true`
327
+ |`CANON_XML_DIFF_SHOW_RAW_RECEIVED=true`
328
+
329
+ |Show Preprocessed Inputs (both)
330
+ |`--show-preprocessed-inputs`
331
+ |N/A (config only)
332
+ |`cfg.xml.diff.show_preprocessed_inputs = true`
333
+ |`CANON_XML_DIFF_SHOW_PREPROCESSED_INPUTS=true`
334
+
335
+ |Show Preprocessed Expected only
336
+ |`--show-preprocessed-expected`
337
+ |N/A (config only)
338
+ |`cfg.xml.diff.show_preprocessed_expected = true`
339
+ |`CANON_XML_DIFF_SHOW_PREPROCESSED_EXPECTED=true`
340
+
341
+ |Show Preprocessed Received only
342
+ |`--show-preprocessed-received`
343
+ |N/A (config only)
344
+ |`cfg.xml.diff.show_preprocessed_received = true`
345
+ |`CANON_XML_DIFF_SHOW_PREPROCESSED_RECEIVED=true`
346
+
347
+ |Show Pretty-printed Inputs (both)
348
+ |`--show-prettyprint-inputs`
349
+ |N/A (config only)
350
+ |`cfg.xml.diff.show_prettyprint_inputs = true`
351
+ |`CANON_XML_DIFF_SHOW_PRETTYPRINT_INPUTS=true`
352
+
353
+ |Show Pretty-printed Expected only
354
+ |`--show-prettyprint-expected`
355
+ |N/A (config only)
356
+ |`cfg.xml.diff.show_prettyprint_expected = true`
357
+ |`CANON_XML_DIFF_SHOW_PRETTYPRINT_EXPECTED=true`
358
+
359
+ |Show Pretty-printed Received only
360
+ |`--show-prettyprint-received`
361
+ |N/A (config only)
362
+ |`cfg.xml.diff.show_prettyprint_received = true`
363
+ |`CANON_XML_DIFF_SHOW_PRETTYPRINT_RECEIVED=true`
364
+
365
+ |Show Line-Numbered Inputs
366
+ |`--show-line-numbered-inputs`
367
+ |N/A (config only)
368
+ |`cfg.xml.diff.show_line_numbered_inputs = true`
369
+ |`CANON_XML_DIFF_SHOW_LINE_NUMBERED_INPUTS=true`
370
+ |===
371
+
372
+ NOTE: `show_raw_inputs`, `show_preprocessed_inputs`, and `show_prettyprint_inputs` are convenience shorthands for enabling both sides of their respective section simultaneously. Use the per-side variants (`*_expected` / `*_received`) when you only want to display one side — e.g. suppress the (long) expected fixture and show only what was generated.
373
+
374
+ NOTE: The *pretty-printed* section (`show_prettyprint_*`) differs from the *preprocessed* section (`show_preprocessed_*`) in two ways: (1) it always pretty-prints the **original** strings using `PrettyPrinter::Xml`/`Html`, independently of any `preprocessing` or `display_preprocessing` setting; and (2) it outputs **plain ASCII** with no character visualization (spaces remain spaces), making the output suitable for direct copy-paste into RSpec heredoc fixtures.
375
+
376
+ NOTE: `verbose_diff` enables the raw, preprocessed, and line-numbered debug sections simultaneously. It does **not** enable the pretty-printed section, which is a fixture-oriented feature independent of debugging verbosity.
377
+
198
378
  === Size Limit Options
199
379
 
200
380
  [cols="2,2,3,3,3"]
data/lib/canon/cli.rb CHANGED
@@ -238,6 +238,26 @@ module Canon
238
238
  type: :boolean,
239
239
  default: false,
240
240
  desc: "Show preprocessed contents (what was actually compared)"
241
+ method_option :show_preprocessed_expected,
242
+ type: :boolean,
243
+ default: false,
244
+ desc: "Show only the EXPECTED (fixture) block in the preprocessed-inputs section"
245
+ method_option :show_preprocessed_received,
246
+ type: :boolean,
247
+ default: false,
248
+ desc: "Show only the RECEIVED (actual) block in the preprocessed-inputs section"
249
+ method_option :show_prettyprint_inputs,
250
+ type: :boolean,
251
+ default: false,
252
+ desc: "Show fixture-ready pretty-printed form of both inputs (no whitespace visualization)"
253
+ method_option :show_prettyprint_expected,
254
+ type: :boolean,
255
+ default: false,
256
+ desc: "Show only the EXPECTED block in the fixture-ready pretty-printed section"
257
+ method_option :show_prettyprint_received,
258
+ type: :boolean,
259
+ default: false,
260
+ desc: "Show only the RECEIVED block in the fixture-ready pretty-printed section"
241
261
  method_option :show_line_numbered_inputs,
242
262
  type: :boolean,
243
263
  default: false,