canon 0.2.4 → 0.2.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: c24b7c541d6159a3c261d389c0d41b85f954cd4152d88d9ca4748d9a3ceb34ef
4
- data.tar.gz: 1de985c950b90c6979432b7b0bd1ed1b469240456fb7fd985a3d7f6929448b83
3
+ metadata.gz: 6437f1a8b556bb49bffbcecf47ec0eeecabdf6541bd5baa5954ac88f98f33a2c
4
+ data.tar.gz: 98eff2aa558165dc7e13c8d29da21d8d9c6589cae1d48a18d27f0420d6be7198
5
5
  SHA512:
6
- metadata.gz: 719eefd6be6d642503adb82e50609983fe9082ec8c7efe34c5e6cf27bfdc8065edc05b7ae75a959db8e5fe117f0ec67d71d81006d342a1c01f2846b4aa54b196
7
- data.tar.gz: 32a1bece85afd8265f158fdea547de08759773ba8a1e574ca72e42c79f6f59b02ed881cc4ba4bb78e54d135c9f4362100c8d409d2ee08b0eaa3561b13652296c
6
+ metadata.gz: 055614c143bca292b575755f5b4a1554a002e0d6f264ddee3e29049f89d6f9795a61069c1bdf7ebfa459ed11ac2a21203a779e115f8aee143a2aa3c77951a086
7
+ data.tar.gz: ff64c25654c1eef41dcc80b471df2958c516e69fa625723bdecfd18c5a99716e7f347c313bf72c28f3b898547d5c97aac3ebb62bcf44e726bc49e668a73e0dd9
data/README.adoc CHANGED
@@ -614,6 +614,10 @@ See link:docs/MODES[Diff modes] for details.
614
614
  * **Formatting diff detection**: Automatically detects and highlights purely cosmetic whitespace/line break differences
615
615
  * **Whitespace visualization**: Make invisible characters visible with CJK-safe
616
616
  Unicode symbols
617
+ * **Whitespace adjacency reporting**: Stray whitespace-only text nodes are
618
+ reported as a dedicated `:whitespace_adjacency` dimension with direction
619
+ wording (`before`/`after`/`adjacent to`) instead of cascading into
620
+ misleading `:text_content` mismatches
617
621
  * **Non-ASCII detection**: Warnings for unexpected Unicode characters
618
622
  * **Customizable**: Character maps, context lines, grouping options
619
623
 
@@ -266,6 +266,22 @@ match dimension but follows the same normative rule as all general dimensions:
266
266
  - Structural differences tracked but don't affect equivalence
267
267
  - Useful for content-only comparisons where wrapper elements don't matter
268
268
 
269
+ ==== Whitespace Adjacency
270
+
271
+ `:whitespace_adjacency` is a derived dimension — emitted when the
272
+ alignment walk pairs a whitespace-only text node on one side against a
273
+ content node on the other. The Reason line describes the whitespace's
274
+ direction relative to the partner content: `before`, `after`, or
275
+ `adjacent to`.
276
+
277
+ * **Always normative** — differences always affect equivalence
278
+ * **Not user-configurable** — dimension is always tracked when the
279
+ re-alignment walk encounters an asymmetric whitespace node
280
+ * **Report-only** — does not change equivalence outcomes compared to
281
+ pre-#137 behaviour; only changes the diff-report shape (see
282
+ link:../features/diff-formatting/whitespace-adjacency.adoc[Whitespace
283
+ adjacency] for details)
284
+
269
285
  .Example: Comment handling
270
286
  ====
271
287
  [source,ruby]
@@ -14,6 +14,39 @@ The Semantic Diff Report provides dimension-specific, actionable details for eac
14
14
 
15
15
  The report is automatically shown in verbose mode when differences exist, appearing before the detailed diff output.
16
16
 
17
+ == Parse errors
18
+
19
+ When Canon's underlying parser (libxml for XML, HTML5 for HTML) reports errors during input parsing, Canon surfaces them at the top of the diff report in a banner section before any per-difference output. The banner names the offending side and warns that the diff below describes the parsed tree, not the input — content the parser could not represent has been silently dropped from the comparison tree.
20
+
21
+ This is purely a transparency feature: Canon does not modify the parse to "fix" invalid input. The user is responsible for deciding whether the parse failure was expected (e.g. testing legacy fixtures during a migration) or symptomatic of an upstream bug.
22
+
23
+ .Example: Banner for a duplicate-attribute FATAL on the received side
24
+ [example]
25
+ ====
26
+ [source]
27
+ ----
28
+ ======================================================================
29
+ ⚠️ PARSE ERRORS
30
+ ======================================================================
31
+ Received side:
32
+ Attribute xml:lang redefined
33
+
34
+ ⚠️ The diff below describes the parsed tree, not the input.
35
+ Content that the parser could not represent has been
36
+ dropped and may appear as "missing" in the report.
37
+ ======================================================================
38
+ ----
39
+ ====
40
+
41
+ Common triggers in HTML / XHTML round-trips:
42
+
43
+ * Duplicate attributes (XML strict; HTML5 permissive — only XML mode triggers a banner)
44
+ * Stray processing instructions in fragment context
45
+ * Malformed namespace declarations
46
+ * DOCTYPE in unexpected positions
47
+
48
+ The banner is rendered when `Canon::Comparison::ComparisonResult#parse_errors?` is true. Programmatic callers can read `parse_errors_expected` and `parse_errors_received` directly off the result.
49
+
17
50
  == Key Features
18
51
 
19
52
  * XPath locations for XML/HTML elements
@@ -179,6 +212,38 @@ Reason: Text: "¬······:¬······"
179
212
 
180
213
  This fallback is implemented in `Canon::DiffFormatter::DiffDetailFormatterHelpers::DimensionFormatter.format_text_content_details` and only triggers when `TextUtils.ambiguous_text_pair?` returns `true` _and_ at least one side has a parent element to render.
181
214
 
215
+ ==== One-sided text diffs (added or removed text nodes)
216
+
217
+ When a `text_content` difference carries a text node on one side and `nil` on the other (issue #125) -- the shape that fragment-length mismatches and child-comparison emit when a text-node child is missing -- the renderer mirrors `element_structure`: the missing side reads `(not present)`, and the present side reads the text-node content (whitespace-visualised) plus a brief parent open-tag hint for context. The full ancestor subtree is *not* dumped; only the immediate parent's opening tag is shown, so a missing whitespace text node cannot make the diff look like the entire ancestor differs.
218
+
219
+ .Example: Whitespace text node missing on the received side
220
+ [example]
221
+ ====
222
+ [source]
223
+ ----
224
+ 🔍 DIFFERENCE #1/1 [NORMATIVE]
225
+ ──────────────────────────────────────────────────────────────────────
226
+ Dimension: text_content
227
+ Reason: element missing: text
228
+
229
+ ⊖ Expected (File 1):
230
+ text "¬············" in <div id="A">
231
+ ⊕ Actual (File 2):
232
+ (not present)
233
+
234
+ ✨ Changes:
235
+ Text removed: text "¬············" in <div id="A">
236
+ ----
237
+ ====
238
+
239
+ The `Changes:` line uses `Text removed:` or `Text added:` to mirror the `Element removed:` / `Element added:` phrasing of `element_structure`.
240
+
241
+ ==== Element-shaped diffs misclassified as text_content
242
+
243
+ In rare cases an upstream comparator may emit an *element*-shaped one-sided diff under `dimension: :text_content`. Without a guard, the one-sided text formatter would call `raw_text_value` on the element (which returns `""` for an empty element such as `<br/>`) and render `text "" in <parent>` -- meaningless when an element is what's actually missing.
244
+
245
+ The formatter detects element-shaped present-side nodes (Canon `ElementNode` or Nokogiri `Element`) and delegates to `format_element_structure_details`, so the rendered output reads `<br/>` and `Element removed:` rather than `text ""` and `Text removed:`. This is defence in depth -- the construction-side fix in `XmlComparatorHelpers::ChildComparison` ensures element orphans are now tagged `:element_structure` at source -- but a misclassified diff still renders meaningfully if any path slips through.
246
+
182
247
  === Structural Whitespace
183
248
 
184
249
  Shows whitespace-only differences (usually informative).
@@ -28,6 +28,11 @@ Canon's diff formatting includes:
28
28
  * **Context and grouping**: Control how much surrounding context to show
29
29
  * **Algorithm-specific output**: Different output styles for different diff
30
30
  algorithms
31
+ * **Whitespace adjacency**: Stray whitespace-only text nodes are anchored at
32
+ themselves instead of cascading into mismatches against neighbouring
33
+ content. The Reason line names the direction relative to the partner
34
+ (`before`/`after`/`adjacent to`)
35
+ (link:./whitespace-adjacency.adoc[details])
31
36
 
32
37
  == Available formatting options
33
38
 
@@ -0,0 +1,218 @@
1
+ ---
2
+ title: Whitespace adjacency in diff reports
3
+ parent: Diff Formatting
4
+ nav_order: 8
5
+ ---
6
+ = Whitespace adjacency in diff reports
7
+ :toc:
8
+ :toclevels: 2
9
+
10
+ == Purpose
11
+
12
+ Canon's diff reports anchor whitespace-only text nodes that have no
13
+ counterpart on the other side to a dedicated `:whitespace_adjacency`
14
+ dimension instead of letting them cascade into 3-4 misaligned
15
+ `:text_content` mismatches against neighbouring content nodes.
16
+
17
+ This is a *report-only* contract — equivalence verdicts are unchanged.
18
+ Inputs that were non-equivalent before this feature remain non-equivalent;
19
+ only the *shape* of the diff report changes.
20
+
21
+ == The problem
22
+
23
+ Consider an HTML fragment compared as `be_html_equivalent_to`:
24
+
25
+ [source,html]
26
+ ----
27
+ <!-- expected -->
28
+ <p>
29
+ <span>ISO </span>
30
+ <span>20483</span>
31
+ ,
32
+ <i>Cereals and pulses</i>
33
+ </p>
34
+
35
+ <!-- actual -->
36
+ <p><span>ISO </span><span>20483</span>, <i>Cereals and pulses</i></p>
37
+ ----
38
+
39
+ Both render identically in a browser — the indentation is structural HTML
40
+ formatting, not content. Before this feature, the diff report contained
41
+ four entries:
42
+
43
+ [source]
44
+ ----
45
+ DIFFERENCE #1 — element_structure: parent <p> "missing children"
46
+ DIFFERENCE #2 — text_content: "" vs "20483" (visualised: ↵░░░░)
47
+ DIFFERENCE #3 — text_content: "20483" vs ","
48
+ DIFFERENCE #4 — text_content: "," vs "Cereals and pulses"
49
+ ----
50
+
51
+ The cascade comes from positional `zip()` alignment in
52
+ `Canon::Comparison::XmlComparatorHelpers::ChildComparison`: with the
53
+ expected side carrying extra whitespace-only text nodes and the actual
54
+ side carrying none, every child slides by one slot and gets paired
55
+ against the wrong neighbour.
56
+
57
+ == The contract
58
+
59
+ When `ChildComparison` aligns child sequences and encounters a
60
+ whitespace-only text node on one side paired against a non-whitespace
61
+ node on the other, it:
62
+
63
+ 1. Treats the whitespace node as a *single-side gap* in the alignment.
64
+ 2. Emits one `:whitespace_adjacency` diff entry anchored at the
65
+ whitespace node itself (not at its mis-paired neighbour).
66
+ 3. Advances only the cursor that carries the whitespace, so the next
67
+ iteration aligns content against content.
68
+
69
+ The asymmetric whitespace still produces a non-equivalent verdict — the
70
+ `:whitespace_adjacency` dimension is classified as normative
71
+ unconditionally — so any test that previously failed on whitespace
72
+ asymmetry continues to fail.
73
+
74
+ After the new contract, the cascade above collapses to:
75
+
76
+ [source]
77
+ ----
78
+ DIFFERENCE #1 — whitespace_adjacency: Whitespace before "20483":
79
+ present on EXPECTED ("↵░░"), absent on ACTUAL
80
+ DIFFERENCE #2 — whitespace_adjacency: Whitespace before ",":
81
+ present on EXPECTED ("↵░░"), absent on ACTUAL
82
+ DIFFERENCE #3 — text_content: "↵░░,↵░░" vs ", "
83
+ ----
84
+
85
+ == Direction relative to the partner
86
+
87
+ The Reason line names the document-order position of the whitespace
88
+ node relative to the *partner content node* it was zipped against by
89
+ the alignment walk. The partner is the next (or, at parent edge,
90
+ previous) non-whitespace sibling on the whitespace-bearing side, which
91
+ is what aligns against the corresponding content node on the other
92
+ side.
93
+
94
+ `before`:: The whitespace immediately precedes its next non-whitespace
95
+ sibling. This is the common case (e.g. indentation between two inline
96
+ spans where the asymmetric whitespace sits on the leading edge of the
97
+ partner).
98
+
99
+ `after`:: The whitespace trails the previous non-whitespace sibling
100
+ and has no non-whitespace sibling after it. Emitted at the trailing
101
+ edge of a parent.
102
+
103
+ `adjacent to`:: Degenerate fallback for a whitespace node with no
104
+ non-whitespace siblings at all. Rarely emitted.
105
+
106
+ NOTE: An earlier wording (`Whitespace surrounding "X"`) classified the
107
+ *whitespace node's position among its own siblings* rather than its
108
+ direction relative to the partner. That label was misleading when the
109
+ whitespace sat between two element siblings but the asymmetry was
110
+ one-sided — see issue #137 follow-up.
111
+
112
+ === Examples
113
+
114
+ ==== "before" — whitespace between inline element siblings
115
+
116
+ [source,ruby]
117
+ ----
118
+ html1 = "<a><span>ISO </span>\n <span>712</span></a>"
119
+ html2 = "<a><span>ISO </span><span>712</span></a>"
120
+
121
+ result = Canon::Comparison.equivalent?(html1, html2,
122
+ format: :html5, verbose: true)
123
+ # => #<ComparisonResult equivalent=false>
124
+
125
+ # The stray "\n " between two spans is the only asymmetric node.
126
+ # It sits immediately before <span>712</span>, its next non-ws sibling.
127
+ result.differences.first.reason
128
+ # => "Whitespace before \"712\": present on EXPECTED (\"░\"), absent on ACTUAL"
129
+ ----
130
+
131
+ ==== "after" — trailing whitespace in a whitespace-preserving element
132
+
133
+ [source,ruby]
134
+ ----
135
+ # <code> preserves whitespace, so the trailing newline survives the
136
+ # upstream filter and pairs against the extra <b>B</b> on the other side.
137
+ html1 = "<code><b>A</b>\n</code>"
138
+ html2 = "<code><b>A</b><b>B</b></code>"
139
+
140
+ result = Canon::Comparison.equivalent?(html1, html2,
141
+ format: :html5, verbose: true)
142
+ result.differences.first.reason
143
+ # => "Whitespace after \"B\": present on EXPECTED (\"↵\"), absent on ACTUAL"
144
+ ----
145
+
146
+ ==== "adjacent to" — sole-child whitespace node
147
+
148
+ [source,ruby]
149
+ ----
150
+ # A whitespace-only text node as the only child of <code>, paired
151
+ # against an element on the other side. No non-ws siblings exist.
152
+ html1 = "<code>\n</code>"
153
+ html2 = "<code><b>A</b></code>"
154
+
155
+ result = Canon::Comparison.equivalent?(html1, html2,
156
+ format: :html5, verbose: true)
157
+ result.differences.first.reason
158
+ # => "Whitespace adjacent to \"A\": present on EXPECTED (\"↵\"), absent on ACTUAL"
159
+ ----
160
+
161
+ == Working with :whitespace_adjacency diffs programmatically
162
+
163
+ Use the `dimension` field on `DiffNode` to filter:
164
+
165
+ [source,ruby]
166
+ ----
167
+ result = Canon::Comparison.equivalent?(html1, html2,
168
+ format: :html5, verbose: true)
169
+
170
+ # Find all whitespace-adjacency diffs
171
+ ws_diffs = result.differences.select { |d| d.dimension == :whitespace_adjacency }
172
+
173
+ # These are always normative — they affect the equivalence verdict
174
+ ws_diffs.all?(&:normative?) # => true
175
+ ----
176
+
177
+ == What this contract does NOT do
178
+
179
+ * **Does not change equivalence outcomes.** A non-equivalent comparison
180
+ before #137 remains non-equivalent after — only the diff-report shape
181
+ changes.
182
+ * **Does not silently filter whitespace.** The asymmetric whitespace is
183
+ always reported; it is just labelled `:whitespace_adjacency` and
184
+ anchored at the whitespace node, instead of cascading as
185
+ `:text_content` against unrelated content nodes.
186
+ * **Does not affect symmetric whitespace.** When both sides carry
187
+ parallel whitespace-only nodes, those compare normally
188
+ (no `:whitespace_adjacency` entry, no cascade).
189
+
190
+ == Where it runs
191
+
192
+ The contract is implemented as a re-alignment walk inside
193
+ `Canon::Comparison::XmlComparatorHelpers::ChildComparison.use_positional_comparison`.
194
+ It activates whenever the existing positional `zip()` alignment would
195
+ pair a whitespace-only text node against a content node — that is, in
196
+ every whitespace context where the upstream filter has not already
197
+ dropped the whitespace nodes.
198
+
199
+ For elements where whitespace is preserved by configuration
200
+ (`preserve_whitespace_elements`) the upstream filter does not drop
201
+ indentation, and the re-alignment walk surfaces every asymmetric
202
+ whitespace node as a single normative `:whitespace_adjacency` diff.
203
+
204
+ == Related
205
+
206
+ * link:../../advanced/diff-classification.adoc[Diff classification] —
207
+ Normative vs informative differences.
208
+ * link:../match-options/index.adoc[Match options] — Configuring
209
+ `preserve_whitespace_elements`, `collapse_whitespace_elements`, and
210
+ `strip_whitespace_elements`.
211
+
212
+ == History
213
+
214
+ The cascade behaviour was reported in
215
+ https://github.com/lutaml/canon/issues/137[issue #137]. The fix landed
216
+ as a report-only re-alignment in PR #138. PR #141 replaced the
217
+ misleading `:surrounding`/`:preceding`/`:following` position labels
218
+ with direction-faithful wording (`before`/`after`/`adjacent to`).
@@ -194,7 +194,9 @@ export CANON_JSON_FORMAT_PREPROCESSING=normalize
194
194
  |`CANON_SHOW_PRETTYPRINT_RECEIVED`
195
195
  |boolean
196
196
  |`false`
197
- |Show only the RECEIVED (actual) block in the fixture-ready pretty-printed section. This is the most common fixture-update workflow: enable this option to get a copy-pasteable pretty-printed form of the generated output that can replace the old fixture heredoc. Format-specific: `CANON_{FORMAT}_DIFF_SHOW_PRETTYPRINT_RECEIVED`
197
+ |Show only the RECEIVED (actual) block in the fixture-ready pretty-printed section. This is the most common fixture-update workflow: enable this option to get a copy-pasteable pretty-printed form of the generated output that can replace the old fixture heredoc. Format-specific: `CANON_{FORMAT}_DIFF_SHOW_PRETTYPRINT_RECEIVED`.
198
+
199
+ For HTML / HTML4 / HTML5 inputs, the pretty-printed output is XHTML-shaped: void elements are self-closed (`<br/>`, `<meta/>`), non-void elements are paired (`<a></a>`), and Nokogiri may add `xmlns="http://www.w3.org/1999/xhtml"` on `<html>` and an `xml:lang` mirror of `lang`. This is a display-only serialisation chosen because libxml's `FORMAT` save flag (the only path that actually indents HTML5 input) requires the XHTML save mode -- `Nokogiri::HTML5#to_html` silently ignores its `indent:` keyword. See lutaml/canon#133.
198
200
  |All formats (display only)
199
201
 
200
202
  |`CANON_COMPACT_SEMANTIC_REPORT`
@@ -6,7 +6,8 @@ module Canon
6
6
  # Provides methods to query equivalence based on normative diffs
7
7
  class ComparisonResult
8
8
  attr_reader :differences, :preprocessed_strings, :format, :html_version,
9
- :match_options, :algorithm, :original_strings
9
+ :match_options, :algorithm, :original_strings,
10
+ :parse_errors_expected, :parse_errors_received
10
11
 
11
12
  # @param differences [Array<DiffNode>] Array of difference nodes
12
13
  # @param preprocessed_strings [Array<String, String>] Pre-processed content for display
@@ -15,8 +16,11 @@ module Canon
15
16
  # @param match_options [Hash, nil] Resolved match options used for comparison
16
17
  # @param algorithm [Symbol] Diff algorithm used (:dom or :semantic)
17
18
  # @param original_strings [Array<String, String>, nil] Original unprocessed content for line diff
19
+ # @param parse_errors_expected [Array<String>, nil] Parser errors from the expected side
20
+ # @param parse_errors_received [Array<String>, nil] Parser errors from the received side
18
21
  def initialize(differences:, preprocessed_strings:, format:,
19
- html_version: nil, match_options: nil, algorithm: :dom, original_strings: nil)
22
+ html_version: nil, match_options: nil, algorithm: :dom, original_strings: nil,
23
+ parse_errors_expected: nil, parse_errors_received: nil)
20
24
  @differences = differences
21
25
  @preprocessed_strings = preprocessed_strings
22
26
  @original_strings = original_strings || preprocessed_strings
@@ -24,6 +28,16 @@ html_version: nil, match_options: nil, algorithm: :dom, original_strings: nil)
24
28
  @html_version = html_version
25
29
  @match_options = match_options
26
30
  @algorithm = algorithm
31
+ @parse_errors_expected = Array(parse_errors_expected)
32
+ @parse_errors_received = Array(parse_errors_received)
33
+ end
34
+
35
+ # Whether either side reported parse errors. Used by the diff
36
+ # formatter to decide whether to render the parse-error banner.
37
+ #
38
+ # @return [Boolean]
39
+ def parse_errors?
40
+ @parse_errors_expected.any? || @parse_errors_received.any?
27
41
  end
28
42
 
29
43
  # Check if documents are semantically equivalent (no normative diffs)
@@ -151,6 +151,8 @@ module Canon
151
151
  html_version: detect_html_version_from_node(node1),
152
152
  match_options: match_opts_hash,
153
153
  algorithm: :dom,
154
+ parse_errors_expected: Comparison.parse_errors_for(node1),
155
+ parse_errors_received: Comparison.parse_errors_for(node2),
154
156
  )
155
157
  elsif result != Comparison::EQUIVALENT && !differences.empty?
156
158
  # Non-verbose mode: check equivalence
@@ -300,6 +302,8 @@ module Canon
300
302
  html_version: html_version,
301
303
  match_options: match_opts_hash.merge(strategy.metadata),
302
304
  algorithm: :semantic,
305
+ parse_errors_expected: Comparison.parse_errors_for(node1),
306
+ parse_errors_received: Comparison.parse_errors_for(node2),
303
307
  )
304
308
  else
305
309
  # Simple boolean result - equivalent if no normative differences
@@ -1,6 +1,7 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  require_relative "../comparison" # Load base module with constants
4
+ require_relative "node_inspector"
4
5
  require_relative "../diff/diff_node"
5
6
  require_relative "../diff/path_builder"
6
7
 
@@ -87,23 +88,20 @@ module Canon
87
88
  return nil if node.nil?
88
89
 
89
90
  # Canon::Xml::Node types
90
- if node.is_a?(Canon::Xml::Nodes::RootNode)
91
+ case node
92
+ when Canon::Xml::Nodes::RootNode
91
93
  # Serialize all children of root
92
94
  node.children.map { |child| serialize_node(child) }.join
93
- elsif node.is_a?(Canon::Xml::Nodes::ElementNode)
95
+ when Canon::Xml::Nodes::ElementNode
94
96
  serialize_element_node(node)
95
- elsif node.is_a?(Canon::Xml::Nodes::TextNode)
97
+ when Canon::Xml::Nodes::TextNode
96
98
  # Use original text (with entity references) if available,
97
99
  # otherwise fall back to value (decoded text)
98
100
  node.original || node.value
99
- elsif node.is_a?(Canon::Xml::Nodes::CommentNode)
101
+ when Canon::Xml::Nodes::CommentNode
100
102
  "<!--#{node.value}-->"
101
- elsif node.is_a?(Canon::Xml::Nodes::ProcessingInstructionNode)
103
+ when Canon::Xml::Nodes::ProcessingInstructionNode
102
104
  "<?#{node.target} #{node.data}?>"
103
- elsif node.respond_to?(:to_xml)
104
- node.to_xml
105
- elsif node.respond_to?(:to_html)
106
- node.to_html
107
105
  else
108
106
  node.to_s
109
107
  end
@@ -121,8 +119,8 @@ module Canon
121
119
  node.attribute_nodes.to_h do |attr|
122
120
  [attr.name, attr.value]
123
121
  end
124
- # Nokogiri nodes
125
- elsif node.respond_to?(:attributes)
122
+ # Nokogiri elements
123
+ elsif node.is_a?(Nokogiri::XML::Element)
126
124
  node.attributes.to_h do |_, attr|
127
125
  [attr.name, attr.value]
128
126
  end
@@ -227,8 +225,8 @@ module Canon
227
225
  def same_node_type?(node1, node2)
228
226
  return false if node1.class != node2.class
229
227
 
230
- # For Nokogiri/Canon::Xml nodes, check node type
231
- if node1.respond_to?(:node_type) && node2.respond_to?(:node_type)
228
+ case node1
229
+ when Canon::Xml::Node, Nokogiri::XML::Node
232
230
  node1.node_type == node2.node_type
233
231
  else
234
232
  true
@@ -245,20 +243,7 @@ module Canon
245
243
  # @param node [Object] Node to check
246
244
  # @return [Boolean] true if node is a comment
247
245
  def comment_node?(node)
248
- return true if node.respond_to?(:comment?) && node.comment?
249
- return true if node.respond_to?(:node_type) && node.node_type == :comment
250
-
251
- # HTML comments are parsed as TEXT nodes by Nokogiri
252
- # Check if this is a text node with HTML comment content
253
- if text_node?(node)
254
- text = node_text(node)
255
- # Strip whitespace and backslashes for comparison
256
- # Nokogiri escapes HTML comments as "<\\!-- comment -->" in full documents
257
- text_stripped = text.to_s.strip.gsub("\\", "")
258
- return true if text_stripped.start_with?("<!--") && text_stripped.end_with?("-->")
259
- end
260
-
261
- false
246
+ NodeInspector.comment_node?(node)
262
247
  end
263
248
 
264
249
  # Check if a node is a text node
@@ -266,9 +251,7 @@ module Canon
266
251
  # @param node [Object] Node to check
267
252
  # @return [Boolean] true if node is a text node
268
253
  def text_node?(node)
269
- (node.respond_to?(:text?) && node.text? &&
270
- !node.respond_to?(:element?)) ||
271
- (node.respond_to?(:node_type) && node.node_type == :text)
254
+ NodeInspector.text_node?(node)
272
255
  end
273
256
 
274
257
  # Get text content from a node
@@ -276,15 +259,7 @@ module Canon
276
259
  # @param node [Object] Node to get text from
277
260
  # @return [String] Text content
278
261
  def node_text(node)
279
- # Canon::Xml::Node TextNode uses .value
280
- if node.respond_to?(:value)
281
- node.value.to_s
282
- # Nokogiri nodes use .content
283
- elsif node.respond_to?(:content)
284
- node.content.to_s
285
- else
286
- node.to_s
287
- end
262
+ NodeInspector.text_content(node)
288
263
  end
289
264
 
290
265
  # Check if difference between two texts is only whitespace
@@ -328,7 +303,7 @@ module Canon
328
303
  if diff1 == Canon::Comparison::MISSING_NODE && diff2 == Canon::Comparison::MISSING_NODE
329
304
  "element structure mismatch (children differ)"
330
305
  else
331
- "#{diff1} vs #{diff2}"
306
+ Canon::Comparison.code_pair_label(diff1, diff2)
332
307
  end
333
308
  end
334
309
 
@@ -371,26 +346,18 @@ module Canon
371
346
  def extract_text_content_from_node(node)
372
347
  return nil if node.nil?
373
348
 
374
- # For Canon::Xml::Nodes::TextNode
375
- return node.value if node.respond_to?(:value) && node.is_a?(Canon::Xml::Nodes::TextNode)
376
-
377
- # For XML/HTML nodes with text_content method
378
- return node.text_content if node.respond_to?(:text_content)
379
-
380
- # For nodes with text method
381
- return node.text if node.respond_to?(:text)
382
-
383
- # For nodes with content method (Moxml::Text)
384
- return node.content if node.respond_to?(:content)
385
-
386
- # For nodes with value method (other types)
387
- return node.value if node.respond_to?(:value)
388
-
389
- # For simple text nodes or strings
390
- return node.to_s if node.is_a?(String)
391
-
392
- # For other node types, try to_s
393
- node.to_s
349
+ case node
350
+ when Canon::Xml::Nodes::TextNode
351
+ node.value
352
+ when Canon::Xml::Node
353
+ node.text_content
354
+ when Nokogiri::XML::Node
355
+ node.content.to_s
356
+ when String
357
+ node
358
+ else
359
+ node.to_s
360
+ end
394
361
  rescue StandardError
395
362
  nil
396
363
  end
@@ -444,26 +411,37 @@ module Canon
444
411
 
445
412
  # Determine the appropriate dimension for a node type
446
413
  #
414
+ # Used by ChildComparison to tag per-child orphan diffs with a
415
+ # dimension that matches what the node *is*, so the formatter
416
+ # renders correctly. An element orphan tagged :text_content
417
+ # would otherwise route through PR #126's one-sided text
418
+ # formatter and render as +text ""+ instead of as the actual
419
+ # element (see lutaml/canon#125 follow-up).
420
+ #
447
421
  # @param node [Object] The node to check
448
422
  # @return [Symbol] The dimension symbol
449
423
  def determine_node_dimension(node)
450
- # Canon::Xml::Node types
451
- if node.respond_to?(:node_type) && node.node_type.is_a?(Symbol)
424
+ case node
425
+ when Canon::Xml::Node
452
426
  case node.node_type
427
+ when :element then :element_structure
453
428
  when :comment then :comments
454
429
  when :text, :cdata then :text_content
455
430
  when :processing_instruction then :processing_instructions
456
431
  else :text_content
457
432
  end
458
- # Moxml/Nokogiri types
459
- elsif node.respond_to?(:comment?) && node.comment?
460
- :comments
461
- elsif node.respond_to?(:text?) && node.text?
462
- :text_content
463
- elsif node.respond_to?(:cdata?) && node.cdata?
464
- :text_content
465
- elsif node.respond_to?(:processing_instruction?) && node.processing_instruction?
466
- :processing_instructions
433
+ when Nokogiri::XML::Node
434
+ if node.comment?
435
+ :comments
436
+ elsif node.text? || node.cdata?
437
+ :text_content
438
+ elsif node.processing_instruction?
439
+ :processing_instructions
440
+ elsif node.element?
441
+ :element_structure
442
+ else
443
+ :text_content
444
+ end
467
445
  else
468
446
  :text_content
469
447
  end