RubyGems - canon - Versions diffs - 0.2.4 → 0.2.5 - Mend

canon 0.2.4 → 0.2.5

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (27) hide show

checksums.yaml +4 -4
data/docs/advanced/semantic-diff-report.adoc +65 -0
data/docs/features/diff-formatting/index.adoc +3 -0
data/docs/features/diff-formatting/whitespace-adjacency.adoc +140 -0
data/docs/reference/environment-variables.adoc +3 -1
data/lib/canon/comparison/comparison_result.rb +16 -2
data/lib/canon/comparison/html_comparator.rb +4 -0
data/lib/canon/comparison/markup_comparator.rb +49 -71
data/lib/canon/comparison/node_inspector.rb +103 -0
data/lib/canon/comparison/xml_comparator/child_comparison.rb +127 -55
data/lib/canon/comparison/xml_comparator/diff_node_builder.rb +24 -23
data/lib/canon/comparison/xml_comparator.rb +94 -3
data/lib/canon/comparison/xml_node_comparison.rb +37 -81
data/lib/canon/comparison.rb +59 -0
data/lib/canon/diff/diff_classifier.rb +37 -39
data/lib/canon/diff/xml_serialization_formatter.rb +27 -42
data/lib/canon/diff_formatter/diff_detail_formatter/dimension_formatter.rb +119 -9
data/lib/canon/diff_formatter/diff_detail_formatter/node_utils.rb +75 -4
data/lib/canon/diff_formatter.rb +71 -2
data/lib/canon/pretty_printer/html.rb +76 -14
data/lib/canon/pretty_printer/html_void_elements.rb +20 -0
data/lib/canon/pretty_printer/xml_normalized.rb +10 -3
data/lib/canon/version.rb +1 -1
data/lib/canon/xml/data_model.rb +13 -1
data/lib/canon/xml/node.rb +15 -0
data/lib/canon/xml/sax_builder.rb +18 -0
metadata +5 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: c24b7c541d6159a3c261d389c0d41b85f954cd4152d88d9ca4748d9a3ceb34ef
-  data.tar.gz: 1de985c950b90c6979432b7b0bd1ed1b469240456fb7fd985a3d7f6929448b83
+  metadata.gz: 615e3154c89a9850e86c39852201e5573b461ac62d52cc423523e444ace301f7
+  data.tar.gz: 37ee00969f0682dde670168fbd7888294edda612220bfbebb7c950efbcb76aa2
 SHA512:
-  metadata.gz: 719eefd6be6d642503adb82e50609983fe9082ec8c7efe34c5e6cf27bfdc8065edc05b7ae75a959db8e5fe117f0ec67d71d81006d342a1c01f2846b4aa54b196
-  data.tar.gz: 32a1bece85afd8265f158fdea547de08759773ba8a1e574ca72e42c79f6f59b02ed881cc4ba4bb78e54d135c9f4362100c8d409d2ee08b0eaa3561b13652296c
+  metadata.gz: bce4239ab6a471edd896fd3b54def4e57e21714078cb3631b55363b50646349a6923eed1e208e5706c3319d3e7a2ae75f2db698ffe853c0e03a754d76c856679
+  data.tar.gz: 1441bd5412658d9d2b975e3889fc95bfd080dec2b89b731f71e191f5ca7bbc7e0a8aa63e787916781bd5e653732c16d5c03b0d3fc3b967a3b653a2a735e62636

data/docs/advanced/semantic-diff-report.adoc CHANGED Viewed

@@ -14,6 +14,39 @@ The Semantic Diff Report provides dimension-specific, actionable details for eac
 The report is automatically shown in verbose mode when differences exist, appearing before the detailed diff output.
+== Parse errors
+When Canon's underlying parser (libxml for XML, HTML5 for HTML) reports errors during input parsing, Canon surfaces them at the top of the diff report in a banner section before any per-difference output. The banner names the offending side and warns that the diff below describes the parsed tree, not the input — content the parser could not represent has been silently dropped from the comparison tree.
+This is purely a transparency feature: Canon does not modify the parse to "fix" invalid input. The user is responsible for deciding whether the parse failure was expected (e.g. testing legacy fixtures during a migration) or symptomatic of an upstream bug.
+.Example: Banner for a duplicate-attribute FATAL on the received side
+[example]
+====
+[source]
+----
+======================================================================
+  ⚠️  PARSE ERRORS
+======================================================================
+  Received side:
+    Attribute xml:lang redefined
+  ⚠️  The diff below describes the parsed tree, not the input.
+      Content that the parser could not represent has been
+      dropped and may appear as "missing" in the report.
+======================================================================
+----
+====
+Common triggers in HTML / XHTML round-trips:
+* Duplicate attributes (XML strict; HTML5 permissive — only XML mode triggers a banner)
+* Stray processing instructions in fragment context
+* Malformed namespace declarations
+* DOCTYPE in unexpected positions
+The banner is rendered when `Canon::Comparison::ComparisonResult#parse_errors?` is true. Programmatic callers can read `parse_errors_expected` and `parse_errors_received` directly off the result.
 == Key Features
 * XPath locations for XML/HTML elements
@@ -179,6 +212,38 @@ Reason:  Text: "¬······:¬······"
 This fallback is implemented in `Canon::DiffFormatter::DiffDetailFormatterHelpers::DimensionFormatter.format_text_content_details` and only triggers when `TextUtils.ambiguous_text_pair?` returns `true` _and_ at least one side has a parent element to render.
+==== One-sided text diffs (added or removed text nodes)
+When a `text_content` difference carries a text node on one side and `nil` on the other (issue #125) -- the shape that fragment-length mismatches and child-comparison emit when a text-node child is missing -- the renderer mirrors `element_structure`: the missing side reads `(not present)`, and the present side reads the text-node content (whitespace-visualised) plus a brief parent open-tag hint for context. The full ancestor subtree is *not* dumped; only the immediate parent's opening tag is shown, so a missing whitespace text node cannot make the diff look like the entire ancestor differs.
+.Example: Whitespace text node missing on the received side
+[example]
+====
+[source]
+----
+🔍 DIFFERENCE #1/1 [NORMATIVE]
+──────────────────────────────────────────────────────────────────────
+Dimension: text_content
+Reason:  element missing: text
+⊖ Expected (File 1):
+   text "¬············" in <div id="A">
+⊕ Actual (File 2):
+   (not present)
+✨ Changes:
+   Text removed: text "¬············" in <div id="A">
+----
+====
+The `Changes:` line uses `Text removed:` or `Text added:` to mirror the `Element removed:` / `Element added:` phrasing of `element_structure`.
+==== Element-shaped diffs misclassified as text_content
+In rare cases an upstream comparator may emit an *element*-shaped one-sided diff under `dimension: :text_content`.  Without a guard, the one-sided text formatter would call `raw_text_value` on the element (which returns `""` for an empty element such as `<br/>`) and render `text "" in <parent>` -- meaningless when an element is what's actually missing.
+The formatter detects element-shaped present-side nodes (Canon `ElementNode` or Nokogiri `Element`) and delegates to `format_element_structure_details`, so the rendered output reads `<br/>` and `Element removed:` rather than `text ""` and `Text removed:`.  This is defence in depth -- the construction-side fix in `XmlComparatorHelpers::ChildComparison` ensures element orphans are now tagged `:element_structure` at source -- but a misclassified diff still renders meaningfully if any path slips through.
 === Structural Whitespace
 Shows whitespace-only differences (usually informative).

data/docs/features/diff-formatting/index.adoc CHANGED Viewed

@@ -28,6 +28,9 @@ Canon's diff formatting includes:
 * **Context and grouping**: Control how much surrounding context to show
 * **Algorithm-specific output**: Different output styles for different diff
   algorithms
+* **Whitespace adjacency**: Stray whitespace-only text nodes are anchored at
+  themselves instead of cascading into mismatches against neighbouring
+  content (link:./whitespace-adjacency.adoc[details])
 == Available formatting options

data/docs/features/diff-formatting/whitespace-adjacency.adoc ADDED Viewed

@@ -0,0 +1,140 @@
+---
+title: Whitespace adjacency in diff reports
+parent: Diff Formatting
+nav_order: 8
+---
+= Whitespace adjacency in diff reports
+:toc:
+:toclevels: 2
+== Purpose
+Canon's diff reports anchor whitespace-only text nodes that have no
+counterpart on the other side to a dedicated `:whitespace_adjacency`
+dimension instead of letting them cascade into 3-4 misaligned
+`:text_content` mismatches against neighbouring content nodes.
+This is a *report-only* contract — equivalence verdicts are unchanged.
+Inputs that were non-equivalent before this feature remain non-equivalent;
+only the *shape* of the diff report changes.
+== The problem
+Consider an HTML fragment compared as `be_html_equivalent_to`:
+[source,html]
+----
+<!-- expected -->
+<p>
+  <span>ISO </span>
+  <span>20483</span>
+  ,
+  <i>Cereals and pulses</i>
+</p>
+<!-- actual -->
+<p><span>ISO </span><span>20483</span>, <i>Cereals and pulses</i></p>
+----
+Both render identically in a browser — the indentation is structural HTML
+formatting, not content. Before this feature, the diff report contained
+four entries:
+[source]
+----
+DIFFERENCE #1 — element_structure: parent <p> "missing children"
+DIFFERENCE #2 — text_content: ""  vs  "20483"        (visualised: ↵░░░░)
+DIFFERENCE #3 — text_content: "20483"  vs  ","
+DIFFERENCE #4 — text_content: ","  vs  "Cereals and pulses"
+----
+The cascade comes from positional `zip()` alignment in
+`Canon::Comparison::XmlComparatorHelpers::ChildComparison`: with the
+expected side carrying extra whitespace-only text nodes and the actual
+side carrying none, every child slides by one slot and gets paired
+against the wrong neighbour.
+== The contract
+When `ChildComparison` aligns child sequences and encounters a
+whitespace-only text node on one side paired against a non-whitespace
+node on the other, it:
+1. Treats the whitespace node as a *single-side gap* in the alignment.
+2. Emits one `:whitespace_adjacency` diff entry anchored at the
+   whitespace node itself (not at its mis-paired neighbour).
+3. Advances only the cursor that carries the whitespace, so the next
+   iteration aligns content against content.
+The asymmetric whitespace still produces a non-equivalent verdict — the
+`:whitespace_adjacency` dimension is classified as normative
+unconditionally — so any test that previously failed on whitespace
+asymmetry continues to fail.
+After the new contract, the cascade above collapses to:
+[source]
+----
+DIFFERENCE #1 — whitespace_adjacency: Whitespace surrounding "20483":
+                  present on EXPECTED ("↵░░"), absent on ACTUAL
+DIFFERENCE #2 — whitespace_adjacency: Whitespace surrounding ",":
+                  present on EXPECTED ("↵░░"), absent on ACTUAL
+DIFFERENCE #3 — text_content: "↵░░,↵░░"  vs  ", "
+----
+== Adjacency positions
+The Reason line names the adjacency position of the whitespace node
+relative to its non-whitespace siblings:
+`:preceding`::  Whitespace at the start of its parent (no non-whitespace
+sibling before it, has one after it).
+`:following`::  Whitespace at the end of its parent (has a non-whitespace
+sibling before it, none after).
+`:surrounding`::  Sandwiched between two non-whitespace siblings.
+`:isolated`::  No non-whitespace siblings at all (degenerate; rarely
+emitted).
+== What this contract does NOT do
+* **Does not change equivalence outcomes.** A non-equivalent comparison
+  before #137 remains non-equivalent after — only the diff-report shape
+  changes.
+* **Does not silently filter whitespace.** The asymmetric whitespace is
+  always reported; it is just labelled `:whitespace_adjacency` and
+  anchored at the whitespace node, instead of cascading as
+  `:text_content` against unrelated content nodes.
+* **Does not affect symmetric whitespace.** When both sides carry
+  parallel whitespace-only nodes, those compare normally
+  (no `:whitespace_adjacency` entry, no cascade).
+== Where it runs
+The contract is implemented as a re-alignment walk inside
+`Canon::Comparison::XmlComparatorHelpers::ChildComparison.use_positional_comparison`.
+It activates whenever the existing positional `zip()` alignment would
+pair a whitespace-only text node against a content node — that is, in
+every whitespace context where the upstream filter has not already
+dropped the whitespace nodes.
+For elements where whitespace is preserved by configuration
+(`preserve_whitespace_elements`) the upstream filter does not drop
+indentation, and the re-alignment walk surfaces every asymmetric
+whitespace node as a single normative `:whitespace_adjacency` diff.
+== Related
+* link:../../advanced/diff-classification.adoc[Diff classification] —
+  Normative vs informative differences.
+* link:../match-options/index.adoc[Match options] — Configuring
+  `preserve_whitespace_elements`, `collapse_whitespace_elements`, and
+  `strip_whitespace_elements`.
+== History
+The cascade behaviour was reported in
+https://github.com/lutaml/canon/issues/137[issue #137]. The fix landed
+as a report-only re-alignment in PR #138.

data/docs/reference/environment-variables.adoc CHANGED Viewed

@@ -194,7 +194,9 @@ export CANON_JSON_FORMAT_PREPROCESSING=normalize
 |`CANON_SHOW_PRETTYPRINT_RECEIVED`
 |boolean
 |`false`
-|Show only the RECEIVED (actual) block in the fixture-ready pretty-printed section.  This is the most common fixture-update workflow: enable this option to get a copy-pasteable pretty-printed form of the generated output that can replace the old fixture heredoc.  Format-specific: `CANON_{FORMAT}_DIFF_SHOW_PRETTYPRINT_RECEIVED`
+|Show only the RECEIVED (actual) block in the fixture-ready pretty-printed section.  This is the most common fixture-update workflow: enable this option to get a copy-pasteable pretty-printed form of the generated output that can replace the old fixture heredoc.  Format-specific: `CANON_{FORMAT}_DIFF_SHOW_PRETTYPRINT_RECEIVED`.
+For HTML / HTML4 / HTML5 inputs, the pretty-printed output is XHTML-shaped: void elements are self-closed (`<br/>`, `<meta/>`), non-void elements are paired (`<a></a>`), and Nokogiri may add `xmlns="http://www.w3.org/1999/xhtml"` on `<html>` and an `xml:lang` mirror of `lang`.  This is a display-only serialisation chosen because libxml's `FORMAT` save flag (the only path that actually indents HTML5 input) requires the XHTML save mode -- `Nokogiri::HTML5#to_html` silently ignores its `indent:` keyword.  See lutaml/canon#133.
 |All formats (display only)
 |`CANON_COMPACT_SEMANTIC_REPORT`

data/lib/canon/comparison/comparison_result.rb CHANGED Viewed

@@ -6,7 +6,8 @@ module Canon
     # Provides methods to query equivalence based on normative diffs
     class ComparisonResult
       attr_reader :differences, :preprocessed_strings, :format, :html_version,
-                  :match_options, :algorithm, :original_strings
+                  :match_options, :algorithm, :original_strings,
+                  :parse_errors_expected, :parse_errors_received
       # @param differences [Array<DiffNode>] Array of difference nodes
       # @param preprocessed_strings [Array<String, String>] Pre-processed content for display
@@ -15,8 +16,11 @@ module Canon
       # @param match_options [Hash, nil] Resolved match options used for comparison
       # @param algorithm [Symbol] Diff algorithm used (:dom or :semantic)
       # @param original_strings [Array<String, String>, nil] Original unprocessed content for line diff
+      # @param parse_errors_expected [Array<String>, nil] Parser errors from the expected side
+      # @param parse_errors_received [Array<String>, nil] Parser errors from the received side
       def initialize(differences:, preprocessed_strings:, format:,
-html_version: nil, match_options: nil, algorithm: :dom, original_strings: nil)
+html_version: nil, match_options: nil, algorithm: :dom, original_strings: nil,
+parse_errors_expected: nil, parse_errors_received: nil)
         @differences = differences
         @preprocessed_strings = preprocessed_strings
         @original_strings = original_strings || preprocessed_strings
@@ -24,6 +28,16 @@ html_version: nil, match_options: nil, algorithm: :dom, original_strings: nil)
         @html_version = html_version
         @match_options = match_options
         @algorithm = algorithm
+        @parse_errors_expected = Array(parse_errors_expected)
+        @parse_errors_received = Array(parse_errors_received)
+      end
+      # Whether either side reported parse errors.  Used by the diff
+      # formatter to decide whether to render the parse-error banner.
+      #
+      # @return [Boolean]
+      def parse_errors?
+        @parse_errors_expected.any? || @parse_errors_received.any?
       end
       # Check if documents are semantically equivalent (no normative diffs)

data/lib/canon/comparison/html_comparator.rb CHANGED Viewed

@@ -151,6 +151,8 @@ module Canon
               html_version: detect_html_version_from_node(node1),
               match_options: match_opts_hash,
               algorithm: :dom,
+              parse_errors_expected: Comparison.parse_errors_for(node1),
+              parse_errors_received: Comparison.parse_errors_for(node2),
             )
           elsif result != Comparison::EQUIVALENT && !differences.empty?
             # Non-verbose mode: check equivalence
@@ -300,6 +302,8 @@ module Canon
               html_version: html_version,
               match_options: match_opts_hash.merge(strategy.metadata),
               algorithm: :semantic,
+              parse_errors_expected: Comparison.parse_errors_for(node1),
+              parse_errors_received: Comparison.parse_errors_for(node2),
             )
           else
             # Simple boolean result - equivalent if no normative differences

data/lib/canon/comparison/markup_comparator.rb CHANGED Viewed

@@ -1,6 +1,7 @@
 # frozen_string_literal: true
 require_relative "../comparison" # Load base module with constants
+require_relative "node_inspector"
 require_relative "../diff/diff_node"
 require_relative "../diff/path_builder"
@@ -87,23 +88,20 @@ module Canon
           return nil if node.nil?
           # Canon::Xml::Node types
-          if node.is_a?(Canon::Xml::Nodes::RootNode)
+          case node
+          when Canon::Xml::Nodes::RootNode
             # Serialize all children of root
             node.children.map { |child| serialize_node(child) }.join
-          elsif node.is_a?(Canon::Xml::Nodes::ElementNode)
+          when Canon::Xml::Nodes::ElementNode
             serialize_element_node(node)
-          elsif node.is_a?(Canon::Xml::Nodes::TextNode)
+          when Canon::Xml::Nodes::TextNode
             # Use original text (with entity references) if available,
             # otherwise fall back to value (decoded text)
             node.original || node.value
-          elsif node.is_a?(Canon::Xml::Nodes::CommentNode)
+          when Canon::Xml::Nodes::CommentNode
             "<!--#{node.value}-->"
-          elsif node.is_a?(Canon::Xml::Nodes::ProcessingInstructionNode)
+          when Canon::Xml::Nodes::ProcessingInstructionNode
             "<?#{node.target} #{node.data}?>"
-          elsif node.respond_to?(:to_xml)
-            node.to_xml
-          elsif node.respond_to?(:to_html)
-            node.to_html
           else
             node.to_s
           end
@@ -121,8 +119,8 @@ module Canon
             node.attribute_nodes.to_h do |attr|
               [attr.name, attr.value]
             end
-          # Nokogiri nodes
-          elsif node.respond_to?(:attributes)
+          # Nokogiri elements
+          elsif node.is_a?(Nokogiri::XML::Element)
             node.attributes.to_h do |_, attr|
               [attr.name, attr.value]
             end
@@ -227,8 +225,8 @@ module Canon
         def same_node_type?(node1, node2)
           return false if node1.class != node2.class
-          # For Nokogiri/Canon::Xml nodes, check node type
-          if node1.respond_to?(:node_type) && node2.respond_to?(:node_type)
+          case node1
+          when Canon::Xml::Node, Nokogiri::XML::Node
             node1.node_type == node2.node_type
           else
             true
@@ -245,20 +243,7 @@ module Canon
         # @param node [Object] Node to check
         # @return [Boolean] true if node is a comment
         def comment_node?(node)
-          return true if node.respond_to?(:comment?) && node.comment?
-          return true if node.respond_to?(:node_type) && node.node_type == :comment
-          # HTML comments are parsed as TEXT nodes by Nokogiri
-          # Check if this is a text node with HTML comment content
-          if text_node?(node)
-            text = node_text(node)
-            # Strip whitespace and backslashes for comparison
-            # Nokogiri escapes HTML comments as "<\\!-- comment -->" in full documents
-            text_stripped = text.to_s.strip.gsub("\\", "")
-            return true if text_stripped.start_with?("<!--") && text_stripped.end_with?("-->")
-          end
-          false
+          NodeInspector.comment_node?(node)
         end
         # Check if a node is a text node
@@ -266,9 +251,7 @@ module Canon
         # @param node [Object] Node to check
         # @return [Boolean] true if node is a text node
         def text_node?(node)
-          (node.respond_to?(:text?) && node.text? &&
-            !node.respond_to?(:element?)) ||
-            (node.respond_to?(:node_type) && node.node_type == :text)
+          NodeInspector.text_node?(node)
         end
         # Get text content from a node
@@ -276,15 +259,7 @@ module Canon
         # @param node [Object] Node to get text from
         # @return [String] Text content
         def node_text(node)
-          # Canon::Xml::Node TextNode uses .value
-          if node.respond_to?(:value)
-            node.value.to_s
-          # Nokogiri nodes use .content
-          elsif node.respond_to?(:content)
-            node.content.to_s
-          else
-            node.to_s
-          end
+          NodeInspector.text_content(node)
         end
         # Check if difference between two texts is only whitespace
@@ -328,7 +303,7 @@ module Canon
           if diff1 == Canon::Comparison::MISSING_NODE && diff2 == Canon::Comparison::MISSING_NODE
             "element structure mismatch (children differ)"
           else
-            "#{diff1} vs #{diff2}"
+            Canon::Comparison.code_pair_label(diff1, diff2)
           end
         end
@@ -371,26 +346,18 @@ module Canon
         def extract_text_content_from_node(node)
           return nil if node.nil?
-          # For Canon::Xml::Nodes::TextNode
-          return node.value if node.respond_to?(:value) && node.is_a?(Canon::Xml::Nodes::TextNode)
-          # For XML/HTML nodes with text_content method
-          return node.text_content if node.respond_to?(:text_content)
-          # For nodes with text method
-          return node.text if node.respond_to?(:text)
-          # For nodes with content method (Moxml::Text)
-          return node.content if node.respond_to?(:content)
-          # For nodes with value method (other types)
-          return node.value if node.respond_to?(:value)
-          # For simple text nodes or strings
-          return node.to_s if node.is_a?(String)
-          # For other node types, try to_s
-          node.to_s
+          case node
+          when Canon::Xml::Nodes::TextNode
+            node.value
+          when Canon::Xml::Node
+            node.text_content
+          when Nokogiri::XML::Node
+            node.content.to_s
+          when String
+            node
+          else
+            node.to_s
+          end
         rescue StandardError
           nil
         end
@@ -444,26 +411,37 @@ module Canon
         # Determine the appropriate dimension for a node type
         #
+        # Used by ChildComparison to tag per-child orphan diffs with a
+        # dimension that matches what the node *is*, so the formatter
+        # renders correctly.  An element orphan tagged :text_content
+        # would otherwise route through PR #126's one-sided text
+        # formatter and render as +text ""+ instead of as the actual
+        # element (see lutaml/canon#125 follow-up).
+        #
         # @param node [Object] The node to check
         # @return [Symbol] The dimension symbol
         def determine_node_dimension(node)
-          # Canon::Xml::Node types
-          if node.respond_to?(:node_type) && node.node_type.is_a?(Symbol)
+          case node
+          when Canon::Xml::Node
             case node.node_type
+            when :element then :element_structure
             when :comment then :comments
             when :text, :cdata then :text_content
             when :processing_instruction then :processing_instructions
             else :text_content
             end
-          # Moxml/Nokogiri types
-          elsif node.respond_to?(:comment?) && node.comment?
-            :comments
-          elsif node.respond_to?(:text?) && node.text?
-            :text_content
-          elsif node.respond_to?(:cdata?) && node.cdata?
-            :text_content
-          elsif node.respond_to?(:processing_instruction?) && node.processing_instruction?
-            :processing_instructions
+          when Nokogiri::XML::Node
+            if node.comment?
+              :comments
+            elsif node.text? || node.cdata?
+              :text_content
+            elsif node.processing_instruction?
+              :processing_instructions
+            elsif node.element?
+              :element_structure
+            else
+              :text_content
+            end
           else
             :text_content
           end

data/lib/canon/comparison/node_inspector.rb ADDED Viewed

@@ -0,0 +1,103 @@
+# frozen_string_literal: true
+module Canon
+  module Comparison
+    # Single source of truth for cross-backend node type operations.
+    #
+    # The comparison pipeline handles nodes from two backends:
+    # * Canon::Xml::Node (+ RootNode, ElementNode, TextNode, etc.) —
+    #   custom DOM built by SAX builder and DataModel.
+    # * Nokogiri::XML::Node (+ subclasses) — native Nokogiri nodes used
+    #   by the HTML comparator and some legacy paths.
+    #
+    # Every method here dispatches on type via +case/when+ (+is_a?+).
+    # No +respond_to?+ — the types are known at every call site.
+    module NodeInspector
+      CANON_TEXT_TYPE = :text
+      NOKOGIRI_TEXT_TYPE = defined?(Nokogiri::XML::Node::TEXT_NODE) ? Nokogiri::XML::Node::TEXT_NODE : 3
+      # True when +node+ is a text node (whitespace, content, etc.).
+      def self.text_node?(node)
+        case node
+        when Canon::Xml::Node
+          node.node_type == CANON_TEXT_TYPE
+        when Nokogiri::XML::Node
+          node.node_type == NOKOGIRI_TEXT_TYPE
+        else
+          false
+        end
+      end
+      # Extract the text content of +node+ as a String.
+      def self.text_content(node)
+        case node
+        when Canon::Xml::Node
+          node.value.to_s
+        when Nokogiri::XML::Node
+          node.content.to_s
+        else
+          node.to_s
+        end
+      end
+      # True when +node+ is a text node whose content is whitespace-only.
+      # Empty-string text nodes return false — those represent genuine
+      # empty-vs-content asymmetry, not pretty-print indentation.
+      def self.whitespace_only_text?(node)
+        return false unless text_node?(node)
+        text = text_content(node)
+        !text.empty? && text.strip.empty?
+      end
+      # True when +node+ is a comment node.
+      # For HTML, also detects comments that Nokogiri parses as TEXT nodes
+      # (content like "<!-- comment -->" or escaped "<\\!-- comment -->").
+      def self.comment_node?(node)
+        case node
+        when Canon::Xml::Node
+          node.node_type == :comment
+        when Nokogiri::XML::Node
+          return true if node.comment?
+          # HTML comments are parsed as TEXT nodes by Nokogiri
+          if node.text?
+            text_stripped = text_content(node).to_s.strip.gsub("\\", "")
+            return true if text_stripped.start_with?("<!--") && text_stripped.end_with?("-->")
+          end
+          false
+        else
+          false
+        end
+      end
+      # True when +node+ is an element node.
+      def self.element_node?(node)
+        case node
+        when Canon::Xml::Node
+          node.node_type == :element
+        when Nokogiri::XML::Node
+          node.element?
+        else
+          false
+        end
+      end
+      # Extract parse-time errors carried on a node or its owning document.
+      # Returns an Array of Strings.
+      def self.parse_errors(node)
+        case node
+        when nil
+          []
+        when Canon::Xml::Node
+          errors = node.parse_errors
+          Array(errors).map(&:to_s)
+        when Nokogiri::XML::Document, Nokogiri::HTML5::Document
+          Array(node.errors).map(&:to_s)
+        else
+          []
+        end
+      end
+    end
+  end
+end