RubyGems - canon - Versions diffs - 0.2.5 → 0.2.7 - Mend

canon 0.2.5 → 0.2.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (15) hide show

checksums.yaml +4 -4
data/README.adoc +4 -0
data/docs/advanced/diff-classification.adoc +16 -0
data/docs/advanced/semantic-diff-report.adoc +2 -0
data/docs/features/diff-formatting/display-preprocessing.adoc +1 -1
data/docs/features/diff-formatting/index.adoc +3 -1
data/docs/features/diff-formatting/whitespace-adjacency.adoc +100 -13
data/docs/interfaces/ruby-api/index.adoc +20 -0
data/docs/understanding/formats/html.adoc +17 -0
data/lib/canon/comparison/node_inspector.rb +9 -0
data/lib/canon/comparison/xml_comparator.rb +37 -18
data/lib/canon/diff_formatter/diff_detail_formatter/dimension_formatter.rb +28 -8
data/lib/canon/pretty_printer/html.rb +34 -0
data/lib/canon/version.rb +1 -1
metadata +2 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 615e3154c89a9850e86c39852201e5573b461ac62d52cc423523e444ace301f7
-  data.tar.gz: 37ee00969f0682dde670168fbd7888294edda612220bfbebb7c950efbcb76aa2
+  metadata.gz: 1800400419926b8607eb146490d0bc4a0ecf5e4bfaf2b3007a87e99d440661f3
+  data.tar.gz: 0fc8298171e94fec5e9c4b650001fcc31e79ca774e1d7ad0f19fe91308199b18
 SHA512:
-  metadata.gz: bce4239ab6a471edd896fd3b54def4e57e21714078cb3631b55363b50646349a6923eed1e208e5706c3319d3e7a2ae75f2db698ffe853c0e03a754d76c856679
-  data.tar.gz: 1441bd5412658d9d2b975e3889fc95bfd080dec2b89b731f71e191f5ca7bbc7e0a8aa63e787916781bd5e653732c16d5c03b0d3fc3b967a3b653a2a735e62636
+  metadata.gz: 5c553b671df23a70814bedb7521836b01ce7d6e0ab1af99f00aa05ca6f4ef875d8e8af2e4e63c208443d8ddbab92ed7a05c3c8c6828da2608c42bf6a38f0b7c4
+  data.tar.gz: e183e77684bc3fe7c072caa904cf709a3c20e2f6c38099a42f099e3367d2b0c2336f5407a6def0147f7405056a26a639f4b9dd5075deb6b51a60bcb7921c7c44

data/README.adoc CHANGED Viewed

@@ -614,6 +614,10 @@ See link:docs/MODES[Diff modes] for details.
 * **Formatting diff detection**: Automatically detects and highlights purely cosmetic whitespace/line break differences
 * **Whitespace visualization**: Make invisible characters visible with CJK-safe
   Unicode symbols
+* **Whitespace adjacency reporting**: Stray whitespace-only text nodes are
+  reported as a dedicated `:whitespace_adjacency` dimension with direction
+  wording (`before`/`after`/`adjacent to`) instead of cascading into
+  misleading `:text_content` mismatches
 * **Non-ASCII detection**: Warnings for unexpected Unicode characters
 * **Customizable**: Character maps, context lines, grouping options

data/docs/advanced/diff-classification.adoc CHANGED Viewed

@@ -266,6 +266,22 @@ match dimension but follows the same normative rule as all general dimensions:
   - Structural differences tracked but don't affect equivalence
   - Useful for content-only comparisons where wrapper elements don't matter
+==== Whitespace Adjacency
+`:whitespace_adjacency` is a derived dimension — emitted when the
+alignment walk pairs a whitespace-only text node on one side against a
+content node on the other. The Reason line describes the whitespace's
+direction relative to the partner content: `before`, `after`, or
+`adjacent to`.
+* **Always normative** — differences always affect equivalence
+* **Not user-configurable** — dimension is always tracked when the
+  re-alignment walk encounters an asymmetric whitespace node
+* **Report-only** — does not change equivalence outcomes compared to
+  pre-#137 behaviour; only changes the diff-report shape (see
+  link:../features/diff-formatting/whitespace-adjacency.adoc[Whitespace
+  adjacency] for details)
 .Example: Comment handling
 ====
 [source,ruby]

data/docs/advanced/semantic-diff-report.adoc CHANGED Viewed

@@ -212,6 +212,8 @@ Reason:  Text: "¬······:¬······"
 This fallback is implemented in `Canon::DiffFormatter::DiffDetailFormatterHelpers::DimensionFormatter.format_text_content_details` and only triggers when `TextUtils.ambiguous_text_pair?` returns `true` _and_ at least one side has a parent element to render.
+The same fallback also applies to the `whitespace_adjacency` dimension (see <<whitespace-adjacency,Whitespace adjacency>>): when the alignment partner of a stray whitespace node extracts to an empty / whitespace-only string, the Reason line reads `Whitespace inside <PARENT>` (rather than `Whitespace before ""`), and the Expected/Actual block surfaces each side's parent element compactly. See `format_whitespace_adjacency_details` and `Canon::Comparison::XmlComparator#build_whitespace_adjacency_reason`.
 ==== One-sided text diffs (added or removed text nodes)
 When a `text_content` difference carries a text node on one side and `nil` on the other (issue #125) -- the shape that fragment-length mismatches and child-comparison emit when a text-node child is missing -- the renderer mirrors `element_structure`: the missing side reads `(not present)`, and the present side reads the text-node content (whitespace-visualised) plus a brief parent open-tag hint for context. The full ancestor subtree is *not* dumped; only the immediate parent's opening tag is shown, so a missing whitespace text node cannot make the diff look like the entire ancestor differs.

data/docs/features/diff-formatting/display-preprocessing.adoc CHANGED Viewed

@@ -430,7 +430,7 @@ pretty-printer. This is a known future work item.
 |✓ Full
 |✓ (via XML serializer)
 |✓ Full
-|`:pretty_print` uses `Canon::PrettyPrinter::Html`; `:normalize_pretty_print` falls back to `XmlNormalized` pending a dedicated `HtmlNormalized`; `:c14n` uses Nokogiri HTML5 serialization
+|`:pretty_print` uses `Canon::PrettyPrinter::Html` in fixture-ready mode (`FORMAT|AS_XHTML|NO_DECLARATION`); `:normalize_pretty_print` falls back to `XmlNormalized` pending a dedicated `HtmlNormalized`; `:c14n` uses Nokogiri HTML5 serialization. In fixture-ready mode, stray structural whitespace (whitespace-only text nodes between block-level siblings) is stripped before formatting so that libxml's `FORMAT` flag produces correct indentation. Whitespace inside `<pre>`, `<script>`, `<style>`, and `<textarea>` is preserved.
 |JSON
 |Planned

data/docs/features/diff-formatting/index.adoc CHANGED Viewed

@@ -30,7 +30,9 @@ Canon's diff formatting includes:
   algorithms
 * **Whitespace adjacency**: Stray whitespace-only text nodes are anchored at
   themselves instead of cascading into mismatches against neighbouring
-  content (link:./whitespace-adjacency.adoc[details])
+  content. The Reason line names the direction relative to the partner
+  (`before`/`after`/`adjacent to`)
+  (link:./whitespace-adjacency.adoc[details])
 == Available formatting options

data/docs/features/diff-formatting/whitespace-adjacency.adoc CHANGED Viewed

@@ -75,28 +75,113 @@ After the new contract, the cascade above collapses to:
 [source]
 ----
-DIFFERENCE #1 — whitespace_adjacency: Whitespace surrounding "20483":
+DIFFERENCE #1 — whitespace_adjacency: Whitespace before "20483":
                   present on EXPECTED ("↵░░"), absent on ACTUAL
-DIFFERENCE #2 — whitespace_adjacency: Whitespace surrounding ",":
+DIFFERENCE #2 — whitespace_adjacency: Whitespace before ",":
                   present on EXPECTED ("↵░░"), absent on ACTUAL
 DIFFERENCE #3 — text_content: "↵░░,↵░░"  vs  ", "
 ----
-== Adjacency positions
+== Direction relative to the partner
-The Reason line names the adjacency position of the whitespace node
-relative to its non-whitespace siblings:
+The Reason line names the document-order position of the whitespace
+node relative to the *partner content node* it was zipped against by
+the alignment walk. The partner is the next (or, at parent edge,
+previous) non-whitespace sibling on the whitespace-bearing side, which
+is what aligns against the corresponding content node on the other
+side.
-`:preceding`::  Whitespace at the start of its parent (no non-whitespace
-sibling before it, has one after it).
+`before`::  The whitespace immediately precedes its next non-whitespace
+sibling. This is the common case (e.g. indentation between two inline
+spans where the asymmetric whitespace sits on the leading edge of the
+partner).
-`:following`::  Whitespace at the end of its parent (has a non-whitespace
-sibling before it, none after).
+`after`::  The whitespace trails the previous non-whitespace sibling
+and has no non-whitespace sibling after it. Emitted at the trailing
+edge of a parent.
-`:surrounding`::  Sandwiched between two non-whitespace siblings.
+`adjacent to`::  Degenerate fallback for a whitespace node with no
+non-whitespace siblings at all. Rarely emitted.
-`:isolated`::  No non-whitespace siblings at all (degenerate; rarely
-emitted).
+When the alignment partner extracts to an empty / whitespace-only
+string (e.g. an element with no text descendants), the direction
+phrasing degenerates to `Whitespace before ""` which carries no
+information. In that case Canon falls back to naming the parent
+element instead — `Whitespace inside <PARENT>` — and the
+Expected/Actual detail block renders each side's parent element
+compactly per the contract from
+link:../../advanced/semantic-diff-report.adoc#parent-context-fallback-for-ambiguous-text-diffs[issue #112].
+NOTE: An earlier wording (`Whitespace surrounding "X"`) classified the
+*whitespace node's position among its own siblings* rather than its
+direction relative to the partner. That label was misleading when the
+whitespace sat between two element siblings but the asymmetry was
+one-sided — see issue #137 follow-up.
+=== Examples
+==== "before" — whitespace between inline element siblings
+[source,ruby]
+----
+html1 = "<a><span>ISO </span>\n   <span>712</span></a>"
+html2 = "<a><span>ISO </span><span>712</span></a>"
+result = Canon::Comparison.equivalent?(html1, html2,
+  format: :html5, verbose: true)
+# => #<ComparisonResult equivalent=false>
+# The stray "\n   " between two spans is the only asymmetric node.
+# It sits immediately before <span>712</span>, its next non-ws sibling.
+result.differences.first.reason
+# => "Whitespace before \"712\": present on EXPECTED (\"░\"), absent on ACTUAL"
+----
+==== "after" — trailing whitespace in a whitespace-preserving element
+[source,ruby]
+----
+# <code> preserves whitespace, so the trailing newline survives the
+# upstream filter and pairs against the extra <b>B</b> on the other side.
+html1 = "<code><b>A</b>\n</code>"
+html2 = "<code><b>A</b><b>B</b></code>"
+result = Canon::Comparison.equivalent?(html1, html2,
+  format: :html5, verbose: true)
+result.differences.first.reason
+# => "Whitespace after \"B\": present on EXPECTED (\"↵\"), absent on ACTUAL"
+----
+==== "adjacent to" — sole-child whitespace node
+[source,ruby]
+----
+# A whitespace-only text node as the only child of <code>, paired
+# against an element on the other side.  No non-ws siblings exist.
+html1 = "<code>\n</code>"
+html2 = "<code><b>A</b></code>"
+result = Canon::Comparison.equivalent?(html1, html2,
+  format: :html5, verbose: true)
+result.differences.first.reason
+# => "Whitespace adjacent to \"A\": present on EXPECTED (\"↵\"), absent on ACTUAL"
+----
+== Working with :whitespace_adjacency diffs programmatically
+Use the `dimension` field on `DiffNode` to filter:
+[source,ruby]
+----
+result = Canon::Comparison.equivalent?(html1, html2,
+  format: :html5, verbose: true)
+# Find all whitespace-adjacency diffs
+ws_diffs = result.differences.select { |d| d.dimension == :whitespace_adjacency }
+# These are always normative — they affect the equivalence verdict
+ws_diffs.all?(&:normative?)  # => true
+----
 == What this contract does NOT do
@@ -137,4 +222,6 @@ whitespace node as a single normative `:whitespace_adjacency` diff.
 The cascade behaviour was reported in
 https://github.com/lutaml/canon/issues/137[issue #137]. The fix landed
-as a report-only re-alignment in PR #138.
+as a report-only re-alignment in PR #138. PR #141 replaced the
+misleading `:surrounding`/`:preceding`/`:following` position labels
+with direction-faithful wording (`before`/`after`/`adjacent to`).

data/docs/interfaces/ruby-api/index.adoc CHANGED Viewed

@@ -116,6 +116,9 @@ Where:
 `{Format}`:: The format module (`Xml`, `Html`, `Json`)
 `n`:: Number of spaces (default: 2) or tabs (use 1 for tabs)
 `type`:: Indentation type: `'space'` (default) or `'tab'`
+`fixture_ready`:: (HTML only) When `true`, emit indented XHTML-shaped
+output that strips structural whitespace before formatting. Designed for
+copy-paste into RSpec heredoc fixtures. Default: `false`.
 `content`:: The input string
 .Pretty-print examples
@@ -151,6 +154,23 @@ Canon::Xml::PrettyPrinter.new(
 html_input = '<div><p>Hello</p></div>'
 Canon::Html::PrettyPrinter.new(indent: 2).format(html_input)
+# HTML fixture-ready mode: produces indented XHTML-shaped output
+# suitable for pasting into RSpec heredoc fixtures.  Strips stray
+# structural whitespace (inter-element text nodes) so libxml's FORMAT
+# flag can indent block-level siblings that would otherwise be treated
+# as mixed content.  Whitespace inside <pre>, <script>, <style>, and
+# <textarea> is preserved.
+Canon::Html::PrettyPrinter.new(indent: 2, fixture_ready: true)
+  .format('<html><body><div>a</div> <div>b</div></body></html>')
+# =>
+# <html xmlns="http://www.w3.org/1999/xhtml">
+#   <head>...</head>
+#   <body>
+#     <div>a</div>
+#     <div>b</div>
+#   </body>
+# </html>
 # JSON with 2-space indentation
 json_input = '{"z":3,"a":{"b":1}}'
 Canon::Json::PrettyPrinter.new(indent: 2).format(json_input)

data/docs/understanding/formats/html.adoc CHANGED Viewed

@@ -235,6 +235,23 @@ HTML whitespace is collapsed per CSS rendering rules. Empty text nodes between e
 Multiple spaces within text content are collapsed to single spaces when `text_content: :normalize` is used.
 ====
+==== Fixture-ready pretty-print and structural whitespace
+When using `Canon::PrettyPrinter::Html` with `fixture_ready: true` (the mode
+used by the diff pipeline's *PRETTY-PRINTED INPUTS* section), Canon strips
+stray structural whitespace before formatting.  Real-world HTML5 input from
+upstream pipelines often carries whitespace-only text nodes between block-level
+siblings (`<body>` → `<div>`, `<br>`, `<div>`, ...).  libxml's `FORMAT` flag
+treats any element with a non-whitespace-only text child as mixed content and
+refuses to indent its children — producing a single-line blob instead of a
+readable tree.
+The fixture-ready mode removes whitespace-only text nodes from parents that
+are purely structural (no real text content) and are not whitespace-preserving
+elements (`<pre>`, `<script>`, `<style>`, `<textarea>`).  Mixed-content runs
+like `<p>foo <em>bar</em> baz</p>` are left untouched so that significant
+inline whitespace is preserved.
 === Attribute order
 HTML attributes are inherently unordered per the HTML specification, so default is `:ignore`.

data/lib/canon/comparison/node_inspector.rb CHANGED Viewed

@@ -98,6 +98,15 @@ module Canon
           []
         end
       end
+      # Return the parent node of +node+, or nil when +node+ is not a
+      # recognised DOM backend type or has no parent.
+      def self.parent_of(node)
+        case node
+        when Canon::Xml::Node, Nokogiri::XML::Node
+          node.parent
+        end
+      end
     end
   end
 end

data/lib/canon/comparison/xml_comparator.rb CHANGED Viewed

@@ -873,36 +873,55 @@ differences)
             return build_text_diff_reason(text1, text2)
           end
-          position = whitespace_adjacency_position(ws_node)
           ws_vis = visualize_whitespace(ws_text)
-          content_vis = content_text ? visualize_whitespace(truncate_text(content_text)) : "(none)"
-          "Whitespace #{position} \"#{content_vis}\": " \
-            "present on #{present_side} (\"#{ws_vis}\"), absent on #{absent_side}"
+          if content_text.nil? || content_text.strip.empty?
+            # Partner content extracts to "" / whitespace-only — naming it
+            # in the Reason ("Whitespace before \"\"") gives the reader
+            # nothing.  Fall back to the parent element name so the
+            # diff carries structural context (issue #112's contract,
+            # extended from :text_content to :whitespace_adjacency).
+            parent_label = whitespace_adjacency_parent_label(ws_node)
+            "Whitespace inside #{parent_label}: " \
+              "present on #{present_side} (\"#{ws_vis}\"), absent on #{absent_side}"
+          else
+            direction = whitespace_partner_direction(ws_node)
+            content_vis = visualize_whitespace(truncate_text(content_text))
+            "Whitespace #{direction} \"#{content_vis}\": " \
+              "present on #{present_side} (\"#{ws_vis}\"), absent on #{absent_side}"
+          end
         end
-        def whitespace_adjacency_position(ws_node)
-          return :isolated unless ws_node.is_a?(Canon::Xml::Node) ||
-            ws_node.is_a?(Nokogiri::XML::Node)
+        def whitespace_adjacency_parent_label(ws_node)
+          parent = NodeInspector.parent_of(ws_node)
+          return "(unknown parent)" unless parent
-          parent = ws_node.parent
-          return :isolated if parent.nil?
+          name = parent.name
+          name && !name.empty? ? "<#{name}>" : "(unknown parent)"
+        end
+        # Direction of the partner content relative to the whitespace node,
+        # phrased from the partner's point of view: "before" when the
+        # whitespace immediately precedes its next non-whitespace sibling
+        # (the alignment partner on the other side), "after" when the
+        # whitespace trails the previous non-whitespace sibling, or
+        # "adjacent to" as a degenerate fallback when neither neighbour
+        # exists.
+        def whitespace_partner_direction(ws_node)
+          parent = NodeInspector.parent_of(ws_node)
+          return "adjacent to" unless parent
           siblings = parent.children
           idx = siblings.index(ws_node)
-          return :isolated unless idx
-          before = sibling_with_content?(siblings, idx, -1)
-          after = sibling_with_content?(siblings, idx, 1)
+          return "adjacent to" unless idx
-          if before && after then :surrounding
-          elsif before then :following
-          elsif after then :preceding
-          else :isolated
+          if non_ws_sibling_exists?(siblings, idx, 1) then "before"
+          elsif non_ws_sibling_exists?(siblings, idx, -1) then "after"
+          else "adjacent to"
           end
         end
-        def sibling_with_content?(siblings, idx, direction)
+        def non_ws_sibling_exists?(siblings, idx, direction)
           i = idx + direction
           while i >= 0 && i < siblings.length
             s = siblings[i]

data/lib/canon/diff_formatter/diff_detail_formatter/dimension_formatter.rb CHANGED Viewed

@@ -525,14 +525,34 @@ expand_difference: false)
           text1 = NodeUtils.get_node_text(node1).to_s
           text2 = NodeUtils.get_node_text(node2).to_s
-          detail1 = ColorHelper.colorize(
-            "\"#{TextUtils.visualize_whitespace(text1)}\"",
-            :red, use_color
-          )
-          detail2 = ColorHelper.colorize(
-            "\"#{TextUtils.visualize_whitespace(text2)}\"",
-            :green, use_color
-          )
+          if TextUtils.ambiguous_text_pair?(text1, text2) &&
+              (NodeUtils.parent_of(node1) || NodeUtils.parent_of(node2))
+            # Both sides extract to empty / whitespace-only strings —
+            # `""` / `""` tells the reader nothing.  Fall back to a
+            # brief parent open-tag hint per #112's contract, but
+            # without dumping the full ancestor subtree (#125).
+            hint1 = NodeUtils.serialize_open_tag(NodeUtils.parent_of(node1))
+            hint2 = NodeUtils.serialize_open_tag(NodeUtils.parent_of(node2))
+            ws1 = TextUtils.visualize_whitespace(text1)
+            ws2 = TextUtils.visualize_whitespace(text2)
+            detail1 = ColorHelper.colorize(
+              "\"#{ws1}\" in #{hint1}",
+              :red, use_color
+            )
+            detail2 = ColorHelper.colorize(
+              "\"#{ws2}\" in #{hint2}",
+              :green, use_color
+            )
+          else
+            detail1 = ColorHelper.colorize(
+              "\"#{TextUtils.visualize_whitespace(text1)}\"",
+              :red, use_color
+            )
+            detail2 = ColorHelper.colorize(
+              "\"#{TextUtils.visualize_whitespace(text2)}\"",
+              :green, use_color
+            )
+          end
           reason = if diff.is_a?(Canon::Diff::DiffNode)
                      diff.reason

data/lib/canon/pretty_printer/html.rb CHANGED Viewed

@@ -29,6 +29,8 @@ module Canon
     #
     # See lutaml/canon#133, lutaml/canon#135.
     class Html
+      WHITESPACE_PRESERVING_ELEMENTS = %w[pre textarea script style].freeze
       def initialize(indent: 2, indent_type: "space", fixture_ready: false)
         @indent = indent.to_i
         @indent_type = indent_type
@@ -83,6 +85,7 @@ module Canon
       # suppresses the +<?xml ...?>+ prefix.
       def format_fixture_ready(html_string)
         doc = Nokogiri::HTML5(html_string)
+        strip_structural_whitespace!(doc)
         io = StringIO.new
         if @indent_type == "tab"
           doc.write_to(io, save_with: fixture_ready_save_options,
@@ -94,6 +97,37 @@ module Canon
         io.string
       end
+      # libxml's +FORMAT+ save flag does not insert indentation around
+      # the children of any element it sees as mixed content (any
+      # non-whitespace-only text node child).  +Nokogiri::HTML5+ does
+      # not accept the +noblanks+ option that the XML parser uses to
+      # strip these inter-sibling text nodes pre-serialisation, so we
+      # do it manually here: drop whitespace-only text nodes whose
+      # parent is structural (no real text content) and not a
+      # whitespace-preserving element.  Mixed-content runs like
+      # +<p>foo <em>bar</em> baz</p>+ are left alone.
+      def strip_structural_whitespace!(doc)
+        to_remove = []
+        doc.traverse do |node|
+          next unless node.text?
+          next unless node.content.strip.empty?
+          parent = node.parent
+          next if parent.nil?
+          next if WHITESPACE_PRESERVING_ELEMENTS.include?(parent.name)
+          next if parent_has_real_text?(parent)
+          to_remove << node
+        end
+        to_remove.each(&:remove)
+      end
+      def parent_has_real_text?(parent)
+        parent.children.any? do |c|
+          c.text? && !c.content.strip.empty?
+        end
+      end
       def fixture_ready_save_options
         Nokogiri::XML::Node::SaveOptions::FORMAT |
           Nokogiri::XML::Node::SaveOptions::AS_XHTML |

data/lib/canon/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # frozen_string_literal: true
 module Canon
-  VERSION = "0.2.5"
+  VERSION = "0.2.7"
 end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: canon
 version: !ruby/object:Gem::Version
-  version: 0.2.5
+  version: 0.2.7
 platform: ruby
 authors:
 - Ribose Inc.
 autorequire:
 bindir: exe
 cert_chain: []
-date: 2026-05-03 00:00:00.000000000 Z
+date: 2026-05-04 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: diff-lcs