canon 0.2.6 → 0.2.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 6437f1a8b556bb49bffbcecf47ec0eeecabdf6541bd5baa5954ac88f98f33a2c
4
- data.tar.gz: 98eff2aa558165dc7e13c8d29da21d8d9c6589cae1d48a18d27f0420d6be7198
3
+ metadata.gz: 1800400419926b8607eb146490d0bc4a0ecf5e4bfaf2b3007a87e99d440661f3
4
+ data.tar.gz: 0fc8298171e94fec5e9c4b650001fcc31e79ca774e1d7ad0f19fe91308199b18
5
5
  SHA512:
6
- metadata.gz: 055614c143bca292b575755f5b4a1554a002e0d6f264ddee3e29049f89d6f9795a61069c1bdf7ebfa459ed11ac2a21203a779e115f8aee143a2aa3c77951a086
7
- data.tar.gz: ff64c25654c1eef41dcc80b471df2958c516e69fa625723bdecfd18c5a99716e7f347c313bf72c28f3b898547d5c97aac3ebb62bcf44e726bc49e668a73e0dd9
6
+ metadata.gz: 5c553b671df23a70814bedb7521836b01ce7d6e0ab1af99f00aa05ca6f4ef875d8e8af2e4e63c208443d8ddbab92ed7a05c3c8c6828da2608c42bf6a38f0b7c4
7
+ data.tar.gz: e183e77684bc3fe7c072caa904cf709a3c20e2f6c38099a42f099e3367d2b0c2336f5407a6def0147f7405056a26a639f4b9dd5075deb6b51a60bcb7921c7c44
@@ -212,6 +212,8 @@ Reason: Text: "¬······:¬······"
212
212
 
213
213
  This fallback is implemented in `Canon::DiffFormatter::DiffDetailFormatterHelpers::DimensionFormatter.format_text_content_details` and only triggers when `TextUtils.ambiguous_text_pair?` returns `true` _and_ at least one side has a parent element to render.
214
214
 
215
+ The same fallback also applies to the `whitespace_adjacency` dimension (see <<whitespace-adjacency,Whitespace adjacency>>): when the alignment partner of a stray whitespace node extracts to an empty / whitespace-only string, the Reason line reads `Whitespace inside <PARENT>` (rather than `Whitespace before ""`), and the Expected/Actual block surfaces each side's parent element compactly. See `format_whitespace_adjacency_details` and `Canon::Comparison::XmlComparator#build_whitespace_adjacency_reason`.
216
+
215
217
  ==== One-sided text diffs (added or removed text nodes)
216
218
 
217
219
  When a `text_content` difference carries a text node on one side and `nil` on the other (issue #125) -- the shape that fragment-length mismatches and child-comparison emit when a text-node child is missing -- the renderer mirrors `element_structure`: the missing side reads `(not present)`, and the present side reads the text-node content (whitespace-visualised) plus a brief parent open-tag hint for context. The full ancestor subtree is *not* dumped; only the immediate parent's opening tag is shown, so a missing whitespace text node cannot make the diff look like the entire ancestor differs.
@@ -430,7 +430,7 @@ pretty-printer. This is a known future work item.
430
430
  |✓ Full
431
431
  |✓ (via XML serializer)
432
432
  |✓ Full
433
- |`:pretty_print` uses `Canon::PrettyPrinter::Html`; `:normalize_pretty_print` falls back to `XmlNormalized` pending a dedicated `HtmlNormalized`; `:c14n` uses Nokogiri HTML5 serialization
433
+ |`:pretty_print` uses `Canon::PrettyPrinter::Html` in fixture-ready mode (`FORMAT|AS_XHTML|NO_DECLARATION`); `:normalize_pretty_print` falls back to `XmlNormalized` pending a dedicated `HtmlNormalized`; `:c14n` uses Nokogiri HTML5 serialization. In fixture-ready mode, stray structural whitespace (whitespace-only text nodes between block-level siblings) is stripped before formatting so that libxml's `FORMAT` flag produces correct indentation. Whitespace inside `<pre>`, `<script>`, `<style>`, and `<textarea>` is preserved.
434
434
 
435
435
  |JSON
436
436
  |Planned
@@ -103,6 +103,15 @@ edge of a parent.
103
103
  `adjacent to`:: Degenerate fallback for a whitespace node with no
104
104
  non-whitespace siblings at all. Rarely emitted.
105
105
 
106
+ When the alignment partner extracts to an empty / whitespace-only
107
+ string (e.g. an element with no text descendants), the direction
108
+ phrasing degenerates to `Whitespace before ""` which carries no
109
+ information. In that case Canon falls back to naming the parent
110
+ element instead — `Whitespace inside <PARENT>` — and the
111
+ Expected/Actual detail block renders each side's parent element
112
+ compactly per the contract from
113
+ link:../../advanced/semantic-diff-report.adoc#parent-context-fallback-for-ambiguous-text-diffs[issue #112].
114
+
106
115
  NOTE: An earlier wording (`Whitespace surrounding "X"`) classified the
107
116
  *whitespace node's position among its own siblings* rather than its
108
117
  direction relative to the partner. That label was misleading when the
@@ -116,6 +116,9 @@ Where:
116
116
  `{Format}`:: The format module (`Xml`, `Html`, `Json`)
117
117
  `n`:: Number of spaces (default: 2) or tabs (use 1 for tabs)
118
118
  `type`:: Indentation type: `'space'` (default) or `'tab'`
119
+ `fixture_ready`:: (HTML only) When `true`, emit indented XHTML-shaped
120
+ output that strips structural whitespace before formatting. Designed for
121
+ copy-paste into RSpec heredoc fixtures. Default: `false`.
119
122
  `content`:: The input string
120
123
 
121
124
  .Pretty-print examples
@@ -151,6 +154,23 @@ Canon::Xml::PrettyPrinter.new(
151
154
  html_input = '<div><p>Hello</p></div>'
152
155
  Canon::Html::PrettyPrinter.new(indent: 2).format(html_input)
153
156
 
157
+ # HTML fixture-ready mode: produces indented XHTML-shaped output
158
+ # suitable for pasting into RSpec heredoc fixtures. Strips stray
159
+ # structural whitespace (inter-element text nodes) so libxml's FORMAT
160
+ # flag can indent block-level siblings that would otherwise be treated
161
+ # as mixed content. Whitespace inside <pre>, <script>, <style>, and
162
+ # <textarea> is preserved.
163
+ Canon::Html::PrettyPrinter.new(indent: 2, fixture_ready: true)
164
+ .format('<html><body><div>a</div> <div>b</div></body></html>')
165
+ # =>
166
+ # <html xmlns="http://www.w3.org/1999/xhtml">
167
+ # <head>...</head>
168
+ # <body>
169
+ # <div>a</div>
170
+ # <div>b</div>
171
+ # </body>
172
+ # </html>
173
+
154
174
  # JSON with 2-space indentation
155
175
  json_input = '{"z":3,"a":{"b":1}}'
156
176
  Canon::Json::PrettyPrinter.new(indent: 2).format(json_input)
@@ -235,6 +235,23 @@ HTML whitespace is collapsed per CSS rendering rules. Empty text nodes between e
235
235
  Multiple spaces within text content are collapsed to single spaces when `text_content: :normalize` is used.
236
236
  ====
237
237
 
238
+ ==== Fixture-ready pretty-print and structural whitespace
239
+
240
+ When using `Canon::PrettyPrinter::Html` with `fixture_ready: true` (the mode
241
+ used by the diff pipeline's *PRETTY-PRINTED INPUTS* section), Canon strips
242
+ stray structural whitespace before formatting. Real-world HTML5 input from
243
+ upstream pipelines often carries whitespace-only text nodes between block-level
244
+ siblings (`<body>` → `<div>`, `<br>`, `<div>`, ...). libxml's `FORMAT` flag
245
+ treats any element with a non-whitespace-only text child as mixed content and
246
+ refuses to indent its children — producing a single-line blob instead of a
247
+ readable tree.
248
+
249
+ The fixture-ready mode removes whitespace-only text nodes from parents that
250
+ are purely structural (no real text content) and are not whitespace-preserving
251
+ elements (`<pre>`, `<script>`, `<style>`, `<textarea>`). Mixed-content runs
252
+ like `<p>foo <em>bar</em> baz</p>` are left untouched so that significant
253
+ inline whitespace is preserved.
254
+
238
255
  === Attribute order
239
256
 
240
257
  HTML attributes are inherently unordered per the HTML specification, so default is `:ignore`.
@@ -98,6 +98,15 @@ module Canon
98
98
  []
99
99
  end
100
100
  end
101
+
102
+ # Return the parent node of +node+, or nil when +node+ is not a
103
+ # recognised DOM backend type or has no parent.
104
+ def self.parent_of(node)
105
+ case node
106
+ when Canon::Xml::Node, Nokogiri::XML::Node
107
+ node.parent
108
+ end
109
+ end
101
110
  end
102
111
  end
103
112
  end
@@ -873,12 +873,31 @@ differences)
873
873
  return build_text_diff_reason(text1, text2)
874
874
  end
875
875
 
876
- direction = whitespace_partner_direction(ws_node)
877
876
  ws_vis = visualize_whitespace(ws_text)
878
- content_vis = content_text ? visualize_whitespace(truncate_text(content_text)) : "(none)"
879
877
 
880
- "Whitespace #{direction} \"#{content_vis}\": " \
881
- "present on #{present_side} (\"#{ws_vis}\"), absent on #{absent_side}"
878
+ if content_text.nil? || content_text.strip.empty?
879
+ # Partner content extracts to "" / whitespace-only — naming it
880
+ # in the Reason ("Whitespace before \"\"") gives the reader
881
+ # nothing. Fall back to the parent element name so the
882
+ # diff carries structural context (issue #112's contract,
883
+ # extended from :text_content to :whitespace_adjacency).
884
+ parent_label = whitespace_adjacency_parent_label(ws_node)
885
+ "Whitespace inside #{parent_label}: " \
886
+ "present on #{present_side} (\"#{ws_vis}\"), absent on #{absent_side}"
887
+ else
888
+ direction = whitespace_partner_direction(ws_node)
889
+ content_vis = visualize_whitespace(truncate_text(content_text))
890
+ "Whitespace #{direction} \"#{content_vis}\": " \
891
+ "present on #{present_side} (\"#{ws_vis}\"), absent on #{absent_side}"
892
+ end
893
+ end
894
+
895
+ def whitespace_adjacency_parent_label(ws_node)
896
+ parent = NodeInspector.parent_of(ws_node)
897
+ return "(unknown parent)" unless parent
898
+
899
+ name = parent.name
900
+ name && !name.empty? ? "<#{name}>" : "(unknown parent)"
882
901
  end
883
902
 
884
903
  # Direction of the partner content relative to the whitespace node,
@@ -889,11 +908,8 @@ differences)
889
908
  # "adjacent to" as a degenerate fallback when neither neighbour
890
909
  # exists.
891
910
  def whitespace_partner_direction(ws_node)
892
- return "adjacent to" unless ws_node.is_a?(Canon::Xml::Node) ||
893
- ws_node.is_a?(Nokogiri::XML::Node)
894
-
895
- parent = ws_node.parent
896
- return "adjacent to" if parent.nil?
911
+ parent = NodeInspector.parent_of(ws_node)
912
+ return "adjacent to" unless parent
897
913
 
898
914
  siblings = parent.children
899
915
  idx = siblings.index(ws_node)
@@ -525,14 +525,34 @@ expand_difference: false)
525
525
  text1 = NodeUtils.get_node_text(node1).to_s
526
526
  text2 = NodeUtils.get_node_text(node2).to_s
527
527
 
528
- detail1 = ColorHelper.colorize(
529
- "\"#{TextUtils.visualize_whitespace(text1)}\"",
530
- :red, use_color
531
- )
532
- detail2 = ColorHelper.colorize(
533
- "\"#{TextUtils.visualize_whitespace(text2)}\"",
534
- :green, use_color
535
- )
528
+ if TextUtils.ambiguous_text_pair?(text1, text2) &&
529
+ (NodeUtils.parent_of(node1) || NodeUtils.parent_of(node2))
530
+ # Both sides extract to empty / whitespace-only strings —
531
+ # `""` / `""` tells the reader nothing. Fall back to a
532
+ # brief parent open-tag hint per #112's contract, but
533
+ # without dumping the full ancestor subtree (#125).
534
+ hint1 = NodeUtils.serialize_open_tag(NodeUtils.parent_of(node1))
535
+ hint2 = NodeUtils.serialize_open_tag(NodeUtils.parent_of(node2))
536
+ ws1 = TextUtils.visualize_whitespace(text1)
537
+ ws2 = TextUtils.visualize_whitespace(text2)
538
+ detail1 = ColorHelper.colorize(
539
+ "\"#{ws1}\" in #{hint1}",
540
+ :red, use_color
541
+ )
542
+ detail2 = ColorHelper.colorize(
543
+ "\"#{ws2}\" in #{hint2}",
544
+ :green, use_color
545
+ )
546
+ else
547
+ detail1 = ColorHelper.colorize(
548
+ "\"#{TextUtils.visualize_whitespace(text1)}\"",
549
+ :red, use_color
550
+ )
551
+ detail2 = ColorHelper.colorize(
552
+ "\"#{TextUtils.visualize_whitespace(text2)}\"",
553
+ :green, use_color
554
+ )
555
+ end
536
556
 
537
557
  reason = if diff.is_a?(Canon::Diff::DiffNode)
538
558
  diff.reason
@@ -29,6 +29,8 @@ module Canon
29
29
  #
30
30
  # See lutaml/canon#133, lutaml/canon#135.
31
31
  class Html
32
+ WHITESPACE_PRESERVING_ELEMENTS = %w[pre textarea script style].freeze
33
+
32
34
  def initialize(indent: 2, indent_type: "space", fixture_ready: false)
33
35
  @indent = indent.to_i
34
36
  @indent_type = indent_type
@@ -83,6 +85,7 @@ module Canon
83
85
  # suppresses the +<?xml ...?>+ prefix.
84
86
  def format_fixture_ready(html_string)
85
87
  doc = Nokogiri::HTML5(html_string)
88
+ strip_structural_whitespace!(doc)
86
89
  io = StringIO.new
87
90
  if @indent_type == "tab"
88
91
  doc.write_to(io, save_with: fixture_ready_save_options,
@@ -94,6 +97,37 @@ module Canon
94
97
  io.string
95
98
  end
96
99
 
100
+ # libxml's +FORMAT+ save flag does not insert indentation around
101
+ # the children of any element it sees as mixed content (any
102
+ # non-whitespace-only text node child). +Nokogiri::HTML5+ does
103
+ # not accept the +noblanks+ option that the XML parser uses to
104
+ # strip these inter-sibling text nodes pre-serialisation, so we
105
+ # do it manually here: drop whitespace-only text nodes whose
106
+ # parent is structural (no real text content) and not a
107
+ # whitespace-preserving element. Mixed-content runs like
108
+ # +<p>foo <em>bar</em> baz</p>+ are left alone.
109
+ def strip_structural_whitespace!(doc)
110
+ to_remove = []
111
+ doc.traverse do |node|
112
+ next unless node.text?
113
+ next unless node.content.strip.empty?
114
+
115
+ parent = node.parent
116
+ next if parent.nil?
117
+ next if WHITESPACE_PRESERVING_ELEMENTS.include?(parent.name)
118
+ next if parent_has_real_text?(parent)
119
+
120
+ to_remove << node
121
+ end
122
+ to_remove.each(&:remove)
123
+ end
124
+
125
+ def parent_has_real_text?(parent)
126
+ parent.children.any? do |c|
127
+ c.text? && !c.content.strip.empty?
128
+ end
129
+ end
130
+
97
131
  def fixture_ready_save_options
98
132
  Nokogiri::XML::Node::SaveOptions::FORMAT |
99
133
  Nokogiri::XML::Node::SaveOptions::AS_XHTML |
data/lib/canon/version.rb CHANGED
@@ -1,5 +1,5 @@
1
1
  # frozen_string_literal: true
2
2
 
3
3
  module Canon
4
- VERSION = "0.2.6"
4
+ VERSION = "0.2.7"
5
5
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: canon
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.6
4
+ version: 0.2.7
5
5
  platform: ruby
6
6
  authors:
7
7
  - Ribose Inc.
8
8
  autorequire:
9
9
  bindir: exe
10
10
  cert_chain: []
11
- date: 2026-05-03 00:00:00.000000000 Z
11
+ date: 2026-05-04 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: diff-lcs