canon 0.2.5 → 0.2.7
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/README.adoc +4 -0
- data/docs/advanced/diff-classification.adoc +16 -0
- data/docs/advanced/semantic-diff-report.adoc +2 -0
- data/docs/features/diff-formatting/display-preprocessing.adoc +1 -1
- data/docs/features/diff-formatting/index.adoc +3 -1
- data/docs/features/diff-formatting/whitespace-adjacency.adoc +100 -13
- data/docs/interfaces/ruby-api/index.adoc +20 -0
- data/docs/understanding/formats/html.adoc +17 -0
- data/lib/canon/comparison/node_inspector.rb +9 -0
- data/lib/canon/comparison/xml_comparator.rb +37 -18
- data/lib/canon/diff_formatter/diff_detail_formatter/dimension_formatter.rb +28 -8
- data/lib/canon/pretty_printer/html.rb +34 -0
- data/lib/canon/version.rb +1 -1
- metadata +2 -2
checksums.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
|
1
1
|
---
|
|
2
2
|
SHA256:
|
|
3
|
-
metadata.gz:
|
|
4
|
-
data.tar.gz:
|
|
3
|
+
metadata.gz: 1800400419926b8607eb146490d0bc4a0ecf5e4bfaf2b3007a87e99d440661f3
|
|
4
|
+
data.tar.gz: 0fc8298171e94fec5e9c4b650001fcc31e79ca774e1d7ad0f19fe91308199b18
|
|
5
5
|
SHA512:
|
|
6
|
-
metadata.gz:
|
|
7
|
-
data.tar.gz:
|
|
6
|
+
metadata.gz: 5c553b671df23a70814bedb7521836b01ce7d6e0ab1af99f00aa05ca6f4ef875d8e8af2e4e63c208443d8ddbab92ed7a05c3c8c6828da2608c42bf6a38f0b7c4
|
|
7
|
+
data.tar.gz: e183e77684bc3fe7c072caa904cf709a3c20e2f6c38099a42f099e3367d2b0c2336f5407a6def0147f7405056a26a639f4b9dd5075deb6b51a60bcb7921c7c44
|
data/README.adoc
CHANGED
|
@@ -614,6 +614,10 @@ See link:docs/MODES[Diff modes] for details.
|
|
|
614
614
|
* **Formatting diff detection**: Automatically detects and highlights purely cosmetic whitespace/line break differences
|
|
615
615
|
* **Whitespace visualization**: Make invisible characters visible with CJK-safe
|
|
616
616
|
Unicode symbols
|
|
617
|
+
* **Whitespace adjacency reporting**: Stray whitespace-only text nodes are
|
|
618
|
+
reported as a dedicated `:whitespace_adjacency` dimension with direction
|
|
619
|
+
wording (`before`/`after`/`adjacent to`) instead of cascading into
|
|
620
|
+
misleading `:text_content` mismatches
|
|
617
621
|
* **Non-ASCII detection**: Warnings for unexpected Unicode characters
|
|
618
622
|
* **Customizable**: Character maps, context lines, grouping options
|
|
619
623
|
|
|
@@ -266,6 +266,22 @@ match dimension but follows the same normative rule as all general dimensions:
|
|
|
266
266
|
- Structural differences tracked but don't affect equivalence
|
|
267
267
|
- Useful for content-only comparisons where wrapper elements don't matter
|
|
268
268
|
|
|
269
|
+
==== Whitespace Adjacency
|
|
270
|
+
|
|
271
|
+
`:whitespace_adjacency` is a derived dimension — emitted when the
|
|
272
|
+
alignment walk pairs a whitespace-only text node on one side against a
|
|
273
|
+
content node on the other. The Reason line describes the whitespace's
|
|
274
|
+
direction relative to the partner content: `before`, `after`, or
|
|
275
|
+
`adjacent to`.
|
|
276
|
+
|
|
277
|
+
* **Always normative** — differences always affect equivalence
|
|
278
|
+
* **Not user-configurable** — dimension is always tracked when the
|
|
279
|
+
re-alignment walk encounters an asymmetric whitespace node
|
|
280
|
+
* **Report-only** — does not change equivalence outcomes compared to
|
|
281
|
+
pre-#137 behaviour; only changes the diff-report shape (see
|
|
282
|
+
link:../features/diff-formatting/whitespace-adjacency.adoc[Whitespace
|
|
283
|
+
adjacency] for details)
|
|
284
|
+
|
|
269
285
|
.Example: Comment handling
|
|
270
286
|
====
|
|
271
287
|
[source,ruby]
|
|
@@ -212,6 +212,8 @@ Reason: Text: "¬······:¬······"
|
|
|
212
212
|
|
|
213
213
|
This fallback is implemented in `Canon::DiffFormatter::DiffDetailFormatterHelpers::DimensionFormatter.format_text_content_details` and only triggers when `TextUtils.ambiguous_text_pair?` returns `true` _and_ at least one side has a parent element to render.
|
|
214
214
|
|
|
215
|
+
The same fallback also applies to the `whitespace_adjacency` dimension (see <<whitespace-adjacency,Whitespace adjacency>>): when the alignment partner of a stray whitespace node extracts to an empty / whitespace-only string, the Reason line reads `Whitespace inside <PARENT>` (rather than `Whitespace before ""`), and the Expected/Actual block surfaces each side's parent element compactly. See `format_whitespace_adjacency_details` and `Canon::Comparison::XmlComparator#build_whitespace_adjacency_reason`.
|
|
216
|
+
|
|
215
217
|
==== One-sided text diffs (added or removed text nodes)
|
|
216
218
|
|
|
217
219
|
When a `text_content` difference carries a text node on one side and `nil` on the other (issue #125) -- the shape that fragment-length mismatches and child-comparison emit when a text-node child is missing -- the renderer mirrors `element_structure`: the missing side reads `(not present)`, and the present side reads the text-node content (whitespace-visualised) plus a brief parent open-tag hint for context. The full ancestor subtree is *not* dumped; only the immediate parent's opening tag is shown, so a missing whitespace text node cannot make the diff look like the entire ancestor differs.
|
|
@@ -430,7 +430,7 @@ pretty-printer. This is a known future work item.
|
|
|
430
430
|
|✓ Full
|
|
431
431
|
|✓ (via XML serializer)
|
|
432
432
|
|✓ Full
|
|
433
|
-
|`:pretty_print` uses `Canon::PrettyPrinter::Html
|
|
433
|
+
|`:pretty_print` uses `Canon::PrettyPrinter::Html` in fixture-ready mode (`FORMAT|AS_XHTML|NO_DECLARATION`); `:normalize_pretty_print` falls back to `XmlNormalized` pending a dedicated `HtmlNormalized`; `:c14n` uses Nokogiri HTML5 serialization. In fixture-ready mode, stray structural whitespace (whitespace-only text nodes between block-level siblings) is stripped before formatting so that libxml's `FORMAT` flag produces correct indentation. Whitespace inside `<pre>`, `<script>`, `<style>`, and `<textarea>` is preserved.
|
|
434
434
|
|
|
435
435
|
|JSON
|
|
436
436
|
|Planned
|
|
@@ -30,7 +30,9 @@ Canon's diff formatting includes:
|
|
|
30
30
|
algorithms
|
|
31
31
|
* **Whitespace adjacency**: Stray whitespace-only text nodes are anchored at
|
|
32
32
|
themselves instead of cascading into mismatches against neighbouring
|
|
33
|
-
content
|
|
33
|
+
content. The Reason line names the direction relative to the partner
|
|
34
|
+
(`before`/`after`/`adjacent to`)
|
|
35
|
+
(link:./whitespace-adjacency.adoc[details])
|
|
34
36
|
|
|
35
37
|
== Available formatting options
|
|
36
38
|
|
|
@@ -75,28 +75,113 @@ After the new contract, the cascade above collapses to:
|
|
|
75
75
|
|
|
76
76
|
[source]
|
|
77
77
|
----
|
|
78
|
-
DIFFERENCE #1 — whitespace_adjacency: Whitespace
|
|
78
|
+
DIFFERENCE #1 — whitespace_adjacency: Whitespace before "20483":
|
|
79
79
|
present on EXPECTED ("↵░░"), absent on ACTUAL
|
|
80
|
-
DIFFERENCE #2 — whitespace_adjacency: Whitespace
|
|
80
|
+
DIFFERENCE #2 — whitespace_adjacency: Whitespace before ",":
|
|
81
81
|
present on EXPECTED ("↵░░"), absent on ACTUAL
|
|
82
82
|
DIFFERENCE #3 — text_content: "↵░░,↵░░" vs ", "
|
|
83
83
|
----
|
|
84
84
|
|
|
85
|
-
==
|
|
85
|
+
== Direction relative to the partner
|
|
86
86
|
|
|
87
|
-
The Reason line names the
|
|
88
|
-
relative to
|
|
87
|
+
The Reason line names the document-order position of the whitespace
|
|
88
|
+
node relative to the *partner content node* it was zipped against by
|
|
89
|
+
the alignment walk. The partner is the next (or, at parent edge,
|
|
90
|
+
previous) non-whitespace sibling on the whitespace-bearing side, which
|
|
91
|
+
is what aligns against the corresponding content node on the other
|
|
92
|
+
side.
|
|
89
93
|
|
|
90
|
-
|
|
91
|
-
sibling
|
|
94
|
+
`before`:: The whitespace immediately precedes its next non-whitespace
|
|
95
|
+
sibling. This is the common case (e.g. indentation between two inline
|
|
96
|
+
spans where the asymmetric whitespace sits on the leading edge of the
|
|
97
|
+
partner).
|
|
92
98
|
|
|
93
|
-
|
|
94
|
-
sibling
|
|
99
|
+
`after`:: The whitespace trails the previous non-whitespace sibling
|
|
100
|
+
and has no non-whitespace sibling after it. Emitted at the trailing
|
|
101
|
+
edge of a parent.
|
|
95
102
|
|
|
96
|
-
|
|
103
|
+
`adjacent to`:: Degenerate fallback for a whitespace node with no
|
|
104
|
+
non-whitespace siblings at all. Rarely emitted.
|
|
97
105
|
|
|
98
|
-
|
|
99
|
-
|
|
106
|
+
When the alignment partner extracts to an empty / whitespace-only
|
|
107
|
+
string (e.g. an element with no text descendants), the direction
|
|
108
|
+
phrasing degenerates to `Whitespace before ""` which carries no
|
|
109
|
+
information. In that case Canon falls back to naming the parent
|
|
110
|
+
element instead — `Whitespace inside <PARENT>` — and the
|
|
111
|
+
Expected/Actual detail block renders each side's parent element
|
|
112
|
+
compactly per the contract from
|
|
113
|
+
link:../../advanced/semantic-diff-report.adoc#parent-context-fallback-for-ambiguous-text-diffs[issue #112].
|
|
114
|
+
|
|
115
|
+
NOTE: An earlier wording (`Whitespace surrounding "X"`) classified the
|
|
116
|
+
*whitespace node's position among its own siblings* rather than its
|
|
117
|
+
direction relative to the partner. That label was misleading when the
|
|
118
|
+
whitespace sat between two element siblings but the asymmetry was
|
|
119
|
+
one-sided — see issue #137 follow-up.
|
|
120
|
+
|
|
121
|
+
=== Examples
|
|
122
|
+
|
|
123
|
+
==== "before" — whitespace between inline element siblings
|
|
124
|
+
|
|
125
|
+
[source,ruby]
|
|
126
|
+
----
|
|
127
|
+
html1 = "<a><span>ISO </span>\n <span>712</span></a>"
|
|
128
|
+
html2 = "<a><span>ISO </span><span>712</span></a>"
|
|
129
|
+
|
|
130
|
+
result = Canon::Comparison.equivalent?(html1, html2,
|
|
131
|
+
format: :html5, verbose: true)
|
|
132
|
+
# => #<ComparisonResult equivalent=false>
|
|
133
|
+
|
|
134
|
+
# The stray "\n " between two spans is the only asymmetric node.
|
|
135
|
+
# It sits immediately before <span>712</span>, its next non-ws sibling.
|
|
136
|
+
result.differences.first.reason
|
|
137
|
+
# => "Whitespace before \"712\": present on EXPECTED (\"░\"), absent on ACTUAL"
|
|
138
|
+
----
|
|
139
|
+
|
|
140
|
+
==== "after" — trailing whitespace in a whitespace-preserving element
|
|
141
|
+
|
|
142
|
+
[source,ruby]
|
|
143
|
+
----
|
|
144
|
+
# <code> preserves whitespace, so the trailing newline survives the
|
|
145
|
+
# upstream filter and pairs against the extra <b>B</b> on the other side.
|
|
146
|
+
html1 = "<code><b>A</b>\n</code>"
|
|
147
|
+
html2 = "<code><b>A</b><b>B</b></code>"
|
|
148
|
+
|
|
149
|
+
result = Canon::Comparison.equivalent?(html1, html2,
|
|
150
|
+
format: :html5, verbose: true)
|
|
151
|
+
result.differences.first.reason
|
|
152
|
+
# => "Whitespace after \"B\": present on EXPECTED (\"↵\"), absent on ACTUAL"
|
|
153
|
+
----
|
|
154
|
+
|
|
155
|
+
==== "adjacent to" — sole-child whitespace node
|
|
156
|
+
|
|
157
|
+
[source,ruby]
|
|
158
|
+
----
|
|
159
|
+
# A whitespace-only text node as the only child of <code>, paired
|
|
160
|
+
# against an element on the other side. No non-ws siblings exist.
|
|
161
|
+
html1 = "<code>\n</code>"
|
|
162
|
+
html2 = "<code><b>A</b></code>"
|
|
163
|
+
|
|
164
|
+
result = Canon::Comparison.equivalent?(html1, html2,
|
|
165
|
+
format: :html5, verbose: true)
|
|
166
|
+
result.differences.first.reason
|
|
167
|
+
# => "Whitespace adjacent to \"A\": present on EXPECTED (\"↵\"), absent on ACTUAL"
|
|
168
|
+
----
|
|
169
|
+
|
|
170
|
+
== Working with :whitespace_adjacency diffs programmatically
|
|
171
|
+
|
|
172
|
+
Use the `dimension` field on `DiffNode` to filter:
|
|
173
|
+
|
|
174
|
+
[source,ruby]
|
|
175
|
+
----
|
|
176
|
+
result = Canon::Comparison.equivalent?(html1, html2,
|
|
177
|
+
format: :html5, verbose: true)
|
|
178
|
+
|
|
179
|
+
# Find all whitespace-adjacency diffs
|
|
180
|
+
ws_diffs = result.differences.select { |d| d.dimension == :whitespace_adjacency }
|
|
181
|
+
|
|
182
|
+
# These are always normative — they affect the equivalence verdict
|
|
183
|
+
ws_diffs.all?(&:normative?) # => true
|
|
184
|
+
----
|
|
100
185
|
|
|
101
186
|
== What this contract does NOT do
|
|
102
187
|
|
|
@@ -137,4 +222,6 @@ whitespace node as a single normative `:whitespace_adjacency` diff.
|
|
|
137
222
|
|
|
138
223
|
The cascade behaviour was reported in
|
|
139
224
|
https://github.com/lutaml/canon/issues/137[issue #137]. The fix landed
|
|
140
|
-
as a report-only re-alignment in PR #138.
|
|
225
|
+
as a report-only re-alignment in PR #138. PR #141 replaced the
|
|
226
|
+
misleading `:surrounding`/`:preceding`/`:following` position labels
|
|
227
|
+
with direction-faithful wording (`before`/`after`/`adjacent to`).
|
|
@@ -116,6 +116,9 @@ Where:
|
|
|
116
116
|
`{Format}`:: The format module (`Xml`, `Html`, `Json`)
|
|
117
117
|
`n`:: Number of spaces (default: 2) or tabs (use 1 for tabs)
|
|
118
118
|
`type`:: Indentation type: `'space'` (default) or `'tab'`
|
|
119
|
+
`fixture_ready`:: (HTML only) When `true`, emit indented XHTML-shaped
|
|
120
|
+
output that strips structural whitespace before formatting. Designed for
|
|
121
|
+
copy-paste into RSpec heredoc fixtures. Default: `false`.
|
|
119
122
|
`content`:: The input string
|
|
120
123
|
|
|
121
124
|
.Pretty-print examples
|
|
@@ -151,6 +154,23 @@ Canon::Xml::PrettyPrinter.new(
|
|
|
151
154
|
html_input = '<div><p>Hello</p></div>'
|
|
152
155
|
Canon::Html::PrettyPrinter.new(indent: 2).format(html_input)
|
|
153
156
|
|
|
157
|
+
# HTML fixture-ready mode: produces indented XHTML-shaped output
|
|
158
|
+
# suitable for pasting into RSpec heredoc fixtures. Strips stray
|
|
159
|
+
# structural whitespace (inter-element text nodes) so libxml's FORMAT
|
|
160
|
+
# flag can indent block-level siblings that would otherwise be treated
|
|
161
|
+
# as mixed content. Whitespace inside <pre>, <script>, <style>, and
|
|
162
|
+
# <textarea> is preserved.
|
|
163
|
+
Canon::Html::PrettyPrinter.new(indent: 2, fixture_ready: true)
|
|
164
|
+
.format('<html><body><div>a</div> <div>b</div></body></html>')
|
|
165
|
+
# =>
|
|
166
|
+
# <html xmlns="http://www.w3.org/1999/xhtml">
|
|
167
|
+
# <head>...</head>
|
|
168
|
+
# <body>
|
|
169
|
+
# <div>a</div>
|
|
170
|
+
# <div>b</div>
|
|
171
|
+
# </body>
|
|
172
|
+
# </html>
|
|
173
|
+
|
|
154
174
|
# JSON with 2-space indentation
|
|
155
175
|
json_input = '{"z":3,"a":{"b":1}}'
|
|
156
176
|
Canon::Json::PrettyPrinter.new(indent: 2).format(json_input)
|
|
@@ -235,6 +235,23 @@ HTML whitespace is collapsed per CSS rendering rules. Empty text nodes between e
|
|
|
235
235
|
Multiple spaces within text content are collapsed to single spaces when `text_content: :normalize` is used.
|
|
236
236
|
====
|
|
237
237
|
|
|
238
|
+
==== Fixture-ready pretty-print and structural whitespace
|
|
239
|
+
|
|
240
|
+
When using `Canon::PrettyPrinter::Html` with `fixture_ready: true` (the mode
|
|
241
|
+
used by the diff pipeline's *PRETTY-PRINTED INPUTS* section), Canon strips
|
|
242
|
+
stray structural whitespace before formatting. Real-world HTML5 input from
|
|
243
|
+
upstream pipelines often carries whitespace-only text nodes between block-level
|
|
244
|
+
siblings (`<body>` → `<div>`, `<br>`, `<div>`, ...). libxml's `FORMAT` flag
|
|
245
|
+
treats any element with a non-whitespace-only text child as mixed content and
|
|
246
|
+
refuses to indent its children — producing a single-line blob instead of a
|
|
247
|
+
readable tree.
|
|
248
|
+
|
|
249
|
+
The fixture-ready mode removes whitespace-only text nodes from parents that
|
|
250
|
+
are purely structural (no real text content) and are not whitespace-preserving
|
|
251
|
+
elements (`<pre>`, `<script>`, `<style>`, `<textarea>`). Mixed-content runs
|
|
252
|
+
like `<p>foo <em>bar</em> baz</p>` are left untouched so that significant
|
|
253
|
+
inline whitespace is preserved.
|
|
254
|
+
|
|
238
255
|
=== Attribute order
|
|
239
256
|
|
|
240
257
|
HTML attributes are inherently unordered per the HTML specification, so default is `:ignore`.
|
|
@@ -98,6 +98,15 @@ module Canon
|
|
|
98
98
|
[]
|
|
99
99
|
end
|
|
100
100
|
end
|
|
101
|
+
|
|
102
|
+
# Return the parent node of +node+, or nil when +node+ is not a
|
|
103
|
+
# recognised DOM backend type or has no parent.
|
|
104
|
+
def self.parent_of(node)
|
|
105
|
+
case node
|
|
106
|
+
when Canon::Xml::Node, Nokogiri::XML::Node
|
|
107
|
+
node.parent
|
|
108
|
+
end
|
|
109
|
+
end
|
|
101
110
|
end
|
|
102
111
|
end
|
|
103
112
|
end
|
|
@@ -873,36 +873,55 @@ differences)
|
|
|
873
873
|
return build_text_diff_reason(text1, text2)
|
|
874
874
|
end
|
|
875
875
|
|
|
876
|
-
position = whitespace_adjacency_position(ws_node)
|
|
877
876
|
ws_vis = visualize_whitespace(ws_text)
|
|
878
|
-
content_vis = content_text ? visualize_whitespace(truncate_text(content_text)) : "(none)"
|
|
879
877
|
|
|
880
|
-
|
|
881
|
-
|
|
878
|
+
if content_text.nil? || content_text.strip.empty?
|
|
879
|
+
# Partner content extracts to "" / whitespace-only — naming it
|
|
880
|
+
# in the Reason ("Whitespace before \"\"") gives the reader
|
|
881
|
+
# nothing. Fall back to the parent element name so the
|
|
882
|
+
# diff carries structural context (issue #112's contract,
|
|
883
|
+
# extended from :text_content to :whitespace_adjacency).
|
|
884
|
+
parent_label = whitespace_adjacency_parent_label(ws_node)
|
|
885
|
+
"Whitespace inside #{parent_label}: " \
|
|
886
|
+
"present on #{present_side} (\"#{ws_vis}\"), absent on #{absent_side}"
|
|
887
|
+
else
|
|
888
|
+
direction = whitespace_partner_direction(ws_node)
|
|
889
|
+
content_vis = visualize_whitespace(truncate_text(content_text))
|
|
890
|
+
"Whitespace #{direction} \"#{content_vis}\": " \
|
|
891
|
+
"present on #{present_side} (\"#{ws_vis}\"), absent on #{absent_side}"
|
|
892
|
+
end
|
|
882
893
|
end
|
|
883
894
|
|
|
884
|
-
def
|
|
885
|
-
|
|
886
|
-
|
|
895
|
+
def whitespace_adjacency_parent_label(ws_node)
|
|
896
|
+
parent = NodeInspector.parent_of(ws_node)
|
|
897
|
+
return "(unknown parent)" unless parent
|
|
887
898
|
|
|
888
|
-
|
|
889
|
-
|
|
899
|
+
name = parent.name
|
|
900
|
+
name && !name.empty? ? "<#{name}>" : "(unknown parent)"
|
|
901
|
+
end
|
|
902
|
+
|
|
903
|
+
# Direction of the partner content relative to the whitespace node,
|
|
904
|
+
# phrased from the partner's point of view: "before" when the
|
|
905
|
+
# whitespace immediately precedes its next non-whitespace sibling
|
|
906
|
+
# (the alignment partner on the other side), "after" when the
|
|
907
|
+
# whitespace trails the previous non-whitespace sibling, or
|
|
908
|
+
# "adjacent to" as a degenerate fallback when neither neighbour
|
|
909
|
+
# exists.
|
|
910
|
+
def whitespace_partner_direction(ws_node)
|
|
911
|
+
parent = NodeInspector.parent_of(ws_node)
|
|
912
|
+
return "adjacent to" unless parent
|
|
890
913
|
|
|
891
914
|
siblings = parent.children
|
|
892
915
|
idx = siblings.index(ws_node)
|
|
893
|
-
return
|
|
894
|
-
|
|
895
|
-
before = sibling_with_content?(siblings, idx, -1)
|
|
896
|
-
after = sibling_with_content?(siblings, idx, 1)
|
|
916
|
+
return "adjacent to" unless idx
|
|
897
917
|
|
|
898
|
-
if
|
|
899
|
-
elsif
|
|
900
|
-
|
|
901
|
-
else :isolated
|
|
918
|
+
if non_ws_sibling_exists?(siblings, idx, 1) then "before"
|
|
919
|
+
elsif non_ws_sibling_exists?(siblings, idx, -1) then "after"
|
|
920
|
+
else "adjacent to"
|
|
902
921
|
end
|
|
903
922
|
end
|
|
904
923
|
|
|
905
|
-
def
|
|
924
|
+
def non_ws_sibling_exists?(siblings, idx, direction)
|
|
906
925
|
i = idx + direction
|
|
907
926
|
while i >= 0 && i < siblings.length
|
|
908
927
|
s = siblings[i]
|
|
@@ -525,14 +525,34 @@ expand_difference: false)
|
|
|
525
525
|
text1 = NodeUtils.get_node_text(node1).to_s
|
|
526
526
|
text2 = NodeUtils.get_node_text(node2).to_s
|
|
527
527
|
|
|
528
|
-
|
|
529
|
-
|
|
530
|
-
|
|
531
|
-
|
|
532
|
-
|
|
533
|
-
|
|
534
|
-
|
|
535
|
-
|
|
528
|
+
if TextUtils.ambiguous_text_pair?(text1, text2) &&
|
|
529
|
+
(NodeUtils.parent_of(node1) || NodeUtils.parent_of(node2))
|
|
530
|
+
# Both sides extract to empty / whitespace-only strings —
|
|
531
|
+
# `""` / `""` tells the reader nothing. Fall back to a
|
|
532
|
+
# brief parent open-tag hint per #112's contract, but
|
|
533
|
+
# without dumping the full ancestor subtree (#125).
|
|
534
|
+
hint1 = NodeUtils.serialize_open_tag(NodeUtils.parent_of(node1))
|
|
535
|
+
hint2 = NodeUtils.serialize_open_tag(NodeUtils.parent_of(node2))
|
|
536
|
+
ws1 = TextUtils.visualize_whitespace(text1)
|
|
537
|
+
ws2 = TextUtils.visualize_whitespace(text2)
|
|
538
|
+
detail1 = ColorHelper.colorize(
|
|
539
|
+
"\"#{ws1}\" in #{hint1}",
|
|
540
|
+
:red, use_color
|
|
541
|
+
)
|
|
542
|
+
detail2 = ColorHelper.colorize(
|
|
543
|
+
"\"#{ws2}\" in #{hint2}",
|
|
544
|
+
:green, use_color
|
|
545
|
+
)
|
|
546
|
+
else
|
|
547
|
+
detail1 = ColorHelper.colorize(
|
|
548
|
+
"\"#{TextUtils.visualize_whitespace(text1)}\"",
|
|
549
|
+
:red, use_color
|
|
550
|
+
)
|
|
551
|
+
detail2 = ColorHelper.colorize(
|
|
552
|
+
"\"#{TextUtils.visualize_whitespace(text2)}\"",
|
|
553
|
+
:green, use_color
|
|
554
|
+
)
|
|
555
|
+
end
|
|
536
556
|
|
|
537
557
|
reason = if diff.is_a?(Canon::Diff::DiffNode)
|
|
538
558
|
diff.reason
|
|
@@ -29,6 +29,8 @@ module Canon
|
|
|
29
29
|
#
|
|
30
30
|
# See lutaml/canon#133, lutaml/canon#135.
|
|
31
31
|
class Html
|
|
32
|
+
WHITESPACE_PRESERVING_ELEMENTS = %w[pre textarea script style].freeze
|
|
33
|
+
|
|
32
34
|
def initialize(indent: 2, indent_type: "space", fixture_ready: false)
|
|
33
35
|
@indent = indent.to_i
|
|
34
36
|
@indent_type = indent_type
|
|
@@ -83,6 +85,7 @@ module Canon
|
|
|
83
85
|
# suppresses the +<?xml ...?>+ prefix.
|
|
84
86
|
def format_fixture_ready(html_string)
|
|
85
87
|
doc = Nokogiri::HTML5(html_string)
|
|
88
|
+
strip_structural_whitespace!(doc)
|
|
86
89
|
io = StringIO.new
|
|
87
90
|
if @indent_type == "tab"
|
|
88
91
|
doc.write_to(io, save_with: fixture_ready_save_options,
|
|
@@ -94,6 +97,37 @@ module Canon
|
|
|
94
97
|
io.string
|
|
95
98
|
end
|
|
96
99
|
|
|
100
|
+
# libxml's +FORMAT+ save flag does not insert indentation around
|
|
101
|
+
# the children of any element it sees as mixed content (any
|
|
102
|
+
# non-whitespace-only text node child). +Nokogiri::HTML5+ does
|
|
103
|
+
# not accept the +noblanks+ option that the XML parser uses to
|
|
104
|
+
# strip these inter-sibling text nodes pre-serialisation, so we
|
|
105
|
+
# do it manually here: drop whitespace-only text nodes whose
|
|
106
|
+
# parent is structural (no real text content) and not a
|
|
107
|
+
# whitespace-preserving element. Mixed-content runs like
|
|
108
|
+
# +<p>foo <em>bar</em> baz</p>+ are left alone.
|
|
109
|
+
def strip_structural_whitespace!(doc)
|
|
110
|
+
to_remove = []
|
|
111
|
+
doc.traverse do |node|
|
|
112
|
+
next unless node.text?
|
|
113
|
+
next unless node.content.strip.empty?
|
|
114
|
+
|
|
115
|
+
parent = node.parent
|
|
116
|
+
next if parent.nil?
|
|
117
|
+
next if WHITESPACE_PRESERVING_ELEMENTS.include?(parent.name)
|
|
118
|
+
next if parent_has_real_text?(parent)
|
|
119
|
+
|
|
120
|
+
to_remove << node
|
|
121
|
+
end
|
|
122
|
+
to_remove.each(&:remove)
|
|
123
|
+
end
|
|
124
|
+
|
|
125
|
+
def parent_has_real_text?(parent)
|
|
126
|
+
parent.children.any? do |c|
|
|
127
|
+
c.text? && !c.content.strip.empty?
|
|
128
|
+
end
|
|
129
|
+
end
|
|
130
|
+
|
|
97
131
|
def fixture_ready_save_options
|
|
98
132
|
Nokogiri::XML::Node::SaveOptions::FORMAT |
|
|
99
133
|
Nokogiri::XML::Node::SaveOptions::AS_XHTML |
|
data/lib/canon/version.rb
CHANGED
metadata
CHANGED
|
@@ -1,14 +1,14 @@
|
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
|
2
2
|
name: canon
|
|
3
3
|
version: !ruby/object:Gem::Version
|
|
4
|
-
version: 0.2.
|
|
4
|
+
version: 0.2.7
|
|
5
5
|
platform: ruby
|
|
6
6
|
authors:
|
|
7
7
|
- Ribose Inc.
|
|
8
8
|
autorequire:
|
|
9
9
|
bindir: exe
|
|
10
10
|
cert_chain: []
|
|
11
|
-
date: 2026-05-
|
|
11
|
+
date: 2026-05-04 00:00:00.000000000 Z
|
|
12
12
|
dependencies:
|
|
13
13
|
- !ruby/object:Gem::Dependency
|
|
14
14
|
name: diff-lcs
|