canon 0.1.8 → 0.1.10
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop_todo.yml +83 -22
- data/docs/Gemfile +1 -0
- data/docs/_config.yml +90 -1
- data/docs/advanced/diff-classification.adoc +196 -24
- data/docs/features/match-options/index.adoc +239 -1
- data/lib/canon/comparison/format_detector.rb +2 -1
- data/lib/canon/comparison/html_comparator.rb +19 -8
- data/lib/canon/comparison/html_compare_profile.rb +8 -2
- data/lib/canon/comparison/markup_comparator.rb +109 -2
- data/lib/canon/comparison/match_options/base_resolver.rb +7 -0
- data/lib/canon/comparison/whitespace_sensitivity.rb +208 -0
- data/lib/canon/comparison/xml_comparator/child_comparison.rb +15 -7
- data/lib/canon/comparison/xml_comparator/diff_node_builder.rb +108 -0
- data/lib/canon/comparison/xml_comparator/node_parser.rb +10 -5
- data/lib/canon/comparison/xml_comparator/node_type_comparator.rb +14 -7
- data/lib/canon/comparison/xml_comparator.rb +240 -23
- data/lib/canon/comparison/xml_node_comparison.rb +25 -3
- data/lib/canon/diff/diff_classifier.rb +119 -5
- data/lib/canon/diff/formatting_detector.rb +1 -1
- data/lib/canon/diff/xml_serialization_formatter.rb +153 -0
- data/lib/canon/rspec_matchers.rb +37 -8
- data/lib/canon/version.rb +1 -1
- data/lib/canon/xml/data_model.rb +24 -13
- metadata +4 -78
- data/docs/plans/2025-01-17-html-parser-selection-fix.adoc +0 -250
- data/false_positive_analysis.txt +0 -0
- data/file1.html +0 -1
- data/file2.html +0 -1
- data/old-docs/ADVANCED_TOPICS.adoc +0 -20
- data/old-docs/BASIC_USAGE.adoc +0 -16
- data/old-docs/CHARACTER_VISUALIZATION.adoc +0 -567
- data/old-docs/CLI.adoc +0 -497
- data/old-docs/CUSTOMIZING_BEHAVIOR.adoc +0 -19
- data/old-docs/DIFF_ARCHITECTURE.adoc +0 -435
- data/old-docs/DIFF_FORMATTING.adoc +0 -540
- data/old-docs/DIFF_PARAMETERS.adoc +0 -261
- data/old-docs/DOM_DIFF.adoc +0 -1017
- data/old-docs/ENV_CONFIG.adoc +0 -876
- data/old-docs/FORMATS.adoc +0 -867
- data/old-docs/INPUT_VALIDATION.adoc +0 -477
- data/old-docs/MATCHER_BEHAVIOR.adoc +0 -90
- data/old-docs/MATCH_ARCHITECTURE.adoc +0 -463
- data/old-docs/MATCH_OPTIONS.adoc +0 -912
- data/old-docs/MODES.adoc +0 -432
- data/old-docs/NORMATIVE_INFORMATIVE_DIFFS.adoc +0 -219
- data/old-docs/OPTIONS.adoc +0 -1387
- data/old-docs/PREPROCESSING.adoc +0 -491
- data/old-docs/README.old.adoc +0 -2831
- data/old-docs/RSPEC.adoc +0 -814
- data/old-docs/RUBY_API.adoc +0 -485
- data/old-docs/SEMANTIC_DIFF_REPORT.adoc +0 -646
- data/old-docs/SEMANTIC_TREE_DIFF.adoc +0 -765
- data/old-docs/STRING_COMPARE.adoc +0 -345
- data/old-docs/TMP.adoc +0 -3384
- data/old-docs/TREE_DIFF.adoc +0 -1080
- data/old-docs/UNDERSTANDING_CANON.adoc +0 -17
- data/old-docs/VERBOSE.adoc +0 -482
- data/old-docs/VISUALIZATION_MAP.adoc +0 -625
- data/old-docs/WHITESPACE_TREATMENT.adoc +0 -1155
- data/scripts/analyze_current_state.rb +0 -85
- data/scripts/analyze_false_positives.rb +0 -114
- data/scripts/analyze_remaining_failures.rb +0 -105
- data/scripts/compare_current_failures.rb +0 -95
- data/scripts/compare_dom_tree_diff.rb +0 -158
- data/scripts/compare_failures.rb +0 -151
- data/scripts/debug_attribute_extraction.rb +0 -66
- data/scripts/debug_blocks_839.rb +0 -115
- data/scripts/debug_meta_matching.rb +0 -52
- data/scripts/debug_p_matching.rb +0 -192
- data/scripts/debug_signature_matching.rb +0 -118
- data/scripts/debug_sourcecode_124.rb +0 -32
- data/scripts/debug_whitespace_sensitive.rb +0 -192
- data/scripts/extract_false_positives.rb +0 -138
- data/scripts/find_actual_false_positives.rb +0 -125
- data/scripts/investigate_all_false_positives.rb +0 -161
- data/scripts/investigate_batch1.rb +0 -127
- data/scripts/investigate_classification.rb +0 -150
- data/scripts/investigate_classification_detailed.rb +0 -190
- data/scripts/investigate_common_failures.rb +0 -342
- data/scripts/investigate_false_negative.rb +0 -80
- data/scripts/investigate_false_positive.rb +0 -83
- data/scripts/investigate_false_positives.rb +0 -227
- data/scripts/investigate_false_positives_batch.rb +0 -163
- data/scripts/investigate_mixed_content.rb +0 -125
- data/scripts/investigate_remaining_16.rb +0 -214
- data/scripts/run_single_test.rb +0 -29
- data/scripts/test_all_false_positives.rb +0 -95
- data/scripts/test_attribute_details.rb +0 -61
- data/scripts/test_both_algorithms.rb +0 -49
- data/scripts/test_both_simple.rb +0 -49
- data/scripts/test_enhanced_semantic_output.rb +0 -125
- data/scripts/test_readme_examples.rb +0 -131
- data/scripts/test_semantic_tree_diff.rb +0 -99
- data/scripts/test_semantic_ux_improvements.rb +0 -135
- data/scripts/test_single_false_positive.rb +0 -119
- data/scripts/test_size_limits.rb +0 -99
- data/test_html_1.html +0 -21
- data/test_html_2.html +0 -21
- data/test_nokogiri.rb +0 -33
- data/test_normalize.rb +0 -45
data/old-docs/TREE_DIFF.adoc
DELETED
|
@@ -1,1080 +0,0 @@
|
|
|
1
|
-
= Semantic tree diff
|
|
2
|
-
:toc:
|
|
3
|
-
:toclevels: 3
|
|
4
|
-
|
|
5
|
-
[WARNING]
|
|
6
|
-
The semantic tree diff feature is currently **experimental** and under active
|
|
7
|
-
development. While it is functional and tested, the API and behavior may change
|
|
8
|
-
in future releases. Use with caution in production environments.
|
|
9
|
-
|
|
10
|
-
== General
|
|
11
|
-
|
|
12
|
-
Canon provides two complementary diff modes:
|
|
13
|
-
|
|
14
|
-
* **DOM diff** (default): Line-based comparison that matches elements by
|
|
15
|
-
position and structure
|
|
16
|
-
* **Semantic tree diff** (opt-in): Operation-based comparison that detects
|
|
17
|
-
high-level edit operations (INSERT, DELETE, UPDATE, MOVE, MERGE, SPLIT,
|
|
18
|
-
UPGRADE, DOWNGRADE)
|
|
19
|
-
|
|
20
|
-
The semantic tree diff is based on research in XML diff algorithms (XDiff 2002,
|
|
21
|
-
XyDiff/Cobena 2002, JATS-diff 2022) and provides operation-level analysis of
|
|
22
|
-
changes between document trees.
|
|
23
|
-
|
|
24
|
-
== Purpose
|
|
25
|
-
|
|
26
|
-
The semantic tree diff enables:
|
|
27
|
-
|
|
28
|
-
* **Operation detection**: Identify INSERT, DELETE, UPDATE, MOVE, MERGE, SPLIT,
|
|
29
|
-
UPGRADE, and DOWNGRADE operations
|
|
30
|
-
* **Intelligent matching**: Match similar nodes even when positions change
|
|
31
|
-
* **Format independence**: Works with XML, JSON, HTML, and YAML
|
|
32
|
-
* **Statistical analysis**: Provides matching statistics and confidence scores
|
|
33
|
-
|
|
34
|
-
Use semantic tree diff when you need to:
|
|
35
|
-
|
|
36
|
-
* Understand what operations transformed one document into another
|
|
37
|
-
* Track content that moved between positions
|
|
38
|
-
* Analyze structural changes at a high level
|
|
39
|
-
* Compare documents with significant rearrangement
|
|
40
|
-
|
|
41
|
-
== Enabling semantic tree diff
|
|
42
|
-
|
|
43
|
-
Enable semantic tree diff by setting `diff_algorithm: :semantic`:
|
|
44
|
-
|
|
45
|
-
[source,ruby]
|
|
46
|
-
----
|
|
47
|
-
require 'canon/comparison'
|
|
48
|
-
|
|
49
|
-
result = Canon::Comparison.equivalent?(doc1, doc2,
|
|
50
|
-
verbose: true,
|
|
51
|
-
diff_algorithm: :semantic
|
|
52
|
-
)
|
|
53
|
-
----
|
|
54
|
-
|
|
55
|
-
The system will use tree diff instead of the default DOM diff. Both modes
|
|
56
|
-
cannot be used simultaneously.
|
|
57
|
-
|
|
58
|
-
== Detected operations
|
|
59
|
-
|
|
60
|
-
The semantic tree diff detects these eight operations in three levels:
|
|
61
|
-
|
|
62
|
-
* **Level 1**: Basic operations (INSERT, DELETE, UPDATE)
|
|
63
|
-
* **Level 2**: Structural operations (MOVE)
|
|
64
|
-
* **Level 3**: Semantic operations (MERGE, SPLIT, UPGRADE, DOWNGRADE)
|
|
65
|
-
|
|
66
|
-
=== Detection algorithms
|
|
67
|
-
|
|
68
|
-
The operation detector uses a pattern-based approach inspired by XDiff, XyDiff,
|
|
69
|
-
and JATS-diff research:
|
|
70
|
-
|
|
71
|
-
**Level 1 - Basic operations:**
|
|
72
|
-
|
|
73
|
-
* **INSERT**: Detected by finding nodes in tree2 with no match in tree1
|
|
74
|
-
* **DELETE**: Detected by finding nodes in tree1 with no match in tree2
|
|
75
|
-
* **UPDATE**: Detected when matched nodes have different content, attributes, or labels
|
|
76
|
-
|
|
77
|
-
**Level 2 - Structural operations:**
|
|
78
|
-
|
|
79
|
-
* **MOVE**: Detected when a matched node has a different parent in tree2 than
|
|
80
|
-
its matched parent from tree1
|
|
81
|
-
|
|
82
|
-
**Level 3 - Semantic operations:**
|
|
83
|
-
|
|
84
|
-
These are detected by analyzing patterns in the basic operations:
|
|
85
|
-
|
|
86
|
-
* **MERGE**: Pattern of (n-1) DELETE operations + 1 UPDATE operation where
|
|
87
|
-
deleted content appears in the updated node (80% text similarity threshold)
|
|
88
|
-
|
|
89
|
-
* **SPLIT**: Pattern of 1 DELETE operation + n INSERT operations where the
|
|
90
|
-
deleted node's content is distributed across the inserted nodes (80% text
|
|
91
|
-
similarity threshold)
|
|
92
|
-
|
|
93
|
-
* **UPGRADE**: Pattern of DELETE + INSERT where the inserted node is at a
|
|
94
|
-
shallower depth (promoted in hierarchy) with same label and similar content
|
|
95
|
-
(90% similarity threshold)
|
|
96
|
-
|
|
97
|
-
* **DOWNGRADE**: Pattern of DELETE + INSERT where the inserted node is at a
|
|
98
|
-
deeper depth (demoted in hierarchy) with same label and similar content (90%
|
|
99
|
-
similarity threshold)
|
|
100
|
-
|
|
101
|
-
When semantic operations are detected, the component basic operations are
|
|
102
|
-
removed and replaced with the higher-level semantic operation.
|
|
103
|
-
|
|
104
|
-
=== Operation types
|
|
105
|
-
|
|
106
|
-
=== INSERT
|
|
107
|
-
|
|
108
|
-
A new node was added to the tree.
|
|
109
|
-
|
|
110
|
-
[example]
|
|
111
|
-
====
|
|
112
|
-
.Before
|
|
113
|
-
[source,xml]
|
|
114
|
-
----
|
|
115
|
-
<root>
|
|
116
|
-
<a>1</a>
|
|
117
|
-
</root>
|
|
118
|
-
----
|
|
119
|
-
|
|
120
|
-
.After
|
|
121
|
-
[source,xml]
|
|
122
|
-
----
|
|
123
|
-
<root>
|
|
124
|
-
<a>1</a>
|
|
125
|
-
<b>2</b>
|
|
126
|
-
</root>
|
|
127
|
-
----
|
|
128
|
-
|
|
129
|
-
Operation: INSERT `<b>2</b>`
|
|
130
|
-
====
|
|
131
|
-
|
|
132
|
-
=== DELETE
|
|
133
|
-
|
|
134
|
-
A node was removed from the tree.
|
|
135
|
-
|
|
136
|
-
[example]
|
|
137
|
-
====
|
|
138
|
-
.Before
|
|
139
|
-
[source,xml]
|
|
140
|
-
----
|
|
141
|
-
<root>
|
|
142
|
-
<a>1</a>
|
|
143
|
-
<b>2</b>
|
|
144
|
-
</root>
|
|
145
|
-
----
|
|
146
|
-
|
|
147
|
-
.After
|
|
148
|
-
[source,xml]
|
|
149
|
-
----
|
|
150
|
-
<root>
|
|
151
|
-
<a>1</a>
|
|
152
|
-
</root>
|
|
153
|
-
----
|
|
154
|
-
|
|
155
|
-
Operation: DELETE `<b>2</b>`
|
|
156
|
-
====
|
|
157
|
-
|
|
158
|
-
=== UPDATE
|
|
159
|
-
|
|
160
|
-
A node's content or attributes changed.
|
|
161
|
-
|
|
162
|
-
[example]
|
|
163
|
-
====
|
|
164
|
-
.Before
|
|
165
|
-
[source,xml]
|
|
166
|
-
----
|
|
167
|
-
<root>
|
|
168
|
-
<a id="1">old text</a>
|
|
169
|
-
</root>
|
|
170
|
-
----
|
|
171
|
-
|
|
172
|
-
.After
|
|
173
|
-
[source,xml]
|
|
174
|
-
----
|
|
175
|
-
<root>
|
|
176
|
-
<a id="2">new text</a>
|
|
177
|
-
</root>
|
|
178
|
-
----
|
|
179
|
-
|
|
180
|
-
Operation: UPDATE `<a>` (id and text changed)
|
|
181
|
-
====
|
|
182
|
-
|
|
183
|
-
=== MOVE
|
|
184
|
-
|
|
185
|
-
A node was relocated to a different position in the tree.
|
|
186
|
-
|
|
187
|
-
[example]
|
|
188
|
-
====
|
|
189
|
-
.Before
|
|
190
|
-
[source,xml]
|
|
191
|
-
----
|
|
192
|
-
<root>
|
|
193
|
-
<section1>
|
|
194
|
-
<item>X</item>
|
|
195
|
-
</section1>
|
|
196
|
-
<section2>
|
|
197
|
-
</section2>
|
|
198
|
-
</root>
|
|
199
|
-
----
|
|
200
|
-
|
|
201
|
-
.After
|
|
202
|
-
[source,xml]
|
|
203
|
-
----
|
|
204
|
-
<root>
|
|
205
|
-
<section1>
|
|
206
|
-
</section1>
|
|
207
|
-
<section2>
|
|
208
|
-
<item>X</item>
|
|
209
|
-
</section2>
|
|
210
|
-
</root>
|
|
211
|
-
----
|
|
212
|
-
|
|
213
|
-
Operation: MOVE `<item>X</item>` from section1 to section2
|
|
214
|
-
====
|
|
215
|
-
|
|
216
|
-
=== MERGE
|
|
217
|
-
|
|
218
|
-
Multiple nodes were combined into a single node.
|
|
219
|
-
|
|
220
|
-
[example]
|
|
221
|
-
====
|
|
222
|
-
.Before
|
|
223
|
-
[source,xml]
|
|
224
|
-
----
|
|
225
|
-
<root>
|
|
226
|
-
<paragraph>First sentence.</paragraph>
|
|
227
|
-
<paragraph>Second sentence.</paragraph>
|
|
228
|
-
<paragraph>Third sentence.</paragraph>
|
|
229
|
-
</root>
|
|
230
|
-
----
|
|
231
|
-
|
|
232
|
-
.After
|
|
233
|
-
[source,xml]
|
|
234
|
-
----
|
|
235
|
-
<root>
|
|
236
|
-
<paragraph>First sentence. Second sentence. Third sentence.</paragraph>
|
|
237
|
-
</root>
|
|
238
|
-
----
|
|
239
|
-
|
|
240
|
-
Operation: MERGE 3 `<paragraph>` nodes into one
|
|
241
|
-
====
|
|
242
|
-
|
|
243
|
-
=== SPLIT
|
|
244
|
-
|
|
245
|
-
A single node was divided into multiple nodes.
|
|
246
|
-
|
|
247
|
-
[example]
|
|
248
|
-
====
|
|
249
|
-
.Before
|
|
250
|
-
[source,xml]
|
|
251
|
-
----
|
|
252
|
-
<root>
|
|
253
|
-
<section>
|
|
254
|
-
<title>Combined Content</title>
|
|
255
|
-
<paragraph>First part. Second part.</paragraph>
|
|
256
|
-
</section>
|
|
257
|
-
</root>
|
|
258
|
-
----
|
|
259
|
-
|
|
260
|
-
.After
|
|
261
|
-
[source,xml]
|
|
262
|
-
----
|
|
263
|
-
<root>
|
|
264
|
-
<section>
|
|
265
|
-
<title>First Content</title>
|
|
266
|
-
<paragraph>First part.</paragraph>
|
|
267
|
-
</section>
|
|
268
|
-
<section>
|
|
269
|
-
<title>Second Content</title>
|
|
270
|
-
<paragraph>Second part.</paragraph>
|
|
271
|
-
</section>
|
|
272
|
-
</root>
|
|
273
|
-
----
|
|
274
|
-
|
|
275
|
-
Operation: SPLIT 1 `<section>` into 2 sections
|
|
276
|
-
====
|
|
277
|
-
|
|
278
|
-
=== UPGRADE
|
|
279
|
-
|
|
280
|
-
A node was promoted to a higher level in the hierarchy (decreased depth).
|
|
281
|
-
|
|
282
|
-
[example]
|
|
283
|
-
====
|
|
284
|
-
.Before
|
|
285
|
-
[source,xml]
|
|
286
|
-
----
|
|
287
|
-
<root>
|
|
288
|
-
<chapter>
|
|
289
|
-
<section>
|
|
290
|
-
<subsection>
|
|
291
|
-
<title>Important Topic</title>
|
|
292
|
-
<content>This is important.</content>
|
|
293
|
-
</subsection>
|
|
294
|
-
</section>
|
|
295
|
-
</chapter>
|
|
296
|
-
</root>
|
|
297
|
-
----
|
|
298
|
-
|
|
299
|
-
.After
|
|
300
|
-
[source,xml]
|
|
301
|
-
----
|
|
302
|
-
<root>
|
|
303
|
-
<chapter>
|
|
304
|
-
<section>
|
|
305
|
-
<title>Important Topic</title>
|
|
306
|
-
<content>This is important.</content>
|
|
307
|
-
</section>
|
|
308
|
-
</chapter>
|
|
309
|
-
</root>
|
|
310
|
-
----
|
|
311
|
-
|
|
312
|
-
Operation: UPGRADE `<subsection>` promoted to `<section>` level
|
|
313
|
-
====
|
|
314
|
-
|
|
315
|
-
=== DOWNGRADE
|
|
316
|
-
|
|
317
|
-
A node was demoted to a lower level in the hierarchy (increased depth).
|
|
318
|
-
|
|
319
|
-
[example]
|
|
320
|
-
====
|
|
321
|
-
.Before
|
|
322
|
-
[source,json]
|
|
323
|
-
----
|
|
324
|
-
{
|
|
325
|
-
"items": [
|
|
326
|
-
{ "id": 1, "name": "Item A", "type": "primary" }
|
|
327
|
-
]
|
|
328
|
-
}
|
|
329
|
-
----
|
|
330
|
-
|
|
331
|
-
.After
|
|
332
|
-
[source,json]
|
|
333
|
-
----
|
|
334
|
-
{
|
|
335
|
-
"items": [
|
|
336
|
-
{
|
|
337
|
-
"id": 1,
|
|
338
|
-
"details": {
|
|
339
|
-
"name": "Item A",
|
|
340
|
-
"type": "primary"
|
|
341
|
-
}
|
|
342
|
-
}
|
|
343
|
-
]
|
|
344
|
-
}
|
|
345
|
-
----
|
|
346
|
-
|
|
347
|
-
Operation: DOWNGRADE `name` and `type` demoted into nested `details` object
|
|
348
|
-
====
|
|
349
|
-
|
|
350
|
-
== Matching algorithm
|
|
351
|
-
|
|
352
|
-
The semantic tree diff uses a hybrid three-phase matching pipeline:
|
|
353
|
-
|
|
354
|
-
=== Phase 1: Hash-based exact matching
|
|
355
|
-
|
|
356
|
-
Matches nodes with identical structure and content using hash signatures.
|
|
357
|
-
|
|
358
|
-
* Fast O(n) exact matching
|
|
359
|
-
* Eliminates unchanged subtrees
|
|
360
|
-
* Reduces problem size for subsequent phases
|
|
361
|
-
|
|
362
|
-
=== Phase 2: Similarity-based matching
|
|
363
|
-
|
|
364
|
-
Matches similar but not identical nodes using similarity scoring.
|
|
365
|
-
|
|
366
|
-
* Compares node names, attributes, text content, and structure
|
|
367
|
-
* Configurable similarity threshold (default: 0.95)
|
|
368
|
-
* Uses weighted similarity metrics
|
|
369
|
-
|
|
370
|
-
=== Phase 3: Structural propagation
|
|
371
|
-
|
|
372
|
-
Propagates matches from parents and children to improve matching quality.
|
|
373
|
-
|
|
374
|
-
* Top-down propagation from matched parents
|
|
375
|
-
* Bottom-up propagation from matched children
|
|
376
|
-
* Resolves ambiguous matches
|
|
377
|
-
|
|
378
|
-
== Configuration options
|
|
379
|
-
|
|
380
|
-
=== similarity_threshold
|
|
381
|
-
|
|
382
|
-
Controls the minimum similarity score for matching nodes.
|
|
383
|
-
|
|
384
|
-
[source,ruby]
|
|
385
|
-
----
|
|
386
|
-
result = Canon::Comparison.equivalent?(doc1, doc2,
|
|
387
|
-
verbose: true,
|
|
388
|
-
diff_algorithm: :semantic,
|
|
389
|
-
match: {
|
|
390
|
-
similarity_threshold: 0.90 # Default: 0.95
|
|
391
|
-
}
|
|
392
|
-
)
|
|
393
|
-
----
|
|
394
|
-
|
|
395
|
-
Where,
|
|
396
|
-
|
|
397
|
-
* Higher values (e.g., 0.99): More conservative, only very similar nodes match
|
|
398
|
-
* Lower values (e.g., 0.80): More aggressive, allows less similar nodes to match
|
|
399
|
-
* Default 0.95: Balanced approach suitable for most use cases
|
|
400
|
-
|
|
401
|
-
=== hash_matching
|
|
402
|
-
|
|
403
|
-
Enable or disable hash-based exact matching phase.
|
|
404
|
-
|
|
405
|
-
[source,ruby]
|
|
406
|
-
----
|
|
407
|
-
result = Canon::Comparison.equivalent?(doc1, doc2,
|
|
408
|
-
verbose: true,
|
|
409
|
-
diff_algorithm: :semantic,
|
|
410
|
-
match: {
|
|
411
|
-
hash_matching: true # Default: true
|
|
412
|
-
}
|
|
413
|
-
)
|
|
414
|
-
----
|
|
415
|
-
|
|
416
|
-
Disable only if exact matching causes issues with your data.
|
|
417
|
-
|
|
418
|
-
=== similarity_matching
|
|
419
|
-
|
|
420
|
-
Enable or disable similarity-based matching phase.
|
|
421
|
-
|
|
422
|
-
[source,ruby]
|
|
423
|
-
----
|
|
424
|
-
result = Canon::Comparison.equivalent?(doc1, doc2,
|
|
425
|
-
verbose: true,
|
|
426
|
-
diff_algorithm: :semantic,
|
|
427
|
-
match: {
|
|
428
|
-
similarity_matching: true # Default: true
|
|
429
|
-
}
|
|
430
|
-
)
|
|
431
|
-
----
|
|
432
|
-
|
|
433
|
-
Disable for faster but less accurate matching.
|
|
434
|
-
|
|
435
|
-
=== propagation
|
|
436
|
-
|
|
437
|
-
Enable or disable structural propagation phase.
|
|
438
|
-
|
|
439
|
-
[source,ruby]
|
|
440
|
-
----
|
|
441
|
-
result = Canon::Comparison.equivalent?(doc1, doc2,
|
|
442
|
-
verbose: true,
|
|
443
|
-
diff_algorithm: :semantic,
|
|
444
|
-
match: {
|
|
445
|
-
propagation: true # Default: true
|
|
446
|
-
}
|
|
447
|
-
)
|
|
448
|
-
----
|
|
449
|
-
|
|
450
|
-
Disable for simpler but potentially less accurate results.
|
|
451
|
-
|
|
452
|
-
== Result format
|
|
453
|
-
|
|
454
|
-
When `verbose: true` is set, the result includes operation details:
|
|
455
|
-
|
|
456
|
-
[source,ruby]
|
|
457
|
-
----
|
|
458
|
-
result = Canon::Comparison.equivalent?(doc1, doc2,
|
|
459
|
-
verbose: true,
|
|
460
|
-
diff_algorithm: :semantic
|
|
461
|
-
)
|
|
462
|
-
|
|
463
|
-
# Access operations
|
|
464
|
-
result.operations.each do |op|
|
|
465
|
-
puts "#{op.type}: #{op.node1_path} -> #{op.node2_path}"
|
|
466
|
-
end
|
|
467
|
-
|
|
468
|
-
# Access matching statistics
|
|
469
|
-
stats = result.match_options[:tree_diff_statistics]
|
|
470
|
-
puts "Total nodes tree1: #{stats[:tree1_node_count]}"
|
|
471
|
-
puts "Total nodes tree2: #{stats[:tree2_node_count]}"
|
|
472
|
-
puts "Matched nodes: #{stats[:matched_count]}"
|
|
473
|
-
puts "Match rate: #{stats[:match_rate]}"
|
|
474
|
-
|
|
475
|
-
# Access matching details
|
|
476
|
-
matching = result.match_options[:tree_diff_matching]
|
|
477
|
-
matching.each_pair do |node1, node2|
|
|
478
|
-
puts "Matched: #{node1.path} <-> #{node2.path}"
|
|
479
|
-
end
|
|
480
|
-
----
|
|
481
|
-
|
|
482
|
-
== Examples
|
|
483
|
-
|
|
484
|
-
=== Basic comparison with operations
|
|
485
|
-
|
|
486
|
-
[example]
|
|
487
|
-
====
|
|
488
|
-
[source,ruby]
|
|
489
|
-
----
|
|
490
|
-
require 'canon/comparison'
|
|
491
|
-
|
|
492
|
-
xml1 = <<~XML
|
|
493
|
-
<article>
|
|
494
|
-
<title>Old Title</title>
|
|
495
|
-
<section id="1">
|
|
496
|
-
<p>Content A</p>
|
|
497
|
-
</section>
|
|
498
|
-
</article>
|
|
499
|
-
XML
|
|
500
|
-
|
|
501
|
-
xml2 = <<~XML
|
|
502
|
-
<article>
|
|
503
|
-
<title>New Title</title>
|
|
504
|
-
<section id="2">
|
|
505
|
-
<p>Content B</p>
|
|
506
|
-
</section>
|
|
507
|
-
<section id="1">
|
|
508
|
-
<p>Content A</p>
|
|
509
|
-
</section>
|
|
510
|
-
</article>
|
|
511
|
-
XML
|
|
512
|
-
|
|
513
|
-
result = Canon::Comparison.equivalent?(xml1, xml2,
|
|
514
|
-
verbose: true,
|
|
515
|
-
diff_algorithm: :semantic
|
|
516
|
-
)
|
|
517
|
-
|
|
518
|
-
result.operations.each do |op|
|
|
519
|
-
case op.type
|
|
520
|
-
when :insert
|
|
521
|
-
puts "Added: #{op.node2.name} at #{op.node2_path}"
|
|
522
|
-
when :update
|
|
523
|
-
puts "Changed: #{op.node1_path}"
|
|
524
|
-
when :move
|
|
525
|
-
puts "Moved: #{op.node1_path} -> #{op.node2_path}"
|
|
526
|
-
end
|
|
527
|
-
end
|
|
528
|
-
----
|
|
529
|
-
|
|
530
|
-
Output:
|
|
531
|
-
```
|
|
532
|
-
Changed: /article/title
|
|
533
|
-
Added: section at /article/section[1]
|
|
534
|
-
```
|
|
535
|
-
====
|
|
536
|
-
|
|
537
|
-
=== Detecting moves
|
|
538
|
-
|
|
539
|
-
[example]
|
|
540
|
-
====
|
|
541
|
-
[source,ruby]
|
|
542
|
-
----
|
|
543
|
-
json1 = <<~JSON
|
|
544
|
-
{
|
|
545
|
-
"sections": [
|
|
546
|
-
{ "id": "intro", "content": "Introduction text" },
|
|
547
|
-
{ "id": "body", "content": "Body text" }
|
|
548
|
-
]
|
|
549
|
-
}
|
|
550
|
-
JSON
|
|
551
|
-
|
|
552
|
-
json2 = <<~JSON
|
|
553
|
-
{
|
|
554
|
-
"sections": [
|
|
555
|
-
{ "id": "body", "content": "Body text" },
|
|
556
|
-
{ "id": "intro", "content": "Introduction text" }
|
|
557
|
-
]
|
|
558
|
-
}
|
|
559
|
-
JSON
|
|
560
|
-
|
|
561
|
-
result = Canon::Comparison.equivalent?(json1, json2,
|
|
562
|
-
verbose: true,
|
|
563
|
-
diff_algorithm: :semantic
|
|
564
|
-
)
|
|
565
|
-
|
|
566
|
-
moves = result.operations.select { |op| op.type == :move }
|
|
567
|
-
puts "Detected #{moves.size} move operations"
|
|
568
|
-
----
|
|
569
|
-
====
|
|
570
|
-
|
|
571
|
-
=== Adjusting similarity threshold
|
|
572
|
-
|
|
573
|
-
[example]
|
|
574
|
-
====
|
|
575
|
-
[source,ruby]
|
|
576
|
-
----
|
|
577
|
-
# Strict matching - only very similar nodes match
|
|
578
|
-
result_strict = Canon::Comparison.equivalent?(doc1, doc2,
|
|
579
|
-
verbose: true,
|
|
580
|
-
diff_algorithm: :semantic,
|
|
581
|
-
match: {
|
|
582
|
-
similarity_threshold: 0.99
|
|
583
|
-
}
|
|
584
|
-
)
|
|
585
|
-
|
|
586
|
-
# Lenient matching - allow more variation
|
|
587
|
-
result_lenient = Canon::Comparison.equivalent?(doc1, doc2,
|
|
588
|
-
verbose: true,
|
|
589
|
-
diff_algorithm: :semantic,
|
|
590
|
-
match: {
|
|
591
|
-
similarity_threshold: 0.85
|
|
592
|
-
}
|
|
593
|
-
)
|
|
594
|
-
|
|
595
|
-
puts "Strict: #{result_strict.operations.size} operations"
|
|
596
|
-
puts "Lenient: #{result_lenient.operations.size} operations"
|
|
597
|
-
----
|
|
598
|
-
====
|
|
599
|
-
|
|
600
|
-
== Format support
|
|
601
|
-
|
|
602
|
-
Semantic tree diff works with all Canon-supported formats:
|
|
603
|
-
|
|
604
|
-
=== XML
|
|
605
|
-
|
|
606
|
-
Full support including namespaces, attributes, and mixed content.
|
|
607
|
-
|
|
608
|
-
[source,ruby]
|
|
609
|
-
----
|
|
610
|
-
result = Canon::Comparison.equivalent?(xml1, xml2,
|
|
611
|
-
verbose: true,
|
|
612
|
-
diff_algorithm: :semantic
|
|
613
|
-
)
|
|
614
|
-
----
|
|
615
|
-
|
|
616
|
-
=== JSON
|
|
617
|
-
|
|
618
|
-
Supports objects, arrays, and primitive values.
|
|
619
|
-
|
|
620
|
-
[source,ruby]
|
|
621
|
-
----
|
|
622
|
-
result = Canon::Comparison.equivalent?(json1, json2,
|
|
623
|
-
verbose: true,
|
|
624
|
-
diff_algorithm: :semantic
|
|
625
|
-
)
|
|
626
|
-
----
|
|
627
|
-
|
|
628
|
-
=== HTML
|
|
629
|
-
|
|
630
|
-
Handles HTML 4/5 and XHTML documents.
|
|
631
|
-
|
|
632
|
-
[source,ruby]
|
|
633
|
-
----
|
|
634
|
-
result = Canon::Comparison.equivalent?(html1, html2,
|
|
635
|
-
verbose: true,
|
|
636
|
-
diff_algorithm: :semantic
|
|
637
|
-
)
|
|
638
|
-
----
|
|
639
|
-
|
|
640
|
-
=== YAML
|
|
641
|
-
|
|
642
|
-
Processes YAML documents with nested structures.
|
|
643
|
-
|
|
644
|
-
[source,ruby]
|
|
645
|
-
----
|
|
646
|
-
result = Canon::Comparison.equivalent?(yaml1, yaml2,
|
|
647
|
-
verbose: true,
|
|
648
|
-
diff_algorithm: :semantic
|
|
649
|
-
)
|
|
650
|
-
----
|
|
651
|
-
|
|
652
|
-
== Interaction with other options
|
|
653
|
-
|
|
654
|
-
=== Preprocessing
|
|
655
|
-
|
|
656
|
-
Preprocessing applies before tree diff:
|
|
657
|
-
|
|
658
|
-
[source,ruby]
|
|
659
|
-
----
|
|
660
|
-
result = Canon::Comparison.equivalent?(doc1, doc2,
|
|
661
|
-
verbose: true,
|
|
662
|
-
preprocessing: :c14n, # Applied first
|
|
663
|
-
diff_algorithm: :semantic
|
|
664
|
-
)
|
|
665
|
-
----
|
|
666
|
-
|
|
667
|
-
The preprocessing option normalizes documents before tree diff runs, ensuring
|
|
668
|
-
consistent comparison.
|
|
669
|
-
|
|
670
|
-
=== Match dimensions
|
|
671
|
-
|
|
672
|
-
Match dimensions DO apply with semantic diff and control whether detected
|
|
673
|
-
operations are classified as **normative** (must-fix) or **informative**
|
|
674
|
-
(can-ignore) differences.
|
|
675
|
-
|
|
676
|
-
When a match dimension is set to `:ignore`, operations related to that
|
|
677
|
-
dimension are marked as informative. When set to `:strict` or `:normalize`,
|
|
678
|
-
they are marked as normative.
|
|
679
|
-
|
|
680
|
-
==== Dimension mapping to operations
|
|
681
|
-
|
|
682
|
-
[cols="1,1,2"]
|
|
683
|
-
|===
|
|
684
|
-
|Match Dimension |Operation Types |Effect when `:ignore`
|
|
685
|
-
|
|
686
|
-
|`text_content`
|
|
687
|
-
|UPDATE (text changes)
|
|
688
|
-
|Text content changes → informative
|
|
689
|
-
|
|
690
|
-
|`attribute_values`
|
|
691
|
-
|UPDATE (attribute changes)
|
|
692
|
-
|Attribute value changes → informative
|
|
693
|
-
|
|
694
|
-
|`attribute_order`
|
|
695
|
-
|UPDATE (attribute reordering)
|
|
696
|
-
|Attribute order changes → informative
|
|
697
|
-
|
|
698
|
-
|`element_hierarchy`
|
|
699
|
-
|UPGRADE, DOWNGRADE
|
|
700
|
-
|Hierarchy depth changes → informative
|
|
701
|
-
|
|
702
|
-
|`element_position`
|
|
703
|
-
|MOVE
|
|
704
|
-
|Element position changes → informative
|
|
705
|
-
|
|
706
|
-
|`element_hierarchy`
|
|
707
|
-
|MOVE (between parents)
|
|
708
|
-
|Hierarchy changes → informative
|
|
709
|
-
|
|
710
|
-
|`comments`
|
|
711
|
-
|UPDATE, INSERT, DELETE (comments)
|
|
712
|
-
|Comment changes → informative
|
|
713
|
-
|
|
714
|
-
|`structural_whitespace`
|
|
715
|
-
|UPDATE (whitespace)
|
|
716
|
-
|Whitespace changes → informative
|
|
717
|
-
|===
|
|
718
|
-
|
|
719
|
-
==== Examples
|
|
720
|
-
|
|
721
|
-
.Ignoring text content changes
|
|
722
|
-
[example]
|
|
723
|
-
====
|
|
724
|
-
[source,ruby]
|
|
725
|
-
----
|
|
726
|
-
# Text UPDATE operations are marked as informative
|
|
727
|
-
result = Canon::Comparison.equivalent?(doc1, doc2,
|
|
728
|
-
diff_algorithm: :semantic,
|
|
729
|
-
verbose: true,
|
|
730
|
-
match: {
|
|
731
|
-
text_content: :ignore # Text changes → informative
|
|
732
|
-
}
|
|
733
|
-
)
|
|
734
|
-
|
|
735
|
-
# Check if there are any normative (structural) differences
|
|
736
|
-
has_normative = result.differences.any?(&:normative?)
|
|
737
|
-
----
|
|
738
|
-
====
|
|
739
|
-
|
|
740
|
-
.Ignoring element position changes
|
|
741
|
-
[example]
|
|
742
|
-
====
|
|
743
|
-
[source,ruby]
|
|
744
|
-
----
|
|
745
|
-
# MOVE operations are marked as informative
|
|
746
|
-
result = Canon::Comparison.equivalent?(doc1, doc2,
|
|
747
|
-
diff_algorithm: :semantic,
|
|
748
|
-
verbose: true,
|
|
749
|
-
match: {
|
|
750
|
-
element_position: :ignore # Moves → informative
|
|
751
|
-
}
|
|
752
|
-
)
|
|
753
|
-
|
|
754
|
-
# Only structural changes are normative
|
|
755
|
-
normative_ops = result.differences.select(&:normative?)
|
|
756
|
-
puts "Normative changes: #{normative_ops.map(&:dimension).uniq}"
|
|
757
|
-
----
|
|
758
|
-
====
|
|
759
|
-
|
|
760
|
-
.Combining multiple dimension settings
|
|
761
|
-
[example]
|
|
762
|
-
====
|
|
763
|
-
[source,ruby]
|
|
764
|
-
----
|
|
765
|
-
# Complex filtering: care about structure but not formatting
|
|
766
|
-
result = Canon::Comparison.equivalent?(doc1, doc2,
|
|
767
|
-
diff_algorithm: :semantic,
|
|
768
|
-
verbose: true,
|
|
769
|
-
match: {
|
|
770
|
-
# Normative dimensions (must match)
|
|
771
|
-
element_structure: :strict,
|
|
772
|
-
element_hierarchy: :strict,
|
|
773
|
-
|
|
774
|
-
# Informative dimensions (can differ)
|
|
775
|
-
text_content: :ignore,
|
|
776
|
-
attribute_order: :ignore,
|
|
777
|
-
structural_whitespace: :ignore,
|
|
778
|
-
comments: :ignore
|
|
779
|
-
}
|
|
780
|
-
)
|
|
781
|
-
|
|
782
|
-
# Report only normative differences
|
|
783
|
-
if result.differences.any?(&:normative?)
|
|
784
|
-
puts "Structural differences found!"
|
|
785
|
-
result.differences.select(&:normative?).each do |diff|
|
|
786
|
-
puts " #{diff.dimension}: #{diff.reason}"
|
|
787
|
-
end
|
|
788
|
-
end
|
|
789
|
-
----
|
|
790
|
-
====
|
|
791
|
-
|
|
792
|
-
==== Semantic-specific dimensions
|
|
793
|
-
|
|
794
|
-
The three semantic-specific dimensions are only meaningful with semantic diff:
|
|
795
|
-
|
|
796
|
-
* `element_hierarchy`: Controls UPGRADE/DOWNGRADE operations (depth changes)
|
|
797
|
-
* `element_position`: Controls MOVE operations (position changes)
|
|
798
|
-
* `element_hierarchy`: Controls MOVE operations (parent changes)
|
|
799
|
-
|
|
800
|
-
With DOM diff, these dimensions have no effect since DOM diff doesn't detect
|
|
801
|
-
these operation types.
|
|
802
|
-
|
|
803
|
-
== Performance considerations
|
|
804
|
-
|
|
805
|
-
=== Time complexity
|
|
806
|
-
|
|
807
|
-
* Hash matching: O(n) where n is node count
|
|
808
|
-
* Similarity matching: O(n²) worst case, O(n log n) typical
|
|
809
|
-
* Propagation: O(n)
|
|
810
|
-
|
|
811
|
-
For large documents (>10,000 nodes), consider:
|
|
812
|
-
|
|
813
|
-
* Disabling similarity_matching if exact matches suffice
|
|
814
|
-
* Increasing similarity_threshold to reduce candidate matches
|
|
815
|
-
* Using preprocessing to reduce document size
|
|
816
|
-
|
|
817
|
-
=== Memory usage
|
|
818
|
-
|
|
819
|
-
The tree diff maintains:
|
|
820
|
-
|
|
821
|
-
* Full tree representations of both documents
|
|
822
|
-
* Hash signatures for all nodes
|
|
823
|
-
* Matching state and operations
|
|
824
|
-
|
|
825
|
-
For very large documents, monitor memory usage and consider processing in
|
|
826
|
-
chunks if needed.
|
|
827
|
-
|
|
828
|
-
== Troubleshooting
|
|
829
|
-
|
|
830
|
-
=== Too many/too few matches
|
|
831
|
-
|
|
832
|
-
Adjust `similarity_threshold`:
|
|
833
|
-
|
|
834
|
-
[source,ruby]
|
|
835
|
-
----
|
|
836
|
-
# Too many false matches? Increase threshold
|
|
837
|
-
Canon::Comparison.equivalent?(doc1, doc2,
|
|
838
|
-
diff_algorithm: :semantic,
|
|
839
|
-
match: { similarity_threshold: 0.98 }
|
|
840
|
-
)
|
|
841
|
-
|
|
842
|
-
# Too few matches? Decrease threshold
|
|
843
|
-
Canon::Comparison.equivalent?(doc1, doc2,
|
|
844
|
-
diff_algorithm: :semantic,
|
|
845
|
-
match: { similarity_threshold: 0.90 }
|
|
846
|
-
)
|
|
847
|
-
----
|
|
848
|
-
|
|
849
|
-
=== Incorrect move detection
|
|
850
|
-
|
|
851
|
-
Moves are detected when nodes match but positions change. If move detection is
|
|
852
|
-
incorrect:
|
|
853
|
-
|
|
854
|
-
* Verify nodes are truly similar (check attributes and content)
|
|
855
|
-
* Adjust similarity_threshold
|
|
856
|
-
* Check if preprocessing is needed to normalize content
|
|
857
|
-
|
|
858
|
-
=== Performance issues
|
|
859
|
-
|
|
860
|
-
For slow comparisons:
|
|
861
|
-
|
|
862
|
-
[source,ruby]
|
|
863
|
-
----
|
|
864
|
-
# Disable expensive phases
|
|
865
|
-
Canon::Comparison.equivalent?(doc1, doc2,
|
|
866
|
-
diff_algorithm: :semantic,
|
|
867
|
-
match: {
|
|
868
|
-
similarity_matching: false, # Skip if exact matches suffice
|
|
869
|
-
propagation: false # Skip if not needed
|
|
870
|
-
}
|
|
871
|
-
)
|
|
872
|
-
----
|
|
873
|
-
|
|
874
|
-
== Metadata Element Classification
|
|
875
|
-
|
|
876
|
-
Metadata elements are presentation/formatting elements that don't affect semantic equivalence. The semantic tree diff automatically classifies operations on these elements as **informative** (non-normative).
|
|
877
|
-
|
|
878
|
-
=== Metadata Elements
|
|
879
|
-
|
|
880
|
-
The following elements are treated as metadata:
|
|
881
|
-
|
|
882
|
-
* `semx` - Semantic markup
|
|
883
|
-
* `fmt-concept` - Formatted concept
|
|
884
|
-
* `fmt-name` - Formatted name
|
|
885
|
-
* `fmt-title` - Formatted title
|
|
886
|
-
* `fmt-xref` - Formatted cross-reference
|
|
887
|
-
* `fmt-eref` - Formatted external reference
|
|
888
|
-
* `fmt-termref` - Formatted term reference
|
|
889
|
-
* `fmt-element-name` - Formatted element name
|
|
890
|
-
* `fmt-link` - Formatted link
|
|
891
|
-
* `autonum` - Automatic numbering
|
|
892
|
-
|
|
893
|
-
=== Classification Rules
|
|
894
|
-
|
|
895
|
-
When an operation involves a metadata element:
|
|
896
|
-
|
|
897
|
-
* **INSERT of metadata element**: Marked as informative
|
|
898
|
-
* **DELETE of metadata element**: Marked as informative
|
|
899
|
-
* **UPDATE of metadata element**: Marked as informative
|
|
900
|
-
* **MOVE of metadata element**: Marked as informative
|
|
901
|
-
|
|
902
|
-
This ensures that presentation-layer changes don't cause false positives in semantic comparison.
|
|
903
|
-
|
|
904
|
-
.Example: Metadata element changes are informative
|
|
905
|
-
[example]
|
|
906
|
-
====
|
|
907
|
-
[source,xml]
|
|
908
|
-
----
|
|
909
|
-
<!-- Document 1 -->
|
|
910
|
-
<section>
|
|
911
|
-
<title>Introduction</title>
|
|
912
|
-
<autonum>1.1</autonum>
|
|
913
|
-
</section>
|
|
914
|
-
|
|
915
|
-
<!-- Document 2 -->
|
|
916
|
-
<section>
|
|
917
|
-
<title>Introduction</title>
|
|
918
|
-
<autonum>1.2</autonum> <!-- Changed -->
|
|
919
|
-
<semx element="title">Introduction</semx> <!-- Added -->
|
|
920
|
-
</section>
|
|
921
|
-
----
|
|
922
|
-
|
|
923
|
-
Operations detected:
|
|
924
|
-
|
|
925
|
-
* UPDATE `<autonum>` (1.1 → 1.2): **Informative** (metadata element)
|
|
926
|
-
* INSERT `<semx>`: **Informative** (metadata element)
|
|
927
|
-
|
|
928
|
-
The documents are considered semantically equivalent despite these presentation changes.
|
|
929
|
-
====
|
|
930
|
-
|
|
931
|
-
=== Configuring Metadata Elements
|
|
932
|
-
|
|
933
|
-
The metadata element list is defined in [`OperationConverter::METADATA_ELEMENTS`](../lib/canon/tree_diff/operation_converter.rb:33).
|
|
934
|
-
|
|
935
|
-
To treat additional elements as metadata, modify the `METADATA_ELEMENTS` constant. Alternatively, use match dimensions to ignore specific changes.
|
|
936
|
-
|
|
937
|
-
== Whitespace-Sensitive Elements
|
|
938
|
-
|
|
939
|
-
Certain elements preserve whitespace as semantically significant. For these elements, whitespace is **not normalized** even when `text_content: :normalize` is set.
|
|
940
|
-
|
|
941
|
-
=== Whitespace-Sensitive Element List
|
|
942
|
-
|
|
943
|
-
* `pre` - Preformatted text
|
|
944
|
-
* `code` - Code blocks
|
|
945
|
-
* `textarea` - Text input areas
|
|
946
|
-
* `script` - JavaScript code
|
|
947
|
-
* `style` - CSS styles
|
|
948
|
-
|
|
949
|
-
=== Whitespace Handling
|
|
950
|
-
|
|
951
|
-
For whitespace-sensitive elements:
|
|
952
|
-
|
|
953
|
-
* Text content is preserved exactly as-is during tree conversion
|
|
954
|
-
* Normalization does NOT apply to text content
|
|
955
|
-
* Whitespace differences are **normative** (must match exactly)
|
|
956
|
-
* Leading/trailing spaces are significant
|
|
957
|
-
|
|
958
|
-
.Example: Whitespace in `<pre>` is significant
|
|
959
|
-
[example]
|
|
960
|
-
====
|
|
961
|
-
[source,html]
|
|
962
|
-
----
|
|
963
|
-
<!-- Document 1 -->
|
|
964
|
-
<pre>
|
|
965
|
-
Line 1
|
|
966
|
-
Line 2
|
|
967
|
-
</pre>
|
|
968
|
-
|
|
969
|
-
<!-- Document 2 -->
|
|
970
|
-
<pre>
|
|
971
|
-
Line 1
|
|
972
|
-
Line 2
|
|
973
|
-
</pre>
|
|
974
|
-
----
|
|
975
|
-
|
|
976
|
-
Operation detected:
|
|
977
|
-
|
|
978
|
-
* UPDATE `<pre>` text content: **Normative** (whitespace differs)
|
|
979
|
-
|
|
980
|
-
These documents are NOT equivalent because `<pre>` preserves whitespace. The leading spaces in Document 1 are semantically significant.
|
|
981
|
-
====
|
|
982
|
-
|
|
983
|
-
.Example: Whitespace in `<p>` is normalized
|
|
984
|
-
[example]
|
|
985
|
-
====
|
|
986
|
-
[source,html]
|
|
987
|
-
----
|
|
988
|
-
<!-- Document 1 -->
|
|
989
|
-
<p>
|
|
990
|
-
Line 1
|
|
991
|
-
Line 2
|
|
992
|
-
</p>
|
|
993
|
-
|
|
994
|
-
<!-- Document 2 -->
|
|
995
|
-
<p>
|
|
996
|
-
Line 1
|
|
997
|
-
Line 2
|
|
998
|
-
</p>
|
|
999
|
-
----
|
|
1000
|
-
|
|
1001
|
-
With `text_content: :normalize`:
|
|
1002
|
-
|
|
1003
|
-
* No UPDATE detected (whitespace normalized in `<p>`)
|
|
1004
|
-
|
|
1005
|
-
These documents ARE equivalent because `<p>` normalizes whitespace when `text_content: :normalize` is set.
|
|
1006
|
-
====
|
|
1007
|
-
|
|
1008
|
-
=== Implementation
|
|
1009
|
-
|
|
1010
|
-
Whitespace-sensitive elements are detected automatically based on element name in the format adapters:
|
|
1011
|
-
|
|
1012
|
-
* [`XMLAdapter`](../lib/canon/tree_diff/adapters/xml_adapter.rb) - Preserves original text
|
|
1013
|
-
* [`HTMLAdapter`](../lib/canon/tree_diff/adapters/html_adapter.rb) - Preserves original text
|
|
1014
|
-
|
|
1015
|
-
The normalization decision happens in [`OperationDetector`](../lib/canon/tree_diff/operations/operation_detector.rb) based on element name and match options.
|
|
1016
|
-
|
|
1017
|
-
== Comparison with DOM diff
|
|
1018
|
-
|
|
1019
|
-
[cols="1,2,2"]
|
|
1020
|
-
|===
|
|
1021
|
-
|Aspect |DOM diff (default) |Semantic tree diff
|
|
1022
|
-
|
|
1023
|
-
|Matching
|
|
1024
|
-
|Position-based with DOM structure
|
|
1025
|
-
|Similarity-based with tree operations
|
|
1026
|
-
|
|
1027
|
-
|Operations
|
|
1028
|
-
|Line-by-line changes
|
|
1029
|
-
|INSERT, DELETE, UPDATE, MOVE, MERGE, SPLIT, UPGRADE, DOWNGRADE
|
|
1030
|
-
|
|
1031
|
-
|Use case
|
|
1032
|
-
|Traditional diff output
|
|
1033
|
-
|Operation-level analysis
|
|
1034
|
-
|
|
1035
|
-
|Performance
|
|
1036
|
-
|Faster for large docs
|
|
1037
|
-
|Slower but more intelligent
|
|
1038
|
-
|
|
1039
|
-
|Move detection
|
|
1040
|
-
|No
|
|
1041
|
-
|Yes
|
|
1042
|
-
|
|
1043
|
-
|Rearrangement
|
|
1044
|
-
|Shows as delete + insert
|
|
1045
|
-
|Shows as MOVE
|
|
1046
|
-
|
|
1047
|
-
|Format support
|
|
1048
|
-
|XML, HTML, JSON, YAML
|
|
1049
|
-
|XML, HTML, JSON, YAML
|
|
1050
|
-
|
|
1051
|
-
|Output mode
|
|
1052
|
-
|Line-based or tree-based
|
|
1053
|
-
|Operation list
|
|
1054
|
-
|
|
1055
|
-
|Verbose mode
|
|
1056
|
-
|Required for detailed output
|
|
1057
|
-
|Required for operations
|
|
1058
|
-
|===
|
|
1059
|
-
|
|
1060
|
-
Both modes are first-class citizens in Canon. Choose based on your needs:
|
|
1061
|
-
|
|
1062
|
-
* Use **DOM diff** for traditional comparison and readable output
|
|
1063
|
-
* Use **tree diff** for operation analysis and move detection
|
|
1064
|
-
|
|
1065
|
-
== Research background
|
|
1066
|
-
|
|
1067
|
-
The semantic tree diff implementation is based on:
|
|
1068
|
-
|
|
1069
|
-
* **XDiff (2002)**: Minimum-cost edit distance with unordered tree model
|
|
1070
|
-
* **XyDiff/Cobena (2002)**: BULD algorithm with hash signatures and weights
|
|
1071
|
-
* **JATS-diff (2022)**: Semantic operations for text-centric XML
|
|
1072
|
-
|
|
1073
|
-
Key innovations in Canon's implementation:
|
|
1074
|
-
|
|
1075
|
-
* Hybrid matching pipeline combining hash, similarity, and propagation
|
|
1076
|
-
* Format adapters for XML, JSON, HTML, YAML
|
|
1077
|
-
* Integration with Canon's existing comparison architecture
|
|
1078
|
-
* Configurable similarity thresholds and matching phases
|
|
1079
|
-
|
|
1080
|
-
See `xmldiff-resources.md` in the repository for research paper details.
|