canon 0.1.8 → 0.1.9
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop_todo.yml +112 -25
- data/docs/Gemfile +1 -0
- data/docs/_config.yml +90 -1
- data/docs/advanced/diff-classification.adoc +82 -2
- data/docs/features/match-options/index.adoc +239 -1
- data/lib/canon/comparison/format_detector.rb +2 -1
- data/lib/canon/comparison/html_comparator.rb +19 -8
- data/lib/canon/comparison/html_compare_profile.rb +8 -2
- data/lib/canon/comparison/match_options/base_resolver.rb +7 -0
- data/lib/canon/comparison/whitespace_sensitivity.rb +208 -0
- data/lib/canon/comparison/xml_comparator/child_comparison.rb +15 -7
- data/lib/canon/comparison/xml_comparator/node_parser.rb +10 -5
- data/lib/canon/comparison/xml_comparator/node_type_comparator.rb +14 -7
- data/lib/canon/comparison/xml_comparator.rb +48 -23
- data/lib/canon/comparison/xml_node_comparison.rb +25 -3
- data/lib/canon/diff/diff_classifier.rb +101 -2
- data/lib/canon/diff/formatting_detector.rb +1 -1
- data/lib/canon/rspec_matchers.rb +37 -8
- data/lib/canon/version.rb +1 -1
- data/lib/canon/xml/data_model.rb +24 -13
- metadata +3 -78
- data/docs/plans/2025-01-17-html-parser-selection-fix.adoc +0 -250
- data/false_positive_analysis.txt +0 -0
- data/file1.html +0 -1
- data/file2.html +0 -1
- data/old-docs/ADVANCED_TOPICS.adoc +0 -20
- data/old-docs/BASIC_USAGE.adoc +0 -16
- data/old-docs/CHARACTER_VISUALIZATION.adoc +0 -567
- data/old-docs/CLI.adoc +0 -497
- data/old-docs/CUSTOMIZING_BEHAVIOR.adoc +0 -19
- data/old-docs/DIFF_ARCHITECTURE.adoc +0 -435
- data/old-docs/DIFF_FORMATTING.adoc +0 -540
- data/old-docs/DIFF_PARAMETERS.adoc +0 -261
- data/old-docs/DOM_DIFF.adoc +0 -1017
- data/old-docs/ENV_CONFIG.adoc +0 -876
- data/old-docs/FORMATS.adoc +0 -867
- data/old-docs/INPUT_VALIDATION.adoc +0 -477
- data/old-docs/MATCHER_BEHAVIOR.adoc +0 -90
- data/old-docs/MATCH_ARCHITECTURE.adoc +0 -463
- data/old-docs/MATCH_OPTIONS.adoc +0 -912
- data/old-docs/MODES.adoc +0 -432
- data/old-docs/NORMATIVE_INFORMATIVE_DIFFS.adoc +0 -219
- data/old-docs/OPTIONS.adoc +0 -1387
- data/old-docs/PREPROCESSING.adoc +0 -491
- data/old-docs/README.old.adoc +0 -2831
- data/old-docs/RSPEC.adoc +0 -814
- data/old-docs/RUBY_API.adoc +0 -485
- data/old-docs/SEMANTIC_DIFF_REPORT.adoc +0 -646
- data/old-docs/SEMANTIC_TREE_DIFF.adoc +0 -765
- data/old-docs/STRING_COMPARE.adoc +0 -345
- data/old-docs/TMP.adoc +0 -3384
- data/old-docs/TREE_DIFF.adoc +0 -1080
- data/old-docs/UNDERSTANDING_CANON.adoc +0 -17
- data/old-docs/VERBOSE.adoc +0 -482
- data/old-docs/VISUALIZATION_MAP.adoc +0 -625
- data/old-docs/WHITESPACE_TREATMENT.adoc +0 -1155
- data/scripts/analyze_current_state.rb +0 -85
- data/scripts/analyze_false_positives.rb +0 -114
- data/scripts/analyze_remaining_failures.rb +0 -105
- data/scripts/compare_current_failures.rb +0 -95
- data/scripts/compare_dom_tree_diff.rb +0 -158
- data/scripts/compare_failures.rb +0 -151
- data/scripts/debug_attribute_extraction.rb +0 -66
- data/scripts/debug_blocks_839.rb +0 -115
- data/scripts/debug_meta_matching.rb +0 -52
- data/scripts/debug_p_matching.rb +0 -192
- data/scripts/debug_signature_matching.rb +0 -118
- data/scripts/debug_sourcecode_124.rb +0 -32
- data/scripts/debug_whitespace_sensitive.rb +0 -192
- data/scripts/extract_false_positives.rb +0 -138
- data/scripts/find_actual_false_positives.rb +0 -125
- data/scripts/investigate_all_false_positives.rb +0 -161
- data/scripts/investigate_batch1.rb +0 -127
- data/scripts/investigate_classification.rb +0 -150
- data/scripts/investigate_classification_detailed.rb +0 -190
- data/scripts/investigate_common_failures.rb +0 -342
- data/scripts/investigate_false_negative.rb +0 -80
- data/scripts/investigate_false_positive.rb +0 -83
- data/scripts/investigate_false_positives.rb +0 -227
- data/scripts/investigate_false_positives_batch.rb +0 -163
- data/scripts/investigate_mixed_content.rb +0 -125
- data/scripts/investigate_remaining_16.rb +0 -214
- data/scripts/run_single_test.rb +0 -29
- data/scripts/test_all_false_positives.rb +0 -95
- data/scripts/test_attribute_details.rb +0 -61
- data/scripts/test_both_algorithms.rb +0 -49
- data/scripts/test_both_simple.rb +0 -49
- data/scripts/test_enhanced_semantic_output.rb +0 -125
- data/scripts/test_readme_examples.rb +0 -131
- data/scripts/test_semantic_tree_diff.rb +0 -99
- data/scripts/test_semantic_ux_improvements.rb +0 -135
- data/scripts/test_single_false_positive.rb +0 -119
- data/scripts/test_size_limits.rb +0 -99
- data/test_html_1.html +0 -21
- data/test_html_2.html +0 -21
- data/test_nokogiri.rb +0 -33
- data/test_normalize.rb +0 -45
|
@@ -1,765 +0,0 @@
|
|
|
1
|
-
= Semantic Tree Diff Algorithm
|
|
2
|
-
:toc:
|
|
3
|
-
:toclevels: 3
|
|
4
|
-
|
|
5
|
-
[WARNING]
|
|
6
|
-
The semantic tree diff algorithm is currently **experimental** and under active development. While functional and tested, the API and behavior may change in future releases. Use with caution in production environments.
|
|
7
|
-
|
|
8
|
-
== General
|
|
9
|
-
|
|
10
|
-
Canon provides two complementary diff algorithms:
|
|
11
|
-
|
|
12
|
-
* **DOM diff** (default): Stable, position-based comparison for traditional diff output
|
|
13
|
-
* **Semantic tree diff** (experimental): Advanced, similarity-based comparison with operation detection
|
|
14
|
-
|
|
15
|
-
This document provides a comprehensive guide to the semantic tree diff algorithm, including when to use it, how it works, and how to migrate from DOM diff.
|
|
16
|
-
|
|
17
|
-
== When to Use Semantic Tree Diff
|
|
18
|
-
|
|
19
|
-
=== Use Semantic Tree Diff When
|
|
20
|
-
|
|
21
|
-
* You need to detect high-level operations (INSERT, DELETE, UPDATE, MOVE, MERGE, SPLIT, UPGRADE, DOWNGRADE)
|
|
22
|
-
* Documents have significant rearrangement and you want to track moved content
|
|
23
|
-
* You need statistical analysis of changes (matching rates, confidence scores)
|
|
24
|
-
* You want operation-level analysis of document transformations
|
|
25
|
-
* You're comparing structured documents where content moves between positions
|
|
26
|
-
|
|
27
|
-
=== Use DOM Diff When
|
|
28
|
-
|
|
29
|
-
* You need stable, well-tested comparison for production use
|
|
30
|
-
* You want traditional line-by-line diff output
|
|
31
|
-
* Documents are similar in structure with minimal rearrangement
|
|
32
|
-
* You need maximum performance for large documents (>10,000 nodes)
|
|
33
|
-
* You want consistent, predictable behavior
|
|
34
|
-
|
|
35
|
-
=== Size Limits
|
|
36
|
-
|
|
37
|
-
Both algorithms have configurable size limits to prevent hangs on pathologically large files:
|
|
38
|
-
|
|
39
|
-
* **File size limit**: Default 5MB (configurable via `CANON_MAX_FILE_SIZE`)
|
|
40
|
-
* **Node count limit**: Default 10,000 nodes (configurable via `CANON_MAX_NODE_COUNT`)
|
|
41
|
-
* **Diff output limit**: Default 10,000 lines (configurable via `CANON_MAX_DIFF_LINES`)
|
|
42
|
-
|
|
43
|
-
See link:ENV_CONFIG.adoc#size-limits[ENV_CONFIG.adoc] for details on configuring these limits.
|
|
44
|
-
|
|
45
|
-
== Key Differences from DOM Diff
|
|
46
|
-
|
|
47
|
-
[cols="1,2,2"]
|
|
48
|
-
|===
|
|
49
|
-
|Aspect |DOM Diff |Semantic Tree Diff
|
|
50
|
-
|
|
51
|
-
|**Matching Strategy**
|
|
52
|
-
|Position-based with DOM structure
|
|
53
|
-
|Similarity-based with intelligent matching
|
|
54
|
-
|
|
55
|
-
|**Operation Detection**
|
|
56
|
-
|Line-level changes only
|
|
57
|
-
|INSERT, DELETE, UPDATE, MOVE, MERGE, SPLIT, UPGRADE, DOWNGRADE
|
|
58
|
-
|
|
59
|
-
|**Move Detection**
|
|
60
|
-
|No (shows as delete + insert)
|
|
61
|
-
|Yes (tracks content movement)
|
|
62
|
-
|
|
63
|
-
|**Rearrangement Handling**
|
|
64
|
-
|Poor (many false positives)
|
|
65
|
-
|Excellent (intelligent matching)
|
|
66
|
-
|
|
67
|
-
|**Performance**
|
|
68
|
-
|O(n) - faster for large docs
|
|
69
|
-
|O(n²) worst case - slower but smarter
|
|
70
|
-
|
|
71
|
-
|**Maturity**
|
|
72
|
-
|Stable, production-ready
|
|
73
|
-
|Experimental, under development
|
|
74
|
-
|
|
75
|
-
|**Output Format**
|
|
76
|
-
|Line-by-line or tree diff
|
|
77
|
-
|Operation list with metadata
|
|
78
|
-
|
|
79
|
-
|**Best For**
|
|
80
|
-
|Traditional diff needs
|
|
81
|
-
|Operation analysis, restructuring
|
|
82
|
-
|===
|
|
83
|
-
|
|
84
|
-
== How Semantic Tree Diff Works
|
|
85
|
-
|
|
86
|
-
=== Three-Phase Matching Pipeline
|
|
87
|
-
|
|
88
|
-
The semantic tree diff uses a hybrid matching algorithm:
|
|
89
|
-
|
|
90
|
-
==== Phase 1: Hash-Based Exact Matching
|
|
91
|
-
|
|
92
|
-
Matches nodes with identical structure and content using hash signatures.
|
|
93
|
-
|
|
94
|
-
* Fast O(n) exact matching
|
|
95
|
-
* Eliminates unchanged subtrees
|
|
96
|
-
* Reduces problem size for subsequent phases
|
|
97
|
-
|
|
98
|
-
==== Phase 2: Similarity-Based Matching
|
|
99
|
-
|
|
100
|
-
Matches similar but not identical nodes using weighted similarity scoring.
|
|
101
|
-
|
|
102
|
-
* Compares node names, attributes, text content, and structure
|
|
103
|
-
* Configurable similarity threshold (default: 0.95)
|
|
104
|
-
* Uses weighted similarity metrics
|
|
105
|
-
|
|
106
|
-
==== Phase 3: Structural Propagation
|
|
107
|
-
|
|
108
|
-
Propagates matches from parents and children to improve quality.
|
|
109
|
-
|
|
110
|
-
* Top-down propagation from matched parents
|
|
111
|
-
* Bottom-up propagation from matched children
|
|
112
|
-
* Resolves ambiguous matches
|
|
113
|
-
|
|
114
|
-
=== Operation Detection
|
|
115
|
-
|
|
116
|
-
After matching, the algorithm detects eight operation types:
|
|
117
|
-
|
|
118
|
-
**Level 1 - Basic Operations:**
|
|
119
|
-
|
|
120
|
-
* `INSERT`: New node added
|
|
121
|
-
* `DELETE`: Node removed
|
|
122
|
-
* `UPDATE`: Node content/attributes changed
|
|
123
|
-
|
|
124
|
-
**Level 2 - Structural Operations:**
|
|
125
|
-
|
|
126
|
-
* `MOVE`: Node relocated to different position
|
|
127
|
-
|
|
128
|
-
**Level 3 - Semantic Operations:**
|
|
129
|
-
|
|
130
|
-
* `MERGE`: Multiple nodes combined into one
|
|
131
|
-
* `SPLIT`: One node divided into multiple
|
|
132
|
-
* `UPGRADE`: Node promoted to higher level (decreased depth)
|
|
133
|
-
* `DOWNGRADE`: Node demoted to lower level (increased depth)
|
|
134
|
-
|
|
135
|
-
See link:TREE_DIFF.adoc#detected-operations[TREE_DIFF.adoc] for detailed examples of each operation.
|
|
136
|
-
|
|
137
|
-
== Metadata Elements
|
|
138
|
-
|
|
139
|
-
Metadata elements are presentation/formatting elements that don't affect semantic equivalence. The semantic tree diff automatically treats these as **informative** (non-normative) differences.
|
|
140
|
-
|
|
141
|
-
=== Metadata Element List
|
|
142
|
-
|
|
143
|
-
The following elements are classified as metadata:
|
|
144
|
-
|
|
145
|
-
* `semx` - Semantic markup
|
|
146
|
-
* `fmt-concept` - Formatted concept
|
|
147
|
-
* `fmt-name` - Formatted name
|
|
148
|
-
* `fmt-title` - Formatted title
|
|
149
|
-
* `fmt-xref` - Formatted cross-reference
|
|
150
|
-
* `fmt-eref` - Formatted external reference
|
|
151
|
-
* `fmt-termref` - Formatted term reference
|
|
152
|
-
* `fmt-element-name` - Formatted element name
|
|
153
|
-
* `fmt-link` - Formatted link
|
|
154
|
-
* `autonum` - Automatic numbering
|
|
155
|
-
|
|
156
|
-
=== Classification Behavior
|
|
157
|
-
|
|
158
|
-
When a difference involves a metadata element:
|
|
159
|
-
|
|
160
|
-
* **INSERT/DELETE of metadata element**: Marked as informative
|
|
161
|
-
* **UPDATE of metadata element**: Marked as informative
|
|
162
|
-
* **MOVE of metadata element**: Marked as informative
|
|
163
|
-
|
|
164
|
-
.Example: Metadata element differences are informative
|
|
165
|
-
[example]
|
|
166
|
-
====
|
|
167
|
-
[source,xml]
|
|
168
|
-
----
|
|
169
|
-
<!-- Document 1 -->
|
|
170
|
-
<section>
|
|
171
|
-
<title>Introduction</title>
|
|
172
|
-
<autonum>1.1</autonum>
|
|
173
|
-
</section>
|
|
174
|
-
|
|
175
|
-
<!-- Document 2 -->
|
|
176
|
-
<section>
|
|
177
|
-
<title>Introduction</title>
|
|
178
|
-
<autonum>1.2</autonum> <!-- Changed autonum -->
|
|
179
|
-
</section>
|
|
180
|
-
----
|
|
181
|
-
|
|
182
|
-
The change to `<autonum>` is detected as an UPDATE but marked as **informative** because `autonum` is a metadata element. The documents are considered semantically equivalent.
|
|
183
|
-
====
|
|
184
|
-
|
|
185
|
-
=== Configuring Metadata Elements
|
|
186
|
-
|
|
187
|
-
The metadata element list is defined in [`OperationConverter::METADATA_ELEMENTS`](lib/canon/tree_diff/operation_converter.rb:33).
|
|
188
|
-
|
|
189
|
-
To add custom metadata elements for your domain, you would need to:
|
|
190
|
-
|
|
191
|
-
1. Fork Canon and modify `METADATA_ELEMENTS`
|
|
192
|
-
2. Or use match dimension `:ignore` to ignore specific element changes
|
|
193
|
-
|
|
194
|
-
== Whitespace-Sensitive Elements
|
|
195
|
-
|
|
196
|
-
Certain elements preserve whitespace as semantically significant. For these elements, whitespace is **not normalized** even when `text_content: :normalize` is set.
|
|
197
|
-
|
|
198
|
-
=== Whitespace-Sensitive Element List
|
|
199
|
-
|
|
200
|
-
* `pre` - Preformatted text
|
|
201
|
-
* `code` - Code blocks
|
|
202
|
-
* `textarea` - Text input areas
|
|
203
|
-
* `script` - JavaScript code
|
|
204
|
-
* `style` - CSS styles
|
|
205
|
-
|
|
206
|
-
=== Whitespace Handling
|
|
207
|
-
|
|
208
|
-
For whitespace-sensitive elements:
|
|
209
|
-
|
|
210
|
-
* Text content is preserved exactly as-is
|
|
211
|
-
* Normalization does NOT apply
|
|
212
|
-
* Whitespace differences are **normative** (must match exactly)
|
|
213
|
-
|
|
214
|
-
.Example: Whitespace in `<pre>` is significant
|
|
215
|
-
[example]
|
|
216
|
-
====
|
|
217
|
-
[source,html]
|
|
218
|
-
----
|
|
219
|
-
<!-- Document 1 -->
|
|
220
|
-
<pre>
|
|
221
|
-
Line 1
|
|
222
|
-
Line 2
|
|
223
|
-
</pre>
|
|
224
|
-
|
|
225
|
-
<!-- Document 2 -->
|
|
226
|
-
<pre>
|
|
227
|
-
Line 1
|
|
228
|
-
Line 2
|
|
229
|
-
</pre>
|
|
230
|
-
----
|
|
231
|
-
|
|
232
|
-
These are NOT equivalent because `<pre>` preserves whitespace. The leading spaces in Document 1 are semantically significant.
|
|
233
|
-
====
|
|
234
|
-
|
|
235
|
-
=== Configuration
|
|
236
|
-
|
|
237
|
-
Whitespace-sensitive elements are detected automatically based on element name. No configuration is required.
|
|
238
|
-
|
|
239
|
-
If you need to treat other elements as whitespace-sensitive, you would need to modify the adapter code.
|
|
240
|
-
|
|
241
|
-
== Normative vs Informative Classification
|
|
242
|
-
|
|
243
|
-
The semantic tree diff integrates with Canon's normative/informative diff architecture. Operations are classified based on match dimensions:
|
|
244
|
-
|
|
245
|
-
=== Dimension Mapping
|
|
246
|
-
|
|
247
|
-
[cols="1,1,2"]
|
|
248
|
-
|===
|
|
249
|
-
|Match Dimension |Operation Types |Effect When `:ignore`
|
|
250
|
-
|
|
251
|
-
|`text_content`
|
|
252
|
-
|UPDATE (text changes)
|
|
253
|
-
|Text differences → informative
|
|
254
|
-
|
|
255
|
-
|`attribute_values`
|
|
256
|
-
|UPDATE (attribute changes)
|
|
257
|
-
|Attribute value differences → informative
|
|
258
|
-
|
|
259
|
-
|`attribute_order`
|
|
260
|
-
|UPDATE (attribute reordering)
|
|
261
|
-
|Attribute order differences → informative
|
|
262
|
-
|
|
263
|
-
|`element_position`
|
|
264
|
-
|MOVE
|
|
265
|
-
|Position changes → informative
|
|
266
|
-
|
|
267
|
-
|`element_hierarchy`
|
|
268
|
-
|UPGRADE, DOWNGRADE
|
|
269
|
-
|Hierarchy changes → informative
|
|
270
|
-
|
|
271
|
-
|`element_structure`
|
|
272
|
-
|INSERT, DELETE, MERGE, SPLIT
|
|
273
|
-
|Always normative (structural changes)
|
|
274
|
-
|===
|
|
275
|
-
|
|
276
|
-
=== Example: Ignoring Position Changes
|
|
277
|
-
|
|
278
|
-
[source,ruby]
|
|
279
|
-
----
|
|
280
|
-
result = Canon::Comparison.equivalent?(doc1, doc2,
|
|
281
|
-
verbose: true,
|
|
282
|
-
diff_algorithm: :semantic,
|
|
283
|
-
match: {
|
|
284
|
-
element_position: :ignore # MOVE operations → informative
|
|
285
|
-
}
|
|
286
|
-
)
|
|
287
|
-
|
|
288
|
-
# Only structural changes are normative
|
|
289
|
-
has_structural_changes = result.differences.any?(&:normative?)
|
|
290
|
-
----
|
|
291
|
-
|
|
292
|
-
See link:NORMATIVE_INFORMATIVE_DIFFS.adoc[NORMATIVE_INFORMATIVE_DIFFS.adoc] for details on the normative/informative architecture.
|
|
293
|
-
|
|
294
|
-
== Configuration
|
|
295
|
-
|
|
296
|
-
=== Enabling Semantic Tree Diff
|
|
297
|
-
|
|
298
|
-
Set `diff_algorithm: :semantic`:
|
|
299
|
-
|
|
300
|
-
[source,ruby]
|
|
301
|
-
----
|
|
302
|
-
require 'canon/comparison'
|
|
303
|
-
|
|
304
|
-
result = Canon::Comparison.equivalent?(doc1, doc2,
|
|
305
|
-
verbose: true,
|
|
306
|
-
diff_algorithm: :semantic
|
|
307
|
-
)
|
|
308
|
-
----
|
|
309
|
-
|
|
310
|
-
=== Via Environment Variables
|
|
311
|
-
|
|
312
|
-
[source,bash]
|
|
313
|
-
----
|
|
314
|
-
# Set globally
|
|
315
|
-
export CANON_ALGORITHM=semantic
|
|
316
|
-
|
|
317
|
-
# Or format-specific
|
|
318
|
-
export CANON_XML_DIFF_ALGORITHM=semantic
|
|
319
|
-
export CANON_HTML_DIFF_ALGORITHM=semantic
|
|
320
|
-
----
|
|
321
|
-
|
|
322
|
-
See link:ENV_CONFIG.adoc#algorithm-selection[ENV_CONFIG.adoc] for details.
|
|
323
|
-
|
|
324
|
-
=== Configuration Options
|
|
325
|
-
|
|
326
|
-
==== similarity_threshold
|
|
327
|
-
|
|
328
|
-
Minimum similarity score for matching nodes (default: 0.95).
|
|
329
|
-
|
|
330
|
-
[source,ruby]
|
|
331
|
-
----
|
|
332
|
-
result = Canon::Comparison.equivalent?(doc1, doc2,
|
|
333
|
-
diff_algorithm: :semantic,
|
|
334
|
-
match: {
|
|
335
|
-
similarity_threshold: 0.90 # More lenient matching
|
|
336
|
-
}
|
|
337
|
-
)
|
|
338
|
-
----
|
|
339
|
-
|
|
340
|
-
* Higher values (0.99): Very conservative, only nearly identical nodes match
|
|
341
|
-
* Lower values (0.80): More aggressive, allows less similar nodes to match
|
|
342
|
-
* Default (0.95): Balanced for most use cases
|
|
343
|
-
|
|
344
|
-
==== hash_matching
|
|
345
|
-
|
|
346
|
-
Enable/disable exact hash matching phase (default: true).
|
|
347
|
-
|
|
348
|
-
[source,ruby]
|
|
349
|
-
----
|
|
350
|
-
result = Canon::Comparison.equivalent?(doc1, doc2,
|
|
351
|
-
diff_algorithm: :semantic,
|
|
352
|
-
match: {
|
|
353
|
-
hash_matching: false # Disable exact matching
|
|
354
|
-
}
|
|
355
|
-
)
|
|
356
|
-
----
|
|
357
|
-
|
|
358
|
-
Disable only if exact matching causes issues.
|
|
359
|
-
|
|
360
|
-
==== similarity_matching
|
|
361
|
-
|
|
362
|
-
Enable/disable similarity-based matching phase (default: true).
|
|
363
|
-
|
|
364
|
-
[source,ruby]
|
|
365
|
-
----
|
|
366
|
-
result = Canon::Comparison.equivalent?(doc1, doc2,
|
|
367
|
-
diff_algorithm: :semantic,
|
|
368
|
-
match: {
|
|
369
|
-
similarity_matching: false # Use only exact matching
|
|
370
|
-
}
|
|
371
|
-
)
|
|
372
|
-
----
|
|
373
|
-
|
|
374
|
-
Disable for faster but less accurate matching.
|
|
375
|
-
|
|
376
|
-
==== propagation
|
|
377
|
-
|
|
378
|
-
Enable/disable structural propagation phase (default: true).
|
|
379
|
-
|
|
380
|
-
[source,ruby]
|
|
381
|
-
----
|
|
382
|
-
result = Canon::Comparison.equivalent?(doc1, doc2,
|
|
383
|
-
diff_algorithm: :semantic,
|
|
384
|
-
match: {
|
|
385
|
-
propagation: false # Disable propagation
|
|
386
|
-
}
|
|
387
|
-
)
|
|
388
|
-
----
|
|
389
|
-
|
|
390
|
-
Disable for simpler but potentially less accurate results.
|
|
391
|
-
|
|
392
|
-
== Known Limitations
|
|
393
|
-
|
|
394
|
-
=== Performance on Large Documents
|
|
395
|
-
|
|
396
|
-
The semantic tree diff has O(n²) worst-case complexity in the similarity matching phase.
|
|
397
|
-
|
|
398
|
-
**Workarounds:**
|
|
399
|
-
|
|
400
|
-
* Use size limits to prevent hangs (see link:ENV_CONFIG.adoc#size-limits[ENV_CONFIG.adoc])
|
|
401
|
-
* Disable `similarity_matching` if exact matches suffice
|
|
402
|
-
* Increase `similarity_threshold` to reduce candidate matches
|
|
403
|
-
* Use DOM diff for documents >10,000 nodes
|
|
404
|
-
|
|
405
|
-
=== Attribute Order Detection
|
|
406
|
-
|
|
407
|
-
INSERT/DELETE pairs that differ only in attribute order are detected and reclassified, but this detection has limitations:
|
|
408
|
-
|
|
409
|
-
* Only works for single-element differences
|
|
410
|
-
* Requires exact content match
|
|
411
|
-
* May miss complex reorderings
|
|
412
|
-
|
|
413
|
-
**Workaround:**
|
|
414
|
-
|
|
415
|
-
Use `attribute_order: :ignore` in match options to treat all attribute order differences as informative.
|
|
416
|
-
|
|
417
|
-
=== Deep Hierarchy Changes
|
|
418
|
-
|
|
419
|
-
UPGRADE/DOWNGRADE detection requires similar content and element names. Complex restructuring may be reported as DELETE + INSERT instead.
|
|
420
|
-
|
|
421
|
-
**Workaround:**
|
|
422
|
-
|
|
423
|
-
Adjust `similarity_threshold` to allow more lenient matching.
|
|
424
|
-
|
|
425
|
-
=== Format-Specific Limitations
|
|
426
|
-
|
|
427
|
-
==== XML/HTML
|
|
428
|
-
|
|
429
|
-
* Namespace changes may not be detected correctly
|
|
430
|
-
* Mixed content (text + elements) may cause false positives
|
|
431
|
-
* Comment handling depends on `comments` match dimension
|
|
432
|
-
|
|
433
|
-
==== JSON/YAML
|
|
434
|
-
|
|
435
|
-
* Array reordering is detected as MOVE but may be noisy
|
|
436
|
-
* Type changes (string → number) are always normative
|
|
437
|
-
* Null vs missing key differences are structural
|
|
438
|
-
|
|
439
|
-
== Troubleshooting
|
|
440
|
-
|
|
441
|
-
=== Too Many False Positives (Over-Matching)
|
|
442
|
-
|
|
443
|
-
**Symptoms:**
|
|
444
|
-
|
|
445
|
-
* Unrelated nodes are matched
|
|
446
|
-
* Many UPDATE operations instead of INSERT/DELETE
|
|
447
|
-
* Operations seem incorrect
|
|
448
|
-
|
|
449
|
-
**Solutions:**
|
|
450
|
-
|
|
451
|
-
[source,ruby]
|
|
452
|
-
----
|
|
453
|
-
# Increase threshold for stricter matching
|
|
454
|
-
Canon::Comparison.equivalent?(doc1, doc2,
|
|
455
|
-
diff_algorithm: :semantic,
|
|
456
|
-
match: {
|
|
457
|
-
similarity_threshold: 0.98 # Was 0.95
|
|
458
|
-
}
|
|
459
|
-
)
|
|
460
|
-
|
|
461
|
-
# Or disable similarity matching entirely
|
|
462
|
-
Canon::Comparison.equivalent?(doc1, doc2,
|
|
463
|
-
diff_algorithm: :semantic,
|
|
464
|
-
match: {
|
|
465
|
-
similarity_matching: false # Use only exact matches
|
|
466
|
-
}
|
|
467
|
-
)
|
|
468
|
-
----
|
|
469
|
-
|
|
470
|
-
=== Too Few Matches (Under-Matching)
|
|
471
|
-
|
|
472
|
-
**Symptoms:**
|
|
473
|
-
|
|
474
|
-
* Similar content shows as DELETE + INSERT
|
|
475
|
-
* No MOVE operations detected
|
|
476
|
-
* Low match rate in statistics
|
|
477
|
-
|
|
478
|
-
**Solutions:**
|
|
479
|
-
|
|
480
|
-
[source,ruby]
|
|
481
|
-
----
|
|
482
|
-
# Decrease threshold for more lenient matching
|
|
483
|
-
Canon::Comparison.equivalent?(doc1, doc2,
|
|
484
|
-
diff_algorithm: :semantic,
|
|
485
|
-
match: {
|
|
486
|
-
similarity_threshold: 0.85 # Was 0.95
|
|
487
|
-
}
|
|
488
|
-
)
|
|
489
|
-
|
|
490
|
-
# Ensure all matching phases are enabled
|
|
491
|
-
Canon::Comparison.equivalent?(doc1, doc2,
|
|
492
|
-
diff_algorithm: :semantic,
|
|
493
|
-
match: {
|
|
494
|
-
hash_matching: true,
|
|
495
|
-
similarity_matching: true,
|
|
496
|
-
propagation: true
|
|
497
|
-
}
|
|
498
|
-
)
|
|
499
|
-
----
|
|
500
|
-
|
|
501
|
-
=== Performance Issues
|
|
502
|
-
|
|
503
|
-
**Symptoms:**
|
|
504
|
-
|
|
505
|
-
* Comparison hangs or is very slow
|
|
506
|
-
* High memory usage
|
|
507
|
-
* CPU pegs at 100%
|
|
508
|
-
|
|
509
|
-
**Solutions:**
|
|
510
|
-
|
|
511
|
-
[source,bash]
|
|
512
|
-
----
|
|
513
|
-
# Check file/node size
|
|
514
|
-
CANON_XML_DIFF_MAX_FILE_SIZE=1048576 # 1MB limit
|
|
515
|
-
CANON_XML_DIFF_MAX_NODE_COUNT=5000 # 5,000 nodes
|
|
516
|
-
bundle exec rspec
|
|
517
|
-
----
|
|
518
|
-
|
|
519
|
-
Or disable expensive phases:
|
|
520
|
-
|
|
521
|
-
[source,ruby]
|
|
522
|
-
----
|
|
523
|
-
Canon::Comparison.equivalent?(doc1, doc2,
|
|
524
|
-
diff_algorithm: :semantic,
|
|
525
|
-
match: {
|
|
526
|
-
similarity_matching: false, # Skip if exact matches suffice
|
|
527
|
-
propagation: false # Skip if not needed
|
|
528
|
-
}
|
|
529
|
-
)
|
|
530
|
-
----
|
|
531
|
-
|
|
532
|
-
Or switch to DOM diff for large files:
|
|
533
|
-
|
|
534
|
-
[source,ruby]
|
|
535
|
-
----
|
|
536
|
-
# Conditionally use semantic diff only for smaller files
|
|
537
|
-
algorithm = doc1.size > 100_000 ? :dom : :semantic
|
|
538
|
-
|
|
539
|
-
Canon::Comparison.equivalent?(doc1, doc2,
|
|
540
|
-
diff_algorithm: algorithm
|
|
541
|
-
)
|
|
542
|
-
----
|
|
543
|
-
|
|
544
|
-
=== Incorrect MOVE Detection
|
|
545
|
-
|
|
546
|
-
**Symptoms:**
|
|
547
|
-
|
|
548
|
-
* Elements marked as MOVE that didn't actually move
|
|
549
|
-
* Missing MOVE operations for elements that did move
|
|
550
|
-
|
|
551
|
-
**Solutions:**
|
|
552
|
-
|
|
553
|
-
MOVE is detected when matched nodes have different positions. Verify:
|
|
554
|
-
|
|
555
|
-
1. Nodes are truly similar (check attributes and content)
|
|
556
|
-
2. `similarity_threshold` is appropriate
|
|
557
|
-
3. Preprocessing normalizes content consistently
|
|
558
|
-
|
|
559
|
-
[source,ruby]
|
|
560
|
-
----
|
|
561
|
-
# Use preprocessing to normalize before comparison
|
|
562
|
-
Canon::Comparison.equivalent?(doc1, doc2,
|
|
563
|
-
diff_algorithm: :semantic,
|
|
564
|
-
preprocessing: :c14n # Canonicalize first
|
|
565
|
-
)
|
|
566
|
-
----
|
|
567
|
-
|
|
568
|
-
=== Metadata Elements Not Classified Correctly
|
|
569
|
-
|
|
570
|
-
**Symptoms:**
|
|
571
|
-
|
|
572
|
-
* Metadata changes marked as normative
|
|
573
|
-
* Expected informative diffs show as must-fix
|
|
574
|
-
|
|
575
|
-
**Solutions:**
|
|
576
|
-
|
|
577
|
-
Check if element is in [`METADATA_ELEMENTS` list](lib/canon/tree_diff/operation_converter.rb:33). If not, either:
|
|
578
|
-
|
|
579
|
-
1. Add it to the list (requires code change)
|
|
580
|
-
2. Use match dimension to ignore it:
|
|
581
|
-
|
|
582
|
-
[source,ruby]
|
|
583
|
-
----
|
|
584
|
-
# Treat all text content as informative for specific elements
|
|
585
|
-
Canon::Comparison.equivalent?(doc1, doc2,
|
|
586
|
-
diff_algorithm: :semantic,
|
|
587
|
-
match: {
|
|
588
|
-
text_content: :ignore # All text differences → informative
|
|
589
|
-
}
|
|
590
|
-
)
|
|
591
|
-
----
|
|
592
|
-
|
|
593
|
-
== Migration Guide
|
|
594
|
-
|
|
595
|
-
=== From DOM Diff to Semantic Tree Diff
|
|
596
|
-
|
|
597
|
-
==== Step 1: Enable Semantic Diff in Tests
|
|
598
|
-
|
|
599
|
-
Start with a small subset of tests:
|
|
600
|
-
|
|
601
|
-
[source,ruby]
|
|
602
|
-
----
|
|
603
|
-
# spec/spec_helper.rb
|
|
604
|
-
RSpec.configure do |config|
|
|
605
|
-
config.around(:each, semantic_diff: true) do |example|
|
|
606
|
-
# Temporarily enable semantic diff for tagged tests
|
|
607
|
-
old_algo = Canon::Config.instance.xml.diff.algorithm
|
|
608
|
-
Canon::Config.instance.xml.diff.algorithm = :semantic
|
|
609
|
-
|
|
610
|
-
example.run
|
|
611
|
-
|
|
612
|
-
Canon::Config.instance.xml.diff.algorithm = old_algo
|
|
613
|
-
end
|
|
614
|
-
end
|
|
615
|
-
|
|
616
|
-
# In test file
|
|
617
|
-
RSpec.describe 'XML comparison', :semantic_diff do
|
|
618
|
-
it 'detects moves correctly' do
|
|
619
|
-
expect(actual).to be_xml_equivalent_to(expected)
|
|
620
|
-
end
|
|
621
|
-
end
|
|
622
|
-
----
|
|
623
|
-
|
|
624
|
-
==== Step 2: Compare Outputs
|
|
625
|
-
|
|
626
|
-
Run tests with both algorithms to compare:
|
|
627
|
-
|
|
628
|
-
[source,bash]
|
|
629
|
-
----
|
|
630
|
-
# Run with DOM diff (baseline)
|
|
631
|
-
CANON_ALGORITHM=dom bundle exec rspec > dom_output.txt
|
|
632
|
-
|
|
633
|
-
# Run with semantic diff
|
|
634
|
-
CANON_ALGORITHM=semantic bundle exec rspec > semantic_output.txt
|
|
635
|
-
|
|
636
|
-
# Compare outputs
|
|
637
|
-
diff dom_output.txt semantic_output.txt
|
|
638
|
-
----
|
|
639
|
-
|
|
640
|
-
==== Step 3: Adjust Match Options
|
|
641
|
-
|
|
642
|
-
Tune `similarity_threshold` and match dimensions for your use case:
|
|
643
|
-
|
|
644
|
-
[source,ruby]
|
|
645
|
-
----
|
|
646
|
-
# Start conservative, gradually relax
|
|
647
|
-
result = Canon::Comparison.equivalent?(doc1, doc2,
|
|
648
|
-
diff_algorithm: :semantic,
|
|
649
|
-
match: {
|
|
650
|
-
similarity_threshold: 0.98, # Start high
|
|
651
|
-
element_position: :ignore, # Ignore moves initially
|
|
652
|
-
}
|
|
653
|
-
)
|
|
654
|
-
|
|
655
|
-
# Review results, adjust as needed
|
|
656
|
-
----
|
|
657
|
-
|
|
658
|
-
==== Step 4: Handle Format-Specific Issues
|
|
659
|
-
|
|
660
|
-
**For XML:**
|
|
661
|
-
|
|
662
|
-
[source,ruby]
|
|
663
|
-
----
|
|
664
|
-
# Normalize namespaces
|
|
665
|
-
Canon::Comparison.equivalent?(doc1, doc2,
|
|
666
|
-
diff_algorithm: :semantic,
|
|
667
|
-
preprocessing: :c14n
|
|
668
|
-
)
|
|
669
|
-
----
|
|
670
|
-
|
|
671
|
-
**For HTML:**
|
|
672
|
-
|
|
673
|
-
[source,ruby]
|
|
674
|
-
----
|
|
675
|
-
# Ignore presentation differences
|
|
676
|
-
Canon::Comparison.equivalent?(doc1, doc2,
|
|
677
|
-
diff_algorithm: :semantic,
|
|
678
|
-
match: {
|
|
679
|
-
attribute_order: :ignore,
|
|
680
|
-
structural_whitespace: :ignore
|
|
681
|
-
}
|
|
682
|
-
)
|
|
683
|
-
----
|
|
684
|
-
|
|
685
|
-
**For JSON/YAML:**
|
|
686
|
-
|
|
687
|
-
[source,ruby]
|
|
688
|
-
----
|
|
689
|
-
# Ignore key order
|
|
690
|
-
Canon::Comparison.equivalent?(doc1, doc2,
|
|
691
|
-
diff_algorithm: :semantic,
|
|
692
|
-
match: {
|
|
693
|
-
key_order: :ignore
|
|
694
|
-
}
|
|
695
|
-
)
|
|
696
|
-
----
|
|
697
|
-
|
|
698
|
-
==== Step 5: Gradual Rollout
|
|
699
|
-
|
|
700
|
-
1. Enable for new tests first
|
|
701
|
-
2. Migrate stable test suites
|
|
702
|
-
3. Monitor for regressions
|
|
703
|
-
4. Keep DOM diff as fallback for edge cases
|
|
704
|
-
|
|
705
|
-
=== Using Both Algorithms
|
|
706
|
-
|
|
707
|
-
You can use both algorithms in the same codebase:
|
|
708
|
-
|
|
709
|
-
[source,ruby]
|
|
710
|
-
----
|
|
711
|
-
# Use semantic diff for operation analysis
|
|
712
|
-
semantic_result = Canon::Comparison.equivalent?(doc1, doc2,
|
|
713
|
-
verbose: true,
|
|
714
|
-
diff_algorithm: :semantic
|
|
715
|
-
)
|
|
716
|
-
|
|
717
|
-
# Use DOM diff for traditional output
|
|
718
|
-
dom_result = Canon::Comparison.equivalent?(doc1, doc2,
|
|
719
|
-
verbose: true,
|
|
720
|
-
diff_algorithm: :dom
|
|
721
|
-
)
|
|
722
|
-
|
|
723
|
-
# Compare results
|
|
724
|
-
puts "Semantic detected #{semantic_result.operations.size} operations"
|
|
725
|
-
puts "DOM detected #{dom_result.differences.size} differences"
|
|
726
|
-
----
|
|
727
|
-
|
|
728
|
-
Or conditionally:
|
|
729
|
-
|
|
730
|
-
[source,ruby]
|
|
731
|
-
----
|
|
732
|
-
# Use semantic for small docs, DOM for large
|
|
733
|
-
algorithm = node_count < 5000 ? :semantic : :dom
|
|
734
|
-
|
|
735
|
-
Canon::Comparison.equivalent?(doc1, doc2,
|
|
736
|
-
diff_algorithm: algorithm
|
|
737
|
-
)
|
|
738
|
-
----
|
|
739
|
-
|
|
740
|
-
== Research Background
|
|
741
|
-
|
|
742
|
-
The semantic tree diff is based on academic research:
|
|
743
|
-
|
|
744
|
-
* **XDiff (2002)**: Minimum-cost edit distance with unordered tree model
|
|
745
|
-
* **XyDiff/Cobena (2002)**: BULD algorithm with hash signatures and weights
|
|
746
|
-
* **JATS-diff (2022)**: Semantic operations for text-centric XML
|
|
747
|
-
|
|
748
|
-
Key innovations in Canon's implementation:
|
|
749
|
-
|
|
750
|
-
* Hybrid matching pipeline (hash + similarity + propagation)
|
|
751
|
-
* Format adapters for XML, JSON, HTML, YAML
|
|
752
|
-
* Integration with Canon's diff architecture
|
|
753
|
-
* Configurable thresholds and matching phases
|
|
754
|
-
* Metadata element classification
|
|
755
|
-
* Whitespace-sensitive element handling
|
|
756
|
-
|
|
757
|
-
See `xmldiff-resources.md` in the repository for research paper details.
|
|
758
|
-
|
|
759
|
-
== See Also
|
|
760
|
-
|
|
761
|
-
* link:TREE_DIFF.adoc[TREE_DIFF.adoc] - Operation types and examples
|
|
762
|
-
* link:ENV_CONFIG.adoc[ENV_CONFIG.adoc] - Environment variable configuration
|
|
763
|
-
* link:NORMATIVE_INFORMATIVE_DIFFS.adoc[NORMATIVE_INFORMATIVE_DIFFS.adoc] - Diff classification
|
|
764
|
-
* link:MATCH_OPTIONS.adoc[MATCH_OPTIONS.adoc] - Match dimensions and behaviors
|
|
765
|
-
* link:DOM_DIFF.adoc[DOM_DIFF.adoc] - DOM diff algorithm details
|