canon 0.1.8 → 0.1.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (98) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop_todo.yml +112 -25
  3. data/docs/Gemfile +1 -0
  4. data/docs/_config.yml +90 -1
  5. data/docs/advanced/diff-classification.adoc +82 -2
  6. data/docs/features/match-options/index.adoc +239 -1
  7. data/lib/canon/comparison/format_detector.rb +2 -1
  8. data/lib/canon/comparison/html_comparator.rb +19 -8
  9. data/lib/canon/comparison/html_compare_profile.rb +8 -2
  10. data/lib/canon/comparison/match_options/base_resolver.rb +7 -0
  11. data/lib/canon/comparison/whitespace_sensitivity.rb +208 -0
  12. data/lib/canon/comparison/xml_comparator/child_comparison.rb +15 -7
  13. data/lib/canon/comparison/xml_comparator/node_parser.rb +10 -5
  14. data/lib/canon/comparison/xml_comparator/node_type_comparator.rb +14 -7
  15. data/lib/canon/comparison/xml_comparator.rb +48 -23
  16. data/lib/canon/comparison/xml_node_comparison.rb +25 -3
  17. data/lib/canon/diff/diff_classifier.rb +101 -2
  18. data/lib/canon/diff/formatting_detector.rb +1 -1
  19. data/lib/canon/rspec_matchers.rb +37 -8
  20. data/lib/canon/version.rb +1 -1
  21. data/lib/canon/xml/data_model.rb +24 -13
  22. metadata +3 -78
  23. data/docs/plans/2025-01-17-html-parser-selection-fix.adoc +0 -250
  24. data/false_positive_analysis.txt +0 -0
  25. data/file1.html +0 -1
  26. data/file2.html +0 -1
  27. data/old-docs/ADVANCED_TOPICS.adoc +0 -20
  28. data/old-docs/BASIC_USAGE.adoc +0 -16
  29. data/old-docs/CHARACTER_VISUALIZATION.adoc +0 -567
  30. data/old-docs/CLI.adoc +0 -497
  31. data/old-docs/CUSTOMIZING_BEHAVIOR.adoc +0 -19
  32. data/old-docs/DIFF_ARCHITECTURE.adoc +0 -435
  33. data/old-docs/DIFF_FORMATTING.adoc +0 -540
  34. data/old-docs/DIFF_PARAMETERS.adoc +0 -261
  35. data/old-docs/DOM_DIFF.adoc +0 -1017
  36. data/old-docs/ENV_CONFIG.adoc +0 -876
  37. data/old-docs/FORMATS.adoc +0 -867
  38. data/old-docs/INPUT_VALIDATION.adoc +0 -477
  39. data/old-docs/MATCHER_BEHAVIOR.adoc +0 -90
  40. data/old-docs/MATCH_ARCHITECTURE.adoc +0 -463
  41. data/old-docs/MATCH_OPTIONS.adoc +0 -912
  42. data/old-docs/MODES.adoc +0 -432
  43. data/old-docs/NORMATIVE_INFORMATIVE_DIFFS.adoc +0 -219
  44. data/old-docs/OPTIONS.adoc +0 -1387
  45. data/old-docs/PREPROCESSING.adoc +0 -491
  46. data/old-docs/README.old.adoc +0 -2831
  47. data/old-docs/RSPEC.adoc +0 -814
  48. data/old-docs/RUBY_API.adoc +0 -485
  49. data/old-docs/SEMANTIC_DIFF_REPORT.adoc +0 -646
  50. data/old-docs/SEMANTIC_TREE_DIFF.adoc +0 -765
  51. data/old-docs/STRING_COMPARE.adoc +0 -345
  52. data/old-docs/TMP.adoc +0 -3384
  53. data/old-docs/TREE_DIFF.adoc +0 -1080
  54. data/old-docs/UNDERSTANDING_CANON.adoc +0 -17
  55. data/old-docs/VERBOSE.adoc +0 -482
  56. data/old-docs/VISUALIZATION_MAP.adoc +0 -625
  57. data/old-docs/WHITESPACE_TREATMENT.adoc +0 -1155
  58. data/scripts/analyze_current_state.rb +0 -85
  59. data/scripts/analyze_false_positives.rb +0 -114
  60. data/scripts/analyze_remaining_failures.rb +0 -105
  61. data/scripts/compare_current_failures.rb +0 -95
  62. data/scripts/compare_dom_tree_diff.rb +0 -158
  63. data/scripts/compare_failures.rb +0 -151
  64. data/scripts/debug_attribute_extraction.rb +0 -66
  65. data/scripts/debug_blocks_839.rb +0 -115
  66. data/scripts/debug_meta_matching.rb +0 -52
  67. data/scripts/debug_p_matching.rb +0 -192
  68. data/scripts/debug_signature_matching.rb +0 -118
  69. data/scripts/debug_sourcecode_124.rb +0 -32
  70. data/scripts/debug_whitespace_sensitive.rb +0 -192
  71. data/scripts/extract_false_positives.rb +0 -138
  72. data/scripts/find_actual_false_positives.rb +0 -125
  73. data/scripts/investigate_all_false_positives.rb +0 -161
  74. data/scripts/investigate_batch1.rb +0 -127
  75. data/scripts/investigate_classification.rb +0 -150
  76. data/scripts/investigate_classification_detailed.rb +0 -190
  77. data/scripts/investigate_common_failures.rb +0 -342
  78. data/scripts/investigate_false_negative.rb +0 -80
  79. data/scripts/investigate_false_positive.rb +0 -83
  80. data/scripts/investigate_false_positives.rb +0 -227
  81. data/scripts/investigate_false_positives_batch.rb +0 -163
  82. data/scripts/investigate_mixed_content.rb +0 -125
  83. data/scripts/investigate_remaining_16.rb +0 -214
  84. data/scripts/run_single_test.rb +0 -29
  85. data/scripts/test_all_false_positives.rb +0 -95
  86. data/scripts/test_attribute_details.rb +0 -61
  87. data/scripts/test_both_algorithms.rb +0 -49
  88. data/scripts/test_both_simple.rb +0 -49
  89. data/scripts/test_enhanced_semantic_output.rb +0 -125
  90. data/scripts/test_readme_examples.rb +0 -131
  91. data/scripts/test_semantic_tree_diff.rb +0 -99
  92. data/scripts/test_semantic_ux_improvements.rb +0 -135
  93. data/scripts/test_single_false_positive.rb +0 -119
  94. data/scripts/test_size_limits.rb +0 -99
  95. data/test_html_1.html +0 -21
  96. data/test_html_2.html +0 -21
  97. data/test_nokogiri.rb +0 -33
  98. data/test_normalize.rb +0 -45
@@ -1,765 +0,0 @@
1
- = Semantic Tree Diff Algorithm
2
- :toc:
3
- :toclevels: 3
4
-
5
- [WARNING]
6
- The semantic tree diff algorithm is currently **experimental** and under active development. While functional and tested, the API and behavior may change in future releases. Use with caution in production environments.
7
-
8
- == General
9
-
10
- Canon provides two complementary diff algorithms:
11
-
12
- * **DOM diff** (default): Stable, position-based comparison for traditional diff output
13
- * **Semantic tree diff** (experimental): Advanced, similarity-based comparison with operation detection
14
-
15
- This document provides a comprehensive guide to the semantic tree diff algorithm, including when to use it, how it works, and how to migrate from DOM diff.
16
-
17
- == When to Use Semantic Tree Diff
18
-
19
- === Use Semantic Tree Diff When
20
-
21
- * You need to detect high-level operations (INSERT, DELETE, UPDATE, MOVE, MERGE, SPLIT, UPGRADE, DOWNGRADE)
22
- * Documents have significant rearrangement and you want to track moved content
23
- * You need statistical analysis of changes (matching rates, confidence scores)
24
- * You want operation-level analysis of document transformations
25
- * You're comparing structured documents where content moves between positions
26
-
27
- === Use DOM Diff When
28
-
29
- * You need stable, well-tested comparison for production use
30
- * You want traditional line-by-line diff output
31
- * Documents are similar in structure with minimal rearrangement
32
- * You need maximum performance for large documents (>10,000 nodes)
33
- * You want consistent, predictable behavior
34
-
35
- === Size Limits
36
-
37
- Both algorithms have configurable size limits to prevent hangs on pathologically large files:
38
-
39
- * **File size limit**: Default 5MB (configurable via `CANON_MAX_FILE_SIZE`)
40
- * **Node count limit**: Default 10,000 nodes (configurable via `CANON_MAX_NODE_COUNT`)
41
- * **Diff output limit**: Default 10,000 lines (configurable via `CANON_MAX_DIFF_LINES`)
42
-
43
- See link:ENV_CONFIG.adoc#size-limits[ENV_CONFIG.adoc] for details on configuring these limits.
44
-
45
- == Key Differences from DOM Diff
46
-
47
- [cols="1,2,2"]
48
- |===
49
- |Aspect |DOM Diff |Semantic Tree Diff
50
-
51
- |**Matching Strategy**
52
- |Position-based with DOM structure
53
- |Similarity-based with intelligent matching
54
-
55
- |**Operation Detection**
56
- |Line-level changes only
57
- |INSERT, DELETE, UPDATE, MOVE, MERGE, SPLIT, UPGRADE, DOWNGRADE
58
-
59
- |**Move Detection**
60
- |No (shows as delete + insert)
61
- |Yes (tracks content movement)
62
-
63
- |**Rearrangement Handling**
64
- |Poor (many false positives)
65
- |Excellent (intelligent matching)
66
-
67
- |**Performance**
68
- |O(n) - faster for large docs
69
- |O(n²) worst case - slower but smarter
70
-
71
- |**Maturity**
72
- |Stable, production-ready
73
- |Experimental, under development
74
-
75
- |**Output Format**
76
- |Line-by-line or tree diff
77
- |Operation list with metadata
78
-
79
- |**Best For**
80
- |Traditional diff needs
81
- |Operation analysis, restructuring
82
- |===
83
-
84
- == How Semantic Tree Diff Works
85
-
86
- === Three-Phase Matching Pipeline
87
-
88
- The semantic tree diff uses a hybrid matching algorithm:
89
-
90
- ==== Phase 1: Hash-Based Exact Matching
91
-
92
- Matches nodes with identical structure and content using hash signatures.
93
-
94
- * Fast O(n) exact matching
95
- * Eliminates unchanged subtrees
96
- * Reduces problem size for subsequent phases
97
-
98
- ==== Phase 2: Similarity-Based Matching
99
-
100
- Matches similar but not identical nodes using weighted similarity scoring.
101
-
102
- * Compares node names, attributes, text content, and structure
103
- * Configurable similarity threshold (default: 0.95)
104
- * Uses weighted similarity metrics
105
-
106
- ==== Phase 3: Structural Propagation
107
-
108
- Propagates matches from parents and children to improve quality.
109
-
110
- * Top-down propagation from matched parents
111
- * Bottom-up propagation from matched children
112
- * Resolves ambiguous matches
113
-
114
- === Operation Detection
115
-
116
- After matching, the algorithm detects eight operation types:
117
-
118
- **Level 1 - Basic Operations:**
119
-
120
- * `INSERT`: New node added
121
- * `DELETE`: Node removed
122
- * `UPDATE`: Node content/attributes changed
123
-
124
- **Level 2 - Structural Operations:**
125
-
126
- * `MOVE`: Node relocated to different position
127
-
128
- **Level 3 - Semantic Operations:**
129
-
130
- * `MERGE`: Multiple nodes combined into one
131
- * `SPLIT`: One node divided into multiple
132
- * `UPGRADE`: Node promoted to higher level (decreased depth)
133
- * `DOWNGRADE`: Node demoted to lower level (increased depth)
134
-
135
- See link:TREE_DIFF.adoc#detected-operations[TREE_DIFF.adoc] for detailed examples of each operation.
136
-
137
- == Metadata Elements
138
-
139
- Metadata elements are presentation/formatting elements that don't affect semantic equivalence. The semantic tree diff automatically treats these as **informative** (non-normative) differences.
140
-
141
- === Metadata Element List
142
-
143
- The following elements are classified as metadata:
144
-
145
- * `semx` - Semantic markup
146
- * `fmt-concept` - Formatted concept
147
- * `fmt-name` - Formatted name
148
- * `fmt-title` - Formatted title
149
- * `fmt-xref` - Formatted cross-reference
150
- * `fmt-eref` - Formatted external reference
151
- * `fmt-termref` - Formatted term reference
152
- * `fmt-element-name` - Formatted element name
153
- * `fmt-link` - Formatted link
154
- * `autonum` - Automatic numbering
155
-
156
- === Classification Behavior
157
-
158
- When a difference involves a metadata element:
159
-
160
- * **INSERT/DELETE of metadata element**: Marked as informative
161
- * **UPDATE of metadata element**: Marked as informative
162
- * **MOVE of metadata element**: Marked as informative
163
-
164
- .Example: Metadata element differences are informative
165
- [example]
166
- ====
167
- [source,xml]
168
- ----
169
- <!-- Document 1 -->
170
- <section>
171
- <title>Introduction</title>
172
- <autonum>1.1</autonum>
173
- </section>
174
-
175
- <!-- Document 2 -->
176
- <section>
177
- <title>Introduction</title>
178
- <autonum>1.2</autonum> <!-- Changed autonum -->
179
- </section>
180
- ----
181
-
182
- The change to `<autonum>` is detected as an UPDATE but marked as **informative** because `autonum` is a metadata element. The documents are considered semantically equivalent.
183
- ====
184
-
185
- === Configuring Metadata Elements
186
-
187
- The metadata element list is defined in [`OperationConverter::METADATA_ELEMENTS`](lib/canon/tree_diff/operation_converter.rb:33).
188
-
189
- To add custom metadata elements for your domain, you would need to:
190
-
191
- 1. Fork Canon and modify `METADATA_ELEMENTS`
192
- 2. Or use match dimension `:ignore` to ignore specific element changes
193
-
194
- == Whitespace-Sensitive Elements
195
-
196
- Certain elements preserve whitespace as semantically significant. For these elements, whitespace is **not normalized** even when `text_content: :normalize` is set.
197
-
198
- === Whitespace-Sensitive Element List
199
-
200
- * `pre` - Preformatted text
201
- * `code` - Code blocks
202
- * `textarea` - Text input areas
203
- * `script` - JavaScript code
204
- * `style` - CSS styles
205
-
206
- === Whitespace Handling
207
-
208
- For whitespace-sensitive elements:
209
-
210
- * Text content is preserved exactly as-is
211
- * Normalization does NOT apply
212
- * Whitespace differences are **normative** (must match exactly)
213
-
214
- .Example: Whitespace in `<pre>` is significant
215
- [example]
216
- ====
217
- [source,html]
218
- ----
219
- <!-- Document 1 -->
220
- <pre>
221
- Line 1
222
- Line 2
223
- </pre>
224
-
225
- <!-- Document 2 -->
226
- <pre>
227
- Line 1
228
- Line 2
229
- </pre>
230
- ----
231
-
232
- These are NOT equivalent because `<pre>` preserves whitespace. The leading spaces in Document 1 are semantically significant.
233
- ====
234
-
235
- === Configuration
236
-
237
- Whitespace-sensitive elements are detected automatically based on element name. No configuration is required.
238
-
239
- If you need to treat other elements as whitespace-sensitive, you would need to modify the adapter code.
240
-
241
- == Normative vs Informative Classification
242
-
243
- The semantic tree diff integrates with Canon's normative/informative diff architecture. Operations are classified based on match dimensions:
244
-
245
- === Dimension Mapping
246
-
247
- [cols="1,1,2"]
248
- |===
249
- |Match Dimension |Operation Types |Effect When `:ignore`
250
-
251
- |`text_content`
252
- |UPDATE (text changes)
253
- |Text differences → informative
254
-
255
- |`attribute_values`
256
- |UPDATE (attribute changes)
257
- |Attribute value differences → informative
258
-
259
- |`attribute_order`
260
- |UPDATE (attribute reordering)
261
- |Attribute order differences → informative
262
-
263
- |`element_position`
264
- |MOVE
265
- |Position changes → informative
266
-
267
- |`element_hierarchy`
268
- |UPGRADE, DOWNGRADE
269
- |Hierarchy changes → informative
270
-
271
- |`element_structure`
272
- |INSERT, DELETE, MERGE, SPLIT
273
- |Always normative (structural changes)
274
- |===
275
-
276
- === Example: Ignoring Position Changes
277
-
278
- [source,ruby]
279
- ----
280
- result = Canon::Comparison.equivalent?(doc1, doc2,
281
- verbose: true,
282
- diff_algorithm: :semantic,
283
- match: {
284
- element_position: :ignore # MOVE operations → informative
285
- }
286
- )
287
-
288
- # Only structural changes are normative
289
- has_structural_changes = result.differences.any?(&:normative?)
290
- ----
291
-
292
- See link:NORMATIVE_INFORMATIVE_DIFFS.adoc[NORMATIVE_INFORMATIVE_DIFFS.adoc] for details on the normative/informative architecture.
293
-
294
- == Configuration
295
-
296
- === Enabling Semantic Tree Diff
297
-
298
- Set `diff_algorithm: :semantic`:
299
-
300
- [source,ruby]
301
- ----
302
- require 'canon/comparison'
303
-
304
- result = Canon::Comparison.equivalent?(doc1, doc2,
305
- verbose: true,
306
- diff_algorithm: :semantic
307
- )
308
- ----
309
-
310
- === Via Environment Variables
311
-
312
- [source,bash]
313
- ----
314
- # Set globally
315
- export CANON_ALGORITHM=semantic
316
-
317
- # Or format-specific
318
- export CANON_XML_DIFF_ALGORITHM=semantic
319
- export CANON_HTML_DIFF_ALGORITHM=semantic
320
- ----
321
-
322
- See link:ENV_CONFIG.adoc#algorithm-selection[ENV_CONFIG.adoc] for details.
323
-
324
- === Configuration Options
325
-
326
- ==== similarity_threshold
327
-
328
- Minimum similarity score for matching nodes (default: 0.95).
329
-
330
- [source,ruby]
331
- ----
332
- result = Canon::Comparison.equivalent?(doc1, doc2,
333
- diff_algorithm: :semantic,
334
- match: {
335
- similarity_threshold: 0.90 # More lenient matching
336
- }
337
- )
338
- ----
339
-
340
- * Higher values (0.99): Very conservative, only nearly identical nodes match
341
- * Lower values (0.80): More aggressive, allows less similar nodes to match
342
- * Default (0.95): Balanced for most use cases
343
-
344
- ==== hash_matching
345
-
346
- Enable/disable exact hash matching phase (default: true).
347
-
348
- [source,ruby]
349
- ----
350
- result = Canon::Comparison.equivalent?(doc1, doc2,
351
- diff_algorithm: :semantic,
352
- match: {
353
- hash_matching: false # Disable exact matching
354
- }
355
- )
356
- ----
357
-
358
- Disable only if exact matching causes issues.
359
-
360
- ==== similarity_matching
361
-
362
- Enable/disable similarity-based matching phase (default: true).
363
-
364
- [source,ruby]
365
- ----
366
- result = Canon::Comparison.equivalent?(doc1, doc2,
367
- diff_algorithm: :semantic,
368
- match: {
369
- similarity_matching: false # Use only exact matching
370
- }
371
- )
372
- ----
373
-
374
- Disable for faster but less accurate matching.
375
-
376
- ==== propagation
377
-
378
- Enable/disable structural propagation phase (default: true).
379
-
380
- [source,ruby]
381
- ----
382
- result = Canon::Comparison.equivalent?(doc1, doc2,
383
- diff_algorithm: :semantic,
384
- match: {
385
- propagation: false # Disable propagation
386
- }
387
- )
388
- ----
389
-
390
- Disable for simpler but potentially less accurate results.
391
-
392
- == Known Limitations
393
-
394
- === Performance on Large Documents
395
-
396
- The semantic tree diff has O(n²) worst-case complexity in the similarity matching phase.
397
-
398
- **Workarounds:**
399
-
400
- * Use size limits to prevent hangs (see link:ENV_CONFIG.adoc#size-limits[ENV_CONFIG.adoc])
401
- * Disable `similarity_matching` if exact matches suffice
402
- * Increase `similarity_threshold` to reduce candidate matches
403
- * Use DOM diff for documents >10,000 nodes
404
-
405
- === Attribute Order Detection
406
-
407
- INSERT/DELETE pairs that differ only in attribute order are detected and reclassified, but this detection has limitations:
408
-
409
- * Only works for single-element differences
410
- * Requires exact content match
411
- * May miss complex reorderings
412
-
413
- **Workaround:**
414
-
415
- Use `attribute_order: :ignore` in match options to treat all attribute order differences as informative.
416
-
417
- === Deep Hierarchy Changes
418
-
419
- UPGRADE/DOWNGRADE detection requires similar content and element names. Complex restructuring may be reported as DELETE + INSERT instead.
420
-
421
- **Workaround:**
422
-
423
- Adjust `similarity_threshold` to allow more lenient matching.
424
-
425
- === Format-Specific Limitations
426
-
427
- ==== XML/HTML
428
-
429
- * Namespace changes may not be detected correctly
430
- * Mixed content (text + elements) may cause false positives
431
- * Comment handling depends on `comments` match dimension
432
-
433
- ==== JSON/YAML
434
-
435
- * Array reordering is detected as MOVE but may be noisy
436
- * Type changes (string → number) are always normative
437
- * Null vs missing key differences are structural
438
-
439
- == Troubleshooting
440
-
441
- === Too Many False Positives (Over-Matching)
442
-
443
- **Symptoms:**
444
-
445
- * Unrelated nodes are matched
446
- * Many UPDATE operations instead of INSERT/DELETE
447
- * Operations seem incorrect
448
-
449
- **Solutions:**
450
-
451
- [source,ruby]
452
- ----
453
- # Increase threshold for stricter matching
454
- Canon::Comparison.equivalent?(doc1, doc2,
455
- diff_algorithm: :semantic,
456
- match: {
457
- similarity_threshold: 0.98 # Was 0.95
458
- }
459
- )
460
-
461
- # Or disable similarity matching entirely
462
- Canon::Comparison.equivalent?(doc1, doc2,
463
- diff_algorithm: :semantic,
464
- match: {
465
- similarity_matching: false # Use only exact matches
466
- }
467
- )
468
- ----
469
-
470
- === Too Few Matches (Under-Matching)
471
-
472
- **Symptoms:**
473
-
474
- * Similar content shows as DELETE + INSERT
475
- * No MOVE operations detected
476
- * Low match rate in statistics
477
-
478
- **Solutions:**
479
-
480
- [source,ruby]
481
- ----
482
- # Decrease threshold for more lenient matching
483
- Canon::Comparison.equivalent?(doc1, doc2,
484
- diff_algorithm: :semantic,
485
- match: {
486
- similarity_threshold: 0.85 # Was 0.95
487
- }
488
- )
489
-
490
- # Ensure all matching phases are enabled
491
- Canon::Comparison.equivalent?(doc1, doc2,
492
- diff_algorithm: :semantic,
493
- match: {
494
- hash_matching: true,
495
- similarity_matching: true,
496
- propagation: true
497
- }
498
- )
499
- ----
500
-
501
- === Performance Issues
502
-
503
- **Symptoms:**
504
-
505
- * Comparison hangs or is very slow
506
- * High memory usage
507
- * CPU pegs at 100%
508
-
509
- **Solutions:**
510
-
511
- [source,bash]
512
- ----
513
- # Check file/node size
514
- CANON_XML_DIFF_MAX_FILE_SIZE=1048576 # 1MB limit
515
- CANON_XML_DIFF_MAX_NODE_COUNT=5000 # 5,000 nodes
516
- bundle exec rspec
517
- ----
518
-
519
- Or disable expensive phases:
520
-
521
- [source,ruby]
522
- ----
523
- Canon::Comparison.equivalent?(doc1, doc2,
524
- diff_algorithm: :semantic,
525
- match: {
526
- similarity_matching: false, # Skip if exact matches suffice
527
- propagation: false # Skip if not needed
528
- }
529
- )
530
- ----
531
-
532
- Or switch to DOM diff for large files:
533
-
534
- [source,ruby]
535
- ----
536
- # Conditionally use semantic diff only for smaller files
537
- algorithm = doc1.size > 100_000 ? :dom : :semantic
538
-
539
- Canon::Comparison.equivalent?(doc1, doc2,
540
- diff_algorithm: algorithm
541
- )
542
- ----
543
-
544
- === Incorrect MOVE Detection
545
-
546
- **Symptoms:**
547
-
548
- * Elements marked as MOVE that didn't actually move
549
- * Missing MOVE operations for elements that did move
550
-
551
- **Solutions:**
552
-
553
- MOVE is detected when matched nodes have different positions. Verify:
554
-
555
- 1. Nodes are truly similar (check attributes and content)
556
- 2. `similarity_threshold` is appropriate
557
- 3. Preprocessing normalizes content consistently
558
-
559
- [source,ruby]
560
- ----
561
- # Use preprocessing to normalize before comparison
562
- Canon::Comparison.equivalent?(doc1, doc2,
563
- diff_algorithm: :semantic,
564
- preprocessing: :c14n # Canonicalize first
565
- )
566
- ----
567
-
568
- === Metadata Elements Not Classified Correctly
569
-
570
- **Symptoms:**
571
-
572
- * Metadata changes marked as normative
573
- * Expected informative diffs show as must-fix
574
-
575
- **Solutions:**
576
-
577
- Check if element is in [`METADATA_ELEMENTS` list](lib/canon/tree_diff/operation_converter.rb:33). If not, either:
578
-
579
- 1. Add it to the list (requires code change)
580
- 2. Use match dimension to ignore it:
581
-
582
- [source,ruby]
583
- ----
584
- # Treat all text content as informative for specific elements
585
- Canon::Comparison.equivalent?(doc1, doc2,
586
- diff_algorithm: :semantic,
587
- match: {
588
- text_content: :ignore # All text differences → informative
589
- }
590
- )
591
- ----
592
-
593
- == Migration Guide
594
-
595
- === From DOM Diff to Semantic Tree Diff
596
-
597
- ==== Step 1: Enable Semantic Diff in Tests
598
-
599
- Start with a small subset of tests:
600
-
601
- [source,ruby]
602
- ----
603
- # spec/spec_helper.rb
604
- RSpec.configure do |config|
605
- config.around(:each, semantic_diff: true) do |example|
606
- # Temporarily enable semantic diff for tagged tests
607
- old_algo = Canon::Config.instance.xml.diff.algorithm
608
- Canon::Config.instance.xml.diff.algorithm = :semantic
609
-
610
- example.run
611
-
612
- Canon::Config.instance.xml.diff.algorithm = old_algo
613
- end
614
- end
615
-
616
- # In test file
617
- RSpec.describe 'XML comparison', :semantic_diff do
618
- it 'detects moves correctly' do
619
- expect(actual).to be_xml_equivalent_to(expected)
620
- end
621
- end
622
- ----
623
-
624
- ==== Step 2: Compare Outputs
625
-
626
- Run tests with both algorithms to compare:
627
-
628
- [source,bash]
629
- ----
630
- # Run with DOM diff (baseline)
631
- CANON_ALGORITHM=dom bundle exec rspec > dom_output.txt
632
-
633
- # Run with semantic diff
634
- CANON_ALGORITHM=semantic bundle exec rspec > semantic_output.txt
635
-
636
- # Compare outputs
637
- diff dom_output.txt semantic_output.txt
638
- ----
639
-
640
- ==== Step 3: Adjust Match Options
641
-
642
- Tune `similarity_threshold` and match dimensions for your use case:
643
-
644
- [source,ruby]
645
- ----
646
- # Start conservative, gradually relax
647
- result = Canon::Comparison.equivalent?(doc1, doc2,
648
- diff_algorithm: :semantic,
649
- match: {
650
- similarity_threshold: 0.98, # Start high
651
- element_position: :ignore, # Ignore moves initially
652
- }
653
- )
654
-
655
- # Review results, adjust as needed
656
- ----
657
-
658
- ==== Step 4: Handle Format-Specific Issues
659
-
660
- **For XML:**
661
-
662
- [source,ruby]
663
- ----
664
- # Normalize namespaces
665
- Canon::Comparison.equivalent?(doc1, doc2,
666
- diff_algorithm: :semantic,
667
- preprocessing: :c14n
668
- )
669
- ----
670
-
671
- **For HTML:**
672
-
673
- [source,ruby]
674
- ----
675
- # Ignore presentation differences
676
- Canon::Comparison.equivalent?(doc1, doc2,
677
- diff_algorithm: :semantic,
678
- match: {
679
- attribute_order: :ignore,
680
- structural_whitespace: :ignore
681
- }
682
- )
683
- ----
684
-
685
- **For JSON/YAML:**
686
-
687
- [source,ruby]
688
- ----
689
- # Ignore key order
690
- Canon::Comparison.equivalent?(doc1, doc2,
691
- diff_algorithm: :semantic,
692
- match: {
693
- key_order: :ignore
694
- }
695
- )
696
- ----
697
-
698
- ==== Step 5: Gradual Rollout
699
-
700
- 1. Enable for new tests first
701
- 2. Migrate stable test suites
702
- 3. Monitor for regressions
703
- 4. Keep DOM diff as fallback for edge cases
704
-
705
- === Using Both Algorithms
706
-
707
- You can use both algorithms in the same codebase:
708
-
709
- [source,ruby]
710
- ----
711
- # Use semantic diff for operation analysis
712
- semantic_result = Canon::Comparison.equivalent?(doc1, doc2,
713
- verbose: true,
714
- diff_algorithm: :semantic
715
- )
716
-
717
- # Use DOM diff for traditional output
718
- dom_result = Canon::Comparison.equivalent?(doc1, doc2,
719
- verbose: true,
720
- diff_algorithm: :dom
721
- )
722
-
723
- # Compare results
724
- puts "Semantic detected #{semantic_result.operations.size} operations"
725
- puts "DOM detected #{dom_result.differences.size} differences"
726
- ----
727
-
728
- Or conditionally:
729
-
730
- [source,ruby]
731
- ----
732
- # Use semantic for small docs, DOM for large
733
- algorithm = node_count < 5000 ? :semantic : :dom
734
-
735
- Canon::Comparison.equivalent?(doc1, doc2,
736
- diff_algorithm: algorithm
737
- )
738
- ----
739
-
740
- == Research Background
741
-
742
- The semantic tree diff is based on academic research:
743
-
744
- * **XDiff (2002)**: Minimum-cost edit distance with unordered tree model
745
- * **XyDiff/Cobena (2002)**: BULD algorithm with hash signatures and weights
746
- * **JATS-diff (2022)**: Semantic operations for text-centric XML
747
-
748
- Key innovations in Canon's implementation:
749
-
750
- * Hybrid matching pipeline (hash + similarity + propagation)
751
- * Format adapters for XML, JSON, HTML, YAML
752
- * Integration with Canon's diff architecture
753
- * Configurable thresholds and matching phases
754
- * Metadata element classification
755
- * Whitespace-sensitive element handling
756
-
757
- See `xmldiff-resources.md` in the repository for research paper details.
758
-
759
- == See Also
760
-
761
- * link:TREE_DIFF.adoc[TREE_DIFF.adoc] - Operation types and examples
762
- * link:ENV_CONFIG.adoc[ENV_CONFIG.adoc] - Environment variable configuration
763
- * link:NORMATIVE_INFORMATIVE_DIFFS.adoc[NORMATIVE_INFORMATIVE_DIFFS.adoc] - Diff classification
764
- * link:MATCH_OPTIONS.adoc[MATCH_OPTIONS.adoc] - Match dimensions and behaviors
765
- * link:DOM_DIFF.adoc[DOM_DIFF.adoc] - DOM diff algorithm details