canon 0.1.8 → 0.1.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (101) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop_todo.yml +83 -22
  3. data/docs/Gemfile +1 -0
  4. data/docs/_config.yml +90 -1
  5. data/docs/advanced/diff-classification.adoc +196 -24
  6. data/docs/features/match-options/index.adoc +239 -1
  7. data/lib/canon/comparison/format_detector.rb +2 -1
  8. data/lib/canon/comparison/html_comparator.rb +19 -8
  9. data/lib/canon/comparison/html_compare_profile.rb +8 -2
  10. data/lib/canon/comparison/markup_comparator.rb +109 -2
  11. data/lib/canon/comparison/match_options/base_resolver.rb +7 -0
  12. data/lib/canon/comparison/whitespace_sensitivity.rb +208 -0
  13. data/lib/canon/comparison/xml_comparator/child_comparison.rb +15 -7
  14. data/lib/canon/comparison/xml_comparator/diff_node_builder.rb +108 -0
  15. data/lib/canon/comparison/xml_comparator/node_parser.rb +10 -5
  16. data/lib/canon/comparison/xml_comparator/node_type_comparator.rb +14 -7
  17. data/lib/canon/comparison/xml_comparator.rb +240 -23
  18. data/lib/canon/comparison/xml_node_comparison.rb +25 -3
  19. data/lib/canon/diff/diff_classifier.rb +119 -5
  20. data/lib/canon/diff/formatting_detector.rb +1 -1
  21. data/lib/canon/diff/xml_serialization_formatter.rb +153 -0
  22. data/lib/canon/rspec_matchers.rb +37 -8
  23. data/lib/canon/version.rb +1 -1
  24. data/lib/canon/xml/data_model.rb +24 -13
  25. metadata +4 -78
  26. data/docs/plans/2025-01-17-html-parser-selection-fix.adoc +0 -250
  27. data/false_positive_analysis.txt +0 -0
  28. data/file1.html +0 -1
  29. data/file2.html +0 -1
  30. data/old-docs/ADVANCED_TOPICS.adoc +0 -20
  31. data/old-docs/BASIC_USAGE.adoc +0 -16
  32. data/old-docs/CHARACTER_VISUALIZATION.adoc +0 -567
  33. data/old-docs/CLI.adoc +0 -497
  34. data/old-docs/CUSTOMIZING_BEHAVIOR.adoc +0 -19
  35. data/old-docs/DIFF_ARCHITECTURE.adoc +0 -435
  36. data/old-docs/DIFF_FORMATTING.adoc +0 -540
  37. data/old-docs/DIFF_PARAMETERS.adoc +0 -261
  38. data/old-docs/DOM_DIFF.adoc +0 -1017
  39. data/old-docs/ENV_CONFIG.adoc +0 -876
  40. data/old-docs/FORMATS.adoc +0 -867
  41. data/old-docs/INPUT_VALIDATION.adoc +0 -477
  42. data/old-docs/MATCHER_BEHAVIOR.adoc +0 -90
  43. data/old-docs/MATCH_ARCHITECTURE.adoc +0 -463
  44. data/old-docs/MATCH_OPTIONS.adoc +0 -912
  45. data/old-docs/MODES.adoc +0 -432
  46. data/old-docs/NORMATIVE_INFORMATIVE_DIFFS.adoc +0 -219
  47. data/old-docs/OPTIONS.adoc +0 -1387
  48. data/old-docs/PREPROCESSING.adoc +0 -491
  49. data/old-docs/README.old.adoc +0 -2831
  50. data/old-docs/RSPEC.adoc +0 -814
  51. data/old-docs/RUBY_API.adoc +0 -485
  52. data/old-docs/SEMANTIC_DIFF_REPORT.adoc +0 -646
  53. data/old-docs/SEMANTIC_TREE_DIFF.adoc +0 -765
  54. data/old-docs/STRING_COMPARE.adoc +0 -345
  55. data/old-docs/TMP.adoc +0 -3384
  56. data/old-docs/TREE_DIFF.adoc +0 -1080
  57. data/old-docs/UNDERSTANDING_CANON.adoc +0 -17
  58. data/old-docs/VERBOSE.adoc +0 -482
  59. data/old-docs/VISUALIZATION_MAP.adoc +0 -625
  60. data/old-docs/WHITESPACE_TREATMENT.adoc +0 -1155
  61. data/scripts/analyze_current_state.rb +0 -85
  62. data/scripts/analyze_false_positives.rb +0 -114
  63. data/scripts/analyze_remaining_failures.rb +0 -105
  64. data/scripts/compare_current_failures.rb +0 -95
  65. data/scripts/compare_dom_tree_diff.rb +0 -158
  66. data/scripts/compare_failures.rb +0 -151
  67. data/scripts/debug_attribute_extraction.rb +0 -66
  68. data/scripts/debug_blocks_839.rb +0 -115
  69. data/scripts/debug_meta_matching.rb +0 -52
  70. data/scripts/debug_p_matching.rb +0 -192
  71. data/scripts/debug_signature_matching.rb +0 -118
  72. data/scripts/debug_sourcecode_124.rb +0 -32
  73. data/scripts/debug_whitespace_sensitive.rb +0 -192
  74. data/scripts/extract_false_positives.rb +0 -138
  75. data/scripts/find_actual_false_positives.rb +0 -125
  76. data/scripts/investigate_all_false_positives.rb +0 -161
  77. data/scripts/investigate_batch1.rb +0 -127
  78. data/scripts/investigate_classification.rb +0 -150
  79. data/scripts/investigate_classification_detailed.rb +0 -190
  80. data/scripts/investigate_common_failures.rb +0 -342
  81. data/scripts/investigate_false_negative.rb +0 -80
  82. data/scripts/investigate_false_positive.rb +0 -83
  83. data/scripts/investigate_false_positives.rb +0 -227
  84. data/scripts/investigate_false_positives_batch.rb +0 -163
  85. data/scripts/investigate_mixed_content.rb +0 -125
  86. data/scripts/investigate_remaining_16.rb +0 -214
  87. data/scripts/run_single_test.rb +0 -29
  88. data/scripts/test_all_false_positives.rb +0 -95
  89. data/scripts/test_attribute_details.rb +0 -61
  90. data/scripts/test_both_algorithms.rb +0 -49
  91. data/scripts/test_both_simple.rb +0 -49
  92. data/scripts/test_enhanced_semantic_output.rb +0 -125
  93. data/scripts/test_readme_examples.rb +0 -131
  94. data/scripts/test_semantic_tree_diff.rb +0 -99
  95. data/scripts/test_semantic_ux_improvements.rb +0 -135
  96. data/scripts/test_single_false_positive.rb +0 -119
  97. data/scripts/test_size_limits.rb +0 -99
  98. data/test_html_1.html +0 -21
  99. data/test_html_2.html +0 -21
  100. data/test_nokogiri.rb +0 -33
  101. data/test_normalize.rb +0 -45
@@ -1,1080 +0,0 @@
1
- = Semantic tree diff
2
- :toc:
3
- :toclevels: 3
4
-
5
- [WARNING]
6
- The semantic tree diff feature is currently **experimental** and under active
7
- development. While it is functional and tested, the API and behavior may change
8
- in future releases. Use with caution in production environments.
9
-
10
- == General
11
-
12
- Canon provides two complementary diff modes:
13
-
14
- * **DOM diff** (default): Line-based comparison that matches elements by
15
- position and structure
16
- * **Semantic tree diff** (opt-in): Operation-based comparison that detects
17
- high-level edit operations (INSERT, DELETE, UPDATE, MOVE, MERGE, SPLIT,
18
- UPGRADE, DOWNGRADE)
19
-
20
- The semantic tree diff is based on research in XML diff algorithms (XDiff 2002,
21
- XyDiff/Cobena 2002, JATS-diff 2022) and provides operation-level analysis of
22
- changes between document trees.
23
-
24
- == Purpose
25
-
26
- The semantic tree diff enables:
27
-
28
- * **Operation detection**: Identify INSERT, DELETE, UPDATE, MOVE, MERGE, SPLIT,
29
- UPGRADE, and DOWNGRADE operations
30
- * **Intelligent matching**: Match similar nodes even when positions change
31
- * **Format independence**: Works with XML, JSON, HTML, and YAML
32
- * **Statistical analysis**: Provides matching statistics and confidence scores
33
-
34
- Use semantic tree diff when you need to:
35
-
36
- * Understand what operations transformed one document into another
37
- * Track content that moved between positions
38
- * Analyze structural changes at a high level
39
- * Compare documents with significant rearrangement
40
-
41
- == Enabling semantic tree diff
42
-
43
- Enable semantic tree diff by setting `diff_algorithm: :semantic`:
44
-
45
- [source,ruby]
46
- ----
47
- require 'canon/comparison'
48
-
49
- result = Canon::Comparison.equivalent?(doc1, doc2,
50
- verbose: true,
51
- diff_algorithm: :semantic
52
- )
53
- ----
54
-
55
- The system will use tree diff instead of the default DOM diff. Both modes
56
- cannot be used simultaneously.
57
-
58
- == Detected operations
59
-
60
- The semantic tree diff detects these eight operations in three levels:
61
-
62
- * **Level 1**: Basic operations (INSERT, DELETE, UPDATE)
63
- * **Level 2**: Structural operations (MOVE)
64
- * **Level 3**: Semantic operations (MERGE, SPLIT, UPGRADE, DOWNGRADE)
65
-
66
- === Detection algorithms
67
-
68
- The operation detector uses a pattern-based approach inspired by XDiff, XyDiff,
69
- and JATS-diff research:
70
-
71
- **Level 1 - Basic operations:**
72
-
73
- * **INSERT**: Detected by finding nodes in tree2 with no match in tree1
74
- * **DELETE**: Detected by finding nodes in tree1 with no match in tree2
75
- * **UPDATE**: Detected when matched nodes have different content, attributes, or labels
76
-
77
- **Level 2 - Structural operations:**
78
-
79
- * **MOVE**: Detected when a matched node has a different parent in tree2 than
80
- its matched parent from tree1
81
-
82
- **Level 3 - Semantic operations:**
83
-
84
- These are detected by analyzing patterns in the basic operations:
85
-
86
- * **MERGE**: Pattern of (n-1) DELETE operations + 1 UPDATE operation where
87
- deleted content appears in the updated node (80% text similarity threshold)
88
-
89
- * **SPLIT**: Pattern of 1 DELETE operation + n INSERT operations where the
90
- deleted node's content is distributed across the inserted nodes (80% text
91
- similarity threshold)
92
-
93
- * **UPGRADE**: Pattern of DELETE + INSERT where the inserted node is at a
94
- shallower depth (promoted in hierarchy) with same label and similar content
95
- (90% similarity threshold)
96
-
97
- * **DOWNGRADE**: Pattern of DELETE + INSERT where the inserted node is at a
98
- deeper depth (demoted in hierarchy) with same label and similar content (90%
99
- similarity threshold)
100
-
101
- When semantic operations are detected, the component basic operations are
102
- removed and replaced with the higher-level semantic operation.
103
-
104
- === Operation types
105
-
106
- === INSERT
107
-
108
- A new node was added to the tree.
109
-
110
- [example]
111
- ====
112
- .Before
113
- [source,xml]
114
- ----
115
- <root>
116
- <a>1</a>
117
- </root>
118
- ----
119
-
120
- .After
121
- [source,xml]
122
- ----
123
- <root>
124
- <a>1</a>
125
- <b>2</b>
126
- </root>
127
- ----
128
-
129
- Operation: INSERT `<b>2</b>`
130
- ====
131
-
132
- === DELETE
133
-
134
- A node was removed from the tree.
135
-
136
- [example]
137
- ====
138
- .Before
139
- [source,xml]
140
- ----
141
- <root>
142
- <a>1</a>
143
- <b>2</b>
144
- </root>
145
- ----
146
-
147
- .After
148
- [source,xml]
149
- ----
150
- <root>
151
- <a>1</a>
152
- </root>
153
- ----
154
-
155
- Operation: DELETE `<b>2</b>`
156
- ====
157
-
158
- === UPDATE
159
-
160
- A node's content or attributes changed.
161
-
162
- [example]
163
- ====
164
- .Before
165
- [source,xml]
166
- ----
167
- <root>
168
- <a id="1">old text</a>
169
- </root>
170
- ----
171
-
172
- .After
173
- [source,xml]
174
- ----
175
- <root>
176
- <a id="2">new text</a>
177
- </root>
178
- ----
179
-
180
- Operation: UPDATE `<a>` (id and text changed)
181
- ====
182
-
183
- === MOVE
184
-
185
- A node was relocated to a different position in the tree.
186
-
187
- [example]
188
- ====
189
- .Before
190
- [source,xml]
191
- ----
192
- <root>
193
- <section1>
194
- <item>X</item>
195
- </section1>
196
- <section2>
197
- </section2>
198
- </root>
199
- ----
200
-
201
- .After
202
- [source,xml]
203
- ----
204
- <root>
205
- <section1>
206
- </section1>
207
- <section2>
208
- <item>X</item>
209
- </section2>
210
- </root>
211
- ----
212
-
213
- Operation: MOVE `<item>X</item>` from section1 to section2
214
- ====
215
-
216
- === MERGE
217
-
218
- Multiple nodes were combined into a single node.
219
-
220
- [example]
221
- ====
222
- .Before
223
- [source,xml]
224
- ----
225
- <root>
226
- <paragraph>First sentence.</paragraph>
227
- <paragraph>Second sentence.</paragraph>
228
- <paragraph>Third sentence.</paragraph>
229
- </root>
230
- ----
231
-
232
- .After
233
- [source,xml]
234
- ----
235
- <root>
236
- <paragraph>First sentence. Second sentence. Third sentence.</paragraph>
237
- </root>
238
- ----
239
-
240
- Operation: MERGE 3 `<paragraph>` nodes into one
241
- ====
242
-
243
- === SPLIT
244
-
245
- A single node was divided into multiple nodes.
246
-
247
- [example]
248
- ====
249
- .Before
250
- [source,xml]
251
- ----
252
- <root>
253
- <section>
254
- <title>Combined Content</title>
255
- <paragraph>First part. Second part.</paragraph>
256
- </section>
257
- </root>
258
- ----
259
-
260
- .After
261
- [source,xml]
262
- ----
263
- <root>
264
- <section>
265
- <title>First Content</title>
266
- <paragraph>First part.</paragraph>
267
- </section>
268
- <section>
269
- <title>Second Content</title>
270
- <paragraph>Second part.</paragraph>
271
- </section>
272
- </root>
273
- ----
274
-
275
- Operation: SPLIT 1 `<section>` into 2 sections
276
- ====
277
-
278
- === UPGRADE
279
-
280
- A node was promoted to a higher level in the hierarchy (decreased depth).
281
-
282
- [example]
283
- ====
284
- .Before
285
- [source,xml]
286
- ----
287
- <root>
288
- <chapter>
289
- <section>
290
- <subsection>
291
- <title>Important Topic</title>
292
- <content>This is important.</content>
293
- </subsection>
294
- </section>
295
- </chapter>
296
- </root>
297
- ----
298
-
299
- .After
300
- [source,xml]
301
- ----
302
- <root>
303
- <chapter>
304
- <section>
305
- <title>Important Topic</title>
306
- <content>This is important.</content>
307
- </section>
308
- </chapter>
309
- </root>
310
- ----
311
-
312
- Operation: UPGRADE `<subsection>` promoted to `<section>` level
313
- ====
314
-
315
- === DOWNGRADE
316
-
317
- A node was demoted to a lower level in the hierarchy (increased depth).
318
-
319
- [example]
320
- ====
321
- .Before
322
- [source,json]
323
- ----
324
- {
325
- "items": [
326
- { "id": 1, "name": "Item A", "type": "primary" }
327
- ]
328
- }
329
- ----
330
-
331
- .After
332
- [source,json]
333
- ----
334
- {
335
- "items": [
336
- {
337
- "id": 1,
338
- "details": {
339
- "name": "Item A",
340
- "type": "primary"
341
- }
342
- }
343
- ]
344
- }
345
- ----
346
-
347
- Operation: DOWNGRADE `name` and `type` demoted into nested `details` object
348
- ====
349
-
350
- == Matching algorithm
351
-
352
- The semantic tree diff uses a hybrid three-phase matching pipeline:
353
-
354
- === Phase 1: Hash-based exact matching
355
-
356
- Matches nodes with identical structure and content using hash signatures.
357
-
358
- * Fast O(n) exact matching
359
- * Eliminates unchanged subtrees
360
- * Reduces problem size for subsequent phases
361
-
362
- === Phase 2: Similarity-based matching
363
-
364
- Matches similar but not identical nodes using similarity scoring.
365
-
366
- * Compares node names, attributes, text content, and structure
367
- * Configurable similarity threshold (default: 0.95)
368
- * Uses weighted similarity metrics
369
-
370
- === Phase 3: Structural propagation
371
-
372
- Propagates matches from parents and children to improve matching quality.
373
-
374
- * Top-down propagation from matched parents
375
- * Bottom-up propagation from matched children
376
- * Resolves ambiguous matches
377
-
378
- == Configuration options
379
-
380
- === similarity_threshold
381
-
382
- Controls the minimum similarity score for matching nodes.
383
-
384
- [source,ruby]
385
- ----
386
- result = Canon::Comparison.equivalent?(doc1, doc2,
387
- verbose: true,
388
- diff_algorithm: :semantic,
389
- match: {
390
- similarity_threshold: 0.90 # Default: 0.95
391
- }
392
- )
393
- ----
394
-
395
- Where,
396
-
397
- * Higher values (e.g., 0.99): More conservative, only very similar nodes match
398
- * Lower values (e.g., 0.80): More aggressive, allows less similar nodes to match
399
- * Default 0.95: Balanced approach suitable for most use cases
400
-
401
- === hash_matching
402
-
403
- Enable or disable hash-based exact matching phase.
404
-
405
- [source,ruby]
406
- ----
407
- result = Canon::Comparison.equivalent?(doc1, doc2,
408
- verbose: true,
409
- diff_algorithm: :semantic,
410
- match: {
411
- hash_matching: true # Default: true
412
- }
413
- )
414
- ----
415
-
416
- Disable only if exact matching causes issues with your data.
417
-
418
- === similarity_matching
419
-
420
- Enable or disable similarity-based matching phase.
421
-
422
- [source,ruby]
423
- ----
424
- result = Canon::Comparison.equivalent?(doc1, doc2,
425
- verbose: true,
426
- diff_algorithm: :semantic,
427
- match: {
428
- similarity_matching: true # Default: true
429
- }
430
- )
431
- ----
432
-
433
- Disable for faster but less accurate matching.
434
-
435
- === propagation
436
-
437
- Enable or disable structural propagation phase.
438
-
439
- [source,ruby]
440
- ----
441
- result = Canon::Comparison.equivalent?(doc1, doc2,
442
- verbose: true,
443
- diff_algorithm: :semantic,
444
- match: {
445
- propagation: true # Default: true
446
- }
447
- )
448
- ----
449
-
450
- Disable for simpler but potentially less accurate results.
451
-
452
- == Result format
453
-
454
- When `verbose: true` is set, the result includes operation details:
455
-
456
- [source,ruby]
457
- ----
458
- result = Canon::Comparison.equivalent?(doc1, doc2,
459
- verbose: true,
460
- diff_algorithm: :semantic
461
- )
462
-
463
- # Access operations
464
- result.operations.each do |op|
465
- puts "#{op.type}: #{op.node1_path} -> #{op.node2_path}"
466
- end
467
-
468
- # Access matching statistics
469
- stats = result.match_options[:tree_diff_statistics]
470
- puts "Total nodes tree1: #{stats[:tree1_node_count]}"
471
- puts "Total nodes tree2: #{stats[:tree2_node_count]}"
472
- puts "Matched nodes: #{stats[:matched_count]}"
473
- puts "Match rate: #{stats[:match_rate]}"
474
-
475
- # Access matching details
476
- matching = result.match_options[:tree_diff_matching]
477
- matching.each_pair do |node1, node2|
478
- puts "Matched: #{node1.path} <-> #{node2.path}"
479
- end
480
- ----
481
-
482
- == Examples
483
-
484
- === Basic comparison with operations
485
-
486
- [example]
487
- ====
488
- [source,ruby]
489
- ----
490
- require 'canon/comparison'
491
-
492
- xml1 = <<~XML
493
- <article>
494
- <title>Old Title</title>
495
- <section id="1">
496
- <p>Content A</p>
497
- </section>
498
- </article>
499
- XML
500
-
501
- xml2 = <<~XML
502
- <article>
503
- <title>New Title</title>
504
- <section id="2">
505
- <p>Content B</p>
506
- </section>
507
- <section id="1">
508
- <p>Content A</p>
509
- </section>
510
- </article>
511
- XML
512
-
513
- result = Canon::Comparison.equivalent?(xml1, xml2,
514
- verbose: true,
515
- diff_algorithm: :semantic
516
- )
517
-
518
- result.operations.each do |op|
519
- case op.type
520
- when :insert
521
- puts "Added: #{op.node2.name} at #{op.node2_path}"
522
- when :update
523
- puts "Changed: #{op.node1_path}"
524
- when :move
525
- puts "Moved: #{op.node1_path} -> #{op.node2_path}"
526
- end
527
- end
528
- ----
529
-
530
- Output:
531
- ```
532
- Changed: /article/title
533
- Added: section at /article/section[1]
534
- ```
535
- ====
536
-
537
- === Detecting moves
538
-
539
- [example]
540
- ====
541
- [source,ruby]
542
- ----
543
- json1 = <<~JSON
544
- {
545
- "sections": [
546
- { "id": "intro", "content": "Introduction text" },
547
- { "id": "body", "content": "Body text" }
548
- ]
549
- }
550
- JSON
551
-
552
- json2 = <<~JSON
553
- {
554
- "sections": [
555
- { "id": "body", "content": "Body text" },
556
- { "id": "intro", "content": "Introduction text" }
557
- ]
558
- }
559
- JSON
560
-
561
- result = Canon::Comparison.equivalent?(json1, json2,
562
- verbose: true,
563
- diff_algorithm: :semantic
564
- )
565
-
566
- moves = result.operations.select { |op| op.type == :move }
567
- puts "Detected #{moves.size} move operations"
568
- ----
569
- ====
570
-
571
- === Adjusting similarity threshold
572
-
573
- [example]
574
- ====
575
- [source,ruby]
576
- ----
577
- # Strict matching - only very similar nodes match
578
- result_strict = Canon::Comparison.equivalent?(doc1, doc2,
579
- verbose: true,
580
- diff_algorithm: :semantic,
581
- match: {
582
- similarity_threshold: 0.99
583
- }
584
- )
585
-
586
- # Lenient matching - allow more variation
587
- result_lenient = Canon::Comparison.equivalent?(doc1, doc2,
588
- verbose: true,
589
- diff_algorithm: :semantic,
590
- match: {
591
- similarity_threshold: 0.85
592
- }
593
- )
594
-
595
- puts "Strict: #{result_strict.operations.size} operations"
596
- puts "Lenient: #{result_lenient.operations.size} operations"
597
- ----
598
- ====
599
-
600
- == Format support
601
-
602
- Semantic tree diff works with all Canon-supported formats:
603
-
604
- === XML
605
-
606
- Full support including namespaces, attributes, and mixed content.
607
-
608
- [source,ruby]
609
- ----
610
- result = Canon::Comparison.equivalent?(xml1, xml2,
611
- verbose: true,
612
- diff_algorithm: :semantic
613
- )
614
- ----
615
-
616
- === JSON
617
-
618
- Supports objects, arrays, and primitive values.
619
-
620
- [source,ruby]
621
- ----
622
- result = Canon::Comparison.equivalent?(json1, json2,
623
- verbose: true,
624
- diff_algorithm: :semantic
625
- )
626
- ----
627
-
628
- === HTML
629
-
630
- Handles HTML 4/5 and XHTML documents.
631
-
632
- [source,ruby]
633
- ----
634
- result = Canon::Comparison.equivalent?(html1, html2,
635
- verbose: true,
636
- diff_algorithm: :semantic
637
- )
638
- ----
639
-
640
- === YAML
641
-
642
- Processes YAML documents with nested structures.
643
-
644
- [source,ruby]
645
- ----
646
- result = Canon::Comparison.equivalent?(yaml1, yaml2,
647
- verbose: true,
648
- diff_algorithm: :semantic
649
- )
650
- ----
651
-
652
- == Interaction with other options
653
-
654
- === Preprocessing
655
-
656
- Preprocessing applies before tree diff:
657
-
658
- [source,ruby]
659
- ----
660
- result = Canon::Comparison.equivalent?(doc1, doc2,
661
- verbose: true,
662
- preprocessing: :c14n, # Applied first
663
- diff_algorithm: :semantic
664
- )
665
- ----
666
-
667
- The preprocessing option normalizes documents before tree diff runs, ensuring
668
- consistent comparison.
669
-
670
- === Match dimensions
671
-
672
- Match dimensions DO apply with semantic diff and control whether detected
673
- operations are classified as **normative** (must-fix) or **informative**
674
- (can-ignore) differences.
675
-
676
- When a match dimension is set to `:ignore`, operations related to that
677
- dimension are marked as informative. When set to `:strict` or `:normalize`,
678
- they are marked as normative.
679
-
680
- ==== Dimension mapping to operations
681
-
682
- [cols="1,1,2"]
683
- |===
684
- |Match Dimension |Operation Types |Effect when `:ignore`
685
-
686
- |`text_content`
687
- |UPDATE (text changes)
688
- |Text content changes → informative
689
-
690
- |`attribute_values`
691
- |UPDATE (attribute changes)
692
- |Attribute value changes → informative
693
-
694
- |`attribute_order`
695
- |UPDATE (attribute reordering)
696
- |Attribute order changes → informative
697
-
698
- |`element_hierarchy`
699
- |UPGRADE, DOWNGRADE
700
- |Hierarchy depth changes → informative
701
-
702
- |`element_position`
703
- |MOVE
704
- |Element position changes → informative
705
-
706
- |`element_hierarchy`
707
- |MOVE (between parents)
708
- |Hierarchy changes → informative
709
-
710
- |`comments`
711
- |UPDATE, INSERT, DELETE (comments)
712
- |Comment changes → informative
713
-
714
- |`structural_whitespace`
715
- |UPDATE (whitespace)
716
- |Whitespace changes → informative
717
- |===
718
-
719
- ==== Examples
720
-
721
- .Ignoring text content changes
722
- [example]
723
- ====
724
- [source,ruby]
725
- ----
726
- # Text UPDATE operations are marked as informative
727
- result = Canon::Comparison.equivalent?(doc1, doc2,
728
- diff_algorithm: :semantic,
729
- verbose: true,
730
- match: {
731
- text_content: :ignore # Text changes → informative
732
- }
733
- )
734
-
735
- # Check if there are any normative (structural) differences
736
- has_normative = result.differences.any?(&:normative?)
737
- ----
738
- ====
739
-
740
- .Ignoring element position changes
741
- [example]
742
- ====
743
- [source,ruby]
744
- ----
745
- # MOVE operations are marked as informative
746
- result = Canon::Comparison.equivalent?(doc1, doc2,
747
- diff_algorithm: :semantic,
748
- verbose: true,
749
- match: {
750
- element_position: :ignore # Moves → informative
751
- }
752
- )
753
-
754
- # Only structural changes are normative
755
- normative_ops = result.differences.select(&:normative?)
756
- puts "Normative changes: #{normative_ops.map(&:dimension).uniq}"
757
- ----
758
- ====
759
-
760
- .Combining multiple dimension settings
761
- [example]
762
- ====
763
- [source,ruby]
764
- ----
765
- # Complex filtering: care about structure but not formatting
766
- result = Canon::Comparison.equivalent?(doc1, doc2,
767
- diff_algorithm: :semantic,
768
- verbose: true,
769
- match: {
770
- # Normative dimensions (must match)
771
- element_structure: :strict,
772
- element_hierarchy: :strict,
773
-
774
- # Informative dimensions (can differ)
775
- text_content: :ignore,
776
- attribute_order: :ignore,
777
- structural_whitespace: :ignore,
778
- comments: :ignore
779
- }
780
- )
781
-
782
- # Report only normative differences
783
- if result.differences.any?(&:normative?)
784
- puts "Structural differences found!"
785
- result.differences.select(&:normative?).each do |diff|
786
- puts " #{diff.dimension}: #{diff.reason}"
787
- end
788
- end
789
- ----
790
- ====
791
-
792
- ==== Semantic-specific dimensions
793
-
794
- The three semantic-specific dimensions are only meaningful with semantic diff:
795
-
796
- * `element_hierarchy`: Controls UPGRADE/DOWNGRADE operations (depth changes)
797
- * `element_position`: Controls MOVE operations (position changes)
798
- * `element_hierarchy`: Controls MOVE operations (parent changes)
799
-
800
- With DOM diff, these dimensions have no effect since DOM diff doesn't detect
801
- these operation types.
802
-
803
- == Performance considerations
804
-
805
- === Time complexity
806
-
807
- * Hash matching: O(n) where n is node count
808
- * Similarity matching: O(n²) worst case, O(n log n) typical
809
- * Propagation: O(n)
810
-
811
- For large documents (>10,000 nodes), consider:
812
-
813
- * Disabling similarity_matching if exact matches suffice
814
- * Increasing similarity_threshold to reduce candidate matches
815
- * Using preprocessing to reduce document size
816
-
817
- === Memory usage
818
-
819
- The tree diff maintains:
820
-
821
- * Full tree representations of both documents
822
- * Hash signatures for all nodes
823
- * Matching state and operations
824
-
825
- For very large documents, monitor memory usage and consider processing in
826
- chunks if needed.
827
-
828
- == Troubleshooting
829
-
830
- === Too many/too few matches
831
-
832
- Adjust `similarity_threshold`:
833
-
834
- [source,ruby]
835
- ----
836
- # Too many false matches? Increase threshold
837
- Canon::Comparison.equivalent?(doc1, doc2,
838
- diff_algorithm: :semantic,
839
- match: { similarity_threshold: 0.98 }
840
- )
841
-
842
- # Too few matches? Decrease threshold
843
- Canon::Comparison.equivalent?(doc1, doc2,
844
- diff_algorithm: :semantic,
845
- match: { similarity_threshold: 0.90 }
846
- )
847
- ----
848
-
849
- === Incorrect move detection
850
-
851
- Moves are detected when nodes match but positions change. If move detection is
852
- incorrect:
853
-
854
- * Verify nodes are truly similar (check attributes and content)
855
- * Adjust similarity_threshold
856
- * Check if preprocessing is needed to normalize content
857
-
858
- === Performance issues
859
-
860
- For slow comparisons:
861
-
862
- [source,ruby]
863
- ----
864
- # Disable expensive phases
865
- Canon::Comparison.equivalent?(doc1, doc2,
866
- diff_algorithm: :semantic,
867
- match: {
868
- similarity_matching: false, # Skip if exact matches suffice
869
- propagation: false # Skip if not needed
870
- }
871
- )
872
- ----
873
-
874
- == Metadata Element Classification
875
-
876
- Metadata elements are presentation/formatting elements that don't affect semantic equivalence. The semantic tree diff automatically classifies operations on these elements as **informative** (non-normative).
877
-
878
- === Metadata Elements
879
-
880
- The following elements are treated as metadata:
881
-
882
- * `semx` - Semantic markup
883
- * `fmt-concept` - Formatted concept
884
- * `fmt-name` - Formatted name
885
- * `fmt-title` - Formatted title
886
- * `fmt-xref` - Formatted cross-reference
887
- * `fmt-eref` - Formatted external reference
888
- * `fmt-termref` - Formatted term reference
889
- * `fmt-element-name` - Formatted element name
890
- * `fmt-link` - Formatted link
891
- * `autonum` - Automatic numbering
892
-
893
- === Classification Rules
894
-
895
- When an operation involves a metadata element:
896
-
897
- * **INSERT of metadata element**: Marked as informative
898
- * **DELETE of metadata element**: Marked as informative
899
- * **UPDATE of metadata element**: Marked as informative
900
- * **MOVE of metadata element**: Marked as informative
901
-
902
- This ensures that presentation-layer changes don't cause false positives in semantic comparison.
903
-
904
- .Example: Metadata element changes are informative
905
- [example]
906
- ====
907
- [source,xml]
908
- ----
909
- <!-- Document 1 -->
910
- <section>
911
- <title>Introduction</title>
912
- <autonum>1.1</autonum>
913
- </section>
914
-
915
- <!-- Document 2 -->
916
- <section>
917
- <title>Introduction</title>
918
- <autonum>1.2</autonum> <!-- Changed -->
919
- <semx element="title">Introduction</semx> <!-- Added -->
920
- </section>
921
- ----
922
-
923
- Operations detected:
924
-
925
- * UPDATE `<autonum>` (1.1 → 1.2): **Informative** (metadata element)
926
- * INSERT `<semx>`: **Informative** (metadata element)
927
-
928
- The documents are considered semantically equivalent despite these presentation changes.
929
- ====
930
-
931
- === Configuring Metadata Elements
932
-
933
- The metadata element list is defined in [`OperationConverter::METADATA_ELEMENTS`](../lib/canon/tree_diff/operation_converter.rb:33).
934
-
935
- To treat additional elements as metadata, modify the `METADATA_ELEMENTS` constant. Alternatively, use match dimensions to ignore specific changes.
936
-
937
- == Whitespace-Sensitive Elements
938
-
939
- Certain elements preserve whitespace as semantically significant. For these elements, whitespace is **not normalized** even when `text_content: :normalize` is set.
940
-
941
- === Whitespace-Sensitive Element List
942
-
943
- * `pre` - Preformatted text
944
- * `code` - Code blocks
945
- * `textarea` - Text input areas
946
- * `script` - JavaScript code
947
- * `style` - CSS styles
948
-
949
- === Whitespace Handling
950
-
951
- For whitespace-sensitive elements:
952
-
953
- * Text content is preserved exactly as-is during tree conversion
954
- * Normalization does NOT apply to text content
955
- * Whitespace differences are **normative** (must match exactly)
956
- * Leading/trailing spaces are significant
957
-
958
- .Example: Whitespace in `<pre>` is significant
959
- [example]
960
- ====
961
- [source,html]
962
- ----
963
- <!-- Document 1 -->
964
- <pre>
965
- Line 1
966
- Line 2
967
- </pre>
968
-
969
- <!-- Document 2 -->
970
- <pre>
971
- Line 1
972
- Line 2
973
- </pre>
974
- ----
975
-
976
- Operation detected:
977
-
978
- * UPDATE `<pre>` text content: **Normative** (whitespace differs)
979
-
980
- These documents are NOT equivalent because `<pre>` preserves whitespace. The leading spaces in Document 1 are semantically significant.
981
- ====
982
-
983
- .Example: Whitespace in `<p>` is normalized
984
- [example]
985
- ====
986
- [source,html]
987
- ----
988
- <!-- Document 1 -->
989
- <p>
990
- Line 1
991
- Line 2
992
- </p>
993
-
994
- <!-- Document 2 -->
995
- <p>
996
- Line 1
997
- Line 2
998
- </p>
999
- ----
1000
-
1001
- With `text_content: :normalize`:
1002
-
1003
- * No UPDATE detected (whitespace normalized in `<p>`)
1004
-
1005
- These documents ARE equivalent because `<p>` normalizes whitespace when `text_content: :normalize` is set.
1006
- ====
1007
-
1008
- === Implementation
1009
-
1010
- Whitespace-sensitive elements are detected automatically based on element name in the format adapters:
1011
-
1012
- * [`XMLAdapter`](../lib/canon/tree_diff/adapters/xml_adapter.rb) - Preserves original text
1013
- * [`HTMLAdapter`](../lib/canon/tree_diff/adapters/html_adapter.rb) - Preserves original text
1014
-
1015
- The normalization decision happens in [`OperationDetector`](../lib/canon/tree_diff/operations/operation_detector.rb) based on element name and match options.
1016
-
1017
- == Comparison with DOM diff
1018
-
1019
- [cols="1,2,2"]
1020
- |===
1021
- |Aspect |DOM diff (default) |Semantic tree diff
1022
-
1023
- |Matching
1024
- |Position-based with DOM structure
1025
- |Similarity-based with tree operations
1026
-
1027
- |Operations
1028
- |Line-by-line changes
1029
- |INSERT, DELETE, UPDATE, MOVE, MERGE, SPLIT, UPGRADE, DOWNGRADE
1030
-
1031
- |Use case
1032
- |Traditional diff output
1033
- |Operation-level analysis
1034
-
1035
- |Performance
1036
- |Faster for large docs
1037
- |Slower but more intelligent
1038
-
1039
- |Move detection
1040
- |No
1041
- |Yes
1042
-
1043
- |Rearrangement
1044
- |Shows as delete + insert
1045
- |Shows as MOVE
1046
-
1047
- |Format support
1048
- |XML, HTML, JSON, YAML
1049
- |XML, HTML, JSON, YAML
1050
-
1051
- |Output mode
1052
- |Line-based or tree-based
1053
- |Operation list
1054
-
1055
- |Verbose mode
1056
- |Required for detailed output
1057
- |Required for operations
1058
- |===
1059
-
1060
- Both modes are first-class citizens in Canon. Choose based on your needs:
1061
-
1062
- * Use **DOM diff** for traditional comparison and readable output
1063
- * Use **tree diff** for operation analysis and move detection
1064
-
1065
- == Research background
1066
-
1067
- The semantic tree diff implementation is based on:
1068
-
1069
- * **XDiff (2002)**: Minimum-cost edit distance with unordered tree model
1070
- * **XyDiff/Cobena (2002)**: BULD algorithm with hash signatures and weights
1071
- * **JATS-diff (2022)**: Semantic operations for text-centric XML
1072
-
1073
- Key innovations in Canon's implementation:
1074
-
1075
- * Hybrid matching pipeline combining hash, similarity, and propagation
1076
- * Format adapters for XML, JSON, HTML, YAML
1077
- * Integration with Canon's existing comparison architecture
1078
- * Configurable similarity thresholds and matching phases
1079
-
1080
- See `xmldiff-resources.md` in the repository for research paper details.