canon 0.1.6 → 0.1.7
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop_todo.yml +163 -67
- data/README.adoc +400 -7
- data/docs/Gemfile +9 -0
- data/docs/INDEX.adoc +99 -182
- data/docs/_config.yml +100 -0
- data/docs/advanced/diff-classification.adoc +547 -0
- data/docs/advanced/diff-pipeline.adoc +358 -0
- data/docs/advanced/index.adoc +214 -0
- data/docs/advanced/semantic-diff-report.adoc +390 -0
- data/docs/{VERBOSE.adoc → advanced/verbose-mode-architecture.adoc} +51 -53
- data/docs/features/diff-formatting/algorithm-specific-output.adoc +533 -0
- data/docs/{CHARACTER_VISUALIZATION.adoc → features/diff-formatting/character-visualization.adoc} +23 -62
- data/docs/features/diff-formatting/colors-and-symbols.adoc +606 -0
- data/docs/features/diff-formatting/context-and-grouping.adoc +490 -0
- data/docs/features/diff-formatting/display-filtering.adoc +472 -0
- data/docs/features/diff-formatting/index.adoc +140 -0
- data/docs/features/environment-configuration/index.adoc +327 -0
- data/docs/features/environment-configuration/override-system.adoc +436 -0
- data/docs/features/environment-configuration/size-limits.adoc +273 -0
- data/docs/features/index.adoc +173 -0
- data/docs/features/input-validation/index.adoc +521 -0
- data/docs/features/match-options/algorithm-specific-behavior.adoc +365 -0
- data/docs/features/match-options/html-policies.adoc +312 -0
- data/docs/features/match-options/index.adoc +621 -0
- data/docs/getting-started/index.adoc +83 -0
- data/docs/getting-started/quick-start.adoc +76 -0
- data/docs/guides/choosing-configuration.adoc +689 -0
- data/docs/guides/index.adoc +181 -0
- data/docs/{CLI.adoc → interfaces/cli/index.adoc} +18 -13
- data/docs/interfaces/index.adoc +101 -0
- data/docs/{RSPEC.adoc → interfaces/rspec/index.adoc} +242 -31
- data/docs/{RUBY_API.adoc → interfaces/ruby-api/index.adoc} +118 -16
- data/docs/lychee.toml +65 -0
- data/docs/reference/cli-options.adoc +418 -0
- data/docs/reference/environment-variables.adoc +375 -0
- data/docs/reference/index.adoc +204 -0
- data/docs/reference/options-across-interfaces.adoc +417 -0
- data/docs/understanding/algorithms/dom-diff.adoc +389 -0
- data/docs/understanding/algorithms/index.adoc +314 -0
- data/docs/understanding/algorithms/semantic-tree-diff.adoc +533 -0
- data/docs/understanding/architecture.adoc +447 -0
- data/docs/understanding/comparison-pipeline.adoc +317 -0
- data/docs/understanding/formats/html.adoc +380 -0
- data/docs/understanding/formats/index.adoc +261 -0
- data/docs/understanding/formats/json.adoc +390 -0
- data/docs/understanding/formats/xml.adoc +366 -0
- data/docs/understanding/formats/yaml.adoc +504 -0
- data/docs/understanding/index.adoc +130 -0
- data/lib/canon/cli.rb +42 -1
- data/lib/canon/commands/diff_command.rb +108 -23
- data/lib/canon/comparison/compare_profile.rb +101 -0
- data/lib/canon/comparison/comparison_result.rb +41 -2
- data/lib/canon/comparison/html_comparator.rb +292 -71
- data/lib/canon/comparison/html_compare_profile.rb +117 -0
- data/lib/canon/comparison/match_options.rb +42 -4
- data/lib/canon/comparison/strategies/base_match_strategy.rb +99 -0
- data/lib/canon/comparison/strategies/match_strategy_factory.rb +74 -0
- data/lib/canon/comparison/strategies/semantic_tree_match_strategy.rb +220 -0
- data/lib/canon/comparison/xml_comparator.rb +695 -91
- data/lib/canon/comparison.rb +207 -2
- data/lib/canon/config/env_provider.rb +71 -0
- data/lib/canon/config/env_schema.rb +58 -0
- data/lib/canon/config/override_resolver.rb +55 -0
- data/lib/canon/config/type_converter.rb +59 -0
- data/lib/canon/config.rb +158 -29
- data/lib/canon/data_model.rb +29 -0
- data/lib/canon/diff/diff_classifier.rb +74 -14
- data/lib/canon/diff/diff_context_builder.rb +41 -0
- data/lib/canon/diff/diff_line.rb +18 -2
- data/lib/canon/diff/diff_node.rb +18 -3
- data/lib/canon/diff/diff_node_mapper.rb +71 -12
- data/lib/canon/diff/formatting_detector.rb +53 -0
- data/lib/canon/diff_formatter/by_line/base_formatter.rb +60 -5
- data/lib/canon/diff_formatter/by_line/html_formatter.rb +68 -16
- data/lib/canon/diff_formatter/by_line/json_formatter.rb +0 -37
- data/lib/canon/diff_formatter/by_line/simple_formatter.rb +0 -42
- data/lib/canon/diff_formatter/by_line/xml_formatter.rb +116 -31
- data/lib/canon/diff_formatter/by_line/yaml_formatter.rb +0 -37
- data/lib/canon/diff_formatter/by_object/base_formatter.rb +126 -19
- data/lib/canon/diff_formatter/by_object/xml_formatter.rb +30 -1
- data/lib/canon/diff_formatter/debug_output.rb +7 -1
- data/lib/canon/diff_formatter/diff_detail_formatter.rb +674 -57
- data/lib/canon/diff_formatter/legend.rb +42 -0
- data/lib/canon/diff_formatter.rb +78 -9
- data/lib/canon/errors.rb +56 -0
- data/lib/canon/formatters/html_formatter_base.rb +35 -1
- data/lib/canon/formatters/json_formatter.rb +3 -0
- data/lib/canon/formatters/yaml_formatter.rb +3 -0
- data/lib/canon/html/data_model.rb +229 -0
- data/lib/canon/html.rb +9 -0
- data/lib/canon/options/cli_generator.rb +70 -0
- data/lib/canon/options/registry.rb +234 -0
- data/lib/canon/rspec_matchers.rb +34 -13
- data/lib/canon/tree_diff/adapters/html_adapter.rb +316 -0
- data/lib/canon/tree_diff/adapters/json_adapter.rb +204 -0
- data/lib/canon/tree_diff/adapters/xml_adapter.rb +285 -0
- data/lib/canon/tree_diff/adapters/yaml_adapter.rb +213 -0
- data/lib/canon/tree_diff/core/attribute_comparator.rb +84 -0
- data/lib/canon/tree_diff/core/matching.rb +241 -0
- data/lib/canon/tree_diff/core/node_signature.rb +164 -0
- data/lib/canon/tree_diff/core/node_weight.rb +135 -0
- data/lib/canon/tree_diff/core/tree_node.rb +450 -0
- data/lib/canon/tree_diff/matchers/hash_matcher.rb +258 -0
- data/lib/canon/tree_diff/matchers/similarity_matcher.rb +168 -0
- data/lib/canon/tree_diff/matchers/structural_propagator.rb +242 -0
- data/lib/canon/tree_diff/matchers/universal_matcher.rb +220 -0
- data/lib/canon/tree_diff/operation_converter.rb +631 -0
- data/lib/canon/tree_diff/operations/operation.rb +92 -0
- data/lib/canon/tree_diff/operations/operation_detector.rb +626 -0
- data/lib/canon/tree_diff/tree_diff_integrator.rb +140 -0
- data/lib/canon/tree_diff.rb +33 -0
- data/lib/canon/validators/json_validator.rb +3 -1
- data/lib/canon/validators/yaml_validator.rb +3 -1
- data/lib/canon/version.rb +1 -1
- data/lib/canon/xml/data_model.rb +22 -23
- data/lib/canon/xml/element_matcher.rb +128 -20
- data/lib/canon/xml/namespace_helper.rb +110 -0
- data/lib/canon.rb +3 -0
- metadata +81 -23
- data/_config.yml +0 -116
- data/docs/ADVANCED_TOPICS.adoc +0 -20
- data/docs/BASIC_USAGE.adoc +0 -16
- data/docs/CUSTOMIZING_BEHAVIOR.adoc +0 -19
- data/docs/DIFF_ARCHITECTURE.adoc +0 -435
- data/docs/DIFF_FORMATTING.adoc +0 -540
- data/docs/FORMATS.adoc +0 -447
- data/docs/INPUT_VALIDATION.adoc +0 -477
- data/docs/MATCH_ARCHITECTURE.adoc +0 -463
- data/docs/MATCH_OPTIONS.adoc +0 -719
- data/docs/MODES.adoc +0 -432
- data/docs/NORMATIVE_INFORMATIVE_DIFFS.adoc +0 -219
- data/docs/OPTIONS.adoc +0 -1387
- data/docs/PREPROCESSING.adoc +0 -491
- data/docs/SEMANTIC_DIFF_REPORT.adoc +0 -528
- data/docs/UNDERSTANDING_CANON.adoc +0 -17
|
@@ -0,0 +1,533 @@
|
|
|
1
|
+
---
|
|
2
|
+
title: Semantic Algorithm
|
|
3
|
+
parent: Algorithms
|
|
4
|
+
grand_parent: Understanding
|
|
5
|
+
nav_order: 2
|
|
6
|
+
---
|
|
7
|
+
= Semantic algorithm
|
|
8
|
+
:toc:
|
|
9
|
+
:toclevels: 3
|
|
10
|
+
|
|
11
|
+
WARNING: The semantic tree diff algorithm is currently **experimental** and under active development. While functional and tested, the API and behavior may change in future releases. Use with caution in production environments.
|
|
12
|
+
|
|
13
|
+
== Purpose
|
|
14
|
+
|
|
15
|
+
The Semantic algorithm is Canon's **intelligent, experimental algorithm** for document comparison. It provides signature-based matching with operation detection (INSERT, DELETE, UPDATE, MOVE).
|
|
16
|
+
|
|
17
|
+
This page explains when to use the Semantic algorithm, how it differs from DOM, and how to configure it effectively.
|
|
18
|
+
|
|
19
|
+
== When to Use
|
|
20
|
+
|
|
21
|
+
The Semantic algorithm is for **advanced use cases** where intelligence is worth the performance cost.
|
|
22
|
+
|
|
23
|
+
=== Use Semantic Algorithm When
|
|
24
|
+
|
|
25
|
+
* ✓ You need to **detect element moves and reordering**
|
|
26
|
+
* ✓ Documents have **significant restructuring**
|
|
27
|
+
* ✓ You need **operation-level analysis** (INSERT, DELETE, UPDATE, MOVE)
|
|
28
|
+
* ✓ You want **statistical analysis** of changes
|
|
29
|
+
* ✓ You're **analyzing document evolution**
|
|
30
|
+
* ✓ You're willing to **accept experimental status**
|
|
31
|
+
* ✓ Working with **smaller documents** (< 10KB)
|
|
32
|
+
|
|
33
|
+
=== Characteristics
|
|
34
|
+
|
|
35
|
+
[cols="2,3"]
|
|
36
|
+
|===
|
|
37
|
+
|Feature |Semantic Algorithm
|
|
38
|
+
|
|
39
|
+
|**Status**
|
|
40
|
+
|Experimental, under development
|
|
41
|
+
|
|
42
|
+
|**Performance**
|
|
43
|
+
|Slower - O(n²) worst case
|
|
44
|
+
|
|
45
|
+
|**Memory Usage**
|
|
46
|
+
|Higher - builds tree structures
|
|
47
|
+
|
|
48
|
+
|**Matching Strategy**
|
|
49
|
+
|Signature-based similarity matching
|
|
50
|
+
|
|
51
|
+
|**Move Detection**
|
|
52
|
+
|Yes - detects MOVE operations
|
|
53
|
+
|
|
54
|
+
|**Output Format**
|
|
55
|
+
|Operation-based (INSERT, DELETE, UPDATE, MOVE)
|
|
56
|
+
|
|
57
|
+
|**Best For**
|
|
58
|
+
|Restructured documents, operation analysis
|
|
59
|
+
|
|
60
|
+
|**Document Size**
|
|
61
|
+
|Best for smaller documents (< 10KB)
|
|
62
|
+
|===
|
|
63
|
+
|
|
64
|
+
== How It Works
|
|
65
|
+
|
|
66
|
+
The Semantic algorithm uses a sophisticated three-phase matching process:
|
|
67
|
+
|
|
68
|
+
=== Phase 1: Hash-Based Exact Matching
|
|
69
|
+
|
|
70
|
+
Matches nodes with identical structure and content:
|
|
71
|
+
|
|
72
|
+
* **Fast** - O(n) performance
|
|
73
|
+
* **Eliminates** unchanged subtrees
|
|
74
|
+
* **Reduces** problem size for later phases
|
|
75
|
+
|
|
76
|
+
=== Phase 2: Similarity-Based Matching
|
|
77
|
+
|
|
78
|
+
Matches similar but not identical nodes:
|
|
79
|
+
|
|
80
|
+
* **Compares** node names, attributes, text, structure
|
|
81
|
+
* **Scores** similarity using weighted metrics
|
|
82
|
+
* **Threshold** - Default 0.95 (95% similar)
|
|
83
|
+
|
|
84
|
+
=== Phase 3: Structural Propagation
|
|
85
|
+
|
|
86
|
+
Improves match quality using context:
|
|
87
|
+
|
|
88
|
+
* **Top-down** - Propagate from matched parents
|
|
89
|
+
* **Bottom-up** - Propagate from matched children
|
|
90
|
+
* **Resolves** ambiguous matches
|
|
91
|
+
|
|
92
|
+
=== Signature-Based Matching
|
|
93
|
+
|
|
94
|
+
Unlike DOM's position-based comparison, Semantic uses **signatures**:
|
|
95
|
+
|
|
96
|
+
.Signature-based comparison example
|
|
97
|
+
[example]
|
|
98
|
+
====
|
|
99
|
+
[source,xml]
|
|
100
|
+
----
|
|
101
|
+
<!-- Document 1 -->
|
|
102
|
+
<book>
|
|
103
|
+
<title>Canon Guide</title>
|
|
104
|
+
<author>Alice</author>
|
|
105
|
+
</book>
|
|
106
|
+
|
|
107
|
+
<!-- Document 2 -->
|
|
108
|
+
<book>
|
|
109
|
+
<author>Alice</author>
|
|
110
|
+
<title>Canon Guide</title>
|
|
111
|
+
</book>
|
|
112
|
+
----
|
|
113
|
+
|
|
114
|
+
Semantic algorithm:
|
|
115
|
+
1. Calculates signature for each element
|
|
116
|
+
2. `<author>Alice</author>` has same signature in both documents
|
|
117
|
+
3. Detects as **MOVE** operation (moved from position 2 to position 1)
|
|
118
|
+
|
|
119
|
+
Result: 1 MOVE operation detected (author element moved)
|
|
120
|
+
====
|
|
121
|
+
|
|
122
|
+
== Operation Detection
|
|
123
|
+
|
|
124
|
+
The Semantic algorithm detects eight operation types:
|
|
125
|
+
|
|
126
|
+
=== Basic Operations (Level 1)
|
|
127
|
+
|
|
128
|
+
**INSERT**:: New node added
|
|
129
|
+
[source]
|
|
130
|
+
----
|
|
131
|
+
+ <chapter id="3">New Chapter</chapter>
|
|
132
|
+
----
|
|
133
|
+
|
|
134
|
+
**DELETE**:: Node removed
|
|
135
|
+
[source]
|
|
136
|
+
----
|
|
137
|
+
- <chapter id="old">Removed Chapter</chapter>
|
|
138
|
+
----
|
|
139
|
+
|
|
140
|
+
**UPDATE**:: Node content/attributes changed
|
|
141
|
+
[source]
|
|
142
|
+
----
|
|
143
|
+
~ <title>Old → New</title>
|
|
144
|
+
----
|
|
145
|
+
|
|
146
|
+
=== Structural Operations (Level 2)
|
|
147
|
+
|
|
148
|
+
**MOVE**:: Node relocated to different position
|
|
149
|
+
[source]
|
|
150
|
+
----
|
|
151
|
+
→ <author>Alice</author> (moved from position 2 to 1)
|
|
152
|
+
----
|
|
153
|
+
|
|
154
|
+
=== Semantic Operations (Level 3)
|
|
155
|
+
|
|
156
|
+
**MERGE**:: Multiple nodes combined into one
|
|
157
|
+
[source]
|
|
158
|
+
----
|
|
159
|
+
⊕ <section> (merged from 2 separate sections)
|
|
160
|
+
----
|
|
161
|
+
|
|
162
|
+
**SPLIT**:: One node divided into multiple
|
|
163
|
+
[source]
|
|
164
|
+
----
|
|
165
|
+
⊖ <section> (split into 2 separate sections)
|
|
166
|
+
----
|
|
167
|
+
|
|
168
|
+
**UPGRADE**:: Node promoted to higher level
|
|
169
|
+
[source]
|
|
170
|
+
----
|
|
171
|
+
↑ <section> (promoted from depth 3 to depth 2)
|
|
172
|
+
----
|
|
173
|
+
|
|
174
|
+
**DOWNGRADE**:: Node demoted to lower level
|
|
175
|
+
[source]
|
|
176
|
+
----
|
|
177
|
+
↓ <section> (demoted from depth 2 to depth 3)
|
|
178
|
+
----
|
|
179
|
+
|
|
180
|
+
== Configuration
|
|
181
|
+
|
|
182
|
+
=== Basic Usage
|
|
183
|
+
|
|
184
|
+
**Ruby API**:
|
|
185
|
+
[source,ruby]
|
|
186
|
+
----
|
|
187
|
+
# Explicitly specify semantic algorithm
|
|
188
|
+
Canon::Comparison.equivalent?(doc1, doc2,
|
|
189
|
+
diff_algorithm: :semantic
|
|
190
|
+
)
|
|
191
|
+
----
|
|
192
|
+
|
|
193
|
+
**CLI**:
|
|
194
|
+
[source,bash]
|
|
195
|
+
----
|
|
196
|
+
canon diff file1.xml file2.xml --diff-algorithm semantic
|
|
197
|
+
----
|
|
198
|
+
|
|
199
|
+
=== With Similarity Threshold
|
|
200
|
+
|
|
201
|
+
Control how strict matching is:
|
|
202
|
+
|
|
203
|
+
[source,ruby]
|
|
204
|
+
----
|
|
205
|
+
Canon::Comparison.equivalent?(doc1, doc2,
|
|
206
|
+
diff_algorithm: :semantic,
|
|
207
|
+
match: {
|
|
208
|
+
similarity_threshold: 0.90 # More lenient (default: 0.95)
|
|
209
|
+
}
|
|
210
|
+
)
|
|
211
|
+
----
|
|
212
|
+
|
|
213
|
+
* **Higher** (0.99) - Very conservative, only nearly identical nodes match
|
|
214
|
+
* **Lower** (0.80) - More aggressive, allows less similar nodes to match
|
|
215
|
+
* **Default** (0.95) - Balanced for most use cases
|
|
216
|
+
|
|
217
|
+
=== With Match Options
|
|
218
|
+
|
|
219
|
+
Semantic algorithm interprets match options for signature calculation:
|
|
220
|
+
|
|
221
|
+
[source,ruby]
|
|
222
|
+
----
|
|
223
|
+
Canon::Comparison.equivalent?(doc1, doc2,
|
|
224
|
+
diff_algorithm: :semantic,
|
|
225
|
+
match: {
|
|
226
|
+
text_content: :normalize, # Affects text signatures
|
|
227
|
+
attribute_order: :ignore, # Always ignored (unordered in signatures)
|
|
228
|
+
element_position: :ignore # MOVEs become informative
|
|
229
|
+
}
|
|
230
|
+
)
|
|
231
|
+
----
|
|
232
|
+
|
|
233
|
+
See link:../../features/match-options/algorithm-specific-behavior.adoc[Algorithm-Specific Behavior] for details.
|
|
234
|
+
|
|
235
|
+
=== With Diff Formatting
|
|
236
|
+
|
|
237
|
+
Semantic works best with by_object mode:
|
|
238
|
+
|
|
239
|
+
[source,ruby]
|
|
240
|
+
----
|
|
241
|
+
# Operation-based output (natural fit for Semantic)
|
|
242
|
+
Canon::Comparison.equivalent?(doc1, doc2,
|
|
243
|
+
diff_algorithm: :semantic,
|
|
244
|
+
diff_mode: :by_object, # Shows operations
|
|
245
|
+
verbose: true
|
|
246
|
+
)
|
|
247
|
+
|
|
248
|
+
# Traditional output (also works)
|
|
249
|
+
Canon::Comparison.equivalent?(doc1, doc2,
|
|
250
|
+
diff_algorithm: :semantic,
|
|
251
|
+
diff_mode: :by_line, # Traditional format
|
|
252
|
+
verbose: true
|
|
253
|
+
)
|
|
254
|
+
----
|
|
255
|
+
|
|
256
|
+
== Output Format
|
|
257
|
+
|
|
258
|
+
=== Operation-Based Output (Default)
|
|
259
|
+
|
|
260
|
+
The Semantic algorithm naturally produces operation-based output:
|
|
261
|
+
|
|
262
|
+
.Operation-based diff example
|
|
263
|
+
[example]
|
|
264
|
+
====
|
|
265
|
+
```
|
|
266
|
+
UPDATE: book/title: "Old Title" → "New Title"
|
|
267
|
+
MOVE: book/author → book/author (position 2 → 1)
|
|
268
|
+
|
|
269
|
+
Statistics:
|
|
270
|
+
INSERT: 0
|
|
271
|
+
DELETE: 0
|
|
272
|
+
UPDATE: 1
|
|
273
|
+
MOVE: 1
|
|
274
|
+
```
|
|
275
|
+
|
|
276
|
+
* Shows **what changed** (operation type)
|
|
277
|
+
* Shows **where it changed** (element path)
|
|
278
|
+
* Provides **statistics** (operation counts)
|
|
279
|
+
====
|
|
280
|
+
|
|
281
|
+
=== Accessing Operations
|
|
282
|
+
|
|
283
|
+
[source,ruby]
|
|
284
|
+
----
|
|
285
|
+
result = Canon::Comparison.equivalent?(doc1, doc2,
|
|
286
|
+
diff_algorithm: :semantic,
|
|
287
|
+
verbose: true
|
|
288
|
+
)
|
|
289
|
+
|
|
290
|
+
# Access operations
|
|
291
|
+
result.operations.each do |op|
|
|
292
|
+
puts "Type: #{op.type}" # :insert, :delete, :update, :move
|
|
293
|
+
puts "Path: #{op.path}" # Element path
|
|
294
|
+
puts "Details: #{op.details}" # Operation-specific info
|
|
295
|
+
end
|
|
296
|
+
|
|
297
|
+
# Access statistics
|
|
298
|
+
puts "Moves: #{result.statistics.moves}"
|
|
299
|
+
puts "Updates: #{result.statistics.updates}"
|
|
300
|
+
----
|
|
301
|
+
|
|
302
|
+
== Advantages
|
|
303
|
+
|
|
304
|
+
=== Intelligent Matching
|
|
305
|
+
|
|
306
|
+
* **Detects moves** - Tracks content relocation
|
|
307
|
+
* **Handles restructuring** - Works with heavily modified documents
|
|
308
|
+
* **Signature-based** - Matches similar content anywhere
|
|
309
|
+
|
|
310
|
+
.Move detection example
|
|
311
|
+
[cols="2,2,2"]
|
|
312
|
+
|===
|
|
313
|
+
|DOM Algorithm |Semantic Algorithm |Advantage
|
|
314
|
+
|
|
315
|
+
|Shows as DELETE + INSERT
|
|
316
|
+
|Shows as MOVE
|
|
317
|
+
|Clearer understanding
|
|
318
|
+
|
|
319
|
+
|Many false positives
|
|
320
|
+
|Accurate detection
|
|
321
|
+
|Better analysis
|
|
322
|
+
|
|
323
|
+
|Position-dependent
|
|
324
|
+
|Position-independent
|
|
325
|
+
|Handles reordering
|
|
326
|
+
|===
|
|
327
|
+
|
|
328
|
+
=== Rich Analysis
|
|
329
|
+
|
|
330
|
+
* **Operation counts** - Statistical view of changes
|
|
331
|
+
* **Operation paths** - Precise location information
|
|
332
|
+
* **Confidence scores** - Match quality indicators
|
|
333
|
+
|
|
334
|
+
== Limitations
|
|
335
|
+
|
|
336
|
+
=== Performance
|
|
337
|
+
|
|
338
|
+
The Semantic algorithm is significantly slower:
|
|
339
|
+
|
|
340
|
+
.Performance comparison
|
|
341
|
+
[cols="1,1,1,1"]
|
|
342
|
+
|===
|
|
343
|
+
|Document Size |DOM Time |Semantic Time |Ratio
|
|
344
|
+
|
|
345
|
+
|1 KB
|
|
346
|
+
|~1 ms
|
|
347
|
+
|~10 ms
|
|
348
|
+
|10x slower
|
|
349
|
+
|
|
350
|
+
|10 KB
|
|
351
|
+
|~10 ms
|
|
352
|
+
|~150 ms
|
|
353
|
+
|15x slower
|
|
354
|
+
|
|
355
|
+
|100 KB
|
|
356
|
+
|~100 ms
|
|
357
|
+
|~3000 ms
|
|
358
|
+
|30x slower
|
|
359
|
+
|===
|
|
360
|
+
|
|
361
|
+
**Workaround**: Use DOM algorithm for large documents, Semantic for smaller ones
|
|
362
|
+
|
|
363
|
+
=== Experimental Status
|
|
364
|
+
|
|
365
|
+
* **API may change** - Not stable yet
|
|
366
|
+
* **Behavior may change** - Under active development
|
|
367
|
+
* **Edge cases** - May have unexpected results
|
|
368
|
+
|
|
369
|
+
**Workaround**: Test thoroughly before relying on Semantic in production
|
|
370
|
+
|
|
371
|
+
=== Complex Matching
|
|
372
|
+
|
|
373
|
+
* **False matches** - May match unrelated but similar content
|
|
374
|
+
* **Ambiguity** - Multiple similar candidates can confuse matching
|
|
375
|
+
* **Tuning needed** - May require similarity threshold adjustment
|
|
376
|
+
|
|
377
|
+
**Workaround**: Adjust `similarity_threshold` or use DOM algorithm
|
|
378
|
+
|
|
379
|
+
== Common Use Cases
|
|
380
|
+
|
|
381
|
+
=== Use Case 1: Detecting Document Reorganization
|
|
382
|
+
|
|
383
|
+
[source,ruby]
|
|
384
|
+
----
|
|
385
|
+
# Analyze how document was restructured
|
|
386
|
+
result = Canon::Comparison.equivalent?(old_doc, new_doc,
|
|
387
|
+
diff_algorithm: :semantic,
|
|
388
|
+
verbose: true,
|
|
389
|
+
diff_mode: :by_object
|
|
390
|
+
)
|
|
391
|
+
|
|
392
|
+
# Analyze operations
|
|
393
|
+
puts "Content moved: #{result.statistics.moves} times"
|
|
394
|
+
puts "Sections merged: #{result.statistics.merges}"
|
|
395
|
+
puts "Sections split: #{result.statistics.splits}"
|
|
396
|
+
----
|
|
397
|
+
|
|
398
|
+
=== Use Case 2: Content Evolution Tracking
|
|
399
|
+
|
|
400
|
+
[source,ruby]
|
|
401
|
+
----
|
|
402
|
+
# Track how content evolved over time
|
|
403
|
+
versions = [v1, v2, v3, v4]
|
|
404
|
+
|
|
405
|
+
versions.each_cons(2) do |old, new|
|
|
406
|
+
result = Canon::Comparison.equivalent?(old, new,
|
|
407
|
+
diff_algorithm: :semantic,
|
|
408
|
+
verbose: true
|
|
409
|
+
)
|
|
410
|
+
|
|
411
|
+
log_operations(result.operations)
|
|
412
|
+
end
|
|
413
|
+
----
|
|
414
|
+
|
|
415
|
+
=== Use Case 3: Intelligent Test Assertions
|
|
416
|
+
|
|
417
|
+
[source,ruby]
|
|
418
|
+
----
|
|
419
|
+
# Allow reordering in tests
|
|
420
|
+
RSpec.describe "Content generation" do
|
|
421
|
+
it "generates correct content regardless of order" do
|
|
422
|
+
actual = generate_content
|
|
423
|
+
|
|
424
|
+
expect(actual).to be_xml_equivalent_to(expected)
|
|
425
|
+
.with_options(
|
|
426
|
+
diff_algorithm: :semantic,
|
|
427
|
+
element_position: :ignore # Ignores moves
|
|
428
|
+
)
|
|
429
|
+
end
|
|
430
|
+
end
|
|
431
|
+
----
|
|
432
|
+
|
|
433
|
+
== Best Practices
|
|
434
|
+
|
|
435
|
+
=== Start with DOM, Upgrade to Semantic When Needed
|
|
436
|
+
|
|
437
|
+
Use DOM algorithm as default, switch to Semantic only when move detection is required.
|
|
438
|
+
|
|
439
|
+
=== Adjust Similarity Threshold
|
|
440
|
+
|
|
441
|
+
Start conservative (0.95+), lower gradually if under-matching:
|
|
442
|
+
|
|
443
|
+
[source,ruby]
|
|
444
|
+
----
|
|
445
|
+
# Try different thresholds to find sweet spot
|
|
446
|
+
[0.95, 0.90, 0.85].each do |threshold|
|
|
447
|
+
result = Canon::Comparison.equivalent?(doc1, doc2,
|
|
448
|
+
diff_algorithm: :semantic,
|
|
449
|
+
match: { similarity_threshold: threshold }
|
|
450
|
+
)
|
|
451
|
+
puts "Threshold #{threshold}: #{result.statistics.total} operations"
|
|
452
|
+
end
|
|
453
|
+
----
|
|
454
|
+
|
|
455
|
+
=== Use Appropriate Match Options
|
|
456
|
+
|
|
457
|
+
Configure dimensions to match your needs:
|
|
458
|
+
|
|
459
|
+
[source,ruby]
|
|
460
|
+
----
|
|
461
|
+
# Ignore cosmetic differences
|
|
462
|
+
Canon::Comparison.equivalent?(doc1, doc2,
|
|
463
|
+
diff_algorithm: :semantic,
|
|
464
|
+
match: {
|
|
465
|
+
structural_whitespace: :ignore,
|
|
466
|
+
element_position: :ignore
|
|
467
|
+
}
|
|
468
|
+
)
|
|
469
|
+
----
|
|
470
|
+
|
|
471
|
+
== Troubleshooting
|
|
472
|
+
|
|
473
|
+
=== Too Many Operations Detected
|
|
474
|
+
|
|
475
|
+
**Problem**: Everything shows as changed
|
|
476
|
+
|
|
477
|
+
**Solution**: Increase similarity threshold
|
|
478
|
+
[source,ruby]
|
|
479
|
+
----
|
|
480
|
+
match: { similarity_threshold: 0.98 } # Was 0.95
|
|
481
|
+
----
|
|
482
|
+
|
|
483
|
+
=== Too Few Matches
|
|
484
|
+
|
|
485
|
+
**Problem**: Similar content shows as DELETE + INSERT
|
|
486
|
+
|
|
487
|
+
**Solution**: Decrease similarity threshold
|
|
488
|
+
[source,ruby]
|
|
489
|
+
----
|
|
490
|
+
match: { similarity_threshold: 0.85 } # Was 0.95
|
|
491
|
+
----
|
|
492
|
+
|
|
493
|
+
=== Performance Issues
|
|
494
|
+
|
|
495
|
+
**Problem**: Comparison is very slow
|
|
496
|
+
|
|
497
|
+
**Solution**: Use DOM algorithm or limit document size
|
|
498
|
+
[source,ruby]
|
|
499
|
+
----
|
|
500
|
+
# Conditionally use Semantic only for small docs
|
|
501
|
+
algorithm = doc_size < 10_000 ? :semantic : :dom
|
|
502
|
+
Canon::Comparison.equivalent?(doc1, doc2,
|
|
503
|
+
diff_algorithm: algorithm
|
|
504
|
+
)
|
|
505
|
+
----
|
|
506
|
+
|
|
507
|
+
== Migration from DOM
|
|
508
|
+
|
|
509
|
+
=== Expected Changes
|
|
510
|
+
|
|
511
|
+
When switching from DOM to Semantic:
|
|
512
|
+
|
|
513
|
+
1. **MOVEs detected** - Reordered content shows as MOVE instead of DELETE+INSERT
|
|
514
|
+
2. **Different output** - Operations instead of line-based diff
|
|
515
|
+
3. **Slower performance** - Accept longer comparison time
|
|
516
|
+
4. **New capabilities** - Access to rich operation analysis
|
|
517
|
+
|
|
518
|
+
=== Migration Steps
|
|
519
|
+
|
|
520
|
+
1. **Test on small subset** - Verify behavior on sample documents
|
|
521
|
+
2. **Compare outputs** - Review DOM vs Semantic results side-by-side
|
|
522
|
+
3. **Adjust threshold** - Tune similarity_threshold for your needs
|
|
523
|
+
4. **Update assertions** - Adapt tests to operation-based output
|
|
524
|
+
5. **Monitor performance** - Ensure acceptable speed
|
|
525
|
+
|
|
526
|
+
== See Also
|
|
527
|
+
|
|
528
|
+
* link:index.adoc[Algorithms Overview] - Comparison of DOM vs Semantic
|
|
529
|
+
* link:dom-diff.adoc[DOM Algorithm] - Standard algorithm
|
|
530
|
+
* link:../../features/match-options/algorithm-specific-behavior.adoc[Algorithm-Specific Behavior] - How Semantic interprets options
|
|
531
|
+
* link:../../features/diff-formatting/algorithm-specific-output.adoc[Algorithm-Specific Output] - Output format details
|
|
532
|
+
* link:../../guides/choosing-configuration.adoc[Choosing Configuration] - Complete decision guide
|
|
533
|
+
* link:../../advanced/semantic-tree-diff-internals.adoc[Semantic Tree Diff Internals] - Advanced details (if available)
|