canon 0.1.9 → 0.1.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
@@ -80,14 +80,20 @@ Classification depends on `attribute_order` setting:
80
80
  │ │
81
81
  │ DiffClassifier examines each DiffNode: │
82
82
  │ │
83
- For each dimension:
84
- behavior = match_options.behavior_for(dimension)
83
+ 1. Serialization-level formatting (XmlSerializationFormatter)
84
+ XML syntax differences: <tag/> vs <tag></tag>
85
+ │ → ALWAYS formatting-only (non-normative) │
85
86
  │ │
86
- if behavior == :ignore
87
- INFORMATIVE (difference doesn't matter)
88
- else # :strict or :normalize
89
- │ → NORMATIVE (difference matters) │
87
+ 2. Content-level formatting (text_content: :normalize)
88
+ Whitespace differences in content
89
+ Formatting-only when normalized content matches
90
90
  │ │
91
+ │ 3. CompareProfile policy (normative vs informative) │
92
+ │ → behavior == :ignore → INFORMATIVE │
93
+ │ → behavior == :strict → NORMATIVE │
94
+ │ → behavior == :normalize → Check content normalization │
95
+ │ │
96
+ │ Sets diff_node.formatting = true/false │
91
97
  │ Sets diff_node.normative = true/false │
92
98
  └───────────────────────────────────┬───────────────────────────────┘
93
99
 
@@ -102,6 +108,27 @@ Classification depends on `attribute_order` setting:
102
108
  └──────────────────────────────────────────────────────────────────┘
103
109
  ----
104
110
 
111
+ === Three-Level Classification System
112
+
113
+ Canon distinguishes between **three distinct kinds of differences**:
114
+
115
+ | Kind | `formatting:` | `normative:` | Meaning | Examples |
116
+ |------|---------------|--------------|---------|----------|
117
+ | **Serialization formatting** | `true` | `false` | XML syntax differences | `<tag/>` vs `<tag></tag>` |
118
+ | **Content formatting** | `true` | `false` | Whitespace in content | `Hello world` vs `Hello world` |
119
+ | **Informative** | `false` | `false` | Tracked but doesn't affect equivalence | Attribute order (when `:ignore`) |
120
+ | **Normative** | `false` | `true` | Affects equivalence | Different words, missing elements |
121
+
122
+ **Key distinction**:
123
+
124
+ * **Serialization-level formatting**: XML syntax differences that are ALWAYS non-normative regardless of match options, because they represent different valid serializations of the same semantic content. Detected by `XmlSerializationFormatter`.
125
+
126
+ * **Content-level formatting**: Whitespace differences in document content. These are formatting-only (non-normative) when normalized content matches (using `text_content: :normalize`).
127
+
128
+ * **Informative**: Differences tracked for reference but don't affect equivalence (when behavior is `:ignore`).
129
+
130
+ * **Normative**: Semantic content differences that affect equivalence (when behavior is `:strict` or when normalized content differs).
131
+
105
132
  == CompareProfile-Based Classification
106
133
 
107
134
  === Overview
@@ -120,22 +147,42 @@ DiffNode → DiffClassifier → CompareProfile → normative?
120
147
 
121
148
  === Classification Hierarchy
122
149
 
123
- Canon uses a three-level hierarchy for classifying differences:
150
+ Canon uses a **multi-level hierarchy** for classifying differences:
124
151
 
125
- 1. **Formatting-only** (lowest priority)
126
- - Pure whitespace/formatting differences
127
- - Normalized content is identical
128
- - Markers: `[` and `]` in diff output
152
+ [source]
153
+ ----
154
+ DiffNode DiffClassifier XmlSerializationFormatter → serialization formatting?
155
+
156
+ CompareProfile → normative dimension?
157
+
158
+ FormattingDetector → formatting-only?
159
+
160
+ Final classification
161
+ ----
162
+
163
+ **Classification priority (from highest to lowest specificity)**:
164
+
165
+ 1. **Serialization-level formatting** (highest priority)
166
+ - XML syntax differences: `<tag/>` vs `<tag></tag>`
167
+ - Detected by `XmlSerializationFormatter`
168
+ - **ALWAYS** `formatting: true, normative: false`
169
+ - Bypasses all other classification logic
129
170
 
130
- 2. **Informative** (medium priority)
171
+ 2. **Content-level formatting**
172
+ - Whitespace differences in document content
173
+ - Detected by `FormattingDetector` when `text_content: :normalize`
174
+ - `formatting: true, normative: false` when normalized content matches
175
+ - Respects element-level whitespace sensitivity
176
+
177
+ 3. **Informative** (based on `:ignore` behavior)
131
178
  - Tracked but doesn't affect equivalence
132
- - Based on behavior `:ignore`
133
- - Markers: `<` and `>` in diff output
179
+ - `formatting: false, normative: false`
180
+ - Example: Attribute order when `attribute_order: :ignore`
134
181
 
135
- 3. **Normative** (highest priority)
182
+ 4. **Normative** (based on `:strict` behavior or content mismatch)
136
183
  - Affects equivalence
137
- - Based on behavior `:strict`
138
- - Markers: `-` and `+` in diff output
184
+ - `formatting: false, normative: true`
185
+ - Example: Different words, missing elements
139
186
 
140
187
  === Format-Specific Policies
141
188
 
@@ -292,6 +339,34 @@ Canon::Comparison.equivalent?(html1, html2, format: :html)
292
339
  ----
293
340
  ====
294
341
 
342
+ .Self-closing vs explicit closing tags
343
+ ====
344
+ Per XML standards, `<tag/>` and `<tag></tag>` are semantically equivalent (both represent empty elements). Canon classifies differences in serialisation format as **formatting-only** (non-normative):
345
+
346
+ [source,ruby]
347
+ ----
348
+ # Self-closing vs explicit closing - always equivalent
349
+ xml1 = '<svg><rect x="10" y="10"/></svg>'
350
+ xml2 = '<svg><rect x="10" y="10"></rect></svg>'
351
+
352
+ Canon::Comparison.equivalent?(xml1, xml2, format: :xml)
353
+ # => true
354
+
355
+ # Empty/whitespace-only text nodes from serialisation are formatting-only
356
+ result = Canon::Comparison.equivalent?(xml1, xml2, format: :xml, verbose: true)
357
+ result.differences.each do |diff|
358
+ if diff.dimension == :text_content
359
+ puts "Normative: #{diff.normative?}" # => false
360
+ puts "Formatting: #{diff.formatting?}" # => true
361
+ end
362
+ end
363
+ ----
364
+
365
+ This applies regardless of `text_content` behavior setting, as these differences are purely serialisation format variations (similar to attribute order).
366
+
367
+ The key insight: empty or whitespace-only text nodes created by different serialisation styles (`<tag/>` vs `<tag></tag>`) are always classified as **formatting-only**, not normative.
368
+ ====
369
+
295
370
  === FormattingDetector Integration
296
371
 
297
372
  For dimensions that support it (`:text_content`, `:structural_whitespace`),
@@ -319,19 +394,35 @@ With `:normalize` mode:
319
394
 
320
395
  === Implementation Details
321
396
 
322
- The [`CompareProfile`](../../lib/canon/comparison/compare_profile.rb) class provides:
397
+ The classification system uses three main classes:
323
398
 
324
- * `normative_dimension?(dimension)` - Is this dimension normative?
325
- * `affects_equivalence?(dimension)` - Does this dimension affect equivalence?
326
- * `supports_formatting_detection?(dimension)` - Can this dimension have formatting-only diffs?
399
+ * **`XmlSerializationFormatter`** - Detects XML serialization-level formatting differences
400
+ - Self-closing vs explicit closing tags: `<tag/>` vs `<tag></tag>`
401
+ - Always returns `formatting: true, normative: false`
402
+ - These differences are ALWAYS non-normative regardless of match options
327
403
 
328
- The [`DiffClassifier`](../../lib/canon/diff/diff_classifier.rb) uses CompareProfile to classify differences, with special handling for `text_content: :normalize`:
404
+ * **`CompareProfile`** - Determines dimension behavior and policy
405
+ - `normative_dimension?(dimension)` - Is this dimension normative?
406
+ - `affects_equivalence?(dimension)` - Does this dimension affect equivalence?
407
+ - `supports_formatting_detection?(dimension)` - Can this dimension have formatting-only diffs?
408
+
409
+ * **`DiffClassifier`** - Orchestrates classification using the above
410
+ - First checks `XmlSerializationFormatter` for serialization formatting
411
+ - Then handles content-level formatting (text_content: :normalize)
412
+ - Finally applies `CompareProfile` policy for normative vs informative
329
413
 
330
414
  [source,ruby]
331
415
  ----
332
416
  def classify(diff_node)
333
- # SPECIAL CASE: text_content with :normalize behavior
334
- # Formatting-only differences (whitespace-only) are marked as non-normative
417
+ # FIRST: Check for XML serialization-level formatting differences
418
+ # These are ALWAYS non-normative (formatting-only) regardless of match options
419
+ if XmlSerializationFormatter.serialization_formatting?(diff_node)
420
+ diff_node.formatting = true
421
+ diff_node.normative = false
422
+ return diff_node
423
+ end
424
+
425
+ # SECOND: Handle content-level formatting for text_content with :normalize
335
426
  if diff_node.dimension == :text_content &&
336
427
  profile.send(:behavior_for, :text_content) == :normalize &&
337
428
  !inside_whitespace_sensitive_element?(diff_node) &&
@@ -341,10 +432,10 @@ def classify(diff_node)
341
432
  return diff_node
342
433
  end
343
434
 
344
- # Standard classification flow
435
+ # THIRD: Apply CompareProfile policy
345
436
  is_normative = profile.normative_dimension?(diff_node.dimension)
346
437
 
347
- # Only check formatting for non-normative dimensions
438
+ # FOURTH: Check FormattingDetector for non-normative dimensions
348
439
  if !is_normative && profile.supports_formatting_detection?(diff_node.dimension)
349
440
  if formatting_only_diff?(diff_node)
350
441
  diff_node.formatting = true
@@ -353,6 +444,7 @@ def classify(diff_node)
353
444
  end
354
445
  end
355
446
 
447
+ # FIFTH: Apply normative determination
356
448
  diff_node.normative = is_normative
357
449
  diff_node
358
450
  end
data/lib/canon/cli.rb CHANGED
@@ -126,6 +126,24 @@ module Canon
126
126
 
127
127
  # Disable color output
128
128
  $ canon diff file1.xml file2.xml --no-color
129
+
130
+ # Show raw file contents (for copying to specs)
131
+ $ canon diff file1.xml file2.xml --show-raw-inputs
132
+
133
+ # Show preprocessed contents (what was actually compared)
134
+ $ canon diff file1.xml file2.xml --show-preprocessed-inputs
135
+
136
+ # Show both raw and preprocessed (full trace)
137
+ $ canon diff file1.xml file2.xml --show-raw-inputs --show-preprocessed-inputs
138
+
139
+ # Preprocess with normalization and show what was compared
140
+ $ canon diff file1.xml file2.xml --preprocessing normalize --show-preprocessed-inputs
141
+
142
+ # Show raw inputs with line numbers (RSpec-style)
143
+ $ canon diff file1.xml file2.xml --show-line-numbered-inputs
144
+
145
+ # Verbose mode (shows all three input displays)
146
+ $ canon diff file1.xml file2.xml --verbose
129
147
  DESC
130
148
  method_option :format,
131
149
  aliases: "-f",
@@ -213,6 +231,18 @@ module Canon
213
231
  method_option :diff_grouping_lines,
214
232
  type: :numeric,
215
233
  desc: "Group diffs within N lines into context blocks (default: no grouping)"
234
+ method_option :show_raw_inputs,
235
+ type: :boolean,
236
+ default: false,
237
+ desc: "Show raw/original file contents before diff"
238
+ method_option :show_preprocessed_inputs,
239
+ type: :boolean,
240
+ default: false,
241
+ desc: "Show preprocessed contents (what was actually compared)"
242
+ method_option :show_line_numbered_inputs,
243
+ type: :boolean,
244
+ default: false,
245
+ desc: "Show raw inputs with line numbers (RSpec-style)"
216
246
  def diff(file1, file2)
217
247
  Commands::DiffCommand.new(options).run(file1, file2)
218
248
  end
@@ -53,6 +53,9 @@ module Canon
53
53
  context_lines: @options.fetch(:context_lines, 3),
54
54
  diff_grouping_lines: @options[:diff_grouping_lines],
55
55
  show_diffs: @options[:show_diffs]&.to_sym || :all,
56
+ show_raw_inputs: @options[:show_raw_inputs] || false,
57
+ show_preprocessed_inputs: @options[:show_preprocessed_inputs] || false,
58
+ show_line_numbered_inputs: @options[:show_line_numbered_inputs] || false,
56
59
  )
57
60
 
58
61
  # Show configuration in verbose mode using shared DebugOutput
@@ -239,9 +239,116 @@ module Canon
239
239
  # @param diff2 [Symbol] Difference type for node2
240
240
  # @param dimension [Symbol] The dimension of the difference
241
241
  # @return [String] Human-readable reason
242
- def build_difference_reason(_node1, _node2, diff1, diff2, dimension)
242
+ def build_difference_reason(node1, node2, diff1, diff2, dimension)
243
+ # For attribute presence differences, show what attributes differ
244
+ if dimension == :attribute_presence
245
+ attrs1 = extract_attributes(node1)
246
+ attrs2 = extract_attributes(node2)
247
+ return build_attribute_difference_reason(attrs1, attrs2)
248
+ end
249
+
250
+ # For text content differences, show the actual text (truncated if needed)
251
+ if dimension == :text_content
252
+ text1 = extract_text_content_from_node(node1)
253
+ text2 = extract_text_content_from_node(node2)
254
+ return build_text_difference_reason(text1, text2)
255
+ end
256
+
243
257
  # Default reason - can be overridden in subclasses
244
- "Difference in #{dimension}: #{diff1} vs #{diff2}"
258
+ "#{diff1} vs #{diff2}"
259
+ end
260
+
261
+ # Build a clear reason message for attribute presence differences
262
+ # Shows which attributes are only in node1, only in node2, or different values
263
+ #
264
+ # @param attrs1 [Hash, nil] First node's attributes
265
+ # @param attrs2 [Hash, nil] Second node's attributes
266
+ # @return [String] Clear explanation of the attribute difference
267
+ def build_attribute_difference_reason(attrs1, attrs2)
268
+ return "#{attrs1&.keys&.size || 0} vs #{attrs2&.keys&.size || 0} attributes" unless attrs1 && attrs2
269
+
270
+ require "set"
271
+ keys1 = attrs1.keys.to_set
272
+ keys2 = attrs2.keys.to_set
273
+
274
+ only_in_1 = keys1 - keys2
275
+ only_in_2 = keys2 - keys1
276
+ common = keys1 & keys2
277
+
278
+ # Check if values differ for common keys
279
+ different_values = common.reject { |k| attrs1[k] == attrs2[k] }
280
+
281
+ parts = []
282
+ parts << "only in first: #{only_in_1.to_a.sort.join(', ')}" if only_in_1.any?
283
+ parts << "only in second: #{only_in_2.to_a.sort.join(', ')}" if only_in_2.any?
284
+ parts << "different values: #{different_values.sort.join(', ')}" if different_values.any?
285
+
286
+ if parts.empty?
287
+ "#{keys1.size} vs #{keys2.size} attributes (same names)"
288
+ else
289
+ parts.join("; ")
290
+ end
291
+ end
292
+
293
+ # Extract text content from a node for diff reason
294
+ #
295
+ # @param node [Object, nil] Node to extract text from
296
+ # @return [String, nil] Text content or nil
297
+ def extract_text_content_from_node(node)
298
+ return nil if node.nil?
299
+
300
+ # For Canon::Xml::Nodes::TextNode
301
+ return node.value if node.respond_to?(:value) && node.is_a?(Canon::Xml::Nodes::TextNode)
302
+
303
+ # For XML/HTML nodes with text_content method
304
+ return node.text_content if node.respond_to?(:text_content)
305
+
306
+ # For nodes with text method
307
+ return node.text if node.respond_to?(:text)
308
+
309
+ # For nodes with content method (Moxml::Text)
310
+ return node.content if node.respond_to?(:content)
311
+
312
+ # For nodes with value method (other types)
313
+ return node.value if node.respond_to?(:value)
314
+
315
+ # For simple text nodes or strings
316
+ return node.to_s if node.is_a?(String)
317
+
318
+ # For other node types, try to_s
319
+ node.to_s
320
+ rescue StandardError
321
+ nil
322
+ end
323
+
324
+ # Build a clear reason message for text content differences
325
+ # Shows the actual text content (truncated if too long)
326
+ #
327
+ # @param text1 [String, nil] First text content
328
+ # @param text2 [String, nil] Second text content
329
+ # @return [String] Clear explanation of the text difference
330
+ def build_text_difference_reason(text1, text2)
331
+ # Handle nil cases
332
+ return "missing vs '#{truncate_text(text2)}'" if text1.nil? && text2
333
+ return "'#{truncate_text(text1)}' vs missing" if text1 && text2.nil?
334
+ return "both missing" if text1.nil? && text2.nil?
335
+
336
+ # Both have content - show truncated versions
337
+ "'#{truncate_text(text1)}' vs '#{truncate_text(text2)}'"
338
+ end
339
+
340
+ # Truncate text for display in reason messages
341
+ #
342
+ # @param text [String] Text to truncate
343
+ # @param max_length [Integer] Maximum length
344
+ # @return [String] Truncated text
345
+ def truncate_text(text, max_length = 40)
346
+ return "" if text.nil?
347
+
348
+ text = text.to_s
349
+ return text if text.length <= max_length
350
+
351
+ "#{text[0...max_length]}..."
245
352
  end
246
353
 
247
354
  # Serialize an element node to string
@@ -1,5 +1,6 @@
1
1
  # frozen_string_literal: true
2
2
 
3
+ require "set"
3
4
  require_relative "../../diff/diff_node"
4
5
  require_relative "../../diff/path_builder"
5
6
  require_relative "../../diff/node_serializer"
@@ -62,6 +63,21 @@ module Canon
62
63
  end
63
64
  end
64
65
 
66
+ # For attribute presence differences, show what attributes differ
67
+ if dimension == :attribute_presence
68
+ attrs1 = extract_attributes(node1)
69
+ attrs2 = extract_attributes(node2)
70
+ return build_attribute_difference_reason(attrs1, attrs2)
71
+ end
72
+
73
+ # For text content differences, show the actual text (truncated if needed)
74
+ if dimension == :text_content
75
+ text1 = extract_text_content(node1)
76
+ text2 = extract_text_content(node2)
77
+ return build_text_difference_reason(text1, text2)
78
+ end
79
+
80
+ # Default reason
65
81
  "#{diff1} vs #{diff2}"
66
82
  end
67
83
 
@@ -110,6 +126,98 @@ module Canon
110
126
 
111
127
  Canon::Diff::NodeSerializer.extract_attributes(node)
112
128
  end
129
+
130
+ # Build a clear reason message for attribute presence differences
131
+ # Shows which attributes are only in node1, only in node2, or different values
132
+ #
133
+ # @param attrs1 [Hash, nil] First node's attributes
134
+ # @param attrs2 [Hash, nil] Second node's attributes
135
+ # @return [String] Clear explanation of the attribute difference
136
+ def self.build_attribute_difference_reason(attrs1, attrs2)
137
+ return "#{attrs1&.keys&.size || 0} vs #{attrs2&.keys&.size || 0} attributes" unless attrs1 && attrs2
138
+
139
+ keys1 = attrs1.keys.to_set
140
+ keys2 = attrs2.keys.to_set
141
+
142
+ only_in_1 = keys1 - keys2
143
+ only_in_2 = keys2 - keys1
144
+ common = keys1 & keys2
145
+
146
+ # Check if values differ for common keys
147
+ different_values = common.reject { |k| attrs1[k] == attrs2[k] }
148
+
149
+ parts = []
150
+ parts << "only in first: #{only_in_1.to_a.sort.join(', ')}" if only_in_1.any?
151
+ parts << "only in second: #{only_in_2.to_a.sort.join(', ')}" if only_in_2.any?
152
+ parts << "different values: #{different_values.sort.join(', ')}" if different_values.any?
153
+
154
+ if parts.empty?
155
+ "#{keys1.size} vs #{keys2.size} attributes (same names)"
156
+ else
157
+ parts.join("; ")
158
+ end
159
+ end
160
+
161
+ # Extract text content from a node
162
+ #
163
+ # @param node [Object, nil] Node to extract text from
164
+ # @return [String, nil] Text content or nil
165
+ def self.extract_text_content(node)
166
+ return nil if node.nil?
167
+
168
+ # For Canon::Xml::Nodes::TextNode
169
+ return node.value if node.respond_to?(:value) && node.is_a?(Canon::Xml::Nodes::TextNode)
170
+
171
+ # For XML/HTML nodes with text_content method
172
+ return node.text_content if node.respond_to?(:text_content)
173
+
174
+ # For nodes with text method
175
+ return node.text if node.respond_to?(:text)
176
+
177
+ # For nodes with content method (Moxml::Text)
178
+ return node.content if node.respond_to?(:content)
179
+
180
+ # For nodes with value method (other types)
181
+ return node.value if node.respond_to?(:value)
182
+
183
+ # For simple text nodes or strings
184
+ return node.to_s if node.is_a?(String)
185
+
186
+ # For other node types, try to_s
187
+ node.to_s
188
+ rescue StandardError
189
+ nil
190
+ end
191
+
192
+ # Build a clear reason message for text content differences
193
+ # Shows the actual text content (truncated if too long)
194
+ #
195
+ # @param text1 [String, nil] First text content
196
+ # @param text2 [String, nil] Second text content
197
+ # @return [String] Clear explanation of the text difference
198
+ def self.build_text_difference_reason(text1, text2)
199
+ # Handle nil cases
200
+ return "missing vs '#{truncate(text2)}'" if text1.nil? && text2
201
+ return "'#{truncate(text1)}' vs missing" if text1 && text2.nil?
202
+ return "both missing" if text1.nil? && text2.nil?
203
+
204
+ # Both have content - show truncated versions
205
+ "'#{truncate(text1)}' vs '#{truncate(text2)}'"
206
+ end
207
+
208
+ # Truncate text for display in reason messages
209
+ #
210
+ # @param text [String] Text to truncate
211
+ # @param max_length [Integer] Maximum length
212
+ # @return [String] Truncated text
213
+ def self.truncate(text, max_length = 40)
214
+ return "" if text.nil?
215
+
216
+ text = text.to_s
217
+ return text if text.length <= max_length
218
+
219
+ "#{text[0...max_length]}..."
220
+ end
113
221
  end
114
222
  end
115
223
  end