RubyGems - canon - Versions diffs - 0.1.9 → 0.1.11 - Mend

canon 0.1.9 → 0.1.11

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (16) hide show

checksums.yaml +4 -4
data/.rubocop_todo.yml +25 -99
data/README.adoc +220 -26
data/docs/advanced/diff-classification.adoc +118 -26
data/lib/canon/cli.rb +30 -0
data/lib/canon/commands/diff_command.rb +3 -0
data/lib/canon/comparison/markup_comparator.rb +109 -2
data/lib/canon/comparison/xml_comparator/diff_node_builder.rb +108 -0
data/lib/canon/comparison/xml_comparator.rb +192 -0
data/lib/canon/config/env_schema.rb +5 -1
data/lib/canon/config.rb +30 -0
data/lib/canon/diff/diff_classifier.rb +48 -33
data/lib/canon/diff/xml_serialization_formatter.rb +153 -0
data/lib/canon/diff_formatter.rb +102 -12
data/lib/canon/version.rb +1 -1
metadata +3 -2

data/docs/advanced/diff-classification.adoc CHANGED Viewed

@@ -80,14 +80,20 @@ Classification depends on `attribute_order` setting:
 │                                                                   │
 │ DiffClassifier examines each DiffNode:                           │
 │                                                                   │
-│ For each dimension:                                               │
-│   behavior = match_options.behavior_for(dimension)                │
+│ 1. Serialization-level formatting (XmlSerializationFormatter)    │
+│    → XML syntax differences: <tag/> vs <tag></tag>               │
+│    → ALWAYS formatting-only (non-normative)                      │
 │                                                                   │
-│   if behavior == :ignore                                          │
-│     → INFORMATIVE (difference doesn't matter)                     │
-│   else  # :strict or :normalize                                   │
-│     → NORMATIVE (difference matters)                              │
+│ 2. Content-level formatting (text_content: :normalize)           │
+│    → Whitespace differences in content                           │
+│    → Formatting-only when normalized content matches             │
 │                                                                   │
+│ 3. CompareProfile policy (normative vs informative)              │
+│    → behavior == :ignore → INFORMATIVE                           │
+│    → behavior == :strict → NORMATIVE                             │
+│    → behavior == :normalize → Check content normalization        │
+│                                                                   │
+│ Sets diff_node.formatting = true/false                           │
 │ Sets diff_node.normative = true/false                            │
 └───────────────────────────────────┬───────────────────────────────┘
                                     ↓
@@ -102,6 +108,27 @@ Classification depends on `attribute_order` setting:
 └──────────────────────────────────────────────────────────────────┘
 ----
+=== Three-Level Classification System
+Canon distinguishes between **three distinct kinds of differences**:
+| Kind | `formatting:` | `normative:` | Meaning | Examples |
+|------|---------------|--------------|---------|----------|
+| **Serialization formatting** | `true` | `false` | XML syntax differences | `<tag/>` vs `<tag></tag>` |
+| **Content formatting** | `true` | `false` | Whitespace in content | `Hello  world` vs `Hello world` |
+| **Informative** | `false` | `false` | Tracked but doesn't affect equivalence | Attribute order (when `:ignore`) |
+| **Normative** | `false` | `true` | Affects equivalence | Different words, missing elements |
+**Key distinction**:
+* **Serialization-level formatting**: XML syntax differences that are ALWAYS non-normative regardless of match options, because they represent different valid serializations of the same semantic content. Detected by `XmlSerializationFormatter`.
+* **Content-level formatting**: Whitespace differences in document content. These are formatting-only (non-normative) when normalized content matches (using `text_content: :normalize`).
+* **Informative**: Differences tracked for reference but don't affect equivalence (when behavior is `:ignore`).
+* **Normative**: Semantic content differences that affect equivalence (when behavior is `:strict` or when normalized content differs).
 == CompareProfile-Based Classification
 === Overview
@@ -120,22 +147,42 @@ DiffNode → DiffClassifier → CompareProfile → normative?
 === Classification Hierarchy
-Canon uses a three-level hierarchy for classifying differences:
+Canon uses a **multi-level hierarchy** for classifying differences:
-1. **Formatting-only** (lowest priority)
-   - Pure whitespace/formatting differences
-   - Normalized content is identical
-   - Markers: `[` and `]` in diff output
+[source]
+----
+DiffNode → DiffClassifier → XmlSerializationFormatter → serialization formatting?
+                                       ↓
+                                  CompareProfile → normative dimension?
+                                       ↓
+                                  FormattingDetector → formatting-only?
+                                       ↓
+                                  Final classification
+----
+**Classification priority (from highest to lowest specificity)**:
+1. **Serialization-level formatting** (highest priority)
+   - XML syntax differences: `<tag/>` vs `<tag></tag>`
+   - Detected by `XmlSerializationFormatter`
+   - **ALWAYS** `formatting: true, normative: false`
+   - Bypasses all other classification logic
-2. **Informative** (medium priority)
+2. **Content-level formatting**
+   - Whitespace differences in document content
+   - Detected by `FormattingDetector` when `text_content: :normalize`
+   - `formatting: true, normative: false` when normalized content matches
+   - Respects element-level whitespace sensitivity
+3. **Informative** (based on `:ignore` behavior)
    - Tracked but doesn't affect equivalence
-   - Based on behavior `:ignore`
-   - Markers: `<` and `>` in diff output
+   - `formatting: false, normative: false`
+   - Example: Attribute order when `attribute_order: :ignore`
-3. **Normative** (highest priority)
+4. **Normative** (based on `:strict` behavior or content mismatch)
    - Affects equivalence
-   - Based on behavior `:strict`
-   - Markers: `-` and `+` in diff output
+   - `formatting: false, normative: true`
+   - Example: Different words, missing elements
 === Format-Specific Policies
@@ -292,6 +339,34 @@ Canon::Comparison.equivalent?(html1, html2, format: :html)
 ----
 ====
+.Self-closing vs explicit closing tags
+====
+Per XML standards, `<tag/>` and `<tag></tag>` are semantically equivalent (both represent empty elements). Canon classifies differences in serialisation format as **formatting-only** (non-normative):
+[source,ruby]
+----
+# Self-closing vs explicit closing - always equivalent
+xml1 = '<svg><rect x="10" y="10"/></svg>'
+xml2 = '<svg><rect x="10" y="10"></rect></svg>'
+Canon::Comparison.equivalent?(xml1, xml2, format: :xml)
+# => true
+# Empty/whitespace-only text nodes from serialisation are formatting-only
+result = Canon::Comparison.equivalent?(xml1, xml2, format: :xml, verbose: true)
+result.differences.each do |diff|
+  if diff.dimension == :text_content
+    puts "Normative: #{diff.normative?}"  # => false
+    puts "Formatting: #{diff.formatting?}"  # => true
+  end
+end
+----
+This applies regardless of `text_content` behavior setting, as these differences are purely serialisation format variations (similar to attribute order).
+The key insight: empty or whitespace-only text nodes created by different serialisation styles (`<tag/>` vs `<tag></tag>`) are always classified as **formatting-only**, not normative.
+====
 === FormattingDetector Integration
 For dimensions that support it (`:text_content`, `:structural_whitespace`),
@@ -319,19 +394,35 @@ With `:normalize` mode:
 === Implementation Details
-The [`CompareProfile`](../../lib/canon/comparison/compare_profile.rb) class provides:
+The classification system uses three main classes:
-* `normative_dimension?(dimension)` - Is this dimension normative?
-* `affects_equivalence?(dimension)` - Does this dimension affect equivalence?
-* `supports_formatting_detection?(dimension)` - Can this dimension have formatting-only diffs?
+* **`XmlSerializationFormatter`** - Detects XML serialization-level formatting differences
+  - Self-closing vs explicit closing tags: `<tag/>` vs `<tag></tag>`
+  - Always returns `formatting: true, normative: false`
+  - These differences are ALWAYS non-normative regardless of match options
-The [`DiffClassifier`](../../lib/canon/diff/diff_classifier.rb) uses CompareProfile to classify differences, with special handling for `text_content: :normalize`:
+* **`CompareProfile`** - Determines dimension behavior and policy
+  - `normative_dimension?(dimension)` - Is this dimension normative?
+  - `affects_equivalence?(dimension)` - Does this dimension affect equivalence?
+  - `supports_formatting_detection?(dimension)` - Can this dimension have formatting-only diffs?
+* **`DiffClassifier`** - Orchestrates classification using the above
+  - First checks `XmlSerializationFormatter` for serialization formatting
+  - Then handles content-level formatting (text_content: :normalize)
+  - Finally applies `CompareProfile` policy for normative vs informative
 [source,ruby]
 ----
 def classify(diff_node)
-  # SPECIAL CASE: text_content with :normalize behavior
-  # Formatting-only differences (whitespace-only) are marked as non-normative
+  # FIRST: Check for XML serialization-level formatting differences
+  # These are ALWAYS non-normative (formatting-only) regardless of match options
+  if XmlSerializationFormatter.serialization_formatting?(diff_node)
+    diff_node.formatting = true
+    diff_node.normative = false
+    return diff_node
+  end
+  # SECOND: Handle content-level formatting for text_content with :normalize
   if diff_node.dimension == :text_content &&
       profile.send(:behavior_for, :text_content) == :normalize &&
       !inside_whitespace_sensitive_element?(diff_node) &&
@@ -341,10 +432,10 @@ def classify(diff_node)
     return diff_node
   end
-  # Standard classification flow
+  # THIRD: Apply CompareProfile policy
   is_normative = profile.normative_dimension?(diff_node.dimension)
-  # Only check formatting for non-normative dimensions
+  # FOURTH: Check FormattingDetector for non-normative dimensions
   if !is_normative && profile.supports_formatting_detection?(diff_node.dimension)
     if formatting_only_diff?(diff_node)
       diff_node.formatting = true
@@ -353,6 +444,7 @@ def classify(diff_node)
     end
   end
+  # FIFTH: Apply normative determination
   diff_node.normative = is_normative
   diff_node
 end

data/lib/canon/cli.rb CHANGED Viewed

@@ -126,6 +126,24 @@ module Canon
         # Disable color output
         $ canon diff file1.xml file2.xml --no-color
+        # Show raw file contents (for copying to specs)
+        $ canon diff file1.xml file2.xml --show-raw-inputs
+        # Show preprocessed contents (what was actually compared)
+        $ canon diff file1.xml file2.xml --show-preprocessed-inputs
+        # Show both raw and preprocessed (full trace)
+        $ canon diff file1.xml file2.xml --show-raw-inputs --show-preprocessed-inputs
+        # Preprocess with normalization and show what was compared
+        $ canon diff file1.xml file2.xml --preprocessing normalize --show-preprocessed-inputs
+        # Show raw inputs with line numbers (RSpec-style)
+        $ canon diff file1.xml file2.xml --show-line-numbered-inputs
+        # Verbose mode (shows all three input displays)
+        $ canon diff file1.xml file2.xml --verbose
     DESC
     method_option :format,
                   aliases: "-f",
@@ -213,6 +231,18 @@ module Canon
     method_option :diff_grouping_lines,
                   type: :numeric,
                   desc: "Group diffs within N lines into context blocks (default: no grouping)"
+    method_option :show_raw_inputs,
+                  type: :boolean,
+                  default: false,
+                  desc: "Show raw/original file contents before diff"
+    method_option :show_preprocessed_inputs,
+                  type: :boolean,
+                  default: false,
+                  desc: "Show preprocessed contents (what was actually compared)"
+    method_option :show_line_numbered_inputs,
+                  type: :boolean,
+                  default: false,
+                  desc: "Show raw inputs with line numbers (RSpec-style)"
     def diff(file1, file2)
       Commands::DiffCommand.new(options).run(file1, file2)
     end

data/lib/canon/commands/diff_command.rb CHANGED Viewed

@@ -53,6 +53,9 @@ module Canon
           context_lines: @options.fetch(:context_lines, 3),
           diff_grouping_lines: @options[:diff_grouping_lines],
           show_diffs: @options[:show_diffs]&.to_sym || :all,
+          show_raw_inputs: @options[:show_raw_inputs] || false,
+          show_preprocessed_inputs: @options[:show_preprocessed_inputs] || false,
+          show_line_numbered_inputs: @options[:show_line_numbered_inputs] || false,
         )
         # Show configuration in verbose mode using shared DebugOutput

data/lib/canon/comparison/markup_comparator.rb CHANGED Viewed

@@ -239,9 +239,116 @@ module Canon
         # @param diff2 [Symbol] Difference type for node2
         # @param dimension [Symbol] The dimension of the difference
         # @return [String] Human-readable reason
-        def build_difference_reason(_node1, _node2, diff1, diff2, dimension)
+        def build_difference_reason(node1, node2, diff1, diff2, dimension)
+          # For attribute presence differences, show what attributes differ
+          if dimension == :attribute_presence
+            attrs1 = extract_attributes(node1)
+            attrs2 = extract_attributes(node2)
+            return build_attribute_difference_reason(attrs1, attrs2)
+          end
+          # For text content differences, show the actual text (truncated if needed)
+          if dimension == :text_content
+            text1 = extract_text_content_from_node(node1)
+            text2 = extract_text_content_from_node(node2)
+            return build_text_difference_reason(text1, text2)
+          end
           # Default reason - can be overridden in subclasses
-          "Difference in #{dimension}: #{diff1} vs #{diff2}"
+          "#{diff1} vs #{diff2}"
+        end
+        # Build a clear reason message for attribute presence differences
+        # Shows which attributes are only in node1, only in node2, or different values
+        #
+        # @param attrs1 [Hash, nil] First node's attributes
+        # @param attrs2 [Hash, nil] Second node's attributes
+        # @return [String] Clear explanation of the attribute difference
+        def build_attribute_difference_reason(attrs1, attrs2)
+          return "#{attrs1&.keys&.size || 0} vs #{attrs2&.keys&.size || 0} attributes" unless attrs1 && attrs2
+          require "set"
+          keys1 = attrs1.keys.to_set
+          keys2 = attrs2.keys.to_set
+          only_in_1 = keys1 - keys2
+          only_in_2 = keys2 - keys1
+          common = keys1 & keys2
+          # Check if values differ for common keys
+          different_values = common.reject { |k| attrs1[k] == attrs2[k] }
+          parts = []
+          parts << "only in first: #{only_in_1.to_a.sort.join(', ')}" if only_in_1.any?
+          parts << "only in second: #{only_in_2.to_a.sort.join(', ')}" if only_in_2.any?
+          parts << "different values: #{different_values.sort.join(', ')}" if different_values.any?
+          if parts.empty?
+            "#{keys1.size} vs #{keys2.size} attributes (same names)"
+          else
+            parts.join("; ")
+          end
+        end
+        # Extract text content from a node for diff reason
+        #
+        # @param node [Object, nil] Node to extract text from
+        # @return [String, nil] Text content or nil
+        def extract_text_content_from_node(node)
+          return nil if node.nil?
+          # For Canon::Xml::Nodes::TextNode
+          return node.value if node.respond_to?(:value) && node.is_a?(Canon::Xml::Nodes::TextNode)
+          # For XML/HTML nodes with text_content method
+          return node.text_content if node.respond_to?(:text_content)
+          # For nodes with text method
+          return node.text if node.respond_to?(:text)
+          # For nodes with content method (Moxml::Text)
+          return node.content if node.respond_to?(:content)
+          # For nodes with value method (other types)
+          return node.value if node.respond_to?(:value)
+          # For simple text nodes or strings
+          return node.to_s if node.is_a?(String)
+          # For other node types, try to_s
+          node.to_s
+        rescue StandardError
+          nil
+        end
+        # Build a clear reason message for text content differences
+        # Shows the actual text content (truncated if too long)
+        #
+        # @param text1 [String, nil] First text content
+        # @param text2 [String, nil] Second text content
+        # @return [String] Clear explanation of the text difference
+        def build_text_difference_reason(text1, text2)
+          # Handle nil cases
+          return "missing vs '#{truncate_text(text2)}'" if text1.nil? && text2
+          return "'#{truncate_text(text1)}' vs missing" if text1 && text2.nil?
+          return "both missing" if text1.nil? && text2.nil?
+          # Both have content - show truncated versions
+          "'#{truncate_text(text1)}' vs '#{truncate_text(text2)}'"
+        end
+        # Truncate text for display in reason messages
+        #
+        # @param text [String] Text to truncate
+        # @param max_length [Integer] Maximum length
+        # @return [String] Truncated text
+        def truncate_text(text, max_length = 40)
+          return "" if text.nil?
+          text = text.to_s
+          return text if text.length <= max_length
+          "#{text[0...max_length]}..."
         end
         # Serialize an element node to string

data/lib/canon/comparison/xml_comparator/diff_node_builder.rb CHANGED Viewed

@@ -1,5 +1,6 @@
 # frozen_string_literal: true
+require "set"
 require_relative "../../diff/diff_node"
 require_relative "../../diff/path_builder"
 require_relative "../../diff/node_serializer"
@@ -62,6 +63,21 @@ module Canon
           end
         end
+        # For attribute presence differences, show what attributes differ
+        if dimension == :attribute_presence
+          attrs1 = extract_attributes(node1)
+          attrs2 = extract_attributes(node2)
+          return build_attribute_difference_reason(attrs1, attrs2)
+        end
+        # For text content differences, show the actual text (truncated if needed)
+        if dimension == :text_content
+          text1 = extract_text_content(node1)
+          text2 = extract_text_content(node2)
+          return build_text_difference_reason(text1, text2)
+        end
+        # Default reason
         "#{diff1} vs #{diff2}"
       end
@@ -110,6 +126,98 @@ module Canon
         Canon::Diff::NodeSerializer.extract_attributes(node)
       end
+      # Build a clear reason message for attribute presence differences
+      # Shows which attributes are only in node1, only in node2, or different values
+      #
+      # @param attrs1 [Hash, nil] First node's attributes
+      # @param attrs2 [Hash, nil] Second node's attributes
+      # @return [String] Clear explanation of the attribute difference
+      def self.build_attribute_difference_reason(attrs1, attrs2)
+        return "#{attrs1&.keys&.size || 0} vs #{attrs2&.keys&.size || 0} attributes" unless attrs1 && attrs2
+        keys1 = attrs1.keys.to_set
+        keys2 = attrs2.keys.to_set
+        only_in_1 = keys1 - keys2
+        only_in_2 = keys2 - keys1
+        common = keys1 & keys2
+        # Check if values differ for common keys
+        different_values = common.reject { |k| attrs1[k] == attrs2[k] }
+        parts = []
+        parts << "only in first: #{only_in_1.to_a.sort.join(', ')}" if only_in_1.any?
+        parts << "only in second: #{only_in_2.to_a.sort.join(', ')}" if only_in_2.any?
+        parts << "different values: #{different_values.sort.join(', ')}" if different_values.any?
+        if parts.empty?
+          "#{keys1.size} vs #{keys2.size} attributes (same names)"
+        else
+          parts.join("; ")
+        end
+      end
+      # Extract text content from a node
+      #
+      # @param node [Object, nil] Node to extract text from
+      # @return [String, nil] Text content or nil
+      def self.extract_text_content(node)
+        return nil if node.nil?
+        # For Canon::Xml::Nodes::TextNode
+        return node.value if node.respond_to?(:value) && node.is_a?(Canon::Xml::Nodes::TextNode)
+        # For XML/HTML nodes with text_content method
+        return node.text_content if node.respond_to?(:text_content)
+        # For nodes with text method
+        return node.text if node.respond_to?(:text)
+        # For nodes with content method (Moxml::Text)
+        return node.content if node.respond_to?(:content)
+        # For nodes with value method (other types)
+        return node.value if node.respond_to?(:value)
+        # For simple text nodes or strings
+        return node.to_s if node.is_a?(String)
+        # For other node types, try to_s
+        node.to_s
+      rescue StandardError
+        nil
+      end
+      # Build a clear reason message for text content differences
+      # Shows the actual text content (truncated if too long)
+      #
+      # @param text1 [String, nil] First text content
+      # @param text2 [String, nil] Second text content
+      # @return [String] Clear explanation of the text difference
+      def self.build_text_difference_reason(text1, text2)
+        # Handle nil cases
+        return "missing vs '#{truncate(text2)}'" if text1.nil? && text2
+        return "'#{truncate(text1)}' vs missing" if text1 && text2.nil?
+        return "both missing" if text1.nil? && text2.nil?
+        # Both have content - show truncated versions
+        "'#{truncate(text1)}' vs '#{truncate(text2)}'"
+      end
+      # Truncate text for display in reason messages
+      #
+      # @param text [String] Text to truncate
+      # @param max_length [Integer] Maximum length
+      # @return [String] Truncated text
+      def self.truncate(text, max_length = 40)
+        return "" if text.nil?
+        text = text.to_s
+        return text if text.length <= max_length
+        "#{text[0...max_length]}..."
+      end
     end
   end
 end