RubyGems - canon - Versions diffs - 0.1.8 → 0.1.9 - Mend

canon 0.1.8 → 0.1.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (98) hide show

checksums.yaml +4 -4
data/.rubocop_todo.yml +112 -25
data/docs/Gemfile +1 -0
data/docs/_config.yml +90 -1
data/docs/advanced/diff-classification.adoc +82 -2
data/docs/features/match-options/index.adoc +239 -1
data/lib/canon/comparison/format_detector.rb +2 -1
data/lib/canon/comparison/html_comparator.rb +19 -8
data/lib/canon/comparison/html_compare_profile.rb +8 -2
data/lib/canon/comparison/match_options/base_resolver.rb +7 -0
data/lib/canon/comparison/whitespace_sensitivity.rb +208 -0
data/lib/canon/comparison/xml_comparator/child_comparison.rb +15 -7
data/lib/canon/comparison/xml_comparator/node_parser.rb +10 -5
data/lib/canon/comparison/xml_comparator/node_type_comparator.rb +14 -7
data/lib/canon/comparison/xml_comparator.rb +48 -23
data/lib/canon/comparison/xml_node_comparison.rb +25 -3
data/lib/canon/diff/diff_classifier.rb +101 -2
data/lib/canon/diff/formatting_detector.rb +1 -1
data/lib/canon/rspec_matchers.rb +37 -8
data/lib/canon/version.rb +1 -1
data/lib/canon/xml/data_model.rb +24 -13
metadata +3 -78
data/docs/plans/2025-01-17-html-parser-selection-fix.adoc +0 -250
data/false_positive_analysis.txt +0 -0
data/file1.html +0 -1
data/file2.html +0 -1
data/old-docs/ADVANCED_TOPICS.adoc +0 -20
data/old-docs/BASIC_USAGE.adoc +0 -16
data/old-docs/CHARACTER_VISUALIZATION.adoc +0 -567
data/old-docs/CLI.adoc +0 -497
data/old-docs/CUSTOMIZING_BEHAVIOR.adoc +0 -19
data/old-docs/DIFF_ARCHITECTURE.adoc +0 -435
data/old-docs/DIFF_FORMATTING.adoc +0 -540
data/old-docs/DIFF_PARAMETERS.adoc +0 -261
data/old-docs/DOM_DIFF.adoc +0 -1017
data/old-docs/ENV_CONFIG.adoc +0 -876
data/old-docs/FORMATS.adoc +0 -867
data/old-docs/INPUT_VALIDATION.adoc +0 -477
data/old-docs/MATCHER_BEHAVIOR.adoc +0 -90
data/old-docs/MATCH_ARCHITECTURE.adoc +0 -463
data/old-docs/MATCH_OPTIONS.adoc +0 -912
data/old-docs/MODES.adoc +0 -432
data/old-docs/NORMATIVE_INFORMATIVE_DIFFS.adoc +0 -219
data/old-docs/OPTIONS.adoc +0 -1387
data/old-docs/PREPROCESSING.adoc +0 -491
data/old-docs/README.old.adoc +0 -2831
data/old-docs/RSPEC.adoc +0 -814
data/old-docs/RUBY_API.adoc +0 -485
data/old-docs/SEMANTIC_DIFF_REPORT.adoc +0 -646
data/old-docs/SEMANTIC_TREE_DIFF.adoc +0 -765
data/old-docs/STRING_COMPARE.adoc +0 -345
data/old-docs/TMP.adoc +0 -3384
data/old-docs/TREE_DIFF.adoc +0 -1080
data/old-docs/UNDERSTANDING_CANON.adoc +0 -17
data/old-docs/VERBOSE.adoc +0 -482
data/old-docs/VISUALIZATION_MAP.adoc +0 -625
data/old-docs/WHITESPACE_TREATMENT.adoc +0 -1155
data/scripts/analyze_current_state.rb +0 -85
data/scripts/analyze_false_positives.rb +0 -114
data/scripts/analyze_remaining_failures.rb +0 -105
data/scripts/compare_current_failures.rb +0 -95
data/scripts/compare_dom_tree_diff.rb +0 -158
data/scripts/compare_failures.rb +0 -151
data/scripts/debug_attribute_extraction.rb +0 -66
data/scripts/debug_blocks_839.rb +0 -115
data/scripts/debug_meta_matching.rb +0 -52
data/scripts/debug_p_matching.rb +0 -192
data/scripts/debug_signature_matching.rb +0 -118
data/scripts/debug_sourcecode_124.rb +0 -32
data/scripts/debug_whitespace_sensitive.rb +0 -192
data/scripts/extract_false_positives.rb +0 -138
data/scripts/find_actual_false_positives.rb +0 -125
data/scripts/investigate_all_false_positives.rb +0 -161
data/scripts/investigate_batch1.rb +0 -127
data/scripts/investigate_classification.rb +0 -150
data/scripts/investigate_classification_detailed.rb +0 -190
data/scripts/investigate_common_failures.rb +0 -342
data/scripts/investigate_false_negative.rb +0 -80
data/scripts/investigate_false_positive.rb +0 -83
data/scripts/investigate_false_positives.rb +0 -227
data/scripts/investigate_false_positives_batch.rb +0 -163
data/scripts/investigate_mixed_content.rb +0 -125
data/scripts/investigate_remaining_16.rb +0 -214
data/scripts/run_single_test.rb +0 -29
data/scripts/test_all_false_positives.rb +0 -95
data/scripts/test_attribute_details.rb +0 -61
data/scripts/test_both_algorithms.rb +0 -49
data/scripts/test_both_simple.rb +0 -49
data/scripts/test_enhanced_semantic_output.rb +0 -125
data/scripts/test_readme_examples.rb +0 -131
data/scripts/test_semantic_tree_diff.rb +0 -99
data/scripts/test_semantic_ux_improvements.rb +0 -135
data/scripts/test_single_false_positive.rb +0 -119
data/scripts/test_size_limits.rb +0 -99
data/test_html_1.html +0 -21
data/test_html_2.html +0 -21
data/test_nokogiri.rb +0 -33
data/test_normalize.rb +0 -45

data/docs/plans/2025-01-17-html-parser-selection-fix.adoc DELETED Viewed

@@ -1,250 +0,0 @@
-= HTML Parser Selection Fix Design
-:doctype: article
-:date: 2025-01-17
-:status: Approved
-== Problem Statement
-When comparing HTML documents with `lang` and `xml:lang` attributes, users see false attribute differences:
-----
-⊖ Expected (File 1):
-  <span> with 1 attribute: xml:lang
-⊕ Actual (File 2):
-  <span> with 2 attributes: lang, xml:lang
-----
-Both HTML strings have identical attributes (`lang="EN-GB" xml:lang="EN-GB"`), but the comparison shows different attribute counts. This happens because:
-. *DOM path* uses `Nokogiri::XML.fragment` for all HTML, which treats `lang` and `xml:lang` as the same attribute (XML namespace behavior)
-. *Semantic path* uses `Nokogiri::HTML5.fragment` or `Nokogiri::HTML4.fragment`, which correctly treats them as distinct
-. *The `parse_html` method ignores the format parameter* and returns raw strings, causing inconsistent parsing
-== Root Cause
-In `lib/canon/comparison.rb`, the `parse_html` method at line 374:
-[source,ruby]
-----
-def parse_html(content, _format)  # format is IGNORED!
-  return content unless content.is_a?(String)
-  # ... returns raw string instead of parsing
-end
-----
-This causes HTML version information to be lost, and `HtmlComparator#parse_node` ends up using `XML.fragment` for all HTML content.
-== Solution
-=== Architecture
-Fix the 4-layer architecture to respect user's parser choice:
-----
-User specifies format: :html5
-         |
-         v
-Level 1: Preprocessing
-  parse_html(html, :html5) -> Nokogiri::HTML5.fragment ✓
-         |
-         v
-Level 2: Diff Algorithm (DiffNode creation)
-  Parsed nodes have accurate attributes ✓
-         |
-         v
-Level 3: Diff Report
-  Enriched metadata is correct ✓
-         |
-         v
-Level 4: Diff Rendering
-  Accurate attribute counts in output ✓
-----
-=== Component Changes
-==== 1. `parse_html` Method (`lib/canon/comparison.rb`)
-*Current behavior:* Ignores format parameter, returns raw string
-*New behavior:* Parse with correct Nokogiri parser based on format
-[source,ruby]
-----
-def parse_html(content, format)
-  return content unless content.is_a?(String)
-  return content if already_parsed?(content)
-  begin
-    case format
-    when :html5
-      Nokogiri::HTML5.fragment(content)
-    when :html4
-      Nokogiri::HTML4.fragment(content)
-    when :html
-      detect_and_parse_html(content)
-    else
-      content
-    end
-  rescue StandardError
-    content
-  end
-end
-private
-def already_parsed?(content)
-  content.is_a?(Nokogiri::HTML::Document) ||
-    content.is_a?(Nokogiri::HTML5::Document) ||
-    content.is_a?(Nokogiri::HTML::DocumentFragment) ||
-    content.is_a?(Nokogiri::HTML5::DocumentFragment) ||
-    content.is_a?(Nokogiri::XML::DocumentFragment)
-end
-def detect_and_parse_html(content)
-  version = detect_html_version(content)
-  version == :html5 ?
-    Nokogiri::HTML5.fragment(content) :
-    Nokogiri::HTML4.fragment(content)
-end
-def detect_html_version(content)
-  content.include?('<!DOCTYPE html>') ? :html5 : :html4
-end
-----
-==== 2. `dom_diff` Method (`lib/canon/comparison.rb`)
-*Current behavior:* Normalizes `html4`/`html5` to `:html` at line 320
-*New behavior:* Preserve format information
-Remove or modify line 320:
-[source,ruby]
-----
-# OLD: format1 = format2 = :html
-# NEW: Keep format1, format2 as html4 or html5
-----
-This ensures the format is passed through to `HtmlComparator` and used consistently.
-=== Error Handling
-. *Parse failures:* Fall back to raw string (maintains backward compatibility)
-. *Already-parsed documents:* Return as-is, don't re-parse
-. *Mixed input types:* Both documents parsed with consistent parser based on format parameter
-=== Testing Strategy
-==== Unit Tests (`spec/canon/comparison_spec.rb`)
-[source,ruby]
-----
-context "parse_html with format parameter" do
-  it "parses HTML5 with HTML5.fragment when format is :html5" do
-    html = '<span lang="en" xml:lang="en">text</span>'
-    result = Canon::Comparison.send(:parse_html, html, :html5)
-    expect(result).to be_a(Nokogiri::HTML5::DocumentFragment)
-    expect(result.at_css('span').attributes.keys).to eq(['lang', 'xml:lang'])
-  end
-  it "parses HTML4 with HTML4.fragment when format is :html4" do
-    html = '<span lang="en" xml:lang="en">text</span>'
-    result = Canon::Comparison.send(:parse_html, html, :html4)
-    expect(result).to be_a(Nokogiri::HTML4::DocumentFragment)
-  end
-  it "returns already-parsed documents as-is" do
-    frag = Nokogiri::HTML5.fragment('<span>text</span>')
-    result = Canon::Comparison.send(:parse_html, frag, :html5)
-    expect(result).to eq(frag)
-  end
-end
-----
-==== Integration Tests (`spec/canon/html_comparison_spec.rb`)
-[source,ruby]
-----
-context "HTML5 lang and xml:lang attributes" do
-  it "treats lang and xml:lang as distinct attributes in HTML5" do
-    html1 = '<span lang="EN-GB" xml:lang="EN-GB">text</span>'
-    html2 = '<span lang="EN-GB" xml:lang="EN-GB">text</span>'
-    result = Canon::Comparison.equivalent?(
-      html1, html2,
-      format: :html5,
-      verbose: true
-    )
-    expect(result).to be_equivalent
-  end
-  it "does NOT show false attribute differences" do
-    html1 = '<span lang="EN-GB" xml:lang="EN-GB">&#xA0;</span>'
-    html2 = '<span lang="EN-GB" xml:lang="EN-GB">␣</span>'
-    result = Canon::Comparison.equivalent?(
-      html1, html2,
-      format: :html5,
-      verbose: true
-    )
-    # Only difference should be the non-breaking space encoding
-    # No attribute differences should be reported
-    attr_diffs = result.differences.select { |d| d.dimension == :attribute_values }
-    expect(attr_diffs).to be_empty
-  end
-end
-----
-==== Backward Compatibility Tests
-[source,ruby]
-----
-context "backward compatibility" do
-  it "works when format is not specified (auto-detect)" do
-    html1 = '<span>text</span>'
-    html2 = '<span>text</span>'
-    expect(Canon::Comparison.equivalent?(html1, html2)).to be true
-  end
-  it "handles strings with :html format (legacy behavior)" do
-    html1 = '<span>text</span>'
-    html2 = '<span>text</span>'
-    expect(Canon::Comparison.equivalent?(html1, html2, format: :html)).to be true
-  end
-end
-----
-== Implementation Checklist
-* [ ] Modify `parse_html` in `lib/canon/comparison.rb`
-* [ ] Add helper methods: `already_parsed?`, `detect_and_parse_html`, `detect_html_version`
-* [ ] Update `dom_diff` to preserve format (line 320)
-* [ ] Add unit tests for `parse_html` method
-* [ ] Add integration tests for lang/xml:lang
-* [ ] Add backward compatibility tests
-* [ ] Run full test suite to ensure no regressions
-== Expected Outcomes
-After this fix:
-. *`lang` and `xml:lang` are treated as distinct attributes in HTML5/HTML4*
-. *No false attribute differences when both documents have identical attributes*
-. *User can explicitly control parser via `format: :html5` or `format: :html4`*
-. *Backward compatible with existing code (auto-detect still works)*
-. *Consistent parsing regardless of input format (string vs DocumentFragment)*
-== Notes
-. HTML entity normalization (`&#xa0;` vs `␣`) is intentionally NOT changed - these are semantically equivalent but different serializations, and the diff correctly shows this difference
-. XML comparison continues to use `XML.fragment` - this fix only affects HTML parsing
-. The semantic path already works correctly via `Canon::Html::DataModel.from_html`

data/false_positive_analysis.txt DELETED Viewed

File without changes

data/file1.html DELETED Viewed

	@@ -1 +0,0 @@
1	- <html><body><br><div class="TOC" id="_"><h1 class="IntroTitle">Table of contents</h1></div></body></html>

data/file2.html DELETED Viewed

	@@ -1 +0,0 @@
1	- <html><body><br><div id="_" class="TOC"><h1 class="IntroTitle">Table of contents</h1></div></body></html>

data/old-docs/ADVANCED_TOPICS.adoc DELETED Viewed

@@ -1,20 +0,0 @@
----
-layout: default
-title: Advanced Topics
-nav_order: 5
-has_children: true
----
-= Advanced topics
-For developers and advanced users:
-* **link:VERBOSE[Verbose mode]** - Two-tier diff output architecture
-* **link:SEMANTIC_DIFF_REPORT[Semantic diff report]** - Detailed report
-format
-* **link:NORMATIVE_INFORMATIVE_DIFFS[Normative vs informative diffs]** - Diff
-classification
-* **link:DIFF_ARCHITECTURE[Diff architecture]** - Six-layer technical
-pipeline
-These documents cover Canon's internal architecture and advanced features for
-developers extending or maintaining Canon.

data/old-docs/BASIC_USAGE.adoc DELETED Viewed

@@ -1,16 +0,0 @@
----
-layout: default
-title: Basic Usage
-nav_order: 2
-has_children: true
----
-= Basic usage
-Choose your interface for working with Canon:
-* **link:RUBY_API[Ruby API]** - Using Canon from Ruby code
-* **link:CLI[Command-line interface]** - Terminal commands and options
-* **link:RSPEC[RSpec matchers]** - Testing with Canon in RSpec
-These guides provide practical examples and complete API reference for each
-interface.