canon 0.1.8 → 0.1.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (98) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop_todo.yml +112 -25
  3. data/docs/Gemfile +1 -0
  4. data/docs/_config.yml +90 -1
  5. data/docs/advanced/diff-classification.adoc +82 -2
  6. data/docs/features/match-options/index.adoc +239 -1
  7. data/lib/canon/comparison/format_detector.rb +2 -1
  8. data/lib/canon/comparison/html_comparator.rb +19 -8
  9. data/lib/canon/comparison/html_compare_profile.rb +8 -2
  10. data/lib/canon/comparison/match_options/base_resolver.rb +7 -0
  11. data/lib/canon/comparison/whitespace_sensitivity.rb +208 -0
  12. data/lib/canon/comparison/xml_comparator/child_comparison.rb +15 -7
  13. data/lib/canon/comparison/xml_comparator/node_parser.rb +10 -5
  14. data/lib/canon/comparison/xml_comparator/node_type_comparator.rb +14 -7
  15. data/lib/canon/comparison/xml_comparator.rb +48 -23
  16. data/lib/canon/comparison/xml_node_comparison.rb +25 -3
  17. data/lib/canon/diff/diff_classifier.rb +101 -2
  18. data/lib/canon/diff/formatting_detector.rb +1 -1
  19. data/lib/canon/rspec_matchers.rb +37 -8
  20. data/lib/canon/version.rb +1 -1
  21. data/lib/canon/xml/data_model.rb +24 -13
  22. metadata +3 -78
  23. data/docs/plans/2025-01-17-html-parser-selection-fix.adoc +0 -250
  24. data/false_positive_analysis.txt +0 -0
  25. data/file1.html +0 -1
  26. data/file2.html +0 -1
  27. data/old-docs/ADVANCED_TOPICS.adoc +0 -20
  28. data/old-docs/BASIC_USAGE.adoc +0 -16
  29. data/old-docs/CHARACTER_VISUALIZATION.adoc +0 -567
  30. data/old-docs/CLI.adoc +0 -497
  31. data/old-docs/CUSTOMIZING_BEHAVIOR.adoc +0 -19
  32. data/old-docs/DIFF_ARCHITECTURE.adoc +0 -435
  33. data/old-docs/DIFF_FORMATTING.adoc +0 -540
  34. data/old-docs/DIFF_PARAMETERS.adoc +0 -261
  35. data/old-docs/DOM_DIFF.adoc +0 -1017
  36. data/old-docs/ENV_CONFIG.adoc +0 -876
  37. data/old-docs/FORMATS.adoc +0 -867
  38. data/old-docs/INPUT_VALIDATION.adoc +0 -477
  39. data/old-docs/MATCHER_BEHAVIOR.adoc +0 -90
  40. data/old-docs/MATCH_ARCHITECTURE.adoc +0 -463
  41. data/old-docs/MATCH_OPTIONS.adoc +0 -912
  42. data/old-docs/MODES.adoc +0 -432
  43. data/old-docs/NORMATIVE_INFORMATIVE_DIFFS.adoc +0 -219
  44. data/old-docs/OPTIONS.adoc +0 -1387
  45. data/old-docs/PREPROCESSING.adoc +0 -491
  46. data/old-docs/README.old.adoc +0 -2831
  47. data/old-docs/RSPEC.adoc +0 -814
  48. data/old-docs/RUBY_API.adoc +0 -485
  49. data/old-docs/SEMANTIC_DIFF_REPORT.adoc +0 -646
  50. data/old-docs/SEMANTIC_TREE_DIFF.adoc +0 -765
  51. data/old-docs/STRING_COMPARE.adoc +0 -345
  52. data/old-docs/TMP.adoc +0 -3384
  53. data/old-docs/TREE_DIFF.adoc +0 -1080
  54. data/old-docs/UNDERSTANDING_CANON.adoc +0 -17
  55. data/old-docs/VERBOSE.adoc +0 -482
  56. data/old-docs/VISUALIZATION_MAP.adoc +0 -625
  57. data/old-docs/WHITESPACE_TREATMENT.adoc +0 -1155
  58. data/scripts/analyze_current_state.rb +0 -85
  59. data/scripts/analyze_false_positives.rb +0 -114
  60. data/scripts/analyze_remaining_failures.rb +0 -105
  61. data/scripts/compare_current_failures.rb +0 -95
  62. data/scripts/compare_dom_tree_diff.rb +0 -158
  63. data/scripts/compare_failures.rb +0 -151
  64. data/scripts/debug_attribute_extraction.rb +0 -66
  65. data/scripts/debug_blocks_839.rb +0 -115
  66. data/scripts/debug_meta_matching.rb +0 -52
  67. data/scripts/debug_p_matching.rb +0 -192
  68. data/scripts/debug_signature_matching.rb +0 -118
  69. data/scripts/debug_sourcecode_124.rb +0 -32
  70. data/scripts/debug_whitespace_sensitive.rb +0 -192
  71. data/scripts/extract_false_positives.rb +0 -138
  72. data/scripts/find_actual_false_positives.rb +0 -125
  73. data/scripts/investigate_all_false_positives.rb +0 -161
  74. data/scripts/investigate_batch1.rb +0 -127
  75. data/scripts/investigate_classification.rb +0 -150
  76. data/scripts/investigate_classification_detailed.rb +0 -190
  77. data/scripts/investigate_common_failures.rb +0 -342
  78. data/scripts/investigate_false_negative.rb +0 -80
  79. data/scripts/investigate_false_positive.rb +0 -83
  80. data/scripts/investigate_false_positives.rb +0 -227
  81. data/scripts/investigate_false_positives_batch.rb +0 -163
  82. data/scripts/investigate_mixed_content.rb +0 -125
  83. data/scripts/investigate_remaining_16.rb +0 -214
  84. data/scripts/run_single_test.rb +0 -29
  85. data/scripts/test_all_false_positives.rb +0 -95
  86. data/scripts/test_attribute_details.rb +0 -61
  87. data/scripts/test_both_algorithms.rb +0 -49
  88. data/scripts/test_both_simple.rb +0 -49
  89. data/scripts/test_enhanced_semantic_output.rb +0 -125
  90. data/scripts/test_readme_examples.rb +0 -131
  91. data/scripts/test_semantic_tree_diff.rb +0 -99
  92. data/scripts/test_semantic_ux_improvements.rb +0 -135
  93. data/scripts/test_single_false_positive.rb +0 -119
  94. data/scripts/test_size_limits.rb +0 -99
  95. data/test_html_1.html +0 -21
  96. data/test_html_2.html +0 -21
  97. data/test_nokogiri.rb +0 -33
  98. data/test_normalize.rb +0 -45
@@ -1,250 +0,0 @@
1
- = HTML Parser Selection Fix Design
2
- :doctype: article
3
- :date: 2025-01-17
4
- :status: Approved
5
-
6
- == Problem Statement
7
-
8
- When comparing HTML documents with `lang` and `xml:lang` attributes, users see false attribute differences:
9
-
10
- ----
11
- ⊖ Expected (File 1):
12
- <span> with 1 attribute: xml:lang
13
-
14
- ⊕ Actual (File 2):
15
- <span> with 2 attributes: lang, xml:lang
16
- ----
17
-
18
- Both HTML strings have identical attributes (`lang="EN-GB" xml:lang="EN-GB"`), but the comparison shows different attribute counts. This happens because:
19
-
20
- . *DOM path* uses `Nokogiri::XML.fragment` for all HTML, which treats `lang` and `xml:lang` as the same attribute (XML namespace behavior)
21
- . *Semantic path* uses `Nokogiri::HTML5.fragment` or `Nokogiri::HTML4.fragment`, which correctly treats them as distinct
22
- . *The `parse_html` method ignores the format parameter* and returns raw strings, causing inconsistent parsing
23
-
24
- == Root Cause
25
-
26
- In `lib/canon/comparison.rb`, the `parse_html` method at line 374:
27
-
28
- [source,ruby]
29
- ----
30
- def parse_html(content, _format) # format is IGNORED!
31
- return content unless content.is_a?(String)
32
- # ... returns raw string instead of parsing
33
- end
34
- ----
35
-
36
- This causes HTML version information to be lost, and `HtmlComparator#parse_node` ends up using `XML.fragment` for all HTML content.
37
-
38
- == Solution
39
-
40
- === Architecture
41
-
42
- Fix the 4-layer architecture to respect user's parser choice:
43
-
44
- ----
45
- User specifies format: :html5
46
- |
47
- v
48
- Level 1: Preprocessing
49
- parse_html(html, :html5) -> Nokogiri::HTML5.fragment ✓
50
- |
51
- v
52
- Level 2: Diff Algorithm (DiffNode creation)
53
- Parsed nodes have accurate attributes ✓
54
- |
55
- v
56
- Level 3: Diff Report
57
- Enriched metadata is correct ✓
58
- |
59
- v
60
- Level 4: Diff Rendering
61
- Accurate attribute counts in output ✓
62
- ----
63
-
64
- === Component Changes
65
-
66
- ==== 1. `parse_html` Method (`lib/canon/comparison.rb`)
67
-
68
- *Current behavior:* Ignores format parameter, returns raw string
69
-
70
- *New behavior:* Parse with correct Nokogiri parser based on format
71
-
72
- [source,ruby]
73
- ----
74
- def parse_html(content, format)
75
- return content unless content.is_a?(String)
76
- return content if already_parsed?(content)
77
-
78
- begin
79
- case format
80
- when :html5
81
- Nokogiri::HTML5.fragment(content)
82
- when :html4
83
- Nokogiri::HTML4.fragment(content)
84
- when :html
85
- detect_and_parse_html(content)
86
- else
87
- content
88
- end
89
- rescue StandardError
90
- content
91
- end
92
- end
93
-
94
- private
95
-
96
- def already_parsed?(content)
97
- content.is_a?(Nokogiri::HTML::Document) ||
98
- content.is_a?(Nokogiri::HTML5::Document) ||
99
- content.is_a?(Nokogiri::HTML::DocumentFragment) ||
100
- content.is_a?(Nokogiri::HTML5::DocumentFragment) ||
101
- content.is_a?(Nokogiri::XML::DocumentFragment)
102
- end
103
-
104
- def detect_and_parse_html(content)
105
- version = detect_html_version(content)
106
- version == :html5 ?
107
- Nokogiri::HTML5.fragment(content) :
108
- Nokogiri::HTML4.fragment(content)
109
- end
110
-
111
- def detect_html_version(content)
112
- content.include?('<!DOCTYPE html>') ? :html5 : :html4
113
- end
114
- ----
115
-
116
- ==== 2. `dom_diff` Method (`lib/canon/comparison.rb`)
117
-
118
- *Current behavior:* Normalizes `html4`/`html5` to `:html` at line 320
119
-
120
- *New behavior:* Preserve format information
121
-
122
- Remove or modify line 320:
123
-
124
- [source,ruby]
125
- ----
126
- # OLD: format1 = format2 = :html
127
- # NEW: Keep format1, format2 as html4 or html5
128
- ----
129
-
130
- This ensures the format is passed through to `HtmlComparator` and used consistently.
131
-
132
- === Error Handling
133
-
134
- . *Parse failures:* Fall back to raw string (maintains backward compatibility)
135
- . *Already-parsed documents:* Return as-is, don't re-parse
136
- . *Mixed input types:* Both documents parsed with consistent parser based on format parameter
137
-
138
- === Testing Strategy
139
-
140
- ==== Unit Tests (`spec/canon/comparison_spec.rb`)
141
-
142
- [source,ruby]
143
- ----
144
- context "parse_html with format parameter" do
145
- it "parses HTML5 with HTML5.fragment when format is :html5" do
146
- html = '<span lang="en" xml:lang="en">text</span>'
147
- result = Canon::Comparison.send(:parse_html, html, :html5)
148
-
149
- expect(result).to be_a(Nokogiri::HTML5::DocumentFragment)
150
- expect(result.at_css('span').attributes.keys).to eq(['lang', 'xml:lang'])
151
- end
152
-
153
- it "parses HTML4 with HTML4.fragment when format is :html4" do
154
- html = '<span lang="en" xml:lang="en">text</span>'
155
- result = Canon::Comparison.send(:parse_html, html, :html4)
156
-
157
- expect(result).to be_a(Nokogiri::HTML4::DocumentFragment)
158
- end
159
-
160
- it "returns already-parsed documents as-is" do
161
- frag = Nokogiri::HTML5.fragment('<span>text</span>')
162
- result = Canon::Comparison.send(:parse_html, frag, :html5)
163
-
164
- expect(result).to eq(frag)
165
- end
166
- end
167
- ----
168
-
169
- ==== Integration Tests (`spec/canon/html_comparison_spec.rb`)
170
-
171
- [source,ruby]
172
- ----
173
- context "HTML5 lang and xml:lang attributes" do
174
- it "treats lang and xml:lang as distinct attributes in HTML5" do
175
- html1 = '<span lang="EN-GB" xml:lang="EN-GB">text</span>'
176
- html2 = '<span lang="EN-GB" xml:lang="EN-GB">text</span>'
177
-
178
- result = Canon::Comparison.equivalent?(
179
- html1, html2,
180
- format: :html5,
181
- verbose: true
182
- )
183
-
184
- expect(result).to be_equivalent
185
- end
186
-
187
- it "does NOT show false attribute differences" do
188
- html1 = '<span lang="EN-GB" xml:lang="EN-GB">&#xA0;</span>'
189
- html2 = '<span lang="EN-GB" xml:lang="EN-GB">␣</span>'
190
-
191
- result = Canon::Comparison.equivalent?(
192
- html1, html2,
193
- format: :html5,
194
- verbose: true
195
- )
196
-
197
- # Only difference should be the non-breaking space encoding
198
- # No attribute differences should be reported
199
- attr_diffs = result.differences.select { |d| d.dimension == :attribute_values }
200
- expect(attr_diffs).to be_empty
201
- end
202
- end
203
- ----
204
-
205
- ==== Backward Compatibility Tests
206
-
207
- [source,ruby]
208
- ----
209
- context "backward compatibility" do
210
- it "works when format is not specified (auto-detect)" do
211
- html1 = '<span>text</span>'
212
- html2 = '<span>text</span>'
213
-
214
- expect(Canon::Comparison.equivalent?(html1, html2)).to be true
215
- end
216
-
217
- it "handles strings with :html format (legacy behavior)" do
218
- html1 = '<span>text</span>'
219
- html2 = '<span>text</span>'
220
-
221
- expect(Canon::Comparison.equivalent?(html1, html2, format: :html)).to be true
222
- end
223
- end
224
- ----
225
-
226
- == Implementation Checklist
227
-
228
- * [ ] Modify `parse_html` in `lib/canon/comparison.rb`
229
- * [ ] Add helper methods: `already_parsed?`, `detect_and_parse_html`, `detect_html_version`
230
- * [ ] Update `dom_diff` to preserve format (line 320)
231
- * [ ] Add unit tests for `parse_html` method
232
- * [ ] Add integration tests for lang/xml:lang
233
- * [ ] Add backward compatibility tests
234
- * [ ] Run full test suite to ensure no regressions
235
-
236
- == Expected Outcomes
237
-
238
- After this fix:
239
-
240
- . *`lang` and `xml:lang` are treated as distinct attributes in HTML5/HTML4*
241
- . *No false attribute differences when both documents have identical attributes*
242
- . *User can explicitly control parser via `format: :html5` or `format: :html4`*
243
- . *Backward compatible with existing code (auto-detect still works)*
244
- . *Consistent parsing regardless of input format (string vs DocumentFragment)*
245
-
246
- == Notes
247
-
248
- . HTML entity normalization (`&#xa0;` vs `␣`) is intentionally NOT changed - these are semantically equivalent but different serializations, and the diff correctly shows this difference
249
- . XML comparison continues to use `XML.fragment` - this fix only affects HTML parsing
250
- . The semantic path already works correctly via `Canon::Html::DataModel.from_html`
File without changes
data/file1.html DELETED
@@ -1 +0,0 @@
1
- <html><body><br><div class="TOC" id="_"><h1 class="IntroTitle">Table of contents</h1></div></body></html>
data/file2.html DELETED
@@ -1 +0,0 @@
1
- <html><body><br><div id="_" class="TOC"><h1 class="IntroTitle">Table of contents</h1></div></body></html>
@@ -1,20 +0,0 @@
1
- ---
2
- layout: default
3
- title: Advanced Topics
4
- nav_order: 5
5
- has_children: true
6
- ---
7
- = Advanced topics
8
-
9
- For developers and advanced users:
10
-
11
- * **link:VERBOSE[Verbose mode]** - Two-tier diff output architecture
12
- * **link:SEMANTIC_DIFF_REPORT[Semantic diff report]** - Detailed report
13
- format
14
- * **link:NORMATIVE_INFORMATIVE_DIFFS[Normative vs informative diffs]** - Diff
15
- classification
16
- * **link:DIFF_ARCHITECTURE[Diff architecture]** - Six-layer technical
17
- pipeline
18
-
19
- These documents cover Canon's internal architecture and advanced features for
20
- developers extending or maintaining Canon.
@@ -1,16 +0,0 @@
1
- ---
2
- layout: default
3
- title: Basic Usage
4
- nav_order: 2
5
- has_children: true
6
- ---
7
- = Basic usage
8
-
9
- Choose your interface for working with Canon:
10
-
11
- * **link:RUBY_API[Ruby API]** - Using Canon from Ruby code
12
- * **link:CLI[Command-line interface]** - Terminal commands and options
13
- * **link:RSPEC[RSpec matchers]** - Testing with Canon in RSpec
14
-
15
- These guides provide practical examples and complete API reference for each
16
- interface.