canon 0.1.8 → 0.1.9
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop_todo.yml +112 -25
- data/docs/Gemfile +1 -0
- data/docs/_config.yml +90 -1
- data/docs/advanced/diff-classification.adoc +82 -2
- data/docs/features/match-options/index.adoc +239 -1
- data/lib/canon/comparison/format_detector.rb +2 -1
- data/lib/canon/comparison/html_comparator.rb +19 -8
- data/lib/canon/comparison/html_compare_profile.rb +8 -2
- data/lib/canon/comparison/match_options/base_resolver.rb +7 -0
- data/lib/canon/comparison/whitespace_sensitivity.rb +208 -0
- data/lib/canon/comparison/xml_comparator/child_comparison.rb +15 -7
- data/lib/canon/comparison/xml_comparator/node_parser.rb +10 -5
- data/lib/canon/comparison/xml_comparator/node_type_comparator.rb +14 -7
- data/lib/canon/comparison/xml_comparator.rb +48 -23
- data/lib/canon/comparison/xml_node_comparison.rb +25 -3
- data/lib/canon/diff/diff_classifier.rb +101 -2
- data/lib/canon/diff/formatting_detector.rb +1 -1
- data/lib/canon/rspec_matchers.rb +37 -8
- data/lib/canon/version.rb +1 -1
- data/lib/canon/xml/data_model.rb +24 -13
- metadata +3 -78
- data/docs/plans/2025-01-17-html-parser-selection-fix.adoc +0 -250
- data/false_positive_analysis.txt +0 -0
- data/file1.html +0 -1
- data/file2.html +0 -1
- data/old-docs/ADVANCED_TOPICS.adoc +0 -20
- data/old-docs/BASIC_USAGE.adoc +0 -16
- data/old-docs/CHARACTER_VISUALIZATION.adoc +0 -567
- data/old-docs/CLI.adoc +0 -497
- data/old-docs/CUSTOMIZING_BEHAVIOR.adoc +0 -19
- data/old-docs/DIFF_ARCHITECTURE.adoc +0 -435
- data/old-docs/DIFF_FORMATTING.adoc +0 -540
- data/old-docs/DIFF_PARAMETERS.adoc +0 -261
- data/old-docs/DOM_DIFF.adoc +0 -1017
- data/old-docs/ENV_CONFIG.adoc +0 -876
- data/old-docs/FORMATS.adoc +0 -867
- data/old-docs/INPUT_VALIDATION.adoc +0 -477
- data/old-docs/MATCHER_BEHAVIOR.adoc +0 -90
- data/old-docs/MATCH_ARCHITECTURE.adoc +0 -463
- data/old-docs/MATCH_OPTIONS.adoc +0 -912
- data/old-docs/MODES.adoc +0 -432
- data/old-docs/NORMATIVE_INFORMATIVE_DIFFS.adoc +0 -219
- data/old-docs/OPTIONS.adoc +0 -1387
- data/old-docs/PREPROCESSING.adoc +0 -491
- data/old-docs/README.old.adoc +0 -2831
- data/old-docs/RSPEC.adoc +0 -814
- data/old-docs/RUBY_API.adoc +0 -485
- data/old-docs/SEMANTIC_DIFF_REPORT.adoc +0 -646
- data/old-docs/SEMANTIC_TREE_DIFF.adoc +0 -765
- data/old-docs/STRING_COMPARE.adoc +0 -345
- data/old-docs/TMP.adoc +0 -3384
- data/old-docs/TREE_DIFF.adoc +0 -1080
- data/old-docs/UNDERSTANDING_CANON.adoc +0 -17
- data/old-docs/VERBOSE.adoc +0 -482
- data/old-docs/VISUALIZATION_MAP.adoc +0 -625
- data/old-docs/WHITESPACE_TREATMENT.adoc +0 -1155
- data/scripts/analyze_current_state.rb +0 -85
- data/scripts/analyze_false_positives.rb +0 -114
- data/scripts/analyze_remaining_failures.rb +0 -105
- data/scripts/compare_current_failures.rb +0 -95
- data/scripts/compare_dom_tree_diff.rb +0 -158
- data/scripts/compare_failures.rb +0 -151
- data/scripts/debug_attribute_extraction.rb +0 -66
- data/scripts/debug_blocks_839.rb +0 -115
- data/scripts/debug_meta_matching.rb +0 -52
- data/scripts/debug_p_matching.rb +0 -192
- data/scripts/debug_signature_matching.rb +0 -118
- data/scripts/debug_sourcecode_124.rb +0 -32
- data/scripts/debug_whitespace_sensitive.rb +0 -192
- data/scripts/extract_false_positives.rb +0 -138
- data/scripts/find_actual_false_positives.rb +0 -125
- data/scripts/investigate_all_false_positives.rb +0 -161
- data/scripts/investigate_batch1.rb +0 -127
- data/scripts/investigate_classification.rb +0 -150
- data/scripts/investigate_classification_detailed.rb +0 -190
- data/scripts/investigate_common_failures.rb +0 -342
- data/scripts/investigate_false_negative.rb +0 -80
- data/scripts/investigate_false_positive.rb +0 -83
- data/scripts/investigate_false_positives.rb +0 -227
- data/scripts/investigate_false_positives_batch.rb +0 -163
- data/scripts/investigate_mixed_content.rb +0 -125
- data/scripts/investigate_remaining_16.rb +0 -214
- data/scripts/run_single_test.rb +0 -29
- data/scripts/test_all_false_positives.rb +0 -95
- data/scripts/test_attribute_details.rb +0 -61
- data/scripts/test_both_algorithms.rb +0 -49
- data/scripts/test_both_simple.rb +0 -49
- data/scripts/test_enhanced_semantic_output.rb +0 -125
- data/scripts/test_readme_examples.rb +0 -131
- data/scripts/test_semantic_tree_diff.rb +0 -99
- data/scripts/test_semantic_ux_improvements.rb +0 -135
- data/scripts/test_single_false_positive.rb +0 -119
- data/scripts/test_size_limits.rb +0 -99
- data/test_html_1.html +0 -21
- data/test_html_2.html +0 -21
- data/test_nokogiri.rb +0 -33
- data/test_normalize.rb +0 -45
|
@@ -1,250 +0,0 @@
|
|
|
1
|
-
= HTML Parser Selection Fix Design
|
|
2
|
-
:doctype: article
|
|
3
|
-
:date: 2025-01-17
|
|
4
|
-
:status: Approved
|
|
5
|
-
|
|
6
|
-
== Problem Statement
|
|
7
|
-
|
|
8
|
-
When comparing HTML documents with `lang` and `xml:lang` attributes, users see false attribute differences:
|
|
9
|
-
|
|
10
|
-
----
|
|
11
|
-
⊖ Expected (File 1):
|
|
12
|
-
<span> with 1 attribute: xml:lang
|
|
13
|
-
|
|
14
|
-
⊕ Actual (File 2):
|
|
15
|
-
<span> with 2 attributes: lang, xml:lang
|
|
16
|
-
----
|
|
17
|
-
|
|
18
|
-
Both HTML strings have identical attributes (`lang="EN-GB" xml:lang="EN-GB"`), but the comparison shows different attribute counts. This happens because:
|
|
19
|
-
|
|
20
|
-
. *DOM path* uses `Nokogiri::XML.fragment` for all HTML, which treats `lang` and `xml:lang` as the same attribute (XML namespace behavior)
|
|
21
|
-
. *Semantic path* uses `Nokogiri::HTML5.fragment` or `Nokogiri::HTML4.fragment`, which correctly treats them as distinct
|
|
22
|
-
. *The `parse_html` method ignores the format parameter* and returns raw strings, causing inconsistent parsing
|
|
23
|
-
|
|
24
|
-
== Root Cause
|
|
25
|
-
|
|
26
|
-
In `lib/canon/comparison.rb`, the `parse_html` method at line 374:
|
|
27
|
-
|
|
28
|
-
[source,ruby]
|
|
29
|
-
----
|
|
30
|
-
def parse_html(content, _format) # format is IGNORED!
|
|
31
|
-
return content unless content.is_a?(String)
|
|
32
|
-
# ... returns raw string instead of parsing
|
|
33
|
-
end
|
|
34
|
-
----
|
|
35
|
-
|
|
36
|
-
This causes HTML version information to be lost, and `HtmlComparator#parse_node` ends up using `XML.fragment` for all HTML content.
|
|
37
|
-
|
|
38
|
-
== Solution
|
|
39
|
-
|
|
40
|
-
=== Architecture
|
|
41
|
-
|
|
42
|
-
Fix the 4-layer architecture to respect user's parser choice:
|
|
43
|
-
|
|
44
|
-
----
|
|
45
|
-
User specifies format: :html5
|
|
46
|
-
|
|
|
47
|
-
v
|
|
48
|
-
Level 1: Preprocessing
|
|
49
|
-
parse_html(html, :html5) -> Nokogiri::HTML5.fragment ✓
|
|
50
|
-
|
|
|
51
|
-
v
|
|
52
|
-
Level 2: Diff Algorithm (DiffNode creation)
|
|
53
|
-
Parsed nodes have accurate attributes ✓
|
|
54
|
-
|
|
|
55
|
-
v
|
|
56
|
-
Level 3: Diff Report
|
|
57
|
-
Enriched metadata is correct ✓
|
|
58
|
-
|
|
|
59
|
-
v
|
|
60
|
-
Level 4: Diff Rendering
|
|
61
|
-
Accurate attribute counts in output ✓
|
|
62
|
-
----
|
|
63
|
-
|
|
64
|
-
=== Component Changes
|
|
65
|
-
|
|
66
|
-
==== 1. `parse_html` Method (`lib/canon/comparison.rb`)
|
|
67
|
-
|
|
68
|
-
*Current behavior:* Ignores format parameter, returns raw string
|
|
69
|
-
|
|
70
|
-
*New behavior:* Parse with correct Nokogiri parser based on format
|
|
71
|
-
|
|
72
|
-
[source,ruby]
|
|
73
|
-
----
|
|
74
|
-
def parse_html(content, format)
|
|
75
|
-
return content unless content.is_a?(String)
|
|
76
|
-
return content if already_parsed?(content)
|
|
77
|
-
|
|
78
|
-
begin
|
|
79
|
-
case format
|
|
80
|
-
when :html5
|
|
81
|
-
Nokogiri::HTML5.fragment(content)
|
|
82
|
-
when :html4
|
|
83
|
-
Nokogiri::HTML4.fragment(content)
|
|
84
|
-
when :html
|
|
85
|
-
detect_and_parse_html(content)
|
|
86
|
-
else
|
|
87
|
-
content
|
|
88
|
-
end
|
|
89
|
-
rescue StandardError
|
|
90
|
-
content
|
|
91
|
-
end
|
|
92
|
-
end
|
|
93
|
-
|
|
94
|
-
private
|
|
95
|
-
|
|
96
|
-
def already_parsed?(content)
|
|
97
|
-
content.is_a?(Nokogiri::HTML::Document) ||
|
|
98
|
-
content.is_a?(Nokogiri::HTML5::Document) ||
|
|
99
|
-
content.is_a?(Nokogiri::HTML::DocumentFragment) ||
|
|
100
|
-
content.is_a?(Nokogiri::HTML5::DocumentFragment) ||
|
|
101
|
-
content.is_a?(Nokogiri::XML::DocumentFragment)
|
|
102
|
-
end
|
|
103
|
-
|
|
104
|
-
def detect_and_parse_html(content)
|
|
105
|
-
version = detect_html_version(content)
|
|
106
|
-
version == :html5 ?
|
|
107
|
-
Nokogiri::HTML5.fragment(content) :
|
|
108
|
-
Nokogiri::HTML4.fragment(content)
|
|
109
|
-
end
|
|
110
|
-
|
|
111
|
-
def detect_html_version(content)
|
|
112
|
-
content.include?('<!DOCTYPE html>') ? :html5 : :html4
|
|
113
|
-
end
|
|
114
|
-
----
|
|
115
|
-
|
|
116
|
-
==== 2. `dom_diff` Method (`lib/canon/comparison.rb`)
|
|
117
|
-
|
|
118
|
-
*Current behavior:* Normalizes `html4`/`html5` to `:html` at line 320
|
|
119
|
-
|
|
120
|
-
*New behavior:* Preserve format information
|
|
121
|
-
|
|
122
|
-
Remove or modify line 320:
|
|
123
|
-
|
|
124
|
-
[source,ruby]
|
|
125
|
-
----
|
|
126
|
-
# OLD: format1 = format2 = :html
|
|
127
|
-
# NEW: Keep format1, format2 as html4 or html5
|
|
128
|
-
----
|
|
129
|
-
|
|
130
|
-
This ensures the format is passed through to `HtmlComparator` and used consistently.
|
|
131
|
-
|
|
132
|
-
=== Error Handling
|
|
133
|
-
|
|
134
|
-
. *Parse failures:* Fall back to raw string (maintains backward compatibility)
|
|
135
|
-
. *Already-parsed documents:* Return as-is, don't re-parse
|
|
136
|
-
. *Mixed input types:* Both documents parsed with consistent parser based on format parameter
|
|
137
|
-
|
|
138
|
-
=== Testing Strategy
|
|
139
|
-
|
|
140
|
-
==== Unit Tests (`spec/canon/comparison_spec.rb`)
|
|
141
|
-
|
|
142
|
-
[source,ruby]
|
|
143
|
-
----
|
|
144
|
-
context "parse_html with format parameter" do
|
|
145
|
-
it "parses HTML5 with HTML5.fragment when format is :html5" do
|
|
146
|
-
html = '<span lang="en" xml:lang="en">text</span>'
|
|
147
|
-
result = Canon::Comparison.send(:parse_html, html, :html5)
|
|
148
|
-
|
|
149
|
-
expect(result).to be_a(Nokogiri::HTML5::DocumentFragment)
|
|
150
|
-
expect(result.at_css('span').attributes.keys).to eq(['lang', 'xml:lang'])
|
|
151
|
-
end
|
|
152
|
-
|
|
153
|
-
it "parses HTML4 with HTML4.fragment when format is :html4" do
|
|
154
|
-
html = '<span lang="en" xml:lang="en">text</span>'
|
|
155
|
-
result = Canon::Comparison.send(:parse_html, html, :html4)
|
|
156
|
-
|
|
157
|
-
expect(result).to be_a(Nokogiri::HTML4::DocumentFragment)
|
|
158
|
-
end
|
|
159
|
-
|
|
160
|
-
it "returns already-parsed documents as-is" do
|
|
161
|
-
frag = Nokogiri::HTML5.fragment('<span>text</span>')
|
|
162
|
-
result = Canon::Comparison.send(:parse_html, frag, :html5)
|
|
163
|
-
|
|
164
|
-
expect(result).to eq(frag)
|
|
165
|
-
end
|
|
166
|
-
end
|
|
167
|
-
----
|
|
168
|
-
|
|
169
|
-
==== Integration Tests (`spec/canon/html_comparison_spec.rb`)
|
|
170
|
-
|
|
171
|
-
[source,ruby]
|
|
172
|
-
----
|
|
173
|
-
context "HTML5 lang and xml:lang attributes" do
|
|
174
|
-
it "treats lang and xml:lang as distinct attributes in HTML5" do
|
|
175
|
-
html1 = '<span lang="EN-GB" xml:lang="EN-GB">text</span>'
|
|
176
|
-
html2 = '<span lang="EN-GB" xml:lang="EN-GB">text</span>'
|
|
177
|
-
|
|
178
|
-
result = Canon::Comparison.equivalent?(
|
|
179
|
-
html1, html2,
|
|
180
|
-
format: :html5,
|
|
181
|
-
verbose: true
|
|
182
|
-
)
|
|
183
|
-
|
|
184
|
-
expect(result).to be_equivalent
|
|
185
|
-
end
|
|
186
|
-
|
|
187
|
-
it "does NOT show false attribute differences" do
|
|
188
|
-
html1 = '<span lang="EN-GB" xml:lang="EN-GB"> </span>'
|
|
189
|
-
html2 = '<span lang="EN-GB" xml:lang="EN-GB">␣</span>'
|
|
190
|
-
|
|
191
|
-
result = Canon::Comparison.equivalent?(
|
|
192
|
-
html1, html2,
|
|
193
|
-
format: :html5,
|
|
194
|
-
verbose: true
|
|
195
|
-
)
|
|
196
|
-
|
|
197
|
-
# Only difference should be the non-breaking space encoding
|
|
198
|
-
# No attribute differences should be reported
|
|
199
|
-
attr_diffs = result.differences.select { |d| d.dimension == :attribute_values }
|
|
200
|
-
expect(attr_diffs).to be_empty
|
|
201
|
-
end
|
|
202
|
-
end
|
|
203
|
-
----
|
|
204
|
-
|
|
205
|
-
==== Backward Compatibility Tests
|
|
206
|
-
|
|
207
|
-
[source,ruby]
|
|
208
|
-
----
|
|
209
|
-
context "backward compatibility" do
|
|
210
|
-
it "works when format is not specified (auto-detect)" do
|
|
211
|
-
html1 = '<span>text</span>'
|
|
212
|
-
html2 = '<span>text</span>'
|
|
213
|
-
|
|
214
|
-
expect(Canon::Comparison.equivalent?(html1, html2)).to be true
|
|
215
|
-
end
|
|
216
|
-
|
|
217
|
-
it "handles strings with :html format (legacy behavior)" do
|
|
218
|
-
html1 = '<span>text</span>'
|
|
219
|
-
html2 = '<span>text</span>'
|
|
220
|
-
|
|
221
|
-
expect(Canon::Comparison.equivalent?(html1, html2, format: :html)).to be true
|
|
222
|
-
end
|
|
223
|
-
end
|
|
224
|
-
----
|
|
225
|
-
|
|
226
|
-
== Implementation Checklist
|
|
227
|
-
|
|
228
|
-
* [ ] Modify `parse_html` in `lib/canon/comparison.rb`
|
|
229
|
-
* [ ] Add helper methods: `already_parsed?`, `detect_and_parse_html`, `detect_html_version`
|
|
230
|
-
* [ ] Update `dom_diff` to preserve format (line 320)
|
|
231
|
-
* [ ] Add unit tests for `parse_html` method
|
|
232
|
-
* [ ] Add integration tests for lang/xml:lang
|
|
233
|
-
* [ ] Add backward compatibility tests
|
|
234
|
-
* [ ] Run full test suite to ensure no regressions
|
|
235
|
-
|
|
236
|
-
== Expected Outcomes
|
|
237
|
-
|
|
238
|
-
After this fix:
|
|
239
|
-
|
|
240
|
-
. *`lang` and `xml:lang` are treated as distinct attributes in HTML5/HTML4*
|
|
241
|
-
. *No false attribute differences when both documents have identical attributes*
|
|
242
|
-
. *User can explicitly control parser via `format: :html5` or `format: :html4`*
|
|
243
|
-
. *Backward compatible with existing code (auto-detect still works)*
|
|
244
|
-
. *Consistent parsing regardless of input format (string vs DocumentFragment)*
|
|
245
|
-
|
|
246
|
-
== Notes
|
|
247
|
-
|
|
248
|
-
. HTML entity normalization (` ` vs `␣`) is intentionally NOT changed - these are semantically equivalent but different serializations, and the diff correctly shows this difference
|
|
249
|
-
. XML comparison continues to use `XML.fragment` - this fix only affects HTML parsing
|
|
250
|
-
. The semantic path already works correctly via `Canon::Html::DataModel.from_html`
|
data/false_positive_analysis.txt
DELETED
|
File without changes
|
data/file1.html
DELETED
|
@@ -1 +0,0 @@
|
|
|
1
|
-
<html><body><br><div class="TOC" id="_"><h1 class="IntroTitle">Table of contents</h1></div></body></html>
|
data/file2.html
DELETED
|
@@ -1 +0,0 @@
|
|
|
1
|
-
<html><body><br><div id="_" class="TOC"><h1 class="IntroTitle">Table of contents</h1></div></body></html>
|
|
@@ -1,20 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
layout: default
|
|
3
|
-
title: Advanced Topics
|
|
4
|
-
nav_order: 5
|
|
5
|
-
has_children: true
|
|
6
|
-
---
|
|
7
|
-
= Advanced topics
|
|
8
|
-
|
|
9
|
-
For developers and advanced users:
|
|
10
|
-
|
|
11
|
-
* **link:VERBOSE[Verbose mode]** - Two-tier diff output architecture
|
|
12
|
-
* **link:SEMANTIC_DIFF_REPORT[Semantic diff report]** - Detailed report
|
|
13
|
-
format
|
|
14
|
-
* **link:NORMATIVE_INFORMATIVE_DIFFS[Normative vs informative diffs]** - Diff
|
|
15
|
-
classification
|
|
16
|
-
* **link:DIFF_ARCHITECTURE[Diff architecture]** - Six-layer technical
|
|
17
|
-
pipeline
|
|
18
|
-
|
|
19
|
-
These documents cover Canon's internal architecture and advanced features for
|
|
20
|
-
developers extending or maintaining Canon.
|
data/old-docs/BASIC_USAGE.adoc
DELETED
|
@@ -1,16 +0,0 @@
|
|
|
1
|
-
---
|
|
2
|
-
layout: default
|
|
3
|
-
title: Basic Usage
|
|
4
|
-
nav_order: 2
|
|
5
|
-
has_children: true
|
|
6
|
-
---
|
|
7
|
-
= Basic usage
|
|
8
|
-
|
|
9
|
-
Choose your interface for working with Canon:
|
|
10
|
-
|
|
11
|
-
* **link:RUBY_API[Ruby API]** - Using Canon from Ruby code
|
|
12
|
-
* **link:CLI[Command-line interface]** - Terminal commands and options
|
|
13
|
-
* **link:RSPEC[RSpec matchers]** - Testing with Canon in RSpec
|
|
14
|
-
|
|
15
|
-
These guides provide practical examples and complete API reference for each
|
|
16
|
-
interface.
|