canon 0.1.8 → 0.1.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (101) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop_todo.yml +83 -22
  3. data/docs/Gemfile +1 -0
  4. data/docs/_config.yml +90 -1
  5. data/docs/advanced/diff-classification.adoc +196 -24
  6. data/docs/features/match-options/index.adoc +239 -1
  7. data/lib/canon/comparison/format_detector.rb +2 -1
  8. data/lib/canon/comparison/html_comparator.rb +19 -8
  9. data/lib/canon/comparison/html_compare_profile.rb +8 -2
  10. data/lib/canon/comparison/markup_comparator.rb +109 -2
  11. data/lib/canon/comparison/match_options/base_resolver.rb +7 -0
  12. data/lib/canon/comparison/whitespace_sensitivity.rb +208 -0
  13. data/lib/canon/comparison/xml_comparator/child_comparison.rb +15 -7
  14. data/lib/canon/comparison/xml_comparator/diff_node_builder.rb +108 -0
  15. data/lib/canon/comparison/xml_comparator/node_parser.rb +10 -5
  16. data/lib/canon/comparison/xml_comparator/node_type_comparator.rb +14 -7
  17. data/lib/canon/comparison/xml_comparator.rb +240 -23
  18. data/lib/canon/comparison/xml_node_comparison.rb +25 -3
  19. data/lib/canon/diff/diff_classifier.rb +119 -5
  20. data/lib/canon/diff/formatting_detector.rb +1 -1
  21. data/lib/canon/diff/xml_serialization_formatter.rb +153 -0
  22. data/lib/canon/rspec_matchers.rb +37 -8
  23. data/lib/canon/version.rb +1 -1
  24. data/lib/canon/xml/data_model.rb +24 -13
  25. metadata +4 -78
  26. data/docs/plans/2025-01-17-html-parser-selection-fix.adoc +0 -250
  27. data/false_positive_analysis.txt +0 -0
  28. data/file1.html +0 -1
  29. data/file2.html +0 -1
  30. data/old-docs/ADVANCED_TOPICS.adoc +0 -20
  31. data/old-docs/BASIC_USAGE.adoc +0 -16
  32. data/old-docs/CHARACTER_VISUALIZATION.adoc +0 -567
  33. data/old-docs/CLI.adoc +0 -497
  34. data/old-docs/CUSTOMIZING_BEHAVIOR.adoc +0 -19
  35. data/old-docs/DIFF_ARCHITECTURE.adoc +0 -435
  36. data/old-docs/DIFF_FORMATTING.adoc +0 -540
  37. data/old-docs/DIFF_PARAMETERS.adoc +0 -261
  38. data/old-docs/DOM_DIFF.adoc +0 -1017
  39. data/old-docs/ENV_CONFIG.adoc +0 -876
  40. data/old-docs/FORMATS.adoc +0 -867
  41. data/old-docs/INPUT_VALIDATION.adoc +0 -477
  42. data/old-docs/MATCHER_BEHAVIOR.adoc +0 -90
  43. data/old-docs/MATCH_ARCHITECTURE.adoc +0 -463
  44. data/old-docs/MATCH_OPTIONS.adoc +0 -912
  45. data/old-docs/MODES.adoc +0 -432
  46. data/old-docs/NORMATIVE_INFORMATIVE_DIFFS.adoc +0 -219
  47. data/old-docs/OPTIONS.adoc +0 -1387
  48. data/old-docs/PREPROCESSING.adoc +0 -491
  49. data/old-docs/README.old.adoc +0 -2831
  50. data/old-docs/RSPEC.adoc +0 -814
  51. data/old-docs/RUBY_API.adoc +0 -485
  52. data/old-docs/SEMANTIC_DIFF_REPORT.adoc +0 -646
  53. data/old-docs/SEMANTIC_TREE_DIFF.adoc +0 -765
  54. data/old-docs/STRING_COMPARE.adoc +0 -345
  55. data/old-docs/TMP.adoc +0 -3384
  56. data/old-docs/TREE_DIFF.adoc +0 -1080
  57. data/old-docs/UNDERSTANDING_CANON.adoc +0 -17
  58. data/old-docs/VERBOSE.adoc +0 -482
  59. data/old-docs/VISUALIZATION_MAP.adoc +0 -625
  60. data/old-docs/WHITESPACE_TREATMENT.adoc +0 -1155
  61. data/scripts/analyze_current_state.rb +0 -85
  62. data/scripts/analyze_false_positives.rb +0 -114
  63. data/scripts/analyze_remaining_failures.rb +0 -105
  64. data/scripts/compare_current_failures.rb +0 -95
  65. data/scripts/compare_dom_tree_diff.rb +0 -158
  66. data/scripts/compare_failures.rb +0 -151
  67. data/scripts/debug_attribute_extraction.rb +0 -66
  68. data/scripts/debug_blocks_839.rb +0 -115
  69. data/scripts/debug_meta_matching.rb +0 -52
  70. data/scripts/debug_p_matching.rb +0 -192
  71. data/scripts/debug_signature_matching.rb +0 -118
  72. data/scripts/debug_sourcecode_124.rb +0 -32
  73. data/scripts/debug_whitespace_sensitive.rb +0 -192
  74. data/scripts/extract_false_positives.rb +0 -138
  75. data/scripts/find_actual_false_positives.rb +0 -125
  76. data/scripts/investigate_all_false_positives.rb +0 -161
  77. data/scripts/investigate_batch1.rb +0 -127
  78. data/scripts/investigate_classification.rb +0 -150
  79. data/scripts/investigate_classification_detailed.rb +0 -190
  80. data/scripts/investigate_common_failures.rb +0 -342
  81. data/scripts/investigate_false_negative.rb +0 -80
  82. data/scripts/investigate_false_positive.rb +0 -83
  83. data/scripts/investigate_false_positives.rb +0 -227
  84. data/scripts/investigate_false_positives_batch.rb +0 -163
  85. data/scripts/investigate_mixed_content.rb +0 -125
  86. data/scripts/investigate_remaining_16.rb +0 -214
  87. data/scripts/run_single_test.rb +0 -29
  88. data/scripts/test_all_false_positives.rb +0 -95
  89. data/scripts/test_attribute_details.rb +0 -61
  90. data/scripts/test_both_algorithms.rb +0 -49
  91. data/scripts/test_both_simple.rb +0 -49
  92. data/scripts/test_enhanced_semantic_output.rb +0 -125
  93. data/scripts/test_readme_examples.rb +0 -131
  94. data/scripts/test_semantic_tree_diff.rb +0 -99
  95. data/scripts/test_semantic_ux_improvements.rb +0 -135
  96. data/scripts/test_single_false_positive.rb +0 -119
  97. data/scripts/test_size_limits.rb +0 -99
  98. data/test_html_1.html +0 -21
  99. data/test_html_2.html +0 -21
  100. data/test_nokogiri.rb +0 -33
  101. data/test_normalize.rb +0 -45
@@ -1,17 +0,0 @@
1
- ---
2
- layout: default
3
- title: Understanding Canon
4
- nav_order: 3
5
- has_children: true
6
- ---
7
- = Understanding Canon
8
-
9
- Learn how Canon works internally:
10
-
11
- * **link:FORMATS[Format support]** - XML, HTML, JSON, YAML
12
- canonicalization
13
- * **link:MODES[Diff modes]** - By-line vs by-object comparison modes
14
- * **link:MATCH_ARCHITECTURE[Match architecture]** - Three-phase
15
- comparison flow
16
-
17
- These documents explain Canon's core concepts and architecture.
@@ -1,482 +0,0 @@
1
- ---
2
- layout: default
3
- title: Verbose Mode
4
- nav_order: 40
5
- parent: Advanced Topics
6
- ---
7
- = Canon verbose mode guide
8
- :toc:
9
- :toclevels: 3
10
-
11
- == General
12
-
13
- Canon provides a two-tier verbose output architecture for debugging
14
- comparison failures:
15
-
16
- * **Semantic Diff Report**: Always shown in verbose mode - provides
17
- actionable, dimension-specific details for each difference
18
- * **CANON VERBOSE tables**: Extra detailed option tables shown only when
19
- `CANON_VERBOSE=1` environment variable is set
20
-
21
- This progressive disclosure ensures developers get useful information by
22
- default, with additional debugging details available when needed.
23
-
24
- == Architecture
25
-
26
- The output architecture follows a clear three-tier structure:
27
-
28
- [source]
29
- ----
30
- ╔═════════════════════════════════════════════════════════════════════╗
31
- ║ CANON VERBOSE MODE OUTPUT ARCHITECTURE ║
32
- ╚═════════════════════════════════════════════════════════════════════╝
33
-
34
- When verbose: true is used:
35
-
36
- ┌─────────────────────────────────────────────────────────────────────┐
37
- │ TIER 1: CANON VERBOSE Tables (ONLY if CANON_VERBOSE=1) │
38
- │ │
39
- │ ┌────────────────────────────────────────────────────────────────┐ │
40
- │ │ Match Options Table │ │
41
- │ │ • Shows preprocessing behavior │ │
42
- │ │ • Shows dimension behaviors (strict/normalize/ignore) │ │
43
- │ │ • Explains what each setting means │ │
44
- │ └────────────────────────────────────────────────────────────────┘ │
45
- │ ┌────────────────────────────────────────────────────────────────┐ │
46
- │ │ Formatter Options Table │ │
47
- │ │ • Shows mode (by_line vs by_object) │ │
48
- │ │ • Shows context_lines, diff_grouping_lines │ │
49
- │ │ • Shows show_diffs filter setting │ │
50
- │ └────────────────────────────────────────────────────────────────┘ │
51
- │ ┌────────────────────────────────────────────────────────────────┐ │
52
- │ │ Comparison Result Summary │ │
53
- │ │ • Equivalent? (YES/NO) │ │
54
- │ │ • Normative/Informative/Total diff counts │ │
55
- │ └────────────────────────────────────────────────────────────────┘ │
56
- └─────────────────────────────────────────────────────────────────────┘
57
-
58
-
59
- ┌─────────────────────────────────────────────────────────────────────┐
60
- │ TIER 2: Semantic Diff Report (ALWAYS if diffs exist) │
61
- │ │
62
- │ For each difference: │
63
- │ • XPath location (e.g., /html/body/div/table/pre/text) │
64
- │ • Dimension classification (attribute_presence, text_content) │
65
- │ • Specific changes (Added: +xmlns:v, +xmlns:o) │
66
- │ • Normative/Informative status │
67
- │ • Dimension-specific formatting │
68
- └─────────────────────────────────────────────────────────────────────┘
69
-
70
-
71
- ┌─────────────────────────────────────────────────────────────────────┐
72
- │ TIER 3: Detailed Diff (ALWAYS) │
73
- │ │
74
- │ Either: │
75
- │ • Line-by-line diff (for HTML, or with --by-line flag) │
76
- │ • Object tree diff (for XML/JSON/YAML by default) │
77
- └─────────────────────────────────────────────────────────────────────┘
78
- ----
79
-
80
- === Output flow
81
-
82
- The `DiffFormatter.format_comparison_result()` method orchestrates the
83
- output:
84
-
85
- . Check if `CANON_VERBOSE=1` → Render option tables
86
- . Check if differences exist → Render Semantic Diff Report
87
- . Always render detailed diff (by-line or by-object)
88
-
89
- == Semantic diff report
90
-
91
- === General
92
-
93
- The Semantic Diff Report is the core verbose output, always shown when
94
- differences exist. It provides dimension-specific, actionable details for
95
- each difference.
96
-
97
- Unlike the detailed diff (which shows every changed line), the Semantic
98
- Diff Report shows a high-level summary of WHAT changed and WHY it matters.
99
-
100
- === Output format
101
-
102
- [example]
103
- ====
104
- [source]
105
- ----
106
- ======================================================================
107
- SEMANTIC DIFF REPORT (1 difference)
108
- ======================================================================
109
-
110
- 🔍 DIFFERENCE #1/1 [NORMATIVE]
111
- ──────────────────────────────────────────────────────────────────────
112
- Dimension: attribute_presence
113
- Location: /html
114
-
115
- ⊖ Expected (File 1):
116
- <html> with 2 attributes: lang, xmlns:epub
117
-
118
- ⊕ Actual (File 2):
119
- <html> with 6 attributes: lang, xmlns:epub, xmlns:m, xmlns:o,
120
- xmlns:v, xmlns:w
121
-
122
- ✨ Changes:
123
- Added: +xmlns:m, +xmlns:o, +xmlns:v, +xmlns:w
124
-
125
- ======================================================================
126
- ----
127
- ====
128
-
129
- === Format structure
130
-
131
- Each difference displays:
132
-
133
- * **Status indicator**: `[NORMATIVE]` (green) or `[INFORMATIVE]` (yellow)
134
- * **Dimension**: Which aspect differs (colorized in magenta)
135
- * **Location**: XPath for XML/HTML, path for JSON/YAML (colorized in blue)
136
- * **Expected section**: What was in File 1 (red heading, bold)
137
- * **Actual section**: What was in File 2 (green heading, bold)
138
- * **Changes summary**: Actionable description of the difference (yellow,
139
- bold)
140
-
141
- === Dimension-specific formats
142
-
143
- ==== Attribute presence differences
144
-
145
- For missing or extra attributes:
146
-
147
- [example]
148
- ====
149
- [source]
150
- ----
151
- Dimension: attribute_presence
152
- Location: /html/body/p
153
-
154
- ⊖ Expected: <p> with 2 attributes: id, lang
155
- ⊕ Actual: <p> with 4 attributes: id, lang, data-value, aria-label
156
-
157
- ✨ Changes: Added: +data-value, +aria-label
158
- ----
159
- ====
160
-
161
- Shows:
162
-
163
- * Element name (`<p>`)
164
- * How many attributes each has
165
- * Which attributes were added (green with `+` prefix) or removed (red with
166
- `-` prefix)
167
-
168
- ==== Attribute value differences
169
-
170
- For differing attribute values:
171
-
172
- [example]
173
- ====
174
- [source]
175
- ----
176
- Dimension: attribute_values
177
- Location: /html/body/div
178
-
179
- ⊖ Expected: <div> class=" container fluid "
180
- ⊕ Actual: <div> class="container fluid"
181
-
182
- ✨ Changes: Whitespace normalization difference
183
- ----
184
- ====
185
-
186
- Shows:
187
-
188
- * Which specific attribute differs (highlighted in cyan)
189
- * Exact values on both sides
190
- * Analysis: "Whitespace difference only", "Whitespace normalization
191
- difference", or "Value changed"
192
-
193
- ==== Text content differences
194
-
195
- For text that differs:
196
-
197
- [example]
198
- ====
199
- [source]
200
- ----
201
- Dimension: text_content
202
- Location: /html/body/div/table/tbody/tr/td/pre/text
203
-
204
- ⊖ Expected: <text> "
205
- puts \"Hello, world.\"
206
- "
207
- ⊕ Actual: <text> "puts \"Hello, world.\" "
208
-
209
- ✨ Changes: ⚠️ Whitespace preserved (inside <pre>, <code>, etc. -
210
- whitespace is significant)
211
- ----
212
- ====
213
-
214
- Shows:
215
-
216
- * Text preview (truncated at 100 characters)
217
- * Special warning if inside `<pre>`, `<code>`, `<textarea>`, `<script>`,
218
- or `<style>` elements (where whitespace is significant)
219
-
220
- ==== Structural whitespace differences
221
-
222
- For whitespace-only differences (usually informative):
223
-
224
- [example]
225
- ====
226
- [source]
227
- ----
228
- Dimension: structural_whitespace
229
- Location: /root/p
230
-
231
- ⊖ Expected: <p> "hello␣␣world"
232
- ⊕ Actual: <p> "hello␣world"
233
-
234
- ✨ Changes: Whitespace-only difference (informative)
235
- ----
236
- ====
237
-
238
- Shows:
239
-
240
- * Whitespace visualized: `␣` for space, `→` for tab, `↵` for newline
241
- * Marked as `[INFORMATIVE]` (yellow)
242
-
243
- ==== JSON/YAML differences
244
-
245
- For JSON/YAML path-based differences:
246
-
247
- [example]
248
- ====
249
- [source]
250
- ----
251
- Dimension: 15
252
- Location: user.email
253
-
254
- ⊖ Expected: user.email = "alice@example.com"
255
- ⊕ Actual: user.email = "bob@example.com"
256
-
257
- ✨ Changes: Value changed
258
- ----
259
- ====
260
-
261
- == CANON VERBOSE mode
262
-
263
- === General
264
-
265
- CANON VERBOSE mode adds detailed option tables BEFORE the Semantic Diff
266
- Report. These tables help understand:
267
-
268
- * What match options are in effect
269
- * How the diff formatter is configured
270
- * Statistics about the comparison result
271
-
272
- To enable, set the `CANON_VERBOSE` environment variable:
273
-
274
- [source,bash]
275
- ----
276
- CANON_VERBOSE=1 bundle exec rspec spec/my_failing_spec.rb:123
277
- ----
278
-
279
- === Match options table
280
-
281
- Shows preprocessing and dimension behaviors:
282
-
283
- [example]
284
- ====
285
- [source]
286
- ----
287
- ╭────────────────────────────────────────────────────────────────────╮
288
- │ Match Options (HTML) │
289
- ├────────────────────┬───────────┬────────────────────────────────────┤
290
- │ Dimension │ Behavior │ Meaning │
291
- ├────────────────────┼───────────┼────────────────────────────────────┤
292
- │ preprocessing │ rendered │ As browser-rendered (compacted wh… │
293
- │ text_content │ normalize │ Normalized then compared (normative… │
294
- │ structural_whit… │ ignore │ Differences IGNORED (informative) │
295
- │ attribute_presence │ strict │ Must match exactly (normative) │
296
- │ attribute_values │ normalize │ Normalized then compared (normative… │
297
- │ comments │ ignore │ Differences IGNORED (informative) │
298
- ╰────────────────────┴───────────┴────────────────────────────────────╯
299
- ----
300
- ====
301
-
302
- Preprocessing behaviors:
303
-
304
- * `:none` - No preprocessing (compare as-is)
305
- * `:c14n` - Canonicalize (XML C14N normalization)
306
- * `:normalize` - Normalize (collapse whitespace, trim lines)
307
- * `:format` - Pretty-format (consistent indentation)
308
- * `:rendered` - As browser-rendered (compacted whitespace, to_html)
309
-
310
- Dimension behaviors:
311
-
312
- * `:ignore` - Differences IGNORED (innormative, won't fail test)
313
- * `:strict` - Must match exactly (normative, will fail test)
314
- * `:normalize` - Normalized then compared (normative if different after
315
- normalization)
316
- * `:strip` - Strip leading/trailing whitespace only
317
- * `:compact` - Collapse whitespace runs to single space
318
-
319
- === Formatter options table
320
-
321
- Shows diff formatting settings:
322
-
323
- [example]
324
- ====
325
- [source]
326
- ----
327
- ╭────────────────────────────────────────────────────────────────────╮
328
- │ Formatter Options │
329
- ├─────────────────────┬─────────┬────────────────────────────────────┤
330
- │ Option │ Value │ Impact │
331
- ├─────────────────────┼─────────┼─────────────────────────────────────┤
332
- │ mode │ by_line │ Line-by-line diff │
333
- │ context_lines │ 3 │ 3 lines of context around diffs │
334
- │ show_diffs │ all │ Show all diffs (normative + informative) │
335
- ╰─────────────────────┴─────────┴────────────────────────────────────╯
336
- ----
337
- ====
338
-
339
- === Comparison result summary
340
-
341
- Shows diff statistics:
342
-
343
- [example]
344
- ====
345
- [source]
346
- ----
347
- ╭─────────────────────────────────────────────────────────────────────╮
348
- │ Comparison Result Summary │
349
- ├────────────────┬─────────┬──────────────────────────────────────────┤
350
- │ Equivalent? │ ✗ NO │ Documents have semantic differences │
351
- │ Normative Diffs │ 1 diffs │ Semantic differences that matter │
352
- │ Informative Diffs │ 0 │ Textual/formatting differences (ignored) │
353
- │ Total Diffs │ 1 │ All differences found │
354
- ╰────────────────┴─────────┴──────────────────────────────────────────╯
355
- ----
356
- ====
357
-
358
- == Usage
359
-
360
- === Using in RSpec matchers
361
-
362
- Verbose mode is activated by using `verbose: true` in the comparison:
363
-
364
- [source,ruby]
365
- ----
366
- result = Canon::Comparison::XmlComparator.equivalent?(
367
- xml1,
368
- xml2,
369
- verbose: true
370
- )
371
- # Returns ComparisonResult object
372
- # Semantic Diff Report shown if differences exist
373
- ----
374
-
375
- With RSpec matchers, verbose mode is automatic on test failure:
376
-
377
- [source,ruby]
378
- ----
379
- # Semantic Diff Report automatically shown on failure
380
- expect(actual_html).to be_html4_equivalent_to(expected_html)
381
- ----
382
-
383
- To enable CANON VERBOSE tables:
384
-
385
- [source,bash]
386
- ----
387
- CANON_VERBOSE=1 bundle exec rspec spec/my_spec.rb:123
388
- ----
389
-
390
- === Using via CLI
391
-
392
- [source,bash]
393
- ----
394
- # Verbose mode (shows Semantic Diff Report)
395
- canon diff file1.xml file2.xml --verbose
396
-
397
- # With CANON VERBOSE tables
398
- CANON_VERBOSE=1 canon diff file1.xml file2.xml --verbose
399
- ----
400
-
401
- === Configuration
402
-
403
- You can enable CANON VERBOSE mode permanently for a project:
404
-
405
- [source,ruby]
406
- ----
407
- # In spec/spec_helper.rb
408
- ENV['CANON_VERBOSE'] = '1' if ENV['DEBUG']
409
-
410
- # Or in your test
411
- before(:each) do
412
- ENV['CANON_VERBOSE'] = '1'
413
- end
414
- ----
415
-
416
- == Implementation
417
-
418
- === DiffDetailFormatter module
419
-
420
- Location: `lib/canon/diff_formatter/diff_detail_formatter.rb`
421
-
422
- Responsible for:
423
-
424
- * Formatting the Semantic Diff Report
425
- * Dispatching to dimension-specific formatters
426
- * Extracting XPath/JSON paths
427
- * Detecting whitespace-preserving elements (`<pre>`, `<code>`, etc.)
428
- * Colorizing output
429
-
430
- Key methods:
431
-
432
- * `format_report(differences)` - Main entry point
433
- * `format_attribute_presence_details()` - Format attribute presence diffs
434
- * `format_attribute_values_details()` - Format attribute value diffs
435
- * `format_text_content_details()` - Format text content diffs
436
- * `extract_xpath(node)` - Extract XPath with safety limits
437
- * `inside_preserve_element?(node)` - Detect whitespace preservation
438
-
439
- === DebugOutput module
440
-
441
- Location: `lib/canon/diff_formatter/debug_output.rb`
442
-
443
- Responsible for:
444
-
445
- * Rendering CANON VERBOSE option tables
446
- * Checking if `CANON_VERBOSE=1` is set
447
- * Formatting match options with descriptions
448
- * Formatting formatter options with impact
449
- * Formatting comparison summary statistics
450
-
451
- Key methods:
452
-
453
- * `verbose_tables_only()` - Returns CANON VERBOSE tables or empty string
454
- * `format_match_options_table()` - Render match options as table
455
- * `format_formatter_options_table()` - Render formatter options as table
456
- * `format_comparison_summary()` - Render result summary as table
457
-
458
- === DiffFormatter integration
459
-
460
- Location: `lib/canon/diff_formatter.rb`
461
-
462
- The `format_comparison_result()` method orchestrates output:
463
-
464
- [source,ruby]
465
- ----
466
- def format_comparison_result(comparison_result, expected, actual)
467
- output = []
468
-
469
- # 1. CANON VERBOSE tables (ONLY if CANON_VERBOSE=1)
470
- output << DebugOutput.verbose_tables_only(...)
471
-
472
- # 2. Semantic Diff Report (ALWAYS if diffs exist)
473
- output << DiffDetailFormatter.format_report(...)
474
-
475
- # 3. Detailed diff (ALWAYS)
476
- output << format(differences, ...)
477
-
478
- output.compact.join("\n")
479
- end
480
- ----
481
-
482
- This ensures the correct output order and separation of concerns.