canon 0.1.8 → 0.1.10

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (101) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop_todo.yml +83 -22
  3. data/docs/Gemfile +1 -0
  4. data/docs/_config.yml +90 -1
  5. data/docs/advanced/diff-classification.adoc +196 -24
  6. data/docs/features/match-options/index.adoc +239 -1
  7. data/lib/canon/comparison/format_detector.rb +2 -1
  8. data/lib/canon/comparison/html_comparator.rb +19 -8
  9. data/lib/canon/comparison/html_compare_profile.rb +8 -2
  10. data/lib/canon/comparison/markup_comparator.rb +109 -2
  11. data/lib/canon/comparison/match_options/base_resolver.rb +7 -0
  12. data/lib/canon/comparison/whitespace_sensitivity.rb +208 -0
  13. data/lib/canon/comparison/xml_comparator/child_comparison.rb +15 -7
  14. data/lib/canon/comparison/xml_comparator/diff_node_builder.rb +108 -0
  15. data/lib/canon/comparison/xml_comparator/node_parser.rb +10 -5
  16. data/lib/canon/comparison/xml_comparator/node_type_comparator.rb +14 -7
  17. data/lib/canon/comparison/xml_comparator.rb +240 -23
  18. data/lib/canon/comparison/xml_node_comparison.rb +25 -3
  19. data/lib/canon/diff/diff_classifier.rb +119 -5
  20. data/lib/canon/diff/formatting_detector.rb +1 -1
  21. data/lib/canon/diff/xml_serialization_formatter.rb +153 -0
  22. data/lib/canon/rspec_matchers.rb +37 -8
  23. data/lib/canon/version.rb +1 -1
  24. data/lib/canon/xml/data_model.rb +24 -13
  25. metadata +4 -78
  26. data/docs/plans/2025-01-17-html-parser-selection-fix.adoc +0 -250
  27. data/false_positive_analysis.txt +0 -0
  28. data/file1.html +0 -1
  29. data/file2.html +0 -1
  30. data/old-docs/ADVANCED_TOPICS.adoc +0 -20
  31. data/old-docs/BASIC_USAGE.adoc +0 -16
  32. data/old-docs/CHARACTER_VISUALIZATION.adoc +0 -567
  33. data/old-docs/CLI.adoc +0 -497
  34. data/old-docs/CUSTOMIZING_BEHAVIOR.adoc +0 -19
  35. data/old-docs/DIFF_ARCHITECTURE.adoc +0 -435
  36. data/old-docs/DIFF_FORMATTING.adoc +0 -540
  37. data/old-docs/DIFF_PARAMETERS.adoc +0 -261
  38. data/old-docs/DOM_DIFF.adoc +0 -1017
  39. data/old-docs/ENV_CONFIG.adoc +0 -876
  40. data/old-docs/FORMATS.adoc +0 -867
  41. data/old-docs/INPUT_VALIDATION.adoc +0 -477
  42. data/old-docs/MATCHER_BEHAVIOR.adoc +0 -90
  43. data/old-docs/MATCH_ARCHITECTURE.adoc +0 -463
  44. data/old-docs/MATCH_OPTIONS.adoc +0 -912
  45. data/old-docs/MODES.adoc +0 -432
  46. data/old-docs/NORMATIVE_INFORMATIVE_DIFFS.adoc +0 -219
  47. data/old-docs/OPTIONS.adoc +0 -1387
  48. data/old-docs/PREPROCESSING.adoc +0 -491
  49. data/old-docs/README.old.adoc +0 -2831
  50. data/old-docs/RSPEC.adoc +0 -814
  51. data/old-docs/RUBY_API.adoc +0 -485
  52. data/old-docs/SEMANTIC_DIFF_REPORT.adoc +0 -646
  53. data/old-docs/SEMANTIC_TREE_DIFF.adoc +0 -765
  54. data/old-docs/STRING_COMPARE.adoc +0 -345
  55. data/old-docs/TMP.adoc +0 -3384
  56. data/old-docs/TREE_DIFF.adoc +0 -1080
  57. data/old-docs/UNDERSTANDING_CANON.adoc +0 -17
  58. data/old-docs/VERBOSE.adoc +0 -482
  59. data/old-docs/VISUALIZATION_MAP.adoc +0 -625
  60. data/old-docs/WHITESPACE_TREATMENT.adoc +0 -1155
  61. data/scripts/analyze_current_state.rb +0 -85
  62. data/scripts/analyze_false_positives.rb +0 -114
  63. data/scripts/analyze_remaining_failures.rb +0 -105
  64. data/scripts/compare_current_failures.rb +0 -95
  65. data/scripts/compare_dom_tree_diff.rb +0 -158
  66. data/scripts/compare_failures.rb +0 -151
  67. data/scripts/debug_attribute_extraction.rb +0 -66
  68. data/scripts/debug_blocks_839.rb +0 -115
  69. data/scripts/debug_meta_matching.rb +0 -52
  70. data/scripts/debug_p_matching.rb +0 -192
  71. data/scripts/debug_signature_matching.rb +0 -118
  72. data/scripts/debug_sourcecode_124.rb +0 -32
  73. data/scripts/debug_whitespace_sensitive.rb +0 -192
  74. data/scripts/extract_false_positives.rb +0 -138
  75. data/scripts/find_actual_false_positives.rb +0 -125
  76. data/scripts/investigate_all_false_positives.rb +0 -161
  77. data/scripts/investigate_batch1.rb +0 -127
  78. data/scripts/investigate_classification.rb +0 -150
  79. data/scripts/investigate_classification_detailed.rb +0 -190
  80. data/scripts/investigate_common_failures.rb +0 -342
  81. data/scripts/investigate_false_negative.rb +0 -80
  82. data/scripts/investigate_false_positive.rb +0 -83
  83. data/scripts/investigate_false_positives.rb +0 -227
  84. data/scripts/investigate_false_positives_batch.rb +0 -163
  85. data/scripts/investigate_mixed_content.rb +0 -125
  86. data/scripts/investigate_remaining_16.rb +0 -214
  87. data/scripts/run_single_test.rb +0 -29
  88. data/scripts/test_all_false_positives.rb +0 -95
  89. data/scripts/test_attribute_details.rb +0 -61
  90. data/scripts/test_both_algorithms.rb +0 -49
  91. data/scripts/test_both_simple.rb +0 -49
  92. data/scripts/test_enhanced_semantic_output.rb +0 -125
  93. data/scripts/test_readme_examples.rb +0 -131
  94. data/scripts/test_semantic_tree_diff.rb +0 -99
  95. data/scripts/test_semantic_ux_improvements.rb +0 -135
  96. data/scripts/test_single_false_positive.rb +0 -119
  97. data/scripts/test_size_limits.rb +0 -99
  98. data/test_html_1.html +0 -21
  99. data/test_html_2.html +0 -21
  100. data/test_nokogiri.rb +0 -33
  101. data/test_normalize.rb +0 -45
@@ -1,2831 +0,0 @@
1
- = Canon: Canonicalization for serialization formats
2
-
3
- Canon allows you to format, canonicalize, or compare various serialization
4
- formats, DOM-based (XML, HTML) or object-based (JSON, YAML).
5
-
6
- Its main features:
7
-
8
- * Canonicalization and pretty-printing for XML, HTML, JSON, and YAML
9
- * Comparison of XML, HTML, JSON, and YAML documents
10
-
11
-
12
- == Purpose
13
-
14
- Canon provides canonicalization and pretty-printing for various serialization
15
- formats (XML, HTML, JSON, YAML), producing standardized forms suitable for
16
- comparison, testing, digital signatures, and human-readable output.
17
-
18
-
19
- == Architecture
20
-
21
- Canon follows an **orchestrator pattern** with MECE (Mutually Exclusive,
22
- Collectively Exhaustive) principles for clean separation of concerns.
23
-
24
- === Comparison module
25
-
26
- The `Canon::Comparison` module (123 lines) acts as a pure orchestrator that:
27
-
28
- * Detects input format (XML, HTML, JSON, YAML)
29
- * Validates format compatibility
30
- * Delegates to format-specific comparator classes
31
-
32
- Format-specific comparators:
33
-
34
- * `Canon::Comparison::XmlComparator` - XML semantic comparison
35
- * `Canon::Comparison::HtmlComparator` - HTML semantic comparison
36
- * `Canon::Comparison::JsonComparator` - JSON/Ruby object comparison
37
- * `Canon::Comparison::YamlComparator` - YAML comparison (delegates to JsonComparator)
38
-
39
- Each comparator is self-contained and handles all comparison logic for its format.
40
-
41
- === DiffFormatter module
42
-
43
- The `Canon::DiffFormatter` class (171 lines) acts as a pure orchestrator that:
44
-
45
- * Manages diff options (colors, visualization, context)
46
- * Detects diff mode (by-object vs by-line)
47
- * Delegates to mode-specific and format-specific formatters
48
-
49
- Two diff modes:
50
-
51
- **By-object mode** (tree-based semantic diff):
52
-
53
- * `Canon::DiffFormatter::ByObject::BaseFormatter` - Factory and common logic
54
- * `Canon::DiffFormatter::ByObject::XmlFormatter` - XML DOM differences
55
- * `Canon::DiffFormatter::ByObject::JsonFormatter` - Ruby object differences
56
- * `Canon::DiffFormatter::ByObject::YamlFormatter` - YAML differences
57
-
58
- **By-line mode** (line-based diff):
59
-
60
- * `Canon::DiffFormatter::ByLine::BaseFormatter` - LCS algorithm and factory
61
- * `Canon::DiffFormatter::ByLine::XmlFormatter` - DOM-guided XML line diff
62
- * `Canon::DiffFormatter::ByLine::JsonFormatter` - Semantic JSON line diff
63
- * `Canon::DiffFormatter::ByLine::YamlFormatter` - Semantic YAML line diff
64
- * `Canon::DiffFormatter::ByLine::SimpleFormatter` - Fallback line diff
65
-
66
- Each formatter handles format-specific intelligence (DOM parsing, token
67
- highlighting, semantic understanding).
68
-
69
- === Object-oriented diff foundation
70
-
71
- Canon uses three foundational classes for managing diff data:
72
-
73
- * `Canon::Diff::DiffBlock` - Represents a contiguous block of changes
74
- * `Canon::Diff::DiffContext` - Groups diff blocks with surrounding context
75
- * `Canon::Diff::DiffReport` - Top-level container for complete diff results
76
-
77
- These classes ensure MECE compliance by providing clear ownership of diff data
78
- at different granularity levels.
79
-
80
-
81
- == Features
82
-
83
- === Ruby API
84
-
85
- Single API for working with all four formats (XML, HTML, JSON, YAML).
86
-
87
-
88
- === XML canonicalization
89
-
90
- Format XML documents according to the
91
- https://www.w3.org/TR/xml-c14n11/[W3C Canonical XML Version 1.1] specification.
92
-
93
- Key features:
94
-
95
- * Namespace declaration ordering (lexicographic by prefix)
96
- * Attribute ordering (lexicographic by namespace URI, then local name)
97
- * Character encoding normalization to UTF-8
98
- * Special character encoding in text and attributes
99
- * Removal of superfluous namespace declarations
100
- * Support for xml:base, xml:lang, xml:space, and xml:id attributes
101
- * Processing instruction and comment handling
102
- * Document subset support with attribute inheritance
103
-
104
- === HTML canonicalization
105
-
106
- Format HTML 4/5 and XHTML documents with consistent formatting. Automatically
107
- detects HTML vs XHTML and applies appropriate formatting.
108
-
109
- === YAML canonicalization
110
-
111
- Format YAML documents with keys sorted alphabetically at all levels of the
112
- structure.
113
-
114
- === JSON canonicalization
115
-
116
- Format JSON documents with keys sorted alphabetically at all levels of the
117
- structure.
118
-
119
-
120
- === Output modes
121
-
122
- Canon supports two output modes for all formats:
123
-
124
- `c14n` (canonical):: Compact output without indentation, suitable for digital
125
- signatures, hashing, and equivalence testing. Removes formatting whitespace.
126
-
127
- `pretty` (pretty-print):: Human-readable output with consistent indentation.
128
- Configurable indent size and type (spaces or tabs). This is the default mode for
129
- CLI commands.
130
-
131
-
132
- === RSpec matchers
133
-
134
- Provides matchers for testing equivalence between serialized formats.
135
-
136
- NOTE: RSpec matchers always use canonical (c14n) mode for comparison to ensure
137
- formatting differences don't affect test results.
138
-
139
- === Comparison API
140
-
141
- Canon provides a `Canon::Comparison` module for semantic comparison of HTML and
142
- XML documents.
143
-
144
- The `Canon::Comparison.equivalent?` method compares two documents for semantic
145
- equivalence, ignoring formatting differences that don't affect meaning.
146
-
147
- Key features:
148
-
149
- * Semantic comparison (content, not formatting)
150
- * Whitespace normalization
151
- * Comment handling (can ignore or include)
152
- * Attribute sorting
153
- * Support for both HTML and XML documents
154
- * Optional verbose diff output
155
-
156
- NOTE: `Canon::Comparison.equivalent?` adopts option names used by the excellent
157
- https://github.com/vkononov/compare-xml[`compare-xml` gem].
158
-
159
-
160
- == Installation
161
-
162
- Add this line to your application's Gemfile:
163
-
164
- [source,ruby]
165
- ----
166
- gem 'canon'
167
- ----
168
-
169
- And then execute:
170
-
171
- [source,bash]
172
- ----
173
- $ bundle install
174
- ----
175
-
176
- Or install it yourself as:
177
-
178
- [source,bash]
179
- ----
180
- $ gem install canon
181
- ----
182
-
183
-
184
- == Usage
185
-
186
- === Ruby API
187
-
188
- ==== Basic formatting (c14n mode)
189
-
190
- The `Canon.format` method produces canonical output by default.
191
-
192
- Syntax:
193
-
194
- [source,ruby]
195
- ----
196
- Canon.format({content}, {format})
197
- Canon.format_{format}({content}) # Format-specific shorthand
198
- ----
199
-
200
- Where,
201
-
202
- `{content}`:: The input string
203
- `{format}`:: The format type (`:xml`, `:html`, `:json`, or `:yaml`)
204
-
205
- .Canonical formatting examples
206
- [example]
207
- ====
208
- [source,ruby]
209
- ----
210
- require 'canon'
211
-
212
- # XML - compact canonical form
213
- xml = '<root><b>2</b><a>1</a></root>'
214
- Canon.format(xml, :xml)
215
- # => "<root><a>1</a><b>2</b></root>"
216
-
217
- Canon.format_xml(xml) # Shorthand
218
- # => "<root><a>1</a><b>2</b></root>"
219
-
220
- # HTML - compact canonical form
221
- html = '<div><p>Hello</p></div>'
222
- Canon.format(html, :html)
223
- Canon.format_html(html) # Shorthand
224
-
225
- # JSON - canonical with sorted keys
226
- json = '{"z":3,"a":1,"b":2}'
227
- Canon.format(json, :json)
228
- # => {"a":1,"b":2,"z":3}
229
-
230
- # YAML - canonical with sorted keys
231
- yaml = "z: 3\na: 1\nb: 2"
232
- Canon.format(yaml, :yaml)
233
- ----
234
- ====
235
-
236
- ==== Pretty-print mode
237
-
238
- For human-readable output with indentation, use the format-specific pretty
239
- printer classes.
240
-
241
- Syntax:
242
-
243
- [source,ruby]
244
- ----
245
- Canon::{Format}::PrettyPrinter.new(indent: {n}, indent_type: {type}).format({content})
246
- ----
247
-
248
- Where,
249
-
250
- `{Format}`:: The format module (`Xml`, `Html`, `Json`)
251
- `{n}`:: Number of spaces (default: 2) or tabs (use 1 for tabs)
252
- `{type}`:: Indentation type: `'space'` (default) or `'tab'`
253
- `{content}`:: The input string
254
-
255
- .Pretty-print examples
256
- [example]
257
- ====
258
- [source,ruby]
259
- ----
260
- require 'canon/xml/pretty_printer'
261
- require 'canon/html/pretty_printer'
262
- require 'canon/json/pretty_printer'
263
-
264
- xml_input = '<root><b>2</b><a>1</a></root>'
265
-
266
- # XML with 2-space indentation (default)
267
- Canon::Xml::PrettyPrinter.new(indent: 2).format(xml_input)
268
- # =>
269
- # <?xml version="1.0" encoding="UTF-8"?>
270
- # <root>
271
- # <a>1</a>
272
- # <b>2</b>
273
- # </root>
274
-
275
- # XML with 4-space indentation
276
- Canon::Xml::PrettyPrinter.new(indent: 4).format(xml_input)
277
-
278
- # XML with tab indentation
279
- Canon::Xml::PrettyPrinter.new(
280
- indent: 1,
281
- indent_type: 'tab'
282
- ).format(xml_input)
283
-
284
- # HTML with 2-space indentation
285
- html_input = '<div><p>Hello</p></div>'
286
- Canon::Html::PrettyPrinter.new(indent: 2).format(html_input)
287
-
288
- # JSON with 2-space indentation
289
- json_input = '{"z":3,"a":{"b":1}}'
290
- Canon::Json::PrettyPrinter.new(indent: 2).format(json_input)
291
-
292
- # JSON with tab indentation
293
- Canon::Json::PrettyPrinter.new(
294
- indent: 1,
295
- indent_type: 'tab'
296
- ).format(json_input)
297
- ----
298
- ====
299
-
300
- ==== Parsing
301
-
302
- The `Canon.parse` method parses content into Ruby objects or Nokogiri documents.
303
-
304
- Syntax:
305
-
306
- [source,ruby]
307
- ----
308
- Canon.parse({content}, {format})
309
- Canon.parse_{format}({content}) # Format-specific shorthand
310
- ----
311
-
312
- Where,
313
-
314
- `{content}`:: The input string
315
- `{format}`:: The format type (`:xml`, `:html`, `:json`, or `:yaml`)
316
-
317
- .Parsing examples
318
- [example]
319
- ====
320
- [source,ruby]
321
- ----
322
- # Parse XML → Nokogiri::XML::Document
323
- xml_doc = Canon.parse(xml_input, :xml)
324
- xml_doc = Canon.parse_xml(xml_input)
325
-
326
- # Parse HTML → Nokogiri::HTML5::Document (or XML::Document for XHTML)
327
- html_doc = Canon.parse(html_input, :html)
328
- html_doc = Canon.parse_html(html_input)
329
-
330
- # Parse JSON → Ruby Hash/Array
331
- json_obj = Canon.parse(json_input, :json)
332
- json_obj = Canon.parse_json(json_input)
333
-
334
- # Parse YAML → Ruby Hash/Array
335
- yaml_obj = Canon.parse(yaml_input, :yaml)
336
- yaml_obj = Canon.parse_yaml(yaml_input)
337
- ----
338
- ====
339
-
340
- ==== Comparison
341
-
342
- ===== General
343
-
344
- The `Canon::Comparison.equivalent?` method compares two HTML or XML documents.
345
-
346
- The Comparison module uses a depth-first comparison based on the two DOM trees
347
- by traversing them in parallel and comparing nodes.
348
-
349
- In XML mode:
350
-
351
- * Parsing: accepts Moxml (`Moxml::Document`) or Nokogiri
352
- (`Nokogiri::XML::Document`)
353
- * Comments: normalized and compared unless `ignore_comments: true`
354
- * Whitespace: collapses whitespace in text nodes unless `collapse_whitespace: false`
355
- * Sorts attributes alphabetically before comparison
356
-
357
- In HTML mode:
358
-
359
- * Parsing: accepts Nokogiri (`Nokogiri::HTML5` or `Nokogiri::HTML`)
360
- * Normalizes HTML comments in `<style>` and `<script>` tags
361
- * Sorts attributes alphabetically before comparison
362
- * Collapses whitespace for text content comparison
363
- * Removes empty text nodes between elements
364
-
365
- [NOTE]
366
- ====
367
- The comparison module is automatically used by Canon's RSpec matchers
368
- (`be_html_equivalent_to`, `be_xml_equivalent_to`, etc.) to provide reliable
369
- semantic comparison in tests.
370
- ====
371
-
372
- ===== Basic usage
373
-
374
- Syntax:
375
-
376
- [source,ruby]
377
- ----
378
- Canon::Comparison.equivalent?({doc1}, {doc2}, {options})
379
- ----
380
-
381
- Where,
382
-
383
- `{doc1}`:: First document object (String, Nokogiri::HTML::Document, or supported XML document)
384
- `{doc2}`:: Second document object (String, Nokogiri::HTML::Document, or supported XML document)
385
- `{options}`:: Hash of comparison options (optional)
386
-
387
- Canon::Comparison for XML supports Moxml::Document and Nokogiri::XML::Document
388
- as input.
389
-
390
- Returns:
391
-
392
- * `true` if documents are equivalent
393
- * `false` if documents differ
394
- * `Array` of differences if `verbose: true` option is set
395
-
396
- .Basic comparison examples
397
- [example]
398
- ====
399
- [source,ruby]
400
- ----
401
- require 'canon/comparison'
402
-
403
- # HTML comparison - ignores whitespace and comments by default
404
- html1 = '<div><p>Hello</p></div>'
405
- html2 = '<div> <p> Hello </p> </div>'
406
- Canon::Comparison.equivalent?(html1, html2)
407
- # => true
408
-
409
- # HTML with different content
410
- html3 = '<div><p>Goodbye</p></div>'
411
- Canon::Comparison.equivalent?(html1, html3)
412
- # => false
413
-
414
- # XML comparison
415
- xml1 = '<root><a>1</a><b>2</b></root>'
416
- xml2 = '<root> <b>2</b> <a>1</a> </root>'
417
- Canon::Comparison.equivalent?(xml1, xml2)
418
- # => true
419
-
420
- # With Nokogiri documents
421
- doc1 = Nokogiri::HTML5(html1)
422
- doc2 = Nokogiri::HTML5(html2)
423
- Canon::Comparison.equivalent?(doc1, doc2)
424
- # => true
425
- ----
426
- ====
427
-
428
-
429
-
430
- ===== Options at a glance
431
-
432
- The `Canon::Comparison.equivalent?` method has a variety of options that tailor
433
- comparison behavior.
434
-
435
- The following options control comparison behavior:
436
-
437
- `collapse_whitespace`:: (default: `true`) when `true`, trims and collapses whitespace
438
- (<<collapse_whitespace>>)
439
-
440
- `normalize_tag_whitespace`:: (default: `false`) when `true`, normalizes whitespace
441
- boundaries around tags for flexible comparison (<<normalize_tag_whitespace>>)
442
-
443
- `ignore_comments`:: (default: `true`) when `true`, ignores HTML/XML comments
444
- (<<ignore_comments>>)
445
-
446
- `ignore_attr_order`:: (default: `true`) when `true`, ignores attribute ordering
447
- (<<ignore_attr_order>>)
448
-
449
- `ignore_text_nodes`:: (default: `false`) when `true`, ignores all text content
450
- (<<ignore_text_nodes>>)
451
-
452
- `verbose`:: (default: `false`) when `true`, returns array of differences instead of boolean
453
- (<<verbose>>)
454
-
455
-
456
- [[collapse_whitespace]]
457
- ==== collapse_whitespace
458
-
459
- `collapse_whitespace: {true|false}` default: `true`
460
-
461
- When `true`, all text content within the document is trimmed (i.e. space removed
462
- from left and right) and whitespace is collapsed (i.e. tabs, new lines, multiple
463
- whitespace characters are replaced by a single whitespace).
464
-
465
- XML mode:: Whitespace is collapsed in text nodes only. Whitespace within
466
- attribute values is preserved.
467
-
468
- HTML mode:: Whitespace is collapsed in text nodes only. Whitespace within
469
- attribute values is preserved. Additionally, empty text nodes between elements
470
- are removed.
471
-
472
- Usage:
473
-
474
- [source,ruby]
475
- ----
476
- Canon::Comparison.equivalent?(doc1, doc2, collapse_whitespace: true)
477
- ----
478
-
479
- .HTML examples with collapse_whitespace
480
- [example]
481
- ====
482
- When `true` the following HTML strings are considered equal:
483
-
484
- [source,html]
485
- ----
486
- <a href="/admin"> SOME TEXT CONTENT </a>
487
- <a href="/admin">SOME TEXT CONTENT</a>
488
- ----
489
-
490
- [source,ruby]
491
- ----
492
- html1 = '<a href="/admin"> SOME TEXT CONTENT </a>'
493
- html2 = '<a href="/admin">SOME TEXT CONTENT</a>'
494
- Canon::Comparison.equivalent?(html1, html2, collapse_whitespace: true)
495
- # => true
496
- ----
497
-
498
- When `true` the following HTML strings are considered equal:
499
-
500
- [source,html]
501
- ----
502
- <html>
503
- <title>
504
- This is my title
505
- </title>
506
- </html>
507
-
508
- <html><title>This is my title</title></html>
509
- ----
510
-
511
- [source,ruby]
512
- ----
513
- html1 = <<~HTML
514
- <html>
515
- <title>
516
- This is my title
517
- </title>
518
- </html>
519
- HTML
520
- html2 = '<html><title>This is my title</title></html>'
521
- Canon::Comparison.equivalent?(html1, html2, collapse_whitespace: true)
522
- # => true
523
- ----
524
- ====
525
-
526
- .XML examples with collapse_whitespace
527
- [example]
528
- ====
529
- When `true` the following XML strings are considered equal:
530
-
531
- [source,xml]
532
- ----
533
- <root>
534
- <item> Some text </item>
535
- </root>
536
-
537
- <root><item>Some text</item></root>
538
- ----
539
-
540
- [source,ruby]
541
- ----
542
- xml1 = "<root>\n <item> Some text </item>\n</root>"
543
- xml2 = '<root><item>Some text</item></root>'
544
- Canon::Comparison.equivalent?(xml1, xml2, collapse_whitespace: true)
545
- # => true
546
- ----
547
- ====
548
-
549
- [[normalize_tag_whitespace]]
550
- ==== normalize_tag_whitespace
551
-
552
- `normalize_tag_whitespace: {true|false}` default: `false`
553
-
554
- When `true`, normalizes whitespace at tag boundaries by collapsing multiple
555
- whitespace characters (spaces, tabs, newlines) to a single space and stripping
556
- leading/trailing whitespace from text nodes. This enables "forgiving whitespace
557
- mode" for comparing documents that use different pretty-print formatting while
558
- maintaining the same semantic content.
559
-
560
- This option is specifically designed for comparing documents where:
561
-
562
- * One document is compact (no indentation/line breaks)
563
- * The other document is pretty-printed (with indentation/line breaks)
564
- * You want to ignore these formatting differences
565
-
566
- [NOTE]
567
- `normalize_tag_whitespace` is more aggressive than `collapse_whitespace`:
568
-
569
- * `collapse_whitespace` only trims and collapses whitespace within text content
570
- * `normalize_tag_whitespace` additionally handles whitespace at tag boundaries,
571
- making it suitable for comparing compact vs pretty-printed documents
572
-
573
- When both options are enabled, `normalize_tag_whitespace` takes precedence.
574
-
575
- Usage:
576
-
577
- [source,ruby]
578
- ----
579
- Canon::Comparison.equivalent?(doc1, doc2, normalize_tag_whitespace: true)
580
- ----
581
-
582
- .When to use normalize_tag_whitespace
583
- [example]
584
- Use this option when:
585
-
586
- 1. **Comparing generated output with expected fixtures**: Your test generates
587
- pretty-printed XML/HTML but your fixture is compact (or vice versa)
588
-
589
- 2. **Mixed formatting in test suites**: Some tests use pretty-printed expected
590
- values while others use compact format
591
-
592
- 3. **Flexible test fixtures**: You want to maintain human-readable test fixtures
593
- with indentation but compare them against compact generated output
594
-
595
- 4. **Format-agnostic testing**: Testing semantic equivalence regardless of
596
- whether the output is formatted or compact
597
-
598
- .XML examples with normalize_tag_whitespace
599
- [example]
600
- When `true`, documents with different tag boundary whitespace are considered equal:
601
-
602
- [source,xml]
603
- ----
604
- <!-- Pretty-printed with line breaks and indentation -->
605
- <root>
606
- <item>
607
- <name>Widget</name>
608
- <price>10.00</price>
609
- </item>
610
- </root>
611
-
612
- <!-- Compact on a single line -->
613
- <root><item><name>Widget</name><price>10.00</price></item></root>
614
- ----
615
-
616
- [source,ruby]
617
- ----
618
- pretty = <<~XML
619
- <root>
620
- <item>
621
- <name>Widget</name>
622
- <price>10.00</price>
623
- </item>
624
- </root>
625
- XML
626
-
627
- compact = '<root><item><name>Widget</name><price>10.00</price></item></root>'
628
-
629
- Canon::Comparison.equivalent?(pretty, compact, normalize_tag_whitespace: true)
630
- # => true
631
- ----
632
-
633
- When `false` (default), the whitespace differences matter:
634
-
635
- [source,ruby]
636
- ----
637
- Canon::Comparison.equivalent?(pretty, compact, normalize_tag_whitespace: false)
638
- # => false (whitespace at tag boundaries differs)
639
- ----
640
-
641
- This also handles complex nested structures:
642
-
643
- [source,xml]
644
- ----
645
- <!-- Pretty-printed -->
646
- <document>
647
- <metadata>
648
- <title>My Document</title>
649
- <author>
650
- <name>John Doe</name>
651
- <email>john@example.com</email>
652
- </author>
653
- </metadata>
654
- </document>
655
-
656
- <!-- Compact -->
657
- <document><metadata><title>My Document</title><author><name>John Doe</name><email>john@example.com</email></author></metadata></document>
658
- ----
659
-
660
- [source,ruby]
661
- ----
662
- pretty_doc = <<~XML
663
- <document>
664
- <metadata>
665
- <title>My Document</title>
666
- <author>
667
- <name>John Doe</name>
668
- <email>john@example.com</email>
669
- </author>
670
- </metadata>
671
- </document>
672
- XML
673
-
674
- compact_doc = '<document><metadata><title>My Document</title><author><name>John Doe</name><email>john@example.com</email></author></metadata></document>'
675
-
676
- Canon::Comparison.equivalent?(pretty_doc, compact_doc, normalize_tag_whitespace: true)
677
- # => true
678
- ----
679
-
680
- .HTML examples with normalize_tag_whitespace
681
- [example]
682
- When `true`, HTML with different formatting is considered equal:
683
-
684
- [source,html]
685
- ----
686
- <!-- Pretty-printed -->
687
- <div class="container">
688
- <header>
689
- <h1>Welcome</h1>
690
- <p>Introduction text</p>
691
- </header>
692
- </div>
693
-
694
- <!-- Compact -->
695
- <div class="container"><header><h1>Welcome</h1><p>Introduction text</p></header></div>
696
- ----
697
-
698
- [source,ruby]
699
- ----
700
- pretty_html = <<~HTML
701
- <div class="container">
702
- <header>
703
- <h1>Welcome</h1>
704
- <p>Introduction text</p>
705
- </header>
706
- </div>
707
- HTML
708
-
709
- compact_html = '<div class="container"><header><h1>Welcome</h1><p>Introduction text</p></header></div>'
710
-
711
- Canon::Comparison.equivalent?(pretty_html, compact_html, normalize_tag_whitespace: true)
712
- # => true
713
- ----
714
-
715
- .RSpec configuration for normalize_tag_whitespace
716
- [example]
717
- For test suites that consistently need forgiving whitespace mode, configure it
718
- globally:
719
-
720
- [source,ruby]
721
- ----
722
- # spec/spec_helper.rb
723
- require 'canon/rspec_matchers'
724
-
725
- RSpec.configure do |config|
726
- # Enable forgiving whitespace mode globally for all Canon matchers
727
- Canon::RSpecMatchers.configure do |canon_config|
728
- canon_config.normalize_tag_whitespace = true
729
- end
730
- end
731
-
732
- # Now all XML/HTML comparisons will use forgiving whitespace mode
733
- RSpec.describe 'My tests' do
734
- it 'compares pretty-printed with compact XML' do
735
- pretty_xml = <<~XML
736
- <root>
737
- <item>Value</item>
738
- </root>
739
- XML
740
-
741
- compact_xml = '<root><item>Value</item></root>'
742
-
743
- # These will be considered equivalent due to global configuration
744
- expect(pretty_xml).to be_xml_equivalent_to(compact_xml)
745
- end
746
-
747
- it 'compares HTML with different formatting' do
748
- pretty_html = <<~HTML
749
- <div>
750
- <p>Content</p>
751
- </div>
752
- HTML
753
-
754
- compact_html = '<div><p>Content</p></div>'
755
-
756
- expect(pretty_html).to be_html_equivalent_to(compact_html)
757
- end
758
- end
759
- ----
760
-
761
- To disable it for specific tests when globally enabled:
762
-
763
- [source,ruby]
764
- ----
765
- # This test needs exact whitespace matching
766
- it 'checks exact whitespace' do
767
- # Temporarily disable normalize_tag_whitespace
768
- original = Canon::RSpecMatchers.normalize_tag_whitespace
769
- Canon::RSpecMatchers.normalize_tag_whitespace = false
770
-
771
- begin
772
- expect(xml1).to be_xml_equivalent_to(xml2)
773
- ensure
774
- Canon::RSpecMatchers.normalize_tag_whitespace = original
775
- end
776
- end
777
- ----
778
-
779
- .Comparison with collapse_whitespace
780
- [example]
781
- Understanding the difference between `collapse_whitespace` and
782
- `normalize_tag_whitespace`:
783
-
784
- [source,ruby]
785
- ----
786
- # Example XML with whitespace variations
787
- pretty = '<root> <item> Value </item> </root>'
788
- compact = '<root><item>Value</item></root>'
789
-
790
- # With collapse_whitespace only (default)
791
- Canon::Comparison.equivalent?(
792
- pretty,
793
- compact,
794
- collapse_whitespace: true,
795
- normalize_tag_whitespace: false
796
- )
797
- # => false
798
- # Reason: Whitespace at tag boundaries (spaces between > and <) differs
799
-
800
- # With normalize_tag_whitespace
801
- Canon::Comparison.equivalent?(
802
- pretty,
803
- compact,
804
- normalize_tag_whitespace: true
805
- )
806
- # => true
807
- # Reason: All whitespace at tag boundaries is normalized
808
- ----
809
-
810
- Key differences:
811
-
812
- |===
813
- |Feature |collapse_whitespace |normalize_tag_whitespace
814
-
815
- |Trims text content
816
- |✓
817
- |✓
818
-
819
- |Collapses internal whitespace
820
- |✓
821
- |✓
822
-
823
- |Normalizes tag boundaries
824
- |✗
825
- |✓
826
-
827
- |Use case
828
- |Flexible text comparison
829
- |Flexible format comparison
830
- |===
831
-
832
- [[ignore_attr_order]]
833
- ==== ignore_attr_order
834
-
835
- `ignore_attr_order: {true|false}` default: `true`
836
-
837
- When `true`, all attributes are sorted before comparison and only attributes of
838
- the same type are compared.
839
-
840
- Usage:
841
-
842
- [source,ruby]
843
- ----
844
- Canon::Comparison.equivalent?(doc1, doc2, ignore_attr_order: true)
845
- ----
846
-
847
- .HTML examples with ignore_attr_order
848
- [example]
849
- ====
850
- When `true` the following HTML strings are considered equal:
851
-
852
- [source,html]
853
- ----
854
- <a href="/admin" class="button" target="_blank">Link</a>
855
- <a class="button" target="_blank" href="/admin">Link</a>
856
- ----
857
-
858
- [source,ruby]
859
- ----
860
- html1 = '<a href="/admin" class="button" target="_blank">Link</a>'
861
- html2 = '<a class="button" target="_blank" href="/admin">Link</a>'
862
- Canon::Comparison.equivalent?(html1, html2, ignore_attr_order: true)
863
- # => true
864
- ----
865
-
866
- When `false` attributes are compared in order:
867
-
868
- [source,ruby]
869
- ----
870
- html1 = '<a href="/admin" class="button">Link</a>'
871
- html2 = '<a class="button" href="/admin">Link</a>'
872
- Canon::Comparison.equivalent?(html1, html2, ignore_attr_order: false)
873
- # => false
874
- ----
875
- ====
876
-
877
- .XML examples with ignore_attr_order
878
- [example]
879
- ====
880
- When `true` the following XML strings are considered equal:
881
-
882
- [source,xml]
883
- ----
884
- <item id="1" name="Widget" price="10.00"/>
885
- <item price="10.00" id="1" name="Widget"/>
886
- ----
887
-
888
- [source,ruby]
889
- ----
890
- xml1 = '<item id="1" name="Widget" price="10.00"/>'
891
- xml2 = '<item price="10.00" id="1" name="Widget"/>'
892
- Canon::Comparison.equivalent?(xml1, xml2, ignore_attr_order: true)
893
- # => true
894
- ----
895
- ====
896
-
897
- [[ignore_comments]]
898
- ==== ignore_comments
899
-
900
- `ignore_comments: {true|false}` default: `true`
901
-
902
- When `true`, ignores comments such as `<!-- This is a comment -->`.
903
-
904
- Usage:
905
-
906
- [source,ruby]
907
- ----
908
- Canon::Comparison.equivalent?(doc1, doc2, ignore_comments: true)
909
- ----
910
-
911
- .HTML examples with ignore_comments
912
- [example]
913
- ====
914
- When `true` the following HTML strings are considered equal:
915
-
916
- [source,html]
917
- ----
918
- <!-- This is a comment -->
919
- <!-- This is another comment -->
920
- ----
921
-
922
- [source,ruby]
923
- ----
924
- html1 = '<!-- This is a comment -->'
925
- html2 = '<!-- This is another comment -->'
926
- Canon::Comparison.equivalent?(html1, html2, ignore_comments: true)
927
- # => true
928
- ----
929
-
930
- When `true` the following HTML strings are considered equal:
931
-
932
- [source,html]
933
- ----
934
- <a href="/admin"><!-- This is a comment -->Link</a>
935
- <a href="/admin">Link</a>
936
- ----
937
-
938
- [source,ruby]
939
- ----
940
- html1 = '<a href="/admin"><!-- This is a comment -->Link</a>'
941
- html2 = '<a href="/admin">Link</a>'
942
- Canon::Comparison.equivalent?(html1, html2, ignore_comments: true)
943
- # => true
944
- ----
945
-
946
- When `false` comments are compared:
947
-
948
- [source,ruby]
949
- ----
950
- html1 = '<div><!-- comment 1 --><p>Text</p></div>'
951
- html2 = '<div><!-- comment 2 --><p>Text</p></div>'
952
- Canon::Comparison.equivalent?(html1, html2, ignore_comments: false)
953
- # => false
954
- ----
955
- ====
956
-
957
- .XML examples with ignore_comments
958
- [example]
959
- ====
960
- When `true` the following XML strings are considered equal:
961
-
962
- [source,xml]
963
- ----
964
- <root>
965
- <!-- First comment -->
966
- <item>Data</item>
967
- </root>
968
-
969
- <root>
970
- <!-- Different comment -->
971
- <item>Data</item>
972
- </root>
973
- ----
974
-
975
- [source,ruby]
976
- ----
977
- xml1 = '<root><!-- First comment --><item>Data</item></root>'
978
- xml2 = '<root><!-- Different comment --><item>Data</item></root>'
979
- Canon::Comparison.equivalent?(xml1, xml2, ignore_comments: true)
980
- # => true
981
- ----
982
- ====
983
-
984
- [[ignore_text_nodes]]
985
- ==== ignore_text_nodes
986
-
987
- `ignore_text_nodes: {true|false}` default: `false`
988
-
989
- When `true`, ignores all text content. Text content is anything that is included
990
- between an opening and a closing tag, e.g. `<tag>THIS IS TEXT CONTENT</tag>`.
991
-
992
- Usage:
993
-
994
- [source,ruby]
995
- ----
996
- Canon::Comparison.equivalent?(doc1, doc2, ignore_text_nodes: true)
997
- ----
998
-
999
- .HTML examples with ignore_text_nodes
1000
- [example]
1001
- ====
1002
- When `true` the following HTML strings are considered equal:
1003
-
1004
- [source,html]
1005
- ----
1006
- <a href="/admin">SOME TEXT CONTENT</a>
1007
- <a href="/admin">DIFFERENT TEXT CONTENT</a>
1008
- ----
1009
-
1010
- [source,ruby]
1011
- ----
1012
- html1 = '<a href="/admin">SOME TEXT CONTENT</a>'
1013
- html2 = '<a href="/admin">DIFFERENT TEXT CONTENT</a>'
1014
- Canon::Comparison.equivalent?(html1, html2, ignore_text_nodes: true)
1015
- # => true
1016
- ----
1017
-
1018
- When `true` the following HTML strings are considered equal:
1019
-
1020
- [source,html]
1021
- ----
1022
- <i class="icon"></i><b>Warning:</b>
1023
- <i class="icon"></i><b>Message:</b>
1024
- ----
1025
-
1026
- [source,ruby]
1027
- ----
1028
- html1 = '<i class="icon"></i><b>Warning:</b>'
1029
- html2 = '<i class="icon"></i><b>Message:</b>'
1030
- Canon::Comparison.equivalent?(html1, html2, ignore_text_nodes: true)
1031
- # => true
1032
- ----
1033
-
1034
- When `false` text content is compared:
1035
-
1036
- [source,ruby]
1037
- ----
1038
- html1 = '<p>Hello</p>'
1039
- html2 = '<p>Goodbye</p>'
1040
- Canon::Comparison.equivalent?(html1, html2, ignore_text_nodes: false)
1041
- # => false
1042
- ----
1043
- ====
1044
-
1045
- .XML examples with ignore_text_nodes
1046
- [example]
1047
- ====
1048
- When `true` the following XML strings are considered equal:
1049
-
1050
- [source,xml]
1051
- ----
1052
- <item>First value</item>
1053
- <item>Second value</item>
1054
- ----
1055
-
1056
- [source,ruby]
1057
- ----
1058
- xml1 = '<item>First value</item>'
1059
- xml2 = '<item>Second value</item>'
1060
- Canon::Comparison.equivalent?(xml1, xml2, ignore_text_nodes: true)
1061
- # => true
1062
- ----
1063
- ====
1064
-
1065
- [[verbose]]
1066
- ==== verbose
1067
-
1068
- `verbose: {true|false}` default: `false`
1069
-
1070
- When `true`, instead of returning a boolean value `Canon::Comparison.equivalent?`
1071
- returns an array of all errors encountered when performing a comparison.
1072
-
1073
- WARNING: When `true`, the comparison takes longer! Not only because more
1074
- processing is required to produce meaningful differences, but also because in
1075
- this mode, comparison does **NOT** stop when a first difference is encountered,
1076
- because the goal is to capture as many differences as possible.
1077
-
1078
- Usage:
1079
-
1080
- [source,ruby]
1081
- ----
1082
- Canon::Comparison.equivalent?(doc1, doc2, verbose: true)
1083
- ----
1084
-
1085
- Return values in verbose mode:
1086
-
1087
- * Empty array `[]` if documents are equivalent
1088
- * Array of difference hashes if documents differ
1089
-
1090
- Each difference hash contains:
1091
-
1092
- `node1`:: The first node involved in the difference
1093
- `node2`:: The second node involved in the difference
1094
- `diff1`:: Difference code for the first node
1095
- `diff2`:: Difference code for the second node
1096
-
1097
- Difference codes:
1098
-
1099
- * `Canon::Comparison::EQUIVALENT` (1) - Nodes are equivalent
1100
- * `Canon::Comparison::MISSING_ATTRIBUTE` (2) - Attribute missing
1101
- * `Canon::Comparison::MISSING_NODE` (3) - Node missing
1102
- * `Canon::Comparison::UNEQUAL_ATTRIBUTES` (4) - Attributes differ
1103
- * `Canon::Comparison::UNEQUAL_COMMENTS` (5) - Comments differ
1104
- * `Canon::Comparison::UNEQUAL_ELEMENTS` (7) - Element names differ
1105
- * `Canon::Comparison::UNEQUAL_NODES_TYPES` (8) - Node types differ
1106
- * `Canon::Comparison::UNEQUAL_TEXT_CONTENTS` (9) - Text content differs
1107
-
1108
- .Verbose mode examples
1109
- [example]
1110
- ====
1111
- [source,ruby]
1112
- ----
1113
- # Verbose mode with equivalent documents
1114
- html1 = '<div>Hello</div>'
1115
- html2 = '<div>Hello</div>'
1116
- result = Canon::Comparison.equivalent?(html1, html2, verbose: true)
1117
- # => [] (empty array indicates equivalence)
1118
-
1119
- # Verbose mode with different text content
1120
- html1 = '<div>Hello</div>'
1121
- html2 = '<div>Goodbye</div>'
1122
- result = Canon::Comparison.equivalent?(html1, html2, verbose: true)
1123
- # => [{
1124
- # node1: <Nokogiri::XML::Text>,
1125
- # node2: <Nokogiri::XML::Text>,
1126
- # diff1: 9, # UNEQUAL_TEXT_CONTENTS
1127
- # diff2: 9 # UNEQUAL_TEXT_CONTENTS
1128
- # }]
1129
-
1130
- # Verbose mode with different element names
1131
- html1 = '<div>Test</div>'
1132
- html2 = '<span>Test</span>'
1133
- result = Canon::Comparison.equivalent?(html1, html2, verbose: true)
1134
- # => [{
1135
- # node1: <Nokogiri::XML::Element: div>,
1136
- # node2: <Nokogiri::XML::Element: span>,
1137
- # diff1: 7, # UNEQUAL_ELEMENTS
1138
- # diff2: 7 # UNEQUAL_ELEMENTS
1139
- # }]
1140
-
1141
- # Verbose mode with missing attributes
1142
- html1 = '<div class="foo" id="bar">Test</div>'
1143
- html2 = '<div class="foo">Test</div>'
1144
- result = Canon::Comparison.equivalent?(html1, html2, verbose: true)
1145
- # => [{
1146
- # node1: <Nokogiri::XML::Element: div>,
1147
- # node2: <Nokogiri::XML::Element: div>,
1148
- # diff1: 2, # MISSING_ATTRIBUTE
1149
- # diff2: 2 # MISSING_ATTRIBUTE
1150
- # }]
1151
-
1152
- # Check difference type programmatically
1153
- result = Canon::Comparison.equivalent?(html1, html2, verbose: true)
1154
- if result.empty?
1155
- puts "Documents are equivalent"
1156
- else
1157
- result.each do |diff|
1158
- case diff[:diff1]
1159
- when Canon::Comparison::UNEQUAL_TEXT_CONTENTS
1160
- puts "Text content differs"
1161
- when Canon::Comparison::UNEQUAL_ELEMENTS
1162
- puts "Element names differ"
1163
- when Canon::Comparison::MISSING_ATTRIBUTE
1164
- puts "Attributes differ"
1165
- end
1166
- end
1167
- end
1168
- ----
1169
- ====
1170
-
1171
-
1172
- === Diff formatting configuration
1173
-
1174
- ==== General
1175
-
1176
- Canon provides comprehensive diff formatting capabilities across three interfaces:
1177
- RSpec matchers, CLI commands, and the Ruby API. All interfaces support the same
1178
- set of parameters for consistent behavior.
1179
-
1180
- ==== Parameters
1181
-
1182
- The following table shows all available diff formatting parameters and their
1183
- availability across interfaces:
1184
-
1185
- [cols="1,1,1,1,2,1"]
1186
- |===
1187
- |Parameter |RSpec |CLI |Ruby API |Description |Default
1188
-
1189
- |`use_color`
1190
- |✓
1191
- |✓
1192
- |✓
1193
- |Enable/disable colored output
1194
- |`true`
1195
-
1196
- |`diff_mode`
1197
- |✓
1198
- |✓
1199
- |✓
1200
- |Comparison mode: `:by_object` or `:by_line`
1201
- |`:by_line` (RSpec), `:by_object` (XML/JSON/YAML)
1202
-
1203
- |`context_lines`
1204
- |✓
1205
- |✓
1206
- |✓
1207
- |Number of unchanged lines to show around each change
1208
- |`3`
1209
-
1210
- |`diff_grouping_lines`
1211
- |✓
1212
- |✓
1213
- |✓
1214
- |Maximum line distance to group separate diffs into context blocks
1215
- |`10`
1216
- |===
1217
-
1218
- ==== Interface-specific usage
1219
-
1220
- ===== RSpec matchers configuration
1221
-
1222
- Configure diff formatting for RSpec matchers using `Canon::RspecMatchers`:
1223
-
1224
- [source,ruby]
1225
- ----
1226
- require 'canon/rspec_matchers'
1227
-
1228
- # Configure globally for all matchers
1229
- Canon::RspecMatchers.diff_mode = :by_object
1230
- Canon::RspecMatchers.use_color = true
1231
- Canon::RspecMatchers.context_lines = 5
1232
- Canon::RspecMatchers.diff_grouping_lines = 10
1233
-
1234
- # Use in specs
1235
- RSpec.describe 'My comparison' do
1236
- it 'shows formatted diff' do
1237
- expect(actual_xml).to be_xml_equivalent_to(expected_xml)
1238
- end
1239
- end
1240
- ----
1241
-
1242
- ===== CLI usage
1243
-
1244
- Pass options to the `canon diff` command:
1245
-
1246
- [source,bash]
1247
- ----
1248
- # Basic diff with default settings
1249
- $ canon diff file1.xml file2.xml --verbose
1250
-
1251
- # Customize diff output
1252
- $ canon diff file1.xml file2.xml \
1253
- --verbose \
1254
- --by-line \
1255
- --no-color \
1256
- --context-lines 5 \
1257
- --diff-grouping-lines 10
1258
- ----
1259
-
1260
- ===== Ruby API usage
1261
-
1262
- Use `Canon::DiffFormatter` directly in your code:
1263
-
1264
- [source,ruby]
1265
- ----
1266
- require 'canon/diff_formatter'
1267
- require 'canon/comparison'
1268
-
1269
- # Compare documents
1270
- comparison = Canon::Comparison.new(doc1, doc2)
1271
- result = comparison.compare
1272
-
1273
- # Format diff output
1274
- formatter = Canon::DiffFormatter.new(
1275
- use_color: true,
1276
- mode: :by_object,
1277
- context_lines: 5,
1278
- diff_grouping_lines: 10
1279
- )
1280
-
1281
- diff_output = formatter.format(result)
1282
- puts diff_output
1283
- ----
1284
-
1285
- ==== Parameter details
1286
-
1287
- ===== use_color
1288
-
1289
- Controls whether diff output includes ANSI color codes.
1290
-
1291
- * Type: Boolean
1292
- * Default: `true`
1293
- * Colors used:
1294
- ** Red: Deletions/removed content
1295
- ** Green: Additions/inserted content
1296
- ** Yellow: Modified content
1297
- ** Cyan: Element names and structure
1298
-
1299
- [source,ruby]
1300
- ----
1301
- # Disable colors for plain text output
1302
- Canon::RspecMatchers.use_color = false
1303
-
1304
- # CLI
1305
- $ canon diff file1.xml file2.xml --no-color --verbose
1306
- ----
1307
-
1308
- ===== diff_mode
1309
-
1310
- Determines the comparison and display strategy.
1311
-
1312
- * Type: Symbol (`:by_object` or `:by_line`)
1313
- * Default: `:by_line` for RSpec matchers, format-dependent for CLI/API
1314
- * Modes:
1315
- ** `:by_object` - Semantic tree-based comparison showing structural changes
1316
- ** `:by_line` - Line-by-line diff after canonicalization
1317
-
1318
- [source,ruby]
1319
- ----
1320
- # Use object-based diff for RSpec matchers
1321
- Canon::RspecMatchers.diff_mode = :by_object
1322
-
1323
- # CLI - XML uses by-object by default, force by-line
1324
- $ canon diff file1.xml file2.xml --by-line --verbose
1325
- ----
1326
-
1327
- ===== context_lines
1328
-
1329
- Number of unchanged lines to display around each change for context.
1330
-
1331
- * Type: Numeric
1332
- * Default: `3`
1333
- * Range: `0` to any positive integer
1334
- * Effect: Higher values show more surrounding context, lower values show only changes
1335
-
1336
- [source,ruby]
1337
- ----
1338
- # Show 5 lines of context around each change
1339
- Canon::RspecMatchers.context_lines = 5
1340
-
1341
- # CLI
1342
- $ canon diff file1.xml file2.xml --context-lines 5 --verbose
1343
-
1344
- # Ruby API
1345
- formatter = Canon::DiffFormatter.new(context_lines: 5)
1346
- ----
1347
-
1348
- ===== diff_grouping_lines
1349
-
1350
- Maximum line distance between separate changes to group them into a single
1351
- context block.
1352
-
1353
- * Type: Numeric or `nil`
1354
- * Default: `nil` (no grouping)
1355
- * Effect: When set, changes within N lines of each other are grouped into
1356
- context blocks with a header showing the number of diffs in the block
1357
-
1358
- [source,ruby]
1359
- ----
1360
- # Group changes that are within 10 lines of each other
1361
- Canon::RspecMatchers.diff_grouping_lines = 10
1362
-
1363
- # CLI
1364
- $ canon diff file1.xml file2.xml --diff-grouping-lines 10 --verbose
1365
-
1366
- # Ruby API
1367
- formatter = Canon::DiffFormatter.new(diff_grouping_lines: 10)
1368
- ----
1369
-
1370
- .Example of grouped diff output
1371
- [example]
1372
- When `diff_grouping_lines` is set to `10`, changes close together are grouped:
1373
-
1374
- [source]
1375
- ----
1376
- Context block has 3 diffs (lines 5-18):
1377
- 5 - | <foreword id="fwd">
1378
- 5 + | <foreword displayorder="2" id="fwd">
1379
- 6 | <p>First paragraph</p>
1380
- ...
1381
- 15 - | <title>Scope</title>
1382
- 15 + | <title>Application Scope</title>
1383
- 16 | </clause>
1384
- 17 + | <p>New content</p>
1385
- 18 | </sections>
1386
- ----
1387
-
1388
- Without grouping, these would appear as separate diff sections.
1389
-
1390
- ==== Enhanced diff output features
1391
-
1392
- Canon's diff formatter includes several enhancements designed to make diffs more
1393
- readable and informative, especially when working with RSpec test failures.
1394
-
1395
- ===== Color-coded line numbers and structure
1396
-
1397
- **Purpose**: Improve readability by distinguishing structural elements from
1398
- content changes.
1399
-
1400
- When color mode is enabled (`use_color: true`), the diff formatter uses a
1401
- consistent color scheme:
1402
-
1403
- * **Yellow**: Line numbers and pipe separators
1404
- * **Red**: Deletion markers (`-`) and removed content
1405
- * **Green**: Addition markers (`+`) and inserted content
1406
- * **Default terminal color**: Unchanged context lines (no ANSI codes applied)
1407
-
1408
- This color scheme helps differentiate between:
1409
-
1410
- * The diff structure (line numbers, pipes)
1411
- * Content that was removed (red)
1412
- * Content that was added (green)
1413
- * Content that stayed the same (your terminal's default color)
1414
-
1415
- .Example colored diff output
1416
- [example]
1417
- In a colored terminal, a typical diff line appears as:
1418
-
1419
- [source]
1420
- ----
1421
- 5| 5 | <p>First paragraph</p> # Context line (yellow numbers/pipes, default text)
1422
- 6| -| <old>Text</old> # Deletion (yellow numbers/pipes, red marker/content)
1423
- | 6+| <new>Text</new> # Addition (yellow numbers/pipes, green marker/content)
1424
- ----
1425
-
1426
- Where:
1427
-
1428
- * Line numbers (`5`, `6`) are in yellow
1429
- * Pipe separators (`|`) are in yellow
1430
- * Markers (`-`, `+`) are in red/green respectively
1431
- * Changed content is highlighted in red (deletions) or green (additions)
1432
- * Unchanged content uses your terminal's default color (no forced white/black)
1433
-
1434
- **Why this matters**: When running tests with RSpec, the framework initially sets
1435
- output to red. Canon's diff formatter explicitly resets colors to prevent RSpec's
1436
- red from bleeding into the diff output, ensuring consistent and readable diffs.
1437
-
1438
- ===== Whitespace visualization
1439
-
1440
- **Purpose**: Make invisible whitespace and special characters visible in diffs.
1441
-
1442
- Whitespace changes can be difficult to spot in traditional diffs because spaces,
1443
- tabs, and other invisible characters don't appear in output. Canon visualizes
1444
- these changes using a comprehensive set of Unicode symbols that are safe for use
1445
- with CJK (Chinese, Japanese, Korean) text.
1446
-
1447
- **Visualization scope**: Character visualization is applied only to **diff lines**
1448
- (additions, deletions, and changes), not to context lines (unchanged lines). This
1449
- ensures that:
1450
-
1451
- * Context lines display content in its original form without substitution
1452
- * Only actual changes show visualization, making differences easier to spot
1453
- * Within changed lines showing token-level diffs, unchanged tokens are displayed
1454
- in the terminal's default color (not red/green) to distinguish them from actual
1455
- changes
1456
-
1457
- ====== Default character visualization map
1458
-
1459
- Canon provides a comprehensive CJK-safe character mapping for common non-visible
1460
- characters encountered in diffs:
1461
-
1462
- NOTE: These visualization symbols appear **only in diff lines** (additions,
1463
- deletions, and changes), not in context lines (unchanged lines).
1464
-
1465
- .Common whitespace characters
1466
- [cols="1,1,1,2"]
1467
- |===
1468
- |Character |Unicode |Symbol |Description
1469
-
1470
- |Regular space
1471
- |U+0020
1472
- |`░`
1473
- |Light Shade (U+2591)
1474
-
1475
- |Tab
1476
- |U+0009
1477
- |`⇥`
1478
- |Rightwards Arrow to Bar (U+21E5)
1479
-
1480
- |Non-breaking space
1481
- |U+00A0
1482
- |`␣`
1483
- |Open Box (U+2423)
1484
- |===
1485
-
1486
- .Line endings
1487
- [cols="1,1,1,2"]
1488
- |===
1489
- |Character |Unicode |Symbol |Description
1490
-
1491
- |Line feed (LF)
1492
- |U+000A
1493
- |`↵`
1494
- |Downwards Arrow with Corner Leftwards (U+21B5)
1495
-
1496
- |Carriage return (CR)
1497
- |U+000D
1498
- |`⏎`
1499
- |Return Symbol (U+23CE)
1500
-
1501
- |Windows line ending (CRLF)
1502
- |U+000D U+000A
1503
- |`↵`
1504
- |Downwards Arrow with Corner Leftwards (U+21B5)
1505
-
1506
- |Next line (NEL)
1507
- |U+0085
1508
- |`⏎`
1509
- |Return Symbol (U+23CE)
1510
-
1511
- |Line separator
1512
- |U+2028
1513
- |`⤓`
1514
- |Downwards Arrow to Bar (U+2913)
1515
-
1516
- |Paragraph separator
1517
- |U+2029
1518
- |`⤓`
1519
- |Downwards Arrow to Bar (U+2913)
1520
- |===
1521
-
1522
- .Unicode spaces (various widths)
1523
- [cols="1,1,1,2"]
1524
- |===
1525
- |Character |Unicode |Symbol |Description
1526
-
1527
- |En space
1528
- |U+2002
1529
- |`▭`
1530
- |White Rectangle (U+25AD)
1531
-
1532
- |Em space
1533
- |U+2003
1534
- |`▬`
1535
- |Black Rectangle (U+25AC)
1536
-
1537
- |Four-per-em space
1538
- |U+2005
1539
- |`⏓`
1540
- |Metrical Short Over Long (U+23D3)
1541
-
1542
- |Six-per-em space
1543
- |U+2006
1544
- |`⏕`
1545
- |Metrical Two Shorts Over Long (U+23D5)
1546
-
1547
- |Thin space
1548
- |U+2009
1549
- |`▯`
1550
- |White Vertical Rectangle (U+25AF)
1551
-
1552
- |Hair space
1553
- |U+200A
1554
- |`▮`
1555
- |Black Vertical Rectangle (U+25AE)
1556
-
1557
- |Figure space
1558
- |U+2007
1559
- |`□`
1560
- |White Square (U+25A1)
1561
-
1562
- |Narrow no-break space
1563
- |U+202F
1564
- |`▫`
1565
- |White Small Square (U+25AB)
1566
-
1567
- |Medium mathematical space
1568
- |U+205F
1569
- |`▭`
1570
- |White Rectangle (U+25AD)
1571
-
1572
- |Ideographic space
1573
- |U+3000
1574
- |`⎵`
1575
- |Bottom Square Bracket (U+23B5)
1576
-
1577
- |Ideographic half space
1578
- |U+303F
1579
- |`⏑`
1580
- |Metrical Breve (U+23D1)
1581
-
1582
- |===
1583
-
1584
- .Zero-width characters (invisible troublemakers)
1585
- [cols="1,1,1,2"]
1586
- |===
1587
- |Character |Unicode |Symbol |Description
1588
-
1589
- |Zero-width space
1590
- |U+200B
1591
- |`→`
1592
- |Rightwards Arrow (U+2192)
1593
-
1594
- |Zero-width non-joiner
1595
- |U+200C
1596
- |`↛`
1597
- |Rightwards Arrow with Stroke (U+219B)
1598
-
1599
- |Zero-width joiner
1600
- |U+200D
1601
- |`⇢`
1602
- |Rightwards Dashed Arrow (U+21E2)
1603
-
1604
- |Zero-width no-break space (BOM)
1605
- |U+FEFF
1606
- |`⇨`
1607
- |Rightwards White Arrow (U+21E8)
1608
- |===
1609
-
1610
- .Bidirectional/RTL markers
1611
- [cols="1,1,1,2"]
1612
- |===
1613
- |Character |Unicode |Symbol |Description
1614
-
1615
- |Left-to-right mark
1616
- |U+200E
1617
- |`⟹`
1618
- |Long Rightwards Double Arrow (U+27F9)
1619
-
1620
- |Right-to-left mark
1621
- |U+200F
1622
- |`⟸`
1623
- |Long Leftwards Double Arrow (U+27F8)
1624
-
1625
- |LTR embedding
1626
- |U+202A
1627
- |`⇒`
1628
- |Rightwards Double Arrow (U+21D2)
1629
-
1630
- |RTL embedding
1631
- |U+202B
1632
- |`⇐`
1633
- |Leftwards Double Arrow (U+21D0)
1634
-
1635
- |Pop directional formatting
1636
- |U+202C
1637
- |`↔`
1638
- |Left Right Arrow (U+2194)
1639
-
1640
- |LTR override
1641
- |U+202D
1642
- |`⇉`
1643
- |Rightwards Paired Arrows (U+21C9)
1644
-
1645
- |RTL override
1646
- |U+202E
1647
- |`⇇`
1648
- |Leftwards Paired Arrows (U+21C7)
1649
- |===
1650
-
1651
- .Control characters
1652
- [cols="1,1,1,2"]
1653
- |===
1654
- |Character |Unicode |Symbol |Description
1655
-
1656
- |Null
1657
- |U+0000
1658
- |`␀`
1659
- |Symbol for Null (U+2400)
1660
-
1661
- |Soft hyphen
1662
- |U+00AD
1663
- |`­‐`
1664
- |Hyphen (U+2010)
1665
-
1666
- |Backspace
1667
- |U+0008
1668
- |`␈`
1669
- |Symbol for Backspace (U+2408)
1670
-
1671
- |Delete
1672
- |U+007F
1673
- |`␡`
1674
- |Symbol for Delete (U+2421)
1675
- |===
1676
-
1677
- ====== CJK safety
1678
-
1679
- The visualization characters are specifically chosen to avoid conflicts with CJK
1680
- text:
1681
-
1682
- * **No middle dots** (`·`) - commonly used as separators in CJK
1683
- * **No bullets** (`∙`) - used in CJK lists
1684
- * **No circles** (`◌◍◎`) - look similar to CJK characters like ○ ●
1685
- * **No small dots** (`⋅`) - conflict with CJK punctuation
1686
-
1687
- Instead, Canon uses:
1688
- * Box characters (`□▭▬▯▮▫`) for various space types
1689
- * Arrow symbols (`→↛⇢⇨⟹⟸⇒⇐`) for zero-width and directional characters
1690
- * Control Pictures block symbols (`␀␈␡`) for control characters
1691
-
1692
- ====== Customizing character visualization
1693
-
1694
- You can customize the character visualization map for your specific needs:
1695
-
1696
- [source,ruby]
1697
- ----
1698
- require 'canon/diff_formatter'
1699
-
1700
- # Create custom visualization map
1701
- custom_map = Canon::DiffFormatter.merge_visualization_map({
1702
- ' ' => '·', # Use middle dot for spaces (if not using CJK)
1703
- "\t" => '→', # Use simple arrow for tabs
1704
- "\u200B" => '⚠' # Warning symbol for zero-width space
1705
- })
1706
-
1707
- # Use custom map with formatter
1708
- formatter = Canon::DiffFormatter.new(
1709
- use_color: true,
1710
- visualization_map: custom_map
1711
- )
1712
-
1713
- # The custom map merges with defaults, so unspecified
1714
- # characters still use the default visualization
1715
- ----
1716
-
1717
- ====== Visualization in action
1718
-
1719
- .Whitespace visualization examples
1720
- [example]
1721
- [source]
1722
- ----
1723
- # Space added between tags
1724
- 10| -| <tag>Value</tag> # No space
1725
- | 10+| <tag>░Value</tag> # Space added (green light shade)
1726
-
1727
- # Tab character
1728
- 15| -| <tag>⇥Value</tag> # Tab (red arrow-to-bar)
1729
- | 15+| <tag>░░Value</tag> # Two spaces (green light shades)
1730
-
1731
- # Non-breaking space (U+00A0)
1732
- 20| -| <tag>Value</tag> # Regular space
1733
- | 20+| <tag>Value␣</tag> # Non-breaking space (green open box)
1734
-
1735
- # Zero-width space (U+200B)
1736
- 25| -| <word1><word2> # No zero-width space
1737
- | 25+| <word1>→<word2> # Zero-width space (green arrow)
1738
-
1739
- # Mixed invisible characters
1740
- 30| -| <p>Text▬more</p> # Em space (red black rectangle)
1741
- | 30+| <p>Text░more</p> # Regular space (green light shade)
1742
- ----
1743
-
1744
- Where visualization symbols appear in:
1745
-
1746
- * Red when showing removed/deleted characters
1747
- * Green when showing added/inserted characters
1748
- * Bold to make them more visible
1749
-
1750
- **When is this useful?**
1751
-
1752
- 1. **Test failures due to formatting**: Your test expects compact XML but receives
1753
- pretty-printed XML with different indentation
1754
-
1755
- 2. **Mixed whitespace**: Some parts of your code use tabs while others use spaces
1756
-
1757
- 3. **Non-breaking spaces**: Copy-pasted content from browsers often contains
1758
- U+00A0 instead of regular spaces
1759
-
1760
- 4. **Zero-width characters**: Invisible Unicode characters that cause mysterious
1761
- comparison failures
1762
-
1763
- 5. **RTL/LTR markers**: Bidirectional text markers in internationalized content
1764
-
1765
- 6. **Template differences**: Generated output has invisible character differences
1766
-
1767
- .Real-world example: Non-breaking space from web copy-paste
1768
- [example]
1769
- Without whitespace visualization, these two lines look identical:
1770
-
1771
- [source,xml]
1772
- ----
1773
- <foreword id="fwd">
1774
- <foreword id="fwd">
1775
- ----
1776
-
1777
- With whitespace visualization enabled, the difference is immediately visible:
1778
-
1779
- [source]
1780
- ----
1781
- 4| -| <foreword░id="fwd"> # Regular space (U+0020)
1782
- | 4+| <foreword␣id="fwd"> # Non-breaking space (U+00A0)
1783
- ----
1784
-
1785
- The different symbols (`░` vs `␣`) clearly show that one uses a regular space
1786
- while the other uses a non-breaking space, likely from copying text from a web
1787
- page or word processor.
1788
-
1789
- .Real-world example: Zero-width characters
1790
- [example]
1791
- Zero-width characters are completely invisible but affect comparison:
1792
-
1793
- [source,xml]
1794
- ----
1795
- <item>Widget</item>
1796
- <item>Widget</item> <!-- Contains U+200B zero-width space after "Widget" -->
1797
- ----
1798
-
1799
- The diff shows:
1800
-
1801
- [source]
1802
- ----
1803
- 5| -| <item>Widget</item>
1804
- | 5+| <item>Widget→</item> # Zero-width space visualized as →
1805
- ----
1806
-
1807
- The rightwards arrow (`→`) reveals the presence of a zero-width space that would
1808
- otherwise be impossible to detect.
1809
-
1810
- ===== Non-ASCII character detection
1811
-
1812
- **Purpose**: Alert users when diffs contain non-ASCII characters that might cause
1813
- unexpected comparison failures or encoding issues.
1814
-
1815
- When Canon detects non-ASCII characters (any character with Unicode codepoint >
1816
- U+007F) in a diff block, it displays a yellow warning with the specific
1817
- characters and their Unicode codepoints.
1818
-
1819
- .Non-ASCII warning format
1820
- [example]
1821
- [source]
1822
- ----
1823
- Context block has 3 diffs (lines 10-25):
1824
- (WARNING: non-ASCII characters detected in diff: [' ' (U+00A0, shown as: ␣), '—' (U+2014, shown as: —)])
1825
-
1826
- 10| -| <p>Hello░world</p>
1827
- | 10+| <p>Hello␣world</p> # Contains non-breaking space (U+00A0)
1828
- 15| -| <p>Text - more text</p>
1829
- | 15+| <p>Text — more text</p> # Contains em dash (U+2014)
1830
- ----
1831
-
1832
- The warning appears immediately after the "Context block has X diffs" header.
1833
-
1834
- **Common non-ASCII characters in diffs**:
1835
-
1836
- |===
1837
- |Character |Unicode |Name |Common source
1838
-
1839
- |` ` (looks like space)
1840
- |U+00A0
1841
- |Non-breaking space
1842
- |Copy-paste from web browsers, word processors
1843
-
1844
- |`—`
1845
- |U+2014
1846
- |Em dash
1847
- |Word processors, smart quotes enabled
1848
-
1849
- |`–`
1850
- |U+2013
1851
- |En dash
1852
- |Word processors, smart quotes enabled
1853
-
1854
- |`'` `'`
1855
- |U+2018, U+2019
1856
- |Smart single quotes
1857
- |Word processors, text editors with smart quotes
1858
-
1859
- |`"` `"`
1860
- |U+201C, U+201D
1861
- |Smart double quotes
1862
- |Word processors, text editors with smart quotes
1863
-
1864
- |`…`
1865
- |U+2026
1866
- |Ellipsis
1867
- |Word processors
1868
-
1869
- |Various
1870
- |U+2000-U+200B
1871
- |Various spaces
1872
- |HTML entities, special formatting
1873
- |===
1874
-
1875
- **Why this matters**:
1876
-
1877
- 1. **Invisible differences**: Many non-ASCII characters look identical to their
1878
- ASCII equivalents but cause comparison failures
1879
-
1880
- 2. **Encoding issues**: Non-ASCII characters may behave differently across
1881
- systems with different encodings
1882
-
1883
- 3. **Copy-paste errors**: Content copied from browsers or documents often
1884
- includes non-breaking spaces instead of regular spaces
1885
-
1886
- 4. **Smart quotes**: Text editors may automatically convert straight quotes to
1887
- curly quotes
1888
-
1889
- .Practical example
1890
- [example]
1891
- A test fails because the expected output was copied from a web page:
1892
-
1893
- [source,ruby]
1894
- ----
1895
- # Expected (copied from documentation website - contains U+00A0)
1896
- expected = '<p>Hello world</p>' # Space between "Hello" and "world" is U+00A0
1897
-
1898
- # Actual (generated by code - contains regular space)
1899
- actual = '<p>Hello world</p>' # Space is U+0020
1900
-
1901
- expect(actual).to be_xml_equivalent_to(expected)
1902
- # FAILS: Documents appear identical but contain different space characters
1903
- ----
1904
-
1905
- Canon's diff output shows:
1906
-
1907
- [source]
1908
- ----
1909
- Context block has 1 diff (line 1):
1910
- (WARNING: non-ASCII characters detected in diff: [' ' (U+00A0)])
1911
-
1912
- 1| -| <p>Hello world</p> # U+0020 (regular space)
1913
- | 1+| <p>Hello░world</p> # U+00A0 (non-breaking space, shown as block)
1914
- ----
1915
-
1916
- The warning alerts you to check for non-breaking spaces, and the light shade
1917
- block visualization shows where the difference occurs.
1918
-
1919
- ===== Configuration and usage
1920
-
1921
- All enhanced diff features are enabled by default when `use_color` is `true` and
1922
- automatically applied across all Canon interfaces:
1923
-
1924
- [source,ruby]
1925
- ----
1926
- # RSpec matchers (automatically enabled)
1927
- expect(xml1).to be_xml_equivalent_to(xml2)
1928
- # Output includes: colored line numbers, whitespace visualization, non-ASCII warnings
1929
-
1930
- # CLI (enabled by default)
1931
- $ canon diff file1.xml file2.xml --verbose
1932
- # Output includes all enhanced features
1933
-
1934
- # Ruby API (controlled by use_color parameter)
1935
- formatter = Canon::DiffFormatter.new(use_color: true) # Enhanced features enabled
1936
- formatter = Canon::DiffFormatter.new(use_color: false) # Plain text only
1937
- ----
1938
-
1939
- To disable colored output (and all color-dependent enhancements):
1940
-
1941
- [source,ruby]
1942
- ----
1943
- # RSpec
1944
- Canon::RspecMatchers.use_color = false
1945
-
1946
- # CLI
1947
- $ canon diff file1.xml file2.xml --no-color --verbose
1948
-
1949
- # Ruby API
1950
- formatter = Canon::DiffFormatter.new(use_color: false)
1951
- ----
1952
-
1953
- When `use_color` is `false`:
1954
-
1955
- * Line numbers and pipes are plain text
1956
- * Whitespace is not visualized (remains invisible)
1957
- * Non-ASCII warnings are still shown (but without yellow color)
1958
- * Content changes are shown without color highlighting
1959
-
1960
- === Input validation
1961
-
1962
- Canon provides comprehensive input validation for all supported formats (XML,
1963
- HTML, JSON, YAML). When malformed input is detected, Canon raises a
1964
- `Canon::ValidationError` with detailed location information to help you quickly
1965
- identify and fix the problem.
1966
-
1967
- ==== Purpose
1968
-
1969
- Input validation ensures that:
1970
-
1971
- * Malformed documents are detected early with clear error messages
1972
- * Syntax errors show exact line and column numbers
1973
- * Error details appear in RSpec test output (not hidden in log files)
1974
- * Users receive actionable feedback about what's wrong and where
1975
-
1976
- ==== How it works
1977
-
1978
- Canon validates input **before parsing** using format-specific validators:
1979
-
1980
- * `Canon::Validators::XmlValidator` - Strict XML syntax validation
1981
- * `Canon::Validators::HtmlValidator` - HTML5 and XHTML validation
1982
- * `Canon::Validators::JsonValidator` - JSON syntax validation
1983
- * `Canon::Validators::YamlValidator` - YAML syntax validation
1984
-
1985
- Validation happens automatically when you use Canon's formatters or comparison
1986
- methods.
1987
-
1988
- ==== Validation error format
1989
-
1990
- When validation fails, Canon raises `Canon::ValidationError` with:
1991
-
1992
- * `format` - The format being validated (`:xml`, `:html`, `:json`, `:yaml`)
1993
- * `line` - Line number where the error occurred (if available)
1994
- * `column` - Column number where the error occurred (if available)
1995
- * `details` - Additional context about the error
1996
-
1997
- .Validation error example
1998
- [example]
1999
- [source,ruby]
2000
- ----
2001
- require 'canon'
2002
-
2003
- malformed_xml = '<root><unclosed>'
2004
-
2005
- begin
2006
- Canon.format(malformed_xml, :xml)
2007
- rescue Canon::ValidationError => e
2008
- puts e.message
2009
- # XML Validation Error: Premature end of data in tag unclosed line 1
2010
- # Line: 1
2011
- # Column: 18
2012
-
2013
- puts "Format: #{e.format}" # => :xml
2014
- puts "Line: #{e.line}" # => 1
2015
- puts "Column: #{e.column}" # => 18
2016
- end
2017
- ----
2018
-
2019
- ==== Format-specific validation
2020
-
2021
- ===== XML validation
2022
-
2023
- Uses Nokogiri's strict XML parsing to detect:
2024
-
2025
- * Unclosed tags
2026
- * Mismatched tags
2027
- * Invalid XML declaration
2028
- * Malformed attributes
2029
- * Invalid character references
2030
-
2031
- .XML validation examples
2032
- [example]
2033
- [source,ruby]
2034
- ----
2035
- # Unclosed tag
2036
- Canon.format('<root><item>', :xml)
2037
- # => Canon::ValidationError: XML Validation Error: Premature end of data in tag item line 1
2038
- # Line: 1
2039
-
2040
- # Mismatched tags
2041
- Canon.format('<root><item></root>', :xml)
2042
- # => Canon::ValidationError: XML Validation Error: Opening and ending tag mismatch: item line 1 and root
2043
- # Line: 1
2044
- ----
2045
-
2046
- ===== HTML validation
2047
-
2048
- Automatically detects HTML5 vs XHTML and applies appropriate validation:
2049
-
2050
- * HTML5: Uses Nokogiri::HTML5 parser with error filtering
2051
- * XHTML: Uses strict XML parsing
2052
-
2053
- Special handling:
2054
-
2055
- * Strips XML declarations from HTML (common in legacy HTML files)
2056
- * Filters out non-critical HTML5 parser warnings
2057
- * Only reports significant errors (level 2+)
2058
-
2059
- .HTML validation examples
2060
- [example]
2061
- [source,ruby]
2062
- ----
2063
- # Malformed XHTML
2064
- xhtml = '<html xmlns="http://www.w3.org/1999/xhtml"><body><p>Unclosed'
2065
- Canon.format(xhtml, :html)
2066
- # => Canon::ValidationError: HTML Validation Error: Premature end of data in tag p line 1
2067
- # Line: 1
2068
-
2069
- # HTML5 with errors
2070
- html5 = '<div><span></div>'
2071
- Canon.format(html5, :html)
2072
- # => Canon::ValidationError: HTML Validation Error: Unexpected end tag : span
2073
- # Line: 1
2074
- ----
2075
-
2076
- ===== JSON validation
2077
-
2078
- Validates JSON syntax using Ruby's JSON parser:
2079
-
2080
- * Missing/extra braces or brackets
2081
- * Trailing commas
2082
- * Invalid escape sequences
2083
- * Invalid numbers
2084
-
2085
- Provides context showing the error location in the JSON structure.
2086
-
2087
- .JSON validation examples
2088
- [example]
2089
- [source,ruby]
2090
- ----
2091
- # Missing closing brace
2092
- Canon.format('{"key": "value"', :json)
2093
- # => Canon::ValidationError: JSON Validation Error: unexpected token at '{"key": "value"'
2094
- # Details: Error at position 16
2095
-
2096
- # Trailing comma (invalid in JSON)
2097
- Canon.format('{"a": 1,}', :json)
2098
- # => Canon::ValidationError: JSON Validation Error: unexpected token at '{"a": 1,}'
2099
- # Details: Error at position 8
2100
- ----
2101
-
2102
- ===== YAML validation
2103
-
2104
- Validates YAML syntax using Psych (Ruby's YAML parser):
2105
-
2106
- * Invalid indentation
2107
- * Unclosed brackets/braces
2108
- * Invalid anchors/aliases
2109
- * Type mismatches
2110
-
2111
- Shows error location with line numbers and context.
2112
-
2113
- .YAML validation examples
2114
- [example]
2115
- [source,ruby]
2116
- ----
2117
- # Unclosed bracket
2118
- Canon.format("key: {unclosed", :yaml)
2119
- # => Canon::ValidationError: YAML Validation Error: (<unknown>): did not find expected node content...
2120
- # Line: 1
2121
- # Details: Shows context around error
2122
-
2123
- # Invalid indentation
2124
- yaml = <<~YAML
2125
- parent:
2126
- child: value
2127
- YAML
2128
- Canon.format(yaml, :yaml)
2129
- # => Canon::ValidationError: YAML Validation Error: mapping values are not allowed in this context
2130
- # Line: 2
2131
- ----
2132
-
2133
- ==== Validation in RSpec tests
2134
-
2135
- Canon's RSpec matchers automatically propagate validation errors to test output,
2136
- making it easy to see what's wrong:
2137
-
2138
- .RSpec validation error example
2139
- [example]
2140
- [source,ruby]
2141
- ----
2142
- require 'canon/rspec_matchers'
2143
-
2144
- RSpec.describe 'XML validation' do
2145
- it 'validates input' do
2146
- malformed_xml = '<root><unclosed>'
2147
- expected_xml = '<root><item/></root>'
2148
-
2149
- # This will fail with a clear validation error message
2150
- expect(malformed_xml).to be_xml_equivalent_to(expected_xml)
2151
- end
2152
- end
2153
-
2154
- # Test output shows:
2155
- # Canon::ValidationError:
2156
- # XML Validation Error: Premature end of data in tag unclosed line 1
2157
- # Line: 1
2158
- # Column: 18
2159
- ----
2160
-
2161
- The error appears directly in the RSpec output, not hidden in separate error
2162
- files or logs.
2163
-
2164
- ==== Validation in comparison
2165
-
2166
- Validation also occurs when using `Canon::Comparison.equivalent?`:
2167
-
2168
- .Comparison validation example
2169
- [example]
2170
- [source,ruby]
2171
- ----
2172
- require 'canon/comparison'
2173
-
2174
- xml1 = '<root><item/></root>'
2175
- xml2 = '<root><unclosed>'
2176
-
2177
- Canon::Comparison.equivalent?(xml1, xml2)
2178
- # => Canon::ValidationError: XML Validation Error: Premature end of data in tag unclosed line 1
2179
- # Line: 1
2180
- # Column: 18
2181
- ----
2182
-
2183
- ==== Benefits
2184
-
2185
- Input validation provides several key benefits:
2186
-
2187
- **Early error detection**:: Problems are caught before processing begins, saving
2188
- time and providing clear feedback
2189
-
2190
- **Precise error location**:: Line and column numbers pinpoint exactly where the
2191
- problem is, especially useful in large documents
2192
-
2193
- **Clear error messages**:: Descriptive messages explain what's wrong and often
2194
- suggest how to fix it
2195
-
2196
- **Test-friendly**:: Errors appear in RSpec output where developers expect them,
2197
- not in separate log files
2198
-
2199
- **Format-aware**:: Each validator understands format-specific rules and provides
2200
- relevant error details
2201
-
2202
-
2203
- === RSpec matchers
2204
-
2205
- RSpec matchers for testing equivalence between serialized formats. All matchers
2206
- use canonical (c14n) mode for comparison.
2207
-
2208
- See <<Diff formatting configuration>> for details on configuring diff output
2209
- in RSpec matchers.
2210
-
2211
- .RSpec matcher examples
2212
- [example]
2213
- ====
2214
- [source,ruby]
2215
- ----
2216
- require 'rspec'
2217
- require 'canon'
2218
-
2219
- RSpec.describe 'Serialization equivalence' do
2220
- # Unified matcher with format parameter
2221
- it 'compares XML' do
2222
- xml1 = '<root><a>1</a><b>2</b></root>'
2223
- xml2 = '<root> <b>2</b> <a>1</a> </root>'
2224
- expect(xml1).to be_serialization_equivalent_to(xml2, format: :xml)
2225
- end
2226
-
2227
- it 'compares HTML' do
2228
- html1 = '<div><p>Hello</p></div>'
2229
- html2 = '<div> <p> Hello </p> </div>'
2230
- expect(html1).to be_serialization_equivalent_to(html2, format: :html)
2231
- end
2232
-
2233
- it 'compares JSON' do
2234
- json1 = '{"a":1,"b":2}'
2235
- json2 = '{"b":2,"a":1}'
2236
- expect(json1).to be_serialization_equivalent_to(json2, format: :json)
2237
- end
2238
-
2239
- it 'compares YAML' do
2240
- yaml1 = "a: 1\nb: 2"
2241
- yaml2 = "b: 2\na: 1"
2242
- expect(yaml1).to be_serialization_equivalent_to(yaml2, format: :yaml)
2243
- end
2244
-
2245
- # Format-specific matchers
2246
- it 'uses format-specific matchers' do
2247
- expect(xml1).to be_xml_equivalent_to(xml2) # XML
2248
- expect(xml1).to be_analogous_with(xml2) # XML (legacy)
2249
- expect(html1).to be_html_equivalent_to(html2) # HTML
2250
- expect(json1).to be_json_equivalent_to(json2) # JSON
2251
- expect(yaml1).to be_yaml_equivalent_to(yaml2) # YAML
2252
- end
2253
- end
2254
- ----
2255
- ====
2256
-
2257
- [IMPORTANT]
2258
- ====
2259
- RSpec matchers always canonicalize both sides before comparing, so:
2260
-
2261
- * Formatting differences (whitespace, indentation) are ignored
2262
- * Attribute order in XML/HTML is normalized
2263
- * Key order in JSON/YAML is normalized
2264
- * Tests focus on content equality, not formatting
2265
- ====
2266
-
2267
-
2268
- == Command-line interface
2269
-
2270
- === Installation
2271
-
2272
- After installing the gem, the `canon` command will be available:
2273
-
2274
- [source,bash]
2275
- ----
2276
- $ gem install canon
2277
- $ canon --help
2278
- ----
2279
-
2280
- === Format command
2281
-
2282
- The `format` command formats files in XML, HTML, JSON, or YAML.
2283
-
2284
- ==== Output modes
2285
-
2286
- `pretty` (default):: Human-readable output with indentation (2 spaces default)
2287
- `c14n`:: Canonical form without indentation
2288
-
2289
- ==== Command syntax
2290
-
2291
- [source,bash]
2292
- ----
2293
- canon format FILE [OPTIONS]
2294
- ----
2295
-
2296
- ==== Options
2297
-
2298
- `-f, --format FORMAT`:: Specify format: `xml`, `html`, `json`, or `yaml`
2299
- (auto-detected from extension if not specified)
2300
-
2301
- `-m, --mode MODE`:: Output mode: `pretty` (default) or `c14n`
2302
-
2303
- `-i, --indent N`:: Indentation spaces for pretty mode (default: 2)
2304
-
2305
- `--indent-type TYPE`:: Indentation type: `space` (default) or `tab`
2306
-
2307
- `-o, --output FILE`:: Write output to file instead of stdout
2308
-
2309
- `-c, --with-comments`:: Include comments in canonical XML output
2310
-
2311
- ==== Examples
2312
-
2313
- [source,bash]
2314
- ----
2315
- # Pretty-print (default mode)
2316
- $ canon format input.xml
2317
- <?xml version="1.0" encoding="UTF-8"?>
2318
- <root>
2319
- <a>1</a>
2320
- <b>2</b>
2321
- </root>
2322
-
2323
- # Canonical mode (compact)
2324
- $ canon format input.xml --mode c14n
2325
- <root><a>1</a><b>2</b></root>
2326
-
2327
- # Custom indentation
2328
- $ canon format input.xml --mode pretty --indent 4
2329
- $ canon format input.json --indent 4
2330
-
2331
- # Tab indentation
2332
- $ canon format input.xml --indent-type tab
2333
- $ canon format input.html --mode pretty --indent-type tab
2334
-
2335
- # Specify format explicitly
2336
- $ canon format data.txt --format xml
2337
-
2338
- # Save to file
2339
- $ canon format input.xml --output formatted.xml
2340
-
2341
- # Include XML comments in canonical output
2342
- $ canon format doc.xml --mode c14n --with-comments
2343
-
2344
- # HTML files
2345
- $ canon format page.html
2346
- $ canon format page.html --mode c14n
2347
- ----
2348
-
2349
- ==== Format detection
2350
-
2351
- [cols="1,1"]
2352
- |===
2353
- |File Extension |Detected Format
2354
-
2355
- |`.xml`
2356
- |XML
2357
-
2358
- |`.html`, `.htm`
2359
- |HTML
2360
-
2361
- |`.json`
2362
- |JSON
2363
-
2364
- |`.yaml`, `.yml`
2365
- |YAML
2366
- |===
2367
-
2368
- === Diff command
2369
-
2370
- Compare two files using **semantic comparison** that understands the structure of
2371
- XML, HTML, JSON, and YAML formats. Unlike traditional text-based diff tools,
2372
- `canon diff` compares the meaning and structure of your data, not just the
2373
- characters.
2374
-
2375
- ==== Command syntax
2376
-
2377
- [source,bash]
2378
- ----
2379
- canon diff FILE1 FILE2 [OPTIONS]
2380
- ----
2381
-
2382
- ==== Diff modes
2383
-
2384
- Canon supports two diff modes optimized for different use cases:
2385
-
2386
- ===== by-object mode (default for JSON/YAML)
2387
-
2388
- Compares files **semantically** by their data structure and displays differences
2389
- as a visual tree showing what changed in the structure.
2390
-
2391
- Best for::
2392
- * Configuration files where you care about what values changed
2393
- * API responses where structure matters
2394
- * Comparing semantic equivalence across formats
2395
-
2396
- Features::
2397
- * Tree visualization with box-drawing characters
2398
- * Shows only what changed (additions, removals, modifications)
2399
- * Ignores formatting differences automatically
2400
- * Color-coded output (red=removed, green=added, yellow=changed)
2401
-
2402
- ===== by-line mode (default for HTML, optional for XML)
2403
-
2404
- Compares files **line-by-line** after canonicalization, showing traditional
2405
- diff-style output.
2406
-
2407
- Best for::
2408
- * HTML markup where line-level changes matter
2409
- * Reviewing exact textual differences
2410
- * When you need to see the full document context
2411
-
2412
- Features::
2413
- * Traditional diff format with line numbers
2414
- * Shows before/after for each change
2415
- * Better for understanding markup structure changes
2416
-
2417
- [NOTE]
2418
- * JSON and YAML always use **by-object** mode
2419
- * HTML always uses **by-line** mode
2420
- * XML uses **by-object** mode by default, but can use **by-line** with `--by-line`
2421
-
2422
- ==== Options
2423
-
2424
- ===== Format options
2425
-
2426
- `-f, --format FORMAT`:: Format for both files: `xml`, `html`, `json`, or `yaml`
2427
- (auto-detected from extension if not specified)
2428
-
2429
- `--format1 FORMAT`:: Format for first file (when comparing different formats)
2430
-
2431
- `--format2 FORMAT`:: Format for second file (when comparing different formats)
2432
-
2433
- ===== Comparison options
2434
-
2435
- `-v, --verbose`:: Show detailed differences in tree format (default: just show
2436
- if files differ)
2437
-
2438
- `--by-line`:: Use line-by-line diff for XML (default: by-object mode)
2439
-
2440
- `--collapse-whitespace` / `--no-collapse-whitespace`:: Control whitespace
2441
- normalization in text nodes (default: collapse)
2442
-
2443
- `--ignore-attr-order` / `--no-ignore-attr-order`:: Control whether attribute/key
2444
- ordering matters (default: ignore order)
2445
-
2446
- `--ignore-comments`:: Ignore XML/HTML comments during comparison (overrides
2447
- `--with-comments`)
2448
-
2449
- `--ignore-text-nodes`:: Ignore all text node content, only compare structure
2450
-
2451
- `-c, --with-comments`:: Include comments in comparison (sets `ignore_comments: false`)
2452
-
2453
- ===== Output options
2454
-
2455
- `--color` / `--no-color`:: Enable/disable colored output (default: enabled)
2456
-
2457
- ==== Examples
2458
-
2459
- ===== Basic comparison
2460
-
2461
- [source,bash]
2462
- ----
2463
- # Compare two JSON files (shows if equivalent or different)
2464
- $ canon diff config1.json config2.json
2465
- Files are semantically different
2466
-
2467
- # Compare two XML files
2468
- $ canon diff file1.xml file2.xml
2469
- ✅ Files are semantically equivalent
2470
- ----
2471
-
2472
- ===== Verbose mode examples
2473
-
2474
- ====== JSON comparison (by-object mode)
2475
-
2476
- [example]
2477
- Given these two JSON files:
2478
-
2479
- .config1.json
2480
- [source,json]
2481
- ----
2482
- {
2483
- "name": "myapp",
2484
- "version": "1.0.0",
2485
- "settings": {
2486
- "debug": true,
2487
- "port": 8080
2488
- }
2489
- }
2490
- ----
2491
-
2492
- .config2.json
2493
- [source,json]
2494
- ----
2495
- {
2496
- "version": "2.0.0",
2497
- "name": "myapp",
2498
- "settings": {
2499
- "debug": false,
2500
- "port": 8080
2501
- }
2502
- }
2503
- ----
2504
-
2505
- Running with `--verbose`:
2506
-
2507
- [source,bash]
2508
- ----
2509
- $ canon diff config1.json config2.json --verbose
2510
- Visual Diff:
2511
- ├── settings.debug:
2512
- │ ├── - true
2513
- │ └── + false
2514
- └── version:
2515
- ├── - "1.0.0"
2516
- └── + "2.0.0"
2517
- ----
2518
-
2519
- The tree shows:
2520
- * Key order difference (`version` moved) is ignored
2521
- * Only semantic changes are shown: `debug` and `version` values changed
2522
-
2523
- ====== XML comparison (by-object mode with DOM-guided semantic matching)
2524
-
2525
- Canon's XML diff uses **hybrid DOM-guided line diff** that semantically matches
2526
- elements across documents using identity attributes (such as `id`, `ref`, `name`,
2527
- `key`) and element paths. This ensures that corresponding elements are compared
2528
- even when they appear at different line positions in the files.
2529
-
2530
- [example]
2531
- Given these two XML files:
2532
-
2533
- .document1.xml
2534
- [source,xml]
2535
- ----
2536
- <standard-document>
2537
- <preface>
2538
- <foreword id="fwd">
2539
- <p>First paragraph</p>
2540
- </foreword>
2541
- </preface>
2542
- <sections>
2543
- <clause id="scope">
2544
- <title>Scope</title>
2545
- </clause>
2546
- </sections>
2547
- </standard-document>
2548
- ----
2549
-
2550
- .document2.xml
2551
- [source,xml]
2552
- ----
2553
- <standard-document>
2554
- <preface>
2555
- <foreword displayorder="2" id="fwd">
2556
- <p>First paragraph</p>
2557
- </foreword>
2558
- </preface>
2559
- <sections>
2560
- <clause id="scope">
2561
- <title>Scope</title>
2562
- <p>New content</p>
2563
- </clause>
2564
- </sections>
2565
- </standard-document>
2566
- ----
2567
-
2568
- Running with `--verbose` using by-object mode (default):
2569
-
2570
- [source,bash]
2571
- ----
2572
- $ canon diff document1.xml document2.xml --verbose
2573
- Visual Diff:
2574
- ├── preface.foreword:
2575
- │ └── + displayorder="2"
2576
- └── sections.clause.p:
2577
- └── + "New content"
2578
- ----
2579
-
2580
- The DOM-guided diff shows:
2581
-
2582
- * The `<foreword id="fwd">` elements are **semantically matched** by their `id`
2583
- attribute, even though they may be at different positions
2584
- * Only the **added** `displayorder` attribute is shown for foreword
2585
- * The **added** `<p>` element in clause is shown
2586
- * Unchanged content is not displayed
2587
-
2588
- [example]
2589
- Example with element matching when positions differ:
2590
-
2591
- .file1.xml
2592
- [source,xml]
2593
- ----
2594
- <root>
2595
- <item id="1" name="Widget" price="10.00"/>
2596
- <item id="2" name="Gadget" price="20.00"/>
2597
- </root>
2598
- ----
2599
-
2600
- .file2.xml
2601
- [source,xml]
2602
- ----
2603
- <root>
2604
- <item price="20.00" name="Gadget" id="2"/>
2605
- <item id="1" name="Widget" price="15.00"/>
2606
- </root>
2607
- ----
2608
-
2609
- Running with `--verbose`:
2610
-
2611
- [source,bash]
2612
- ----
2613
- $ canon diff file1.xml file2.xml --verbose
2614
- Visual Diff:
2615
- └── root.item[id="1"].price:
2616
- ├── - "10.00"
2617
- └── + "15.00"
2618
- ----
2619
-
2620
- The semantic matching shows:
2621
-
2622
- * Elements are matched by `id` attribute (`id="1"` with `id="1"`, `id="2"` with `id="2"`)
2623
- * Position changes are ignored (item with `id="2"` moved from second to first)
2624
- * Attribute reordering is ignored (price/name order changed)
2625
- * Only the semantic change is shown: `price` value changed for item `id="1"`
2626
-
2627
- [NOTE]
2628
- DOM-guided semantic matching features:
2629
-
2630
- * **Identity attributes**: Matches elements using `id`, `ref`, `name`, or `key` attributes
2631
- * **Element paths**: Uses full element path for matching (e.g., `root.item`)
2632
- * **Token-level highlighting**: Shows differences at semantic token level (element
2633
- names, attribute names, attribute values)
2634
- * **Parent filtering**: Skips parent elements that only differ in children to
2635
- avoid redundant output
2636
- * **Line range mapping**: Maps DOM elements to exact line ranges in pretty-printed
2637
- output for accurate diff display
2638
-
2639
- ====== XML comparison (by-line mode)
2640
-
2641
- The `--by-line` option switches to traditional line-by-line diff after
2642
- canonicalization, useful when you need to see exact line-level changes.
2643
-
2644
- [example]
2645
- Using the previous example files, but with `--by-line`:
2646
-
2647
- [source,bash]
2648
- ----
2649
- $ canon diff document1.xml document2.xml --by-line --verbose
2650
- Line-by-line diff:
2651
- 4 - | <foreword id="fwd">
2652
- 4 + | <foreword displayorder="2" id="fwd">
2653
- 5 | <p>First paragraph</p>
2654
- 10 + | <p>New content</p>
2655
- 11 | </clause>
2656
- ----
2657
-
2658
- The by-line mode shows:
2659
-
2660
- * Traditional diff format with line numbers
2661
- * Full line context after canonicalization
2662
- * All changes at line level (not semantic level)
2663
- * Useful for reviewing exact textual differences
2664
-
2665
- ====== YAML comparison (by-object mode)
2666
-
2667
- YAML comparison uses by-object mode to show semantic differences in the data
2668
- structure, ignoring formatting and key ordering differences.
2669
-
2670
- [example]
2671
- Given these two YAML files:
2672
-
2673
- .config1.yaml
2674
- [source,yaml]
2675
- ----
2676
- database:
2677
- host: localhost
2678
- port: 5432
2679
- name: mydb
2680
- logging:
2681
- level: info
2682
- format: json
2683
- ----
2684
-
2685
- .config2.yaml
2686
- [source,yaml]
2687
- ----
2688
- logging:
2689
- level: debug
2690
- format: json
2691
- database:
2692
- port: 5432
2693
- host: localhost
2694
- name: production
2695
- ----
2696
-
2697
- Running with `--verbose`:
2698
-
2699
- [source,bash]
2700
- ----
2701
- $ canon diff config1.yaml config2.yaml --verbose
2702
- Visual Diff:
2703
- ├── database.name:
2704
- │ ├── - "mydb"
2705
- │ └── + "production"
2706
- └── logging.level:
2707
- ├── - "info"
2708
- └── + "debug"
2709
- ----
2710
-
2711
- The by-object mode shows:
2712
-
2713
- * Section reordering (`logging` before `database`) is ignored
2714
- * Key reordering within sections (`port` before `host`) is ignored
2715
- * Only semantic value changes are displayed
2716
- * Tree structure clearly shows the path to each change
2717
-
2718
- ===== Comparison options examples
2719
-
2720
- [source,bash]
2721
- ----
2722
- # Include comments in XML comparison
2723
- $ canon diff doc1.xml doc2.xml --with-comments --verbose
2724
-
2725
- # Ignore all text content, only compare structure
2726
- $ canon diff template1.html template2.html --ignore-text-nodes
2727
-
2728
- # Don't collapse whitespace (exact whitespace comparison)
2729
- $ canon diff file1.xml file2.xml --no-collapse-whitespace
2730
-
2731
- # Compare different formats (must have same structure)
2732
- $ canon diff config.json config.yaml --format1 json --format2 yaml --verbose
2733
- ----
2734
-
2735
- ===== HTML comparison (by-line mode only)
2736
-
2737
- HTML comparison always uses by-line mode after canonicalization, which is ideal
2738
- for reviewing markup structure changes.
2739
-
2740
- [example]
2741
- Given these two HTML files:
2742
-
2743
- .page1.html
2744
- [source,html]
2745
- ----
2746
- <!DOCTYPE html>
2747
- <html>
2748
- <head>
2749
- <title>My Page</title>
2750
- </head>
2751
- <body>
2752
- <div class="header">
2753
- <h1>Welcome</h1>
2754
- <p>Introduction text</p>
2755
- </div>
2756
- <div class="content">
2757
- <p>Main content</p>
2758
- </div>
2759
- </body>
2760
- </html>
2761
- ----
2762
-
2763
- .page2.html
2764
- [source,html]
2765
- ----
2766
- <!DOCTYPE html>
2767
- <html>
2768
- <head>
2769
- <title>My Updated Page</title>
2770
- </head>
2771
- <body>
2772
- <nav class="header">
2773
- <h1>Welcome</h1>
2774
- <p>Updated introduction</p>
2775
- </nav>
2776
- <div class="content">
2777
- <p>Main content</p>
2778
- <p>Additional paragraph</p>
2779
- </div>
2780
- </body>
2781
- </html>
2782
- ----
2783
-
2784
- Running with `--verbose`:
2785
-
2786
- [source,bash]
2787
- ----
2788
- $ canon diff page1.html page2.html --verbose
2789
- Line-by-line diff:
2790
- 4 - | <title>My Page</title>
2791
- 4 + | <title>My Updated Page</title>
2792
- 7 - | <div class="header">
2793
- 7 + | <nav class="header">
2794
- 9 - | <p>Introduction text</p>
2795
- 9 + | <p>Updated introduction</p>
2796
- 10 - | </div>
2797
- 10 + | </nav>
2798
- 13 + | <p>Additional paragraph</p>
2799
- 14 | </div>
2800
- ----
2801
-
2802
- The line-by-line mode shows:
2803
-
2804
- * Element name changes (`<div>` to `<nav>`)
2805
- * Text content changes
2806
- * Added elements with proper indentation context
2807
- * Line numbers help locate changes in the document
2808
-
2809
- ===== Exit codes
2810
-
2811
- * `0` - Files are semantically equivalent
2812
- * `1` - Files are semantically different
2813
-
2814
-
2815
- == Development
2816
-
2817
- After checking out the repo, run `bin/setup` to install dependencies. Then, run
2818
- `rake spec` to run the tests. You can also run `bin/console` for an interactive
2819
- prompt that will allow you to experiment.
2820
-
2821
-
2822
- == Contributing
2823
-
2824
- Bug reports and pull requests are welcome on GitHub at
2825
- https://github.com/lutaml/canon.
2826
-
2827
-
2828
- == Copyright and license
2829
-
2830
- Copyright Ribose.
2831
- https://opensource.org/licenses/BSD-2-Clause[BSD-2-Clause License].