canon 0.1.8 → 0.1.9

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (98) hide show
  1. checksums.yaml +4 -4
  2. data/.rubocop_todo.yml +112 -25
  3. data/docs/Gemfile +1 -0
  4. data/docs/_config.yml +90 -1
  5. data/docs/advanced/diff-classification.adoc +82 -2
  6. data/docs/features/match-options/index.adoc +239 -1
  7. data/lib/canon/comparison/format_detector.rb +2 -1
  8. data/lib/canon/comparison/html_comparator.rb +19 -8
  9. data/lib/canon/comparison/html_compare_profile.rb +8 -2
  10. data/lib/canon/comparison/match_options/base_resolver.rb +7 -0
  11. data/lib/canon/comparison/whitespace_sensitivity.rb +208 -0
  12. data/lib/canon/comparison/xml_comparator/child_comparison.rb +15 -7
  13. data/lib/canon/comparison/xml_comparator/node_parser.rb +10 -5
  14. data/lib/canon/comparison/xml_comparator/node_type_comparator.rb +14 -7
  15. data/lib/canon/comparison/xml_comparator.rb +48 -23
  16. data/lib/canon/comparison/xml_node_comparison.rb +25 -3
  17. data/lib/canon/diff/diff_classifier.rb +101 -2
  18. data/lib/canon/diff/formatting_detector.rb +1 -1
  19. data/lib/canon/rspec_matchers.rb +37 -8
  20. data/lib/canon/version.rb +1 -1
  21. data/lib/canon/xml/data_model.rb +24 -13
  22. metadata +3 -78
  23. data/docs/plans/2025-01-17-html-parser-selection-fix.adoc +0 -250
  24. data/false_positive_analysis.txt +0 -0
  25. data/file1.html +0 -1
  26. data/file2.html +0 -1
  27. data/old-docs/ADVANCED_TOPICS.adoc +0 -20
  28. data/old-docs/BASIC_USAGE.adoc +0 -16
  29. data/old-docs/CHARACTER_VISUALIZATION.adoc +0 -567
  30. data/old-docs/CLI.adoc +0 -497
  31. data/old-docs/CUSTOMIZING_BEHAVIOR.adoc +0 -19
  32. data/old-docs/DIFF_ARCHITECTURE.adoc +0 -435
  33. data/old-docs/DIFF_FORMATTING.adoc +0 -540
  34. data/old-docs/DIFF_PARAMETERS.adoc +0 -261
  35. data/old-docs/DOM_DIFF.adoc +0 -1017
  36. data/old-docs/ENV_CONFIG.adoc +0 -876
  37. data/old-docs/FORMATS.adoc +0 -867
  38. data/old-docs/INPUT_VALIDATION.adoc +0 -477
  39. data/old-docs/MATCHER_BEHAVIOR.adoc +0 -90
  40. data/old-docs/MATCH_ARCHITECTURE.adoc +0 -463
  41. data/old-docs/MATCH_OPTIONS.adoc +0 -912
  42. data/old-docs/MODES.adoc +0 -432
  43. data/old-docs/NORMATIVE_INFORMATIVE_DIFFS.adoc +0 -219
  44. data/old-docs/OPTIONS.adoc +0 -1387
  45. data/old-docs/PREPROCESSING.adoc +0 -491
  46. data/old-docs/README.old.adoc +0 -2831
  47. data/old-docs/RSPEC.adoc +0 -814
  48. data/old-docs/RUBY_API.adoc +0 -485
  49. data/old-docs/SEMANTIC_DIFF_REPORT.adoc +0 -646
  50. data/old-docs/SEMANTIC_TREE_DIFF.adoc +0 -765
  51. data/old-docs/STRING_COMPARE.adoc +0 -345
  52. data/old-docs/TMP.adoc +0 -3384
  53. data/old-docs/TREE_DIFF.adoc +0 -1080
  54. data/old-docs/UNDERSTANDING_CANON.adoc +0 -17
  55. data/old-docs/VERBOSE.adoc +0 -482
  56. data/old-docs/VISUALIZATION_MAP.adoc +0 -625
  57. data/old-docs/WHITESPACE_TREATMENT.adoc +0 -1155
  58. data/scripts/analyze_current_state.rb +0 -85
  59. data/scripts/analyze_false_positives.rb +0 -114
  60. data/scripts/analyze_remaining_failures.rb +0 -105
  61. data/scripts/compare_current_failures.rb +0 -95
  62. data/scripts/compare_dom_tree_diff.rb +0 -158
  63. data/scripts/compare_failures.rb +0 -151
  64. data/scripts/debug_attribute_extraction.rb +0 -66
  65. data/scripts/debug_blocks_839.rb +0 -115
  66. data/scripts/debug_meta_matching.rb +0 -52
  67. data/scripts/debug_p_matching.rb +0 -192
  68. data/scripts/debug_signature_matching.rb +0 -118
  69. data/scripts/debug_sourcecode_124.rb +0 -32
  70. data/scripts/debug_whitespace_sensitive.rb +0 -192
  71. data/scripts/extract_false_positives.rb +0 -138
  72. data/scripts/find_actual_false_positives.rb +0 -125
  73. data/scripts/investigate_all_false_positives.rb +0 -161
  74. data/scripts/investigate_batch1.rb +0 -127
  75. data/scripts/investigate_classification.rb +0 -150
  76. data/scripts/investigate_classification_detailed.rb +0 -190
  77. data/scripts/investigate_common_failures.rb +0 -342
  78. data/scripts/investigate_false_negative.rb +0 -80
  79. data/scripts/investigate_false_positive.rb +0 -83
  80. data/scripts/investigate_false_positives.rb +0 -227
  81. data/scripts/investigate_false_positives_batch.rb +0 -163
  82. data/scripts/investigate_mixed_content.rb +0 -125
  83. data/scripts/investigate_remaining_16.rb +0 -214
  84. data/scripts/run_single_test.rb +0 -29
  85. data/scripts/test_all_false_positives.rb +0 -95
  86. data/scripts/test_attribute_details.rb +0 -61
  87. data/scripts/test_both_algorithms.rb +0 -49
  88. data/scripts/test_both_simple.rb +0 -49
  89. data/scripts/test_enhanced_semantic_output.rb +0 -125
  90. data/scripts/test_readme_examples.rb +0 -131
  91. data/scripts/test_semantic_tree_diff.rb +0 -99
  92. data/scripts/test_semantic_ux_improvements.rb +0 -135
  93. data/scripts/test_single_false_positive.rb +0 -119
  94. data/scripts/test_size_limits.rb +0 -99
  95. data/test_html_1.html +0 -21
  96. data/test_html_2.html +0 -21
  97. data/test_nokogiri.rb +0 -33
  98. data/test_normalize.rb +0 -45
data/old-docs/TMP.adoc DELETED
@@ -1,3384 +0,0 @@
1
- = Canon: Canonicalization for serialization formats
2
-
3
- A Ruby library for canonicalizing and pretty-printing XML, HTML, YAML, and JSON
4
- with RSpec matchers for equivalence testing.
5
-
6
-
7
- == Purpose
8
-
9
- Canon provides canonicalization and pretty-printing for various serialization
10
- formats (XML, HTML, JSON, YAML), producing standardized forms suitable for
11
- comparison, testing, digital signatures, and human-readable output.
12
-
13
-
14
- == Architecture
15
-
16
- Canon follows an **orchestrator pattern** with clear separation of concerns.
17
-
18
- === Comparison module
19
-
20
- The `Canon::Comparison` module (123 lines) acts as a pure orchestrator that:
21
-
22
- * Detects input format (XML, HTML, JSON, YAML)
23
- * Validates format compatibility
24
- * Delegates to format-specific comparator classes
25
-
26
- Format-specific comparators:
27
-
28
- * `Canon::Comparison::XmlComparator` - XML semantic comparison
29
- * `Canon::Comparison::HtmlComparator` - HTML semantic comparison
30
- * `Canon::Comparison::JsonComparator` - JSON/Ruby object comparison
31
- * `Canon::Comparison::YamlComparator` - YAML comparison (delegates to JsonComparator)
32
-
33
- Each comparator is self-contained and handles all comparison logic for its format.
34
-
35
- === DiffFormatter module
36
-
37
- The `Canon::DiffFormatter` class (171 lines) acts as a pure orchestrator that:
38
-
39
- * Manages diff options (colors, visualization, context)
40
- * Detects diff mode (by-object vs by-line)
41
- * Delegates to mode-specific and format-specific formatters
42
-
43
- Two diff modes:
44
-
45
- **By-object mode** (tree-based semantic diff):
46
-
47
- * `Canon::DiffFormatter::ByObject::BaseFormatter` - Factory and common logic
48
- * `Canon::DiffFormatter::ByObject::XmlFormatter` - XML DOM differences
49
- * `Canon::DiffFormatter::ByObject::JsonFormatter` - Ruby object differences
50
- * `Canon::DiffFormatter::ByObject::YamlFormatter` - YAML differences
51
-
52
- **By-line mode** (line-based diff):
53
-
54
- * `Canon::DiffFormatter::ByLine::BaseFormatter` - LCS algorithm and factory
55
- * `Canon::DiffFormatter::ByLine::XmlFormatter` - DOM-guided XML line diff
56
- * `Canon::DiffFormatter::ByLine::JsonFormatter` - Semantic JSON line diff
57
- * `Canon::DiffFormatter::ByLine::YamlFormatter` - Semantic YAML line diff
58
- * `Canon::DiffFormatter::ByLine::SimpleFormatter` - Fallback line diff
59
-
60
- Each formatter handles format-specific intelligence (DOM parsing, token
61
- highlighting, semantic understanding).
62
-
63
- === Object-oriented diff foundation
64
-
65
- Canon uses three foundational classes for managing diff data:
66
-
67
- * `Canon::Diff::DiffBlock` - Represents a contiguous block of changes
68
- * `Canon::Diff::DiffContext` - Groups diff blocks with surrounding context
69
- * `Canon::Diff::DiffReport` - Top-level container for complete diff results
70
-
71
- These classes ensure clean separation by providing clear ownership of diff data
72
- at different granularity levels.
73
-
74
-
75
- == Features
76
-
77
- === Ruby API
78
-
79
- Single API for working with all four formats (XML, HTML, JSON, YAML).
80
-
81
-
82
- === XML canonicalization
83
-
84
- Format XML documents according to the
85
- https://www.w3.org/TR/xml-c14n11/[W3C Canonical XML Version 1.1] specification.
86
-
87
- Key features:
88
-
89
- * Namespace declaration ordering (lexicographic by prefix)
90
- * Attribute ordering (lexicographic by namespace URI, then local name)
91
- * Character encoding normalization to UTF-8
92
- * Special character encoding in text and attributes
93
- * Removal of superfluous namespace declarations
94
- * Support for xml:base, xml:lang, xml:space, and xml:id attributes
95
- * Processing instruction and comment handling
96
- * Document subset support with attribute inheritance
97
-
98
- === HTML canonicalization
99
-
100
- Format HTML 4/5 and XHTML documents with consistent formatting. Automatically
101
- detects HTML vs XHTML and applies appropriate formatting.
102
-
103
- === YAML canonicalization
104
-
105
- Format YAML documents with keys sorted alphabetically at all levels of the
106
- structure.
107
-
108
- === JSON canonicalization
109
-
110
- Format JSON documents with keys sorted alphabetically at all levels of the
111
- structure.
112
-
113
-
114
- === Output modes
115
-
116
- Canon supports two output modes for all formats:
117
-
118
- `c14n` (canonical):: Compact output without indentation, suitable for digital
119
- signatures, hashing, and equivalence testing. Removes formatting whitespace.
120
-
121
- `pretty` (pretty-print):: Human-readable output with consistent indentation.
122
- Configurable indent size and type (spaces or tabs). This is the default mode for
123
- CLI commands.
124
-
125
-
126
- === RSpec matchers
127
-
128
- Provides matchers for testing equivalence between serialized formats.
129
-
130
- NOTE: RSpec matchers always use canonical (c14n) mode for comparison to ensure
131
- formatting differences don't affect test results.
132
-
133
- === Comparison API
134
-
135
- Canon provides a `Canon::Comparison` module for semantic comparison of HTML and
136
- XML documents.
137
-
138
- The `Canon::Comparison.equivalent?` method compares two documents for semantic
139
- equivalence, ignoring formatting differences that don't affect meaning.
140
-
141
- Key features:
142
-
143
- * Semantic comparison (content, not formatting)
144
- * Whitespace normalization
145
- * Comment handling (can ignore or include)
146
- * Attribute sorting
147
- * Support for both HTML and XML documents
148
- * Optional verbose diff output
149
-
150
- NOTE: `Canon::Comparison.equivalent?` adopts option names used by the excellent
151
- https://github.com/vkononov/compare-xml[`compare-xml` gem].
152
-
153
-
154
-
155
- == Usage
156
-
157
- === Command-line usage
158
-
159
- === Installation
160
-
161
- After installing the gem, the `canon` command will be available:
162
-
163
- [source,bash]
164
- ----
165
- $ gem install canon
166
- $ canon --help
167
- ----
168
-
169
- === Format command
170
-
171
- The `format` command formats files in XML, HTML, JSON, or YAML.
172
-
173
- ==== Output modes
174
-
175
- `pretty` (default):: Human-readable output with indentation (2 spaces default)
176
- `c14n`:: Canonical form without indentation
177
-
178
- ==== Command syntax
179
-
180
- [source,bash]
181
- ----
182
- canon format FILE [OPTIONS]
183
- ----
184
-
185
- ==== Options
186
-
187
- `-f, --format FORMAT`:: Specify format: `xml`, `html`, `json`, or `yaml`
188
- (auto-detected from extension if not specified)
189
-
190
- `-m, --mode MODE`:: Output mode: `pretty` (default) or `c14n`
191
-
192
- `-i, --indent N`:: Indentation spaces for pretty mode (default: 2)
193
-
194
- `--indent-type TYPE`:: Indentation type: `space` (default) or `tab`
195
-
196
- `-o, --output FILE`:: Write output to file instead of stdout
197
-
198
- `-c, --with-comments`:: Include comments in canonical XML output
199
-
200
- ==== Examples
201
-
202
- [source,bash]
203
- ----
204
- # Pretty-print (default mode)
205
- $ canon format input.xml
206
- <?xml version="1.0" encoding="UTF-8"?>
207
- <root>
208
- <a>1</a>
209
- <b>2</b>
210
- </root>
211
-
212
- # Canonical mode (compact)
213
- $ canon format input.xml --mode c14n
214
- <root><a>1</a><b>2</b></root>
215
-
216
- # Custom indentation
217
- $ canon format input.xml --mode pretty --indent 4
218
- $ canon format input.json --indent 4
219
-
220
- # Tab indentation
221
- $ canon format input.xml --indent-type tab
222
- $ canon format input.html --mode pretty --indent-type tab
223
-
224
- # Specify format explicitly
225
- $ canon format data.txt --format xml
226
-
227
- # Save to file
228
- $ canon format input.xml --output formatted.xml
229
-
230
- # Include XML comments in canonical output
231
- $ canon format doc.xml --mode c14n --with-comments
232
-
233
- # HTML files
234
- $ canon format page.html
235
- $ canon format page.html --mode c14n
236
- ----
237
-
238
-
239
- === Diff command
240
-
241
- Compare two files using **semantic comparison** that understands the structure of
242
- XML, HTML, JSON, and YAML formats. Unlike traditional text-based diff tools,
243
- `canon diff` compares the meaning and structure of your data, not just the
244
- characters.
245
-
246
- ==== Command syntax
247
-
248
- [source,bash]
249
- ----
250
- canon diff FILE1 FILE2 [OPTIONS]
251
- ----
252
-
253
- ==== Diff modes
254
-
255
- Canon supports two diff modes optimized for different use cases:
256
-
257
- ===== by-object mode (default for JSON/YAML)
258
-
259
- Compares files **semantically** by their data structure and displays differences
260
- as a visual tree showing what changed in the structure.
261
-
262
- Best for::
263
- * Configuration files where you care about what values changed
264
- * API responses where structure matters
265
- * Comparing semantic equivalence across formats
266
-
267
- Features::
268
- * Tree visualization with box-drawing characters
269
- * Shows only what changed (additions, removals, modifications)
270
- * Ignores formatting differences automatically
271
- * Color-coded output (red=removed, green=added, yellow=changed)
272
-
273
- ===== by-line mode (default for HTML, optional for XML)
274
-
275
- Compares files **line-by-line** after canonicalization, showing traditional
276
- diff-style output.
277
-
278
- Best for::
279
- * HTML markup where line-level changes matter
280
- * Reviewing exact textual differences
281
- * When you need to see the full document context
282
-
283
- Features::
284
- * Traditional diff format with line numbers
285
- * Shows before/after for each change
286
- * Better for understanding markup structure changes
287
-
288
- [NOTE]
289
- * JSON and YAML always use **by-object** mode
290
- * HTML always uses **by-line** mode
291
- * XML uses **by-object** mode by default, but can use **by-line** with `--by-line`
292
-
293
- ==== Options
294
-
295
- ===== Format options
296
-
297
- `-f, --format FORMAT`:: Format for both files: `xml`, `html`, `json`, or `yaml`
298
- (auto-detected from extension if not specified)
299
-
300
- `--format1 FORMAT`:: Format for first file (when comparing different formats)
301
-
302
- `--format2 FORMAT`:: Format for second file (when comparing different formats)
303
-
304
- ===== Comparison options
305
-
306
- `-v, --verbose`:: Show detailed differences in tree format (default: just show
307
- if files differ)
308
-
309
- `--by-line`:: Use line-by-line diff for XML (default: by-object mode)
310
-
311
- `--ignore-attr-order` / `--no-ignore-attr-order`:: Control whether attribute/key
312
- ordering matters (default: ignore order)
313
-
314
- `--ignore-comments`:: Ignore XML/HTML comments during comparison (overrides
315
- `--with-comments`)
316
-
317
- `--ignore-text-nodes`:: Ignore all text node content, only compare structure
318
-
319
- `-c, --with-comments`:: Include comments in comparison (sets `ignore_comments: false`)
320
-
321
- ===== Output options
322
-
323
- `--color` / `--no-color`:: Enable/disable colored output (default: enabled)
324
-
325
- ==== Examples
326
-
327
- ===== Basic comparison
328
-
329
- [source,bash]
330
- ----
331
- # Compare two JSON files (shows if equivalent or different)
332
- $ canon diff config1.json config2.json
333
- Files are semantically different
334
-
335
- # Compare two XML files
336
- $ canon diff file1.xml file2.xml
337
- ✅ Files are semantically equivalent
338
- ----
339
-
340
- ===== Verbose mode examples
341
-
342
- ====== JSON comparison (by-object mode)
343
-
344
- [example]
345
- Given these two JSON files:
346
-
347
- .config1.json
348
- [source,json]
349
- ----
350
- {
351
- "name": "myapp",
352
- "version": "1.0.0",
353
- "settings": {
354
- "debug": true,
355
- "port": 8080
356
- }
357
- }
358
- ----
359
-
360
- .config2.json
361
- [source,json]
362
- ----
363
- {
364
- "version": "2.0.0",
365
- "name": "myapp",
366
- "settings": {
367
- "debug": false,
368
- "port": 8080
369
- }
370
- }
371
- ----
372
-
373
- Running with `--verbose`:
374
-
375
- [source,bash]
376
- ----
377
- $ canon diff config1.json config2.json --verbose
378
- Visual Diff:
379
- ├── settings.debug:
380
- │ ├── - true
381
- │ └── + false
382
- └── version:
383
- ├── - "1.0.0"
384
- └── + "2.0.0"
385
- ----
386
-
387
- The tree shows:
388
-
389
- * Key order difference (`version` moved) is ignored
390
- * Only semantic changes are shown: `debug` and `version` values changed
391
-
392
- ====== XML comparison (by-object mode with DOM-guided semantic matching)
393
-
394
- Canon's XML diff uses **hybrid DOM-guided line diff** that semantically matches
395
- elements across documents using identity attributes (such as `id`, `ref`, `name`,
396
- `key`) and element paths. This ensures that corresponding elements are compared
397
- even when they appear at different line positions in the files.
398
-
399
- [example]
400
- Given these two XML files:
401
-
402
- .document1.xml
403
- [source,xml]
404
- ----
405
- <standard-document>
406
- <preface>
407
- <foreword id="fwd">
408
- <p>First paragraph</p>
409
- </foreword>
410
- </preface>
411
- <sections>
412
- <clause id="scope">
413
- <title>Scope</title>
414
- </clause>
415
- </sections>
416
- </standard-document>
417
- ----
418
-
419
- .document2.xml
420
- [source,xml]
421
- ----
422
- <standard-document>
423
- <preface>
424
- <foreword displayorder="2" id="fwd">
425
- <p>First paragraph</p>
426
- </foreword>
427
- </preface>
428
- <sections>
429
- <clause id="scope">
430
- <title>Scope</title>
431
- <p>New content</p>
432
- </clause>
433
- </sections>
434
- </standard-document>
435
- ----
436
-
437
- Running with `--verbose` using by-object mode (default):
438
-
439
- [source,bash]
440
- ----
441
- $ canon diff document1.xml document2.xml --verbose
442
- Visual Diff:
443
- ├── preface.foreword:
444
- │ └── + displayorder="2"
445
- └── sections.clause.p:
446
- └── + "New content"
447
- ----
448
-
449
- The DOM-guided diff shows:
450
-
451
- * The `<foreword id="fwd">` elements are **semantically matched** by their `id`
452
- attribute, even though they may be at different positions
453
- * Only the **added** `displayorder` attribute is shown for foreword
454
- * The **added** `<p>` element in clause is shown
455
- * Unchanged content is not displayed
456
-
457
- [example]
458
- Example with element matching when positions differ:
459
-
460
- .file1.xml
461
- [source,xml]
462
- ----
463
- <root>
464
- <item id="1" name="Widget" price="10.00"/>
465
- <item id="2" name="Gadget" price="20.00"/>
466
- </root>
467
- ----
468
-
469
- .file2.xml
470
- [source,xml]
471
- ----
472
- <root>
473
- <item price="20.00" name="Gadget" id="2"/>
474
- <item id="1" name="Widget" price="15.00"/>
475
- </root>
476
- ----
477
-
478
- Running with `--verbose`:
479
-
480
- [source,bash]
481
- ----
482
- $ canon diff file1.xml file2.xml --verbose
483
- Visual Diff:
484
- └── root.item[id="1"].price:
485
- ├── - "10.00"
486
- └── + "15.00"
487
- ----
488
-
489
- The semantic matching shows:
490
-
491
- * Elements are matched by `id` attribute (`id="1"` with `id="1"`, `id="2"` with `id="2"`)
492
- * Position changes are ignored (item with `id="2"` moved from second to first)
493
- * Attribute reordering is ignored (price/name order changed)
494
- * Only the semantic change is shown: `price` value changed for item `id="1"`
495
-
496
- [NOTE]
497
- DOM-guided semantic matching features:
498
-
499
- * **Identity attributes**: Matches elements using `id`, `ref`, `name`, or `key` attributes
500
- * **Element paths**: Uses full element path for matching (e.g., `root.item`)
501
- * **Token-level highlighting**: Shows differences at semantic token level (element
502
- names, attribute names, attribute values)
503
- * **Parent filtering**: Skips parent elements that only differ in children to
504
- avoid redundant output
505
- * **Line range mapping**: Maps DOM elements to exact line ranges in pretty-printed
506
- output for accurate diff display
507
-
508
- ====== XML comparison (by-line mode)
509
-
510
- The `--by-line` option switches to traditional line-by-line diff after
511
- canonicalization, useful when you need to see exact line-level changes.
512
-
513
- [example]
514
- Using the previous example files, but with `--by-line`:
515
-
516
- [source,bash]
517
- ----
518
- $ canon diff document1.xml document2.xml --by-line --verbose
519
- Line-by-line diff:
520
- 4 - | <foreword id="fwd">
521
- 4 + | <foreword displayorder="2" id="fwd">
522
- 5 | <p>First paragraph</p>
523
- 10 + | <p>New content</p>
524
- 11 | </clause>
525
- ----
526
-
527
- The by-line mode shows:
528
-
529
- * Traditional diff format with line numbers
530
- * Full line context after canonicalization
531
- * All changes at line level (not semantic level)
532
- * Useful for reviewing exact textual differences
533
-
534
- ====== YAML comparison (by-object mode)
535
-
536
- YAML comparison uses by-object mode to show semantic differences in the data
537
- structure, ignoring formatting and key ordering differences.
538
-
539
- [example]
540
- Given these two YAML files:
541
-
542
- .config1.yaml
543
- [source,yaml]
544
- ----
545
- database:
546
- host: localhost
547
- port: 5432
548
- name: mydb
549
- logging:
550
- level: info
551
- format: json
552
- ----
553
-
554
- .config2.yaml
555
- [source,yaml]
556
- ----
557
- logging:
558
- level: debug
559
- format: json
560
- database:
561
- port: 5432
562
- host: localhost
563
- name: production
564
- ----
565
-
566
- Running with `--verbose`:
567
-
568
- [source,bash]
569
- ----
570
- $ canon diff config1.yaml config2.yaml --verbose
571
- Visual Diff:
572
- ├── database.name:
573
- │ ├── - "mydb"
574
- │ └── + "production"
575
- └── logging.level:
576
- ├── - "info"
577
- └── + "debug"
578
- ----
579
-
580
- The by-object mode shows:
581
-
582
- * Section reordering (`logging` before `database`) is ignored
583
- * Key reordering within sections (`port` before `host`) is ignored
584
- * Only semantic value changes are displayed
585
- * Tree structure clearly shows the path to each change
586
-
587
- ===== Comparison options examples
588
-
589
- [source,bash]
590
- ----
591
- # Include comments in XML comparison
592
- $ canon diff doc1.xml doc2.xml --with-comments --verbose
593
-
594
- # Ignore all text content, only compare structure
595
- $ canon diff template1.html template2.html --ignore-text-nodes
596
-
597
- # Don't collapse whitespace (exact whitespace comparison)
598
- $ canon diff file1.xml file2.xml --no-collapse-whitespace
599
-
600
- # Compare different formats (must have same structure)
601
- $ canon diff config.json config.yaml --format1 json --format2 yaml --verbose
602
- ----
603
-
604
- ===== HTML comparison (by-line mode only)
605
-
606
- HTML comparison always uses by-line mode after canonicalization, which is ideal
607
- for reviewing markup structure changes.
608
-
609
- [example]
610
- Given these two HTML files:
611
-
612
- .page1.html
613
- [source,html]
614
- ----
615
- <!DOCTYPE html>
616
- <html>
617
- <head>
618
- <title>My Page</title>
619
- </head>
620
- <body>
621
- <div class="header">
622
- <h1>Welcome</h1>
623
- <p>Introduction text</p>
624
- </div>
625
- <div class="content">
626
- <p>Main content</p>
627
- </div>
628
- </body>
629
- </html>
630
- ----
631
-
632
- .page2.html
633
- [source,html]
634
- ----
635
- <!DOCTYPE html>
636
- <html>
637
- <head>
638
- <title>My Updated Page</title>
639
- </head>
640
- <body>
641
- <nav class="header">
642
- <h1>Welcome</h1>
643
- <p>Updated introduction</p>
644
- </nav>
645
- <div class="content">
646
- <p>Main content</p>
647
- <p>Additional paragraph</p>
648
- </div>
649
- </body>
650
- </html>
651
- ----
652
-
653
- Running with `--verbose`:
654
-
655
- [source,bash]
656
- ----
657
- $ canon diff page1.html page2.html --verbose
658
- Line-by-line diff:
659
- 4 - | <title>My Page</title>
660
- 4 + | <title>My Updated Page</title>
661
- 7 - | <div class="header">
662
- 7 + | <nav class="header">
663
- 9 - | <p>Introduction text</p>
664
- 9 + | <p>Updated introduction</p>
665
- 10 - | </div>
666
- 10 + | </nav>
667
- 13 + | <p>Additional paragraph</p>
668
- 14 | </div>
669
- ----
670
-
671
- The line-by-line mode shows:
672
-
673
- * Element name changes (`<div>` to `<nav>`)
674
- * Text content changes
675
- * Added elements with proper indentation context
676
- * Line numbers help locate changes in the document
677
-
678
- ===== Exit codes
679
-
680
- * `0` - Files are semantically equivalent
681
- * `1` - Files are semantically different
682
-
683
- === Ruby API usage
684
-
685
- === Basic formatting (c14n mode)
686
-
687
- The `Canon.format` method produces canonical output by default.
688
-
689
- Syntax:
690
-
691
- [source,ruby]
692
- ----
693
- Canon.format({content}, {format})
694
- Canon.format_{format}({content}) # Format-specific shorthand
695
- ----
696
-
697
- Where,
698
-
699
- `{content}`:: The input string
700
- `{format}`:: The format type (`:xml`, `:html`, `:json`, or `:yaml`)
701
-
702
- .Canonical formatting examples
703
- [example]
704
- ====
705
- [source,ruby]
706
- ----
707
- require 'canon'
708
-
709
- # XML - compact canonical form
710
- xml = '<root><b>2</b><a>1</a></root>'
711
- Canon.format(xml, :xml)
712
- # => "<root><a>1</a><b>2</b></root>"
713
-
714
- Canon.format_xml(xml) # Shorthand
715
- # => "<root><a>1</a><b>2</b></root>"
716
-
717
- # HTML - compact canonical form
718
- html = '<div><p>Hello</p></div>'
719
- Canon.format(html, :html)
720
- Canon.format_html(html) # Shorthand
721
-
722
- # JSON - canonical with sorted keys
723
- json = '{"z":3,"a":1,"b":2}'
724
- Canon.format(json, :json)
725
- # => {"a":1,"b":2,"z":3}
726
-
727
- # YAML - canonical with sorted keys
728
- yaml = "z: 3\na: 1\nb: 2"
729
- Canon.format(yaml, :yaml)
730
- ----
731
- ====
732
-
733
- === Pretty-print mode
734
-
735
- For human-readable output with indentation, use the format-specific pretty
736
- printer classes.
737
-
738
- Syntax:
739
-
740
- [source,ruby]
741
- ----
742
- Canon::{Format}::PrettyPrinter.new(indent: {n}, indent_type: {type}).format({content})
743
- ----
744
-
745
- Where,
746
-
747
- `{Format}`:: The format module (`Xml`, `Html`, `Json`)
748
- `{n}`:: Number of spaces (default: 2) or tabs (use 1 for tabs)
749
- `{type}`:: Indentation type: `'space'` (default) or `'tab'`
750
- `{content}`:: The input string
751
-
752
- .Pretty-print examples
753
- [example]
754
- ====
755
- [source,ruby]
756
- ----
757
- require 'canon/xml/pretty_printer'
758
- require 'canon/html/pretty_printer'
759
- require 'canon/json/pretty_printer'
760
-
761
- xml_input = '<root><b>2</b><a>1</a></root>'
762
-
763
- # XML with 2-space indentation (default)
764
- Canon::Xml::PrettyPrinter.new(indent: 2).format(xml_input)
765
- # =>
766
- # <?xml version="1.0" encoding="UTF-8"?>
767
- # <root>
768
- # <a>1</a>
769
- # <b>2</b>
770
- # </root>
771
-
772
- # XML with 4-space indentation
773
- Canon::Xml::PrettyPrinter.new(indent: 4).format(xml_input)
774
-
775
- # XML with tab indentation
776
- Canon::Xml::PrettyPrinter.new(
777
- indent: 1,
778
- indent_type: 'tab'
779
- ).format(xml_input)
780
-
781
- # HTML with 2-space indentation
782
- html_input = '<div><p>Hello</p></div>'
783
- Canon::Html::PrettyPrinter.new(indent: 2).format(html_input)
784
-
785
- # JSON with 2-space indentation
786
- json_input = '{"z":3,"a":{"b":1}}'
787
- Canon::Json::PrettyPrinter.new(indent: 2).format(json_input)
788
-
789
- # JSON with tab indentation
790
- Canon::Json::PrettyPrinter.new(
791
- indent: 1,
792
- indent_type: 'tab'
793
- ).format(json_input)
794
- ----
795
- ====
796
-
797
- === Parsing
798
-
799
- The `Canon.parse` method parses content into Ruby objects or Nokogiri documents.
800
-
801
- Syntax:
802
-
803
- [source,ruby]
804
- ----
805
- Canon.parse({content}, {format})
806
- Canon.parse_{format}({content}) # Format-specific shorthand
807
- ----
808
-
809
- Where,
810
-
811
- `{content}`:: The input string
812
- `{format}`:: The format type (`:xml`, `:html`, `:json`, or `:yaml`)
813
-
814
- .Parsing examples
815
- [example]
816
- ====
817
- [source,ruby]
818
- ----
819
- # Parse XML → Nokogiri::XML::Document
820
- xml_doc = Canon.parse(xml_input, :xml)
821
- xml_doc = Canon.parse_xml(xml_input)
822
-
823
- # Parse HTML → Nokogiri::HTML5::Document (or XML::Document for XHTML)
824
- html_doc = Canon.parse(html_input, :html)
825
- html_doc = Canon.parse_html(html_input)
826
-
827
- # Parse JSON → Ruby Hash/Array
828
- json_obj = Canon.parse(json_input, :json)
829
- json_obj = Canon.parse_json(json_input)
830
-
831
- # Parse YAML → Ruby Hash/Array
832
- yaml_obj = Canon.parse(yaml_input, :yaml)
833
- yaml_obj = Canon.parse_yaml(yaml_input)
834
- ----
835
- ====
836
-
837
- === Comparison
838
-
839
- The `Canon::Comparison.equivalent?` method compares two HTML or XML documents.
840
-
841
- The Comparison module uses a depth-first comparison based on the two DOM trees
842
- by traversing them in parallel and comparing nodes.
843
-
844
- In XML mode:
845
-
846
- * Parsing: accepts Moxml (`Moxml::Document`) or Nokogiri
847
- (`Nokogiri::XML::Document`)
848
- * Comments: normalized and compared unless `ignore_comments: true`
849
- * Whitespace: collapses whitespace in text nodes unless `collapse_whitespace: false`
850
- * Sorts attributes alphabetically before comparison
851
-
852
- In HTML mode:
853
-
854
- * Parsing: accepts Nokogiri (`Nokogiri::HTML5` or `Nokogiri::HTML`)
855
- * Normalizes HTML comments in `<style>` and `<script>` tags
856
- * Sorts attributes alphabetically before comparison
857
- * Collapses whitespace for text content comparison
858
- * Removes empty text nodes between elements
859
-
860
- [NOTE]
861
- ====
862
- The comparison module is automatically used by Canon's RSpec matchers
863
- (`be_html_equivalent_to`, `be_xml_equivalent_to`, etc.) to provide reliable
864
- semantic comparison in tests.
865
- ====
866
-
867
-
868
- Syntax:
869
-
870
- [source,ruby]
871
- ----
872
- Canon::Comparison.equivalent?({doc1}, {doc2}, {options})
873
- ----
874
-
875
- Where,
876
-
877
- `{doc1}`:: First document object (String, Nokogiri::HTML::Document, or supported XML document)
878
- `{doc2}`:: Second document object (String, Nokogiri::HTML::Document, or supported XML document)
879
- `{options}`:: Hash of comparison options (optional)
880
-
881
- Canon::Comparison for XML supports Moxml::Document and Nokogiri::XML::Document
882
- as input.
883
-
884
- Returns:
885
-
886
- * `true` if documents are equivalent
887
- * `false` if documents differ
888
- * `Array` of differences if `verbose: true` option is set
889
-
890
- .Basic comparison examples
891
- [example]
892
- ====
893
- [source,ruby]
894
- ----
895
- require 'canon/comparison'
896
-
897
- # HTML comparison - ignores whitespace and comments by default
898
- html1 = '<div><p>Hello</p></div>'
899
- html2 = '<div> <p> Hello </p> </div>'
900
- Canon::Comparison.equivalent?(html1, html2)
901
- # => true
902
-
903
- # HTML with different content
904
- html3 = '<div><p>Goodbye</p></div>'
905
- Canon::Comparison.equivalent?(html1, html3)
906
- # => false
907
-
908
- # XML comparison
909
- xml1 = '<root><a>1</a><b>2</b></root>'
910
- xml2 = '<root> <b>2</b> <a>1</a> </root>'
911
- Canon::Comparison.equivalent?(xml1, xml2)
912
- # => true
913
-
914
- # With Nokogiri documents
915
- doc1 = Nokogiri::HTML5(html1)
916
- doc2 = Nokogiri::HTML5(html2)
917
- Canon::Comparison.equivalent?(doc1, doc2)
918
- # => true
919
- ----
920
- ====
921
-
922
- === RSpec usage
923
-
924
- === General
925
-
926
- RSpec matchers for testing equivalence between serialized formats. All matchers
927
- use canonical (c14n) mode for comparison.
928
-
929
- See <<Diff formatting configuration>> for details on configuring diff output
930
- in RSpec matchers.
931
-
932
- .RSpec matcher examples
933
- [example]
934
- ====
935
- [source,ruby]
936
- ----
937
- require 'rspec'
938
- require 'canon'
939
-
940
- RSpec.describe 'Serialization equivalence' do
941
- # Unified matcher with format parameter
942
- it 'compares XML' do
943
- xml1 = '<root><a>1</a><b>2</b></root>'
944
- xml2 = '<root> <b>2</b> <a>1</a> </root>'
945
- expect(xml1).to be_serialization_equivalent_to(xml2, format: :xml)
946
- end
947
-
948
- it 'compares HTML' do
949
- html1 = '<div><p>Hello</p></div>'
950
- html2 = '<div> <p> Hello </p> </div>'
951
- expect(html1).to be_serialization_equivalent_to(html2, format: :html)
952
- end
953
-
954
- it 'compares JSON' do
955
- json1 = '{"a":1,"b":2}'
956
- json2 = '{"b":2,"a":1}'
957
- expect(json1).to be_serialization_equivalent_to(json2, format: :json)
958
- end
959
-
960
- it 'compares YAML' do
961
- yaml1 = "a: 1\nb: 2"
962
- yaml2 = "b: 2\na: 1"
963
- expect(yaml1).to be_serialization_equivalent_to(yaml2, format: :yaml)
964
- end
965
-
966
- # Format-specific matchers
967
- it 'uses format-specific matchers' do
968
- expect(xml1).to be_xml_equivalent_to(xml2) # XML
969
- expect(xml1).to be_analogous_with(xml2) # XML (legacy)
970
- expect(html1).to be_html_equivalent_to(html2) # HTML
971
- expect(json1).to be_json_equivalent_to(json2) # JSON
972
- expect(yaml1).to be_yaml_equivalent_to(yaml2) # YAML
973
- end
974
- end
975
- ----
976
- ====
977
-
978
- [IMPORTANT]
979
- ====
980
- RSpec matchers always canonicalize both sides before comparing, so:
981
-
982
- * Formatting differences (whitespace, indentation) are ignored
983
- * Attribute order in XML/HTML is normalized
984
- * Key order in JSON/YAML is normalized
985
- * Tests focus on content equality, not formatting
986
- ====
987
-
988
-
989
- === Usage examples
990
-
991
- ==== Using predefined profiles
992
-
993
- Use a profile for XML comparison:
994
-
995
- [source,ruby]
996
- ----
997
- expect(actual_xml).to be_xml_equivalent_to(
998
- expected_xml,
999
- match_profile: :spec_friendly
1000
- )
1001
- ----
1002
-
1003
- Use a profile for HTML comparison:
1004
-
1005
- [source,ruby]
1006
- ----
1007
- expect(actual_html).to be_html_equivalent_to(
1008
- expected_html,
1009
- match_profile: :content_only
1010
- )
1011
- ----
1012
-
1013
- ==== Using explicit match options
1014
-
1015
- Override specific dimensions:
1016
-
1017
- [source,ruby]
1018
- ----
1019
- expect(actual_xml).to be_xml_equivalent_to(
1020
- expected_xml,
1021
- match_options: {
1022
- text_content: :normalize,
1023
- structural_whitespace: :ignore,
1024
- attribute_whitespace: :strict,
1025
- comments: :ignore
1026
- }
1027
- )
1028
- ----
1029
-
1030
- ==== Combining profiles and explicit options
1031
-
1032
- Explicit options override profile settings:
1033
-
1034
- [source,ruby]
1035
- ----
1036
- expect(actual_xml).to be_xml_equivalent_to(
1037
- expected_xml,
1038
- match_profile: :spec_friendly,
1039
- match_options: {
1040
- attribute_whitespace: :strict # Override just this dimension
1041
- }
1042
- )
1043
- ----
1044
-
1045
- ==== Global configuration
1046
-
1047
- Set a global default profile for all tests:
1048
-
1049
- [source,ruby]
1050
- ----
1051
- # In spec_helper.rb
1052
- Canon::RSpecMatchers.configure do |config|
1053
- config.xml_match_profile = :spec_friendly
1054
- config.html_match_profile = :rendered
1055
- end
1056
- ----
1057
-
1058
- Override global profile in specific tests:
1059
-
1060
- [source,ruby]
1061
- ----
1062
- # This test uses strict matching despite global spec_friendly
1063
- expect(actual_xml).to be_xml_equivalent_to(
1064
- expected_xml,
1065
- match_profile: :strict
1066
- )
1067
- ----
1068
-
1069
- == Configuration
1070
-
1071
- === Comparison options
1072
-
1073
- === Overview
1074
-
1075
- Canon provides a flexible matching system for XML, HTML, JSON, and YAML
1076
- comparisons.
1077
-
1078
- This system allows precise control over how whitespace and formatting
1079
- differences are handled during comparison.
1080
-
1081
- These options apply to the `Canon::Comparison.equivalent?` method, Canon's
1082
- RSpec matchers as well as for the command-line `canon diff` tool to perform
1083
- semantic comparison.
1084
-
1085
- The system uses a two-phase architecture:
1086
-
1087
- * *Preprocessing phase*: What to compare (normalization, canonicalization, formatting)
1088
- * *Matching phase*: How to compare (4 dimensions × 3 behaviors)
1089
-
1090
- The system uses `match_options` and `match_profile` parameters that offer
1091
- precise control over comparison behavior.
1092
-
1093
- `ignore_attr_order`:: (default: `true`) when `true`, ignores attribute ordering
1094
- (<<ignore_attr_order>>)
1095
-
1096
- `verbose`:: (default: `false`) when `true`, returns array of differences instead
1097
- of boolean (<<verbose>>)
1098
-
1099
-
1100
- === Preprocessing phase
1101
-
1102
- The preprocessing phase determines what content is compared.
1103
-
1104
- Canon supports four preprocessing options:
1105
-
1106
- [cols="1,3"]
1107
- |===
1108
- | Option | Description
1109
-
1110
- | `:none`
1111
- | No preprocessing - compare raw content as-is
1112
-
1113
- | `:c14n`
1114
- | Apply XML Canonicalization (C14N) to normalize structure before comparison
1115
-
1116
- | `:normalize`
1117
- | Apply whitespace normalization (collapsing, trimming) before comparison
1118
-
1119
- | `:format`
1120
- | Apply format-specific pretty-printing to standardize formatting before comparison
1121
-
1122
- |===
1123
-
1124
- The preprocessing option is controlled via the `preprocessing` parameter and
1125
- defaults based on the format being compared.
1126
-
1127
- === Matching phase
1128
-
1129
- The matching phase defines how content is compared across four independent
1130
- dimensions. Each dimension can be configured with one of three mutually
1131
- exclusive behaviors.
1132
-
1133
- === Match dimensions
1134
-
1135
- The matching phase operates on four collectively exhaustive dimensions:
1136
-
1137
- [cols="1,3"]
1138
- |===
1139
- | Dimension | What it controls
1140
-
1141
- | `text_content`
1142
- | Text content within elements/values
1143
-
1144
- | `structural_whitespace`
1145
- | Whitespace between tags/elements (indentation, line breaks)
1146
-
1147
- | `attribute_whitespace`
1148
- | Whitespace within attribute values
1149
-
1150
- | `comments`
1151
- | How comments are handled
1152
- |===
1153
-
1154
- These four dimensions are collectively exhaustive - they cover all aspects of
1155
- whitespace and formatting in structured documents.
1156
-
1157
- === Match behaviors
1158
-
1159
- For each dimension, you can specify one of three mutually exclusive behaviors:
1160
-
1161
- [cols="1,3"]
1162
- |===
1163
- | Behavior | Description
1164
-
1165
- | `:strict`
1166
- | Exact character-for-character matching (including all whitespace)
1167
-
1168
- | `:normalize`
1169
- | Collapse consecutive whitespace to single spaces, trim leading/trailing whitespace
1170
-
1171
- | `:ignore`
1172
- | Don't compare this dimension at all
1173
- |===
1174
-
1175
- === Match profiles
1176
-
1177
- ==== Overview
1178
-
1179
- Canon provides a set of predefined match profiles optimized for common use cases.
1180
-
1181
- The following table shows how each profile configures the four match dimensions:
1182
-
1183
- [cols="1,1,1,1,1"]
1184
- |===
1185
- |Profile |text_content |structural_whitespace |attribute_whitespace |comments
1186
-
1187
- |`strict` |`:strict` |`:strict` |`:strict` |`:strict`
1188
-
1189
- |`rendered` |`:normalize` |`:ignore` |`:normalize` |`:ignore`
1190
-
1191
- |`spec_friendly` |`:normalize` |`:ignore` |`:normalize` |`:ignore`
1192
-
1193
- |`content_only` |`:normalize` |`:ignore` |`:ignore` |`:ignore`
1194
-
1195
- |===
1196
-
1197
- The key differences between profiles are:
1198
-
1199
- strict:: Exact matching on all dimensions - use for byte-for-byte comparison
1200
- rendered:: Mimics browser rendering - collapses text, ignores formatting and comments
1201
- spec_friendly:: Same as rendered - ideal for test specifications
1202
- content_only:: Most permissive - only compares text content, ignores all formatting and attribute whitespace
1203
-
1204
- NOTE: The `rendered` and `spec_friendly` profiles have identical configurations
1205
- but serve different semantic purposes in your codebase.
1206
-
1207
- ==== Strict profile
1208
-
1209
- The `strict` profile is the default for XML and requires exact matching:
1210
-
1211
- [source,ruby]
1212
- ----
1213
- {
1214
- text_content: :strict,
1215
- structural_whitespace: :strict,
1216
- attribute_whitespace: :strict,
1217
- comments: :strict
1218
- }
1219
- ----
1220
-
1221
- Use this when:
1222
-
1223
- * You need exact byte-for-byte comparison
1224
- * Whitespace is semantically significant
1225
- * Working with canonicalized or pre-normalized content
1226
-
1227
- ==== Rendered profile
1228
-
1229
- The `rendered` profile mimics how browsers render HTML/XML:
1230
-
1231
- [source,ruby]
1232
- ----
1233
- {
1234
- text_content: :normalize,
1235
- structural_whitespace: :ignore,
1236
- attribute_whitespace: :normalize,
1237
- comments: :ignore
1238
- }
1239
- ----
1240
-
1241
- Use this when:
1242
-
1243
- * Comparing HTML documents where rendering matters
1244
- * Whitespace between tags doesn't affect output
1245
- * Comments are documentation-only
1246
-
1247
- This is the default profile for HTML comparisons.
1248
-
1249
- ==== Spec-friendly profile
1250
-
1251
- The `spec_friendly` profile ignores all formatting differences:
1252
-
1253
- [source,ruby]
1254
- ----
1255
- {
1256
- text_content: :normalize,
1257
- structural_whitespace: :ignore,
1258
- attribute_whitespace: :normalize,
1259
- comments: :ignore
1260
- }
1261
- ----
1262
-
1263
- Use this when:
1264
-
1265
- * Writing test specifications
1266
- * Formatting/indentation style doesn't matter
1267
- * Generated vs. hand-written content comparison
1268
- * CI/CD environments with different formatters
1269
-
1270
- ==== Content-only profile
1271
-
1272
- The `content_only` profile focuses solely on actual content:
1273
-
1274
- [source,ruby]
1275
- ----
1276
- {
1277
- text_content: :normalize,
1278
- structural_whitespace: :ignore,
1279
- attribute_whitespace: :ignore,
1280
- comments: :ignore
1281
- }
1282
- ----
1283
-
1284
- Use this when:
1285
-
1286
- * Only semantic content matters
1287
- * All whitespace (including in attributes) is insignificant
1288
- * Maximum tolerance for formatting differences
1289
-
1290
-
1291
- === Format-specific defaults
1292
-
1293
- ==== General
1294
-
1295
- Different formats have different default behaviors optimized for their typical
1296
- use cases.
1297
-
1298
- ==== XML defaults
1299
-
1300
- [source,ruby]
1301
- ----
1302
- {
1303
- preprocessing: :none,
1304
- match_profile: :strict
1305
- }
1306
- ----
1307
-
1308
- XML defaults to strict matching because:
1309
-
1310
- * XML whitespace can be semantically significant
1311
- * XML is often machine-generated with consistent formatting
1312
- * Canonicalization (C14N) is available for normalization when needed
1313
-
1314
- ==== HTML defaults
1315
-
1316
- [source,ruby]
1317
- ----
1318
- {
1319
- preprocessing: :none,
1320
- match_profile: :rendered
1321
- }
1322
- ----
1323
-
1324
- HTML defaults to rendered-style matching because:
1325
-
1326
- * Browsers collapse whitespace when rendering
1327
- * Indentation and formatting are for readability only
1328
- * Comments are typically documentation
1329
-
1330
- ==== JSON defaults
1331
-
1332
- [source,ruby]
1333
- ----
1334
- {
1335
- preprocessing: :format,
1336
- match_profile: :rendered
1337
- }
1338
- ----
1339
-
1340
- JSON applies pretty-printing before comparison because:
1341
-
1342
- * JSON whitespace is never semantically significant
1343
- * Minified vs. formatted JSON should be equivalent
1344
- * Pretty-printing ensures consistent structure
1345
-
1346
- ==== YAML defaults
1347
-
1348
- [source,ruby]
1349
- ----
1350
- {
1351
- preprocessing: :format,
1352
- match_profile: :rendered
1353
- }
1354
- ----
1355
-
1356
- YAML applies pretty-printing because:
1357
-
1358
- * YAML formatting can vary significantly
1359
- * Indentation styles differ between generators
1360
- * Content equivalence is what matters
1361
-
1362
-
1363
-
1364
- ==== Dimension-specific examples
1365
-
1366
- === Text content dimension
1367
-
1368
- The `text_content` dimension controls how text within elements is compared.
1369
-
1370
- ==== Strict behavior (exact whitespace)
1371
-
1372
- When `text_content: :strict`, all whitespace in text content must match exactly.
1373
-
1374
- .XML examples with strict text_content
1375
- [example]
1376
- The following XML strings are **not** considered equal because whitespace differs:
1377
-
1378
- [source,xml]
1379
- ----
1380
- <p> text with spaces </p>
1381
- <p>text with spaces</p>
1382
- ----
1383
-
1384
- [source,ruby]
1385
- ----
1386
- actual = "<p> text with spaces </p>"
1387
- expected = "<p>text with spaces</p>"
1388
-
1389
- expect(actual).not_to be_xml_equivalent_to(
1390
- expected,
1391
- match_options: {
1392
- text_content: :strict,
1393
- structural_whitespace: :ignore,
1394
- attribute_whitespace: :strict,
1395
- comments: :ignore
1396
- }
1397
- )
1398
- # => true (documents are NOT equivalent)
1399
- ----
1400
-
1401
- Even differences in leading/trailing whitespace matter:
1402
-
1403
- [source,xml]
1404
- ----
1405
- <item> Value </item>
1406
- <item>Value</item>
1407
- ----
1408
-
1409
- [source,ruby]
1410
- ----
1411
- xml1 = "<item> Value </item>"
1412
- xml2 = "<item>Value</item>"
1413
-
1414
- expect(xml1).not_to be_xml_equivalent_to(
1415
- xml2,
1416
- match_options: { text_content: :strict, structural_whitespace: :ignore }
1417
- )
1418
- # => true (documents are NOT equivalent)
1419
- ----
1420
-
1421
- .HTML examples with strict text_content
1422
- [example]
1423
- [source,html]
1424
- ----
1425
- <a href="/admin"> SOME TEXT </a>
1426
- <a href="/admin">SOME TEXT</a>
1427
- ----
1428
-
1429
- [source,ruby]
1430
- ----
1431
- html1 = '<a href="/admin"> SOME TEXT </a>'
1432
- html2 = '<a href="/admin">SOME TEXT</a>'
1433
-
1434
- expect(html1).not_to be_html_equivalent_to(
1435
- html2,
1436
- match_options: { text_content: :strict, structural_whitespace: :ignore }
1437
- )
1438
- # => true (documents are NOT equivalent)
1439
- ----
1440
-
1441
- ==== Normalize behavior (collapse whitespace)
1442
-
1443
- When `text_content: :normalize`, consecutive whitespace is collapsed to single spaces and leading/trailing whitespace is trimmed.
1444
-
1445
- .XML examples with normalized text_content
1446
- [example]
1447
- The following XML strings **are** considered equal:
1448
-
1449
- [source,xml]
1450
- ----
1451
- <p> text with multiple spaces </p>
1452
- <p>text with multiple spaces</p>
1453
- ----
1454
-
1455
- [source,ruby]
1456
- ----
1457
- actual = "<p> text with multiple spaces </p>"
1458
- expected = "<p>text with multiple spaces</p>"
1459
-
1460
- expect(actual).to be_xml_equivalent_to(
1461
- expected,
1462
- match_options: {
1463
- text_content: :normalize,
1464
- structural_whitespace: :ignore,
1465
- attribute_whitespace: :strict,
1466
- comments: :ignore
1467
- }
1468
- )
1469
- # => true (documents are equivalent)
1470
- ----
1471
-
1472
- Tabs and newlines are also normalized:
1473
-
1474
- [source,xml]
1475
- ----
1476
- <description>
1477
- This is a
1478
- multi-line
1479
- description
1480
- </description>
1481
-
1482
- <description>This is a multi-line description</description>
1483
- ----
1484
-
1485
- [source,ruby]
1486
- ----
1487
- xml1 = <<~XML
1488
- <description>
1489
- This is a
1490
- multi-line
1491
- description
1492
- </description>
1493
- XML
1494
-
1495
- xml2 = "<description>This is a multi-line description</description>"
1496
-
1497
- expect(xml1).to be_xml_equivalent_to(
1498
- xml2,
1499
- match_options: { text_content: :normalize, structural_whitespace: :ignore }
1500
- )
1501
- # => true (documents are equivalent)
1502
- ----
1503
-
1504
- .HTML examples with normalized text_content
1505
- [example]
1506
- [source,html]
1507
- ----
1508
- <a href="/admin"> SOME TEXT CONTENT </a>
1509
- <a href="/admin">SOME TEXT CONTENT</a>
1510
- ----
1511
-
1512
- [source,ruby]
1513
- ----
1514
- html1 = '<a href="/admin"> SOME TEXT CONTENT </a>'
1515
- html2 = '<a href="/admin">SOME TEXT CONTENT</a>'
1516
-
1517
- expect(html1).to be_html_equivalent_to(
1518
- html2,
1519
- match_options: { text_content: :normalize, structural_whitespace: :ignore }
1520
- )
1521
- # => true (documents are equivalent)
1522
- ----
1523
-
1524
- Multi-line HTML text:
1525
-
1526
- [source,html]
1527
- ----
1528
- <p>
1529
- This is a paragraph
1530
- with multiple lines
1531
- of text.
1532
- </p>
1533
-
1534
- <p>This is a paragraph with multiple lines of text.</p>
1535
- ----
1536
-
1537
- [source,ruby]
1538
- ----
1539
- html1 = <<~HTML
1540
- <p>
1541
- This is a paragraph
1542
- with multiple lines
1543
- of text.
1544
- </p>
1545
- HTML
1546
-
1547
- html2 = "<p>This is a paragraph with multiple lines of text.</p>"
1548
-
1549
- expect(html1).to be_html_equivalent_to(
1550
- html2,
1551
- match_options: { text_content: :normalize, structural_whitespace: :ignore }
1552
- )
1553
- # => true (documents are equivalent)
1554
- ----
1555
-
1556
- === Structural whitespace dimension
1557
-
1558
- The `structural_whitespace` dimension controls whitespace between tags (indentation, line breaks, formatting).
1559
-
1560
- ==== Strict behavior
1561
-
1562
- When `structural_whitespace: :strict`, all whitespace between tags must match exactly, including indentation and line breaks.
1563
-
1564
- .XML examples with strict structural_whitespace
1565
- [example]
1566
- These documents are **not** equivalent due to different indentation:
1567
-
1568
- [source,xml]
1569
- ----
1570
- <root>
1571
- <item>Value</item>
1572
- </root>
1573
-
1574
- <root>
1575
- <item>Value</item>
1576
- </root>
1577
- ----
1578
-
1579
- [source,ruby]
1580
- ----
1581
- xml1 = "<root>\n <item>Value</item>\n</root>"
1582
- xml2 = "<root>\n <item>Value</item>\n</root>"
1583
-
1584
- expect(xml1).not_to be_xml_equivalent_to(
1585
- xml2,
1586
- match_options: {
1587
- text_content: :normalize,
1588
- structural_whitespace: :strict,
1589
- attribute_whitespace: :strict,
1590
- comments: :ignore
1591
- }
1592
- )
1593
- # => true (documents are NOT equivalent - indentation differs)
1594
- ----
1595
-
1596
- ==== Ignore behavior (formatting doesn't matter)
1597
-
1598
- When `structural_whitespace: :ignore`, all whitespace between tags is ignored, making pretty-printed and compact formats equivalent.
1599
-
1600
- .XML examples with ignored structural_whitespace
1601
- [example]
1602
- Pretty-printed vs compact XML **are** considered equal:
1603
-
1604
- [source,xml]
1605
- ----
1606
- <!-- Pretty-printed with indentation -->
1607
- <root>
1608
- <a>
1609
- <b>text</b>
1610
- </a>
1611
- </root>
1612
-
1613
- <!-- Compact on one line -->
1614
- <root><a><b>text</b></a></root>
1615
- ----
1616
-
1617
- [source,ruby]
1618
- ----
1619
- compact = "<root><a><b>text</b></a></root>"
1620
- formatted = <<~XML
1621
- <root>
1622
- <a>
1623
- <b>text</b>
1624
- </a>
1625
- </root>
1626
- XML
1627
-
1628
- expect(compact).to be_xml_equivalent_to(
1629
- formatted,
1630
- match_options: {
1631
- text_content: :normalize,
1632
- structural_whitespace: :ignore,
1633
- attribute_whitespace: :strict,
1634
- comments: :ignore
1635
- }
1636
- )
1637
- # => true (documents are equivalent)
1638
- ----
1639
-
1640
- Complex nested structures with different indentation:
1641
-
1642
- [source,xml]
1643
- ----
1644
- <!-- 2-space indentation -->
1645
- <document>
1646
- <metadata>
1647
- <title>My Document</title>
1648
- <author>
1649
- <name>John Doe</name>
1650
- </author>
1651
- </metadata>
1652
- </document>
1653
-
1654
- <!-- 4-space indentation -->
1655
- <document>
1656
- <metadata>
1657
- <title>My Document</title>
1658
- <author>
1659
- <name>John Doe</name>
1660
- </author>
1661
- </metadata>
1662
- </document>
1663
-
1664
- <!-- Compact -->
1665
- <document><metadata><title>My Document</title><author><name>John Doe</name></author></metadata></document>
1666
- ----
1667
-
1668
- [source,ruby]
1669
- ----
1670
- two_spaces = <<~XML
1671
- <document>
1672
- <metadata>
1673
- <title>My Document</title>
1674
- <author>
1675
- <name>John Doe</name>
1676
- </author>
1677
- </metadata>
1678
- </document>
1679
- XML
1680
-
1681
- four_spaces = "<document>\n <metadata>\n <title>My Document</title>\n <author>\n <name>John Doe</name>\n </author>\n </metadata>\n</document>"
1682
-
1683
- compact = "<document><metadata><title>My Document</title><author><name>John Doe</name></author></metadata></document>"
1684
-
1685
- expect(two_spaces).to be_xml_equivalent_to(
1686
- four_spaces,
1687
- match_options: { structural_whitespace: :ignore }
1688
- )
1689
- # => true
1690
-
1691
- expect(two_spaces).to be_xml_equivalent_to(
1692
- compact,
1693
- match_options: { structural_whitespace: :ignore }
1694
- )
1695
- # => true
1696
- ----
1697
-
1698
- .HTML examples with ignored structural_whitespace
1699
- [example]
1700
- [source,html]
1701
- ----
1702
- <!-- Pretty-printed -->
1703
- <div class="container">
1704
- <header>
1705
- <h1>Welcome</h1>
1706
- <p>Introduction text</p>
1707
- </header>
1708
- </div>
1709
-
1710
- <!-- Compact -->
1711
- <div class="container"><header><h1>Welcome</h1><p>Introduction text</p></header></div>
1712
- ----
1713
-
1714
- [source,ruby]
1715
- ----
1716
- pretty_html = <<~HTML
1717
- <div class="container">
1718
- <header>
1719
- <h1>Welcome</h1>
1720
- <p>Introduction text</p>
1721
- </header>
1722
- </div>
1723
- HTML
1724
-
1725
- compact_html = '<div class="container"><header><h1>Welcome</h1><p>Introduction text</p></header></div>'
1726
-
1727
- expect(pretty_html).to be_html_equivalent_to(
1728
- compact_html,
1729
- match_options: { structural_whitespace: :ignore }
1730
- )
1731
- # => true (documents are equivalent)
1732
- ----
1733
-
1734
- ==== Normalize behavior
1735
-
1736
- When `structural_whitespace: :normalize`, whitespace between tags is collapsed to single spaces.
1737
-
1738
- .XML examples with normalized structural_whitespace
1739
- [example]
1740
- [source,xml]
1741
- ----
1742
- <root>
1743
-
1744
-
1745
- <item>Value</item>
1746
-
1747
-
1748
- </root>
1749
-
1750
- <root> <item>Value</item> </root>
1751
- ----
1752
-
1753
- [source,ruby]
1754
- ----
1755
- xml1 = "<root>\n\n\n <item>Value</item>\n\n\n</root>"
1756
- xml2 = "<root> <item>Value</item> </root>"
1757
-
1758
- expect(xml1).to be_xml_equivalent_to(
1759
- xml2,
1760
- match_options: { structural_whitespace: :normalize }
1761
- )
1762
- # => true (documents are equivalent - whitespace normalized)
1763
- ----
1764
-
1765
- === Attribute whitespace dimension
1766
-
1767
- The `attribute_whitespace` dimension controls whitespace within attribute values.
1768
-
1769
- ==== Strict behavior (exact attribute whitespace)
1770
-
1771
- When `attribute_whitespace: :strict`, whitespace in attribute values must match exactly.
1772
-
1773
- .XML examples with strict attribute_whitespace
1774
- [example]
1775
- These documents are **not** equivalent due to attribute whitespace differences:
1776
-
1777
- [source,xml]
1778
- ----
1779
- <div class=" foo bar ">text</div>
1780
- <div class="foo bar">text</div>
1781
- ----
1782
-
1783
- [source,ruby]
1784
- ----
1785
- actual = '<div class=" foo bar ">text</div>'
1786
- expected = '<div class="foo bar">text</div>'
1787
-
1788
- expect(actual).not_to be_xml_equivalent_to(
1789
- expected,
1790
- match_options: {
1791
- text_content: :normalize,
1792
- structural_whitespace: :ignore,
1793
- attribute_whitespace: :strict,
1794
- comments: :ignore
1795
- }
1796
- )
1797
- # => true (documents are NOT equivalent)
1798
- ----
1799
-
1800
- Leading/trailing whitespace in attributes:
1801
-
1802
- [source,xml]
1803
- ----
1804
- <item id=" 123 " name=" Widget "/>
1805
- <item id="123" name="Widget"/>
1806
- ----
1807
-
1808
- [source,ruby]
1809
- ----
1810
- xml1 = '<item id=" 123 " name=" Widget "/>'
1811
- xml2 = '<item id="123" name="Widget"/>'
1812
-
1813
- expect(xml1).not_to be_xml_equivalent_to(
1814
- xml2,
1815
- match_options: { attribute_whitespace: :strict }
1816
- )
1817
- # => true (documents are NOT equivalent)
1818
- ----
1819
-
1820
- .HTML examples with strict attribute_whitespace
1821
- [example]
1822
- [source,html]
1823
- ----
1824
- <a href="/admin" class=" button primary ">Link</a>
1825
- <a href="/admin" class="button primary">Link</a>
1826
- ----
1827
-
1828
- [source,ruby]
1829
- ----
1830
- html1 = '<a href="/admin" class=" button primary ">Link</a>'
1831
- html2 = '<a href="/admin" class="button primary">Link</a>'
1832
-
1833
- expect(html1).not_to be_html_equivalent_to(
1834
- html2,
1835
- match_options: { attribute_whitespace: :strict }
1836
- )
1837
- # => true (documents are NOT equivalent)
1838
- ----
1839
-
1840
- ==== Normalize behavior (collapse attribute whitespace)
1841
-
1842
- When `attribute_whitespace: :normalize`, whitespace in attribute values is collapsed and trimmed.
1843
-
1844
- .XML examples with normalized attribute_whitespace
1845
- [example]
1846
- These documents **are** considered equal:
1847
-
1848
- [source,xml]
1849
- ----
1850
- <div class=" foo bar ">text</div>
1851
- <div class="foo bar">text</div>
1852
- ----
1853
-
1854
- [source,ruby]
1855
- ----
1856
- actual = '<div class=" foo bar ">text</div>'
1857
- expected = '<div class="foo bar">text</div>'
1858
-
1859
- expect(actual).to be_xml_equivalent_to(
1860
- expected,
1861
- match_options: {
1862
- text_content: :normalize,
1863
- structural_whitespace: :ignore,
1864
- attribute_whitespace: :normalize,
1865
- comments: :ignore
1866
- }
1867
- )
1868
- # => true (documents are equivalent)
1869
- ----
1870
-
1871
- Multiple attributes with whitespace:
1872
-
1873
- [source,xml]
1874
- ----
1875
- <item id=" 123 " name=" Widget " category=" tools "/>
1876
- <item id="123" name="Widget" category="tools"/>
1877
- ----
1878
-
1879
- [source,ruby]
1880
- ----
1881
- xml1 = '<item id=" 123 " name=" Widget " category=" tools "/>'
1882
- xml2 = '<item id="123" name="Widget" category="tools"/>'
1883
-
1884
- expect(xml1).to be_xml_equivalent_to(
1885
- xml2,
1886
- match_options: { attribute_whitespace: :normalize }
1887
- )
1888
- # => true (documents are equivalent)
1889
- ----
1890
-
1891
- .HTML examples with normalized attribute_whitespace
1892
- [example]
1893
- [source,html]
1894
- ----
1895
- <a href="/admin" class=" button primary " id=" main-link ">Link</a>
1896
- <a href="/admin" class="button primary" id="main-link">Link</a>
1897
- ----
1898
-
1899
- [source,ruby]
1900
- ----
1901
- html1 = '<a href="/admin" class=" button primary " id=" main-link ">Link</a>'
1902
- html2 = '<a href="/admin" class="button primary" id="main-link">Link</a>'
1903
-
1904
- expect(html1).to be_html_equivalent_to(
1905
- html2,
1906
- match_options: { attribute_whitespace: :normalize }
1907
- )
1908
- # => true (documents are equivalent)
1909
- ----
1910
-
1911
- ==== Ignore behavior
1912
-
1913
- When `attribute_whitespace: :ignore`, attribute values are not compared at all (only attribute names are checked).
1914
-
1915
- .Example with ignored attribute_whitespace
1916
- [example]
1917
- [source,ruby]
1918
- ----
1919
- xml1 = '<item class="foo">text</item>'
1920
- xml2 = '<item class="completely different">text</item>'
1921
-
1922
- expect(xml1).to be_xml_equivalent_to(
1923
- xml2,
1924
- match_options: { attribute_whitespace: :ignore }
1925
- )
1926
- # => true (attribute values are not compared)
1927
- ----
1928
-
1929
- === Comments dimension
1930
-
1931
- The `comments` dimension controls how XML/HTML comments are compared.
1932
-
1933
- ==== Strict behavior
1934
-
1935
- When `comments: :strict`, comments must match exactly, including their content and position.
1936
-
1937
- .XML examples with strict comments
1938
- [example]
1939
- These documents are **not** equivalent due to different comments:
1940
-
1941
- [source,xml]
1942
- ----
1943
- <root><!-- First comment --><a>text</a></root>
1944
- <root><!-- Different comment --><a>text</a></root>
1945
- ----
1946
-
1947
- [source,ruby]
1948
- ----
1949
- xml1 = "<root><!-- First comment --><a>text</a></root>"
1950
- xml2 = "<root><!-- Different comment --><a>text</a></root>"
1951
-
1952
- expect(xml1).not_to be_xml_equivalent_to(
1953
- xml2,
1954
- match_options: { comments: :strict }
1955
- )
1956
- # => true (documents are NOT equivalent - comments differ)
1957
- ----
1958
-
1959
- ==== Ignore behavior (comments don't affect comparison)
1960
-
1961
- When `comments: :ignore`, comments are completely ignored during comparison.
1962
-
1963
- .XML examples with ignored comments
1964
- [example]
1965
- These documents **are** considered equal despite different comments:
1966
-
1967
- [source,xml]
1968
- ----
1969
- <root><!-- comment --><a>text</a></root>
1970
- <root><!-- different --><a>text</a></root>
1971
- <root><a>text</a></root>
1972
- ----
1973
-
1974
- [source,ruby]
1975
- ----
1976
- with_comment = "<root><!-- comment --><a>text</a></root>"
1977
- different_comment = "<root><!-- different --><a>text</a></root>"
1978
- no_comment = "<root><a>text</a></root>"
1979
-
1980
- expect(with_comment).to be_xml_equivalent_to(
1981
- different_comment,
1982
- match_options: {
1983
- text_content: :normalize,
1984
- structural_whitespace: :ignore,
1985
- attribute_whitespace: :strict,
1986
- comments: :ignore
1987
- }
1988
- )
1989
- # => true (documents are equivalent - comments ignored)
1990
-
1991
- expect(with_comment).to be_xml_equivalent_to(
1992
- no_comment,
1993
- match_options: {
1994
- text_content: :normalize,
1995
- structural_whitespace: :ignore,
1996
- attribute_whitespace: :strict,
1997
- comments: :ignore
1998
- }
1999
- )
2000
- # => true (documents are equivalent - comments ignored)
2001
- ----
2002
-
2003
- Complex document with multiple comments:
2004
-
2005
- [source,xml]
2006
- ----
2007
- <!-- Document header -->
2008
- <document>
2009
- <!-- Metadata section -->
2010
- <metadata>
2011
- <title>My Document</title>
2012
- <!-- Author information -->
2013
- <author>John Doe</author>
2014
- </metadata>
2015
- <!-- Main content -->
2016
- <content>
2017
- <p>Text</p>
2018
- </content>
2019
- </document>
2020
-
2021
- <document>
2022
- <metadata>
2023
- <title>My Document</title>
2024
- <author>John Doe</author>
2025
- </metadata>
2026
- <content>
2027
- <p>Text</p>
2028
- </content>
2029
- </document>
2030
- ----
2031
-
2032
- [source,ruby]
2033
- ----
2034
- with_comments = <<~XML
2035
- <!-- Document header -->
2036
- <document>
2037
- <!-- Metadata section -->
2038
- <metadata>
2039
- <title>My Document</title>
2040
- <!-- Author information -->
2041
- <author>John Doe</author>
2042
- </metadata>
2043
- <!-- Main content -->
2044
- <content>
2045
- <p>Text</p>
2046
- </content>
2047
- </document>
2048
- XML
2049
-
2050
- without_comments = <<~XML
2051
- <document>
2052
- <metadata>
2053
- <title>My Document</title>
2054
- <author>John Doe</author>
2055
- </metadata>
2056
- <content>
2057
- <p>Text</p>
2058
- </content>
2059
- </document>
2060
- XML
2061
-
2062
- expect(with_comments).to be_xml_equivalent_to(
2063
- without_comments,
2064
- match_options: { comments: :ignore }
2065
- )
2066
- # => true (documents are equivalent)
2067
- ----
2068
-
2069
- .HTML examples with ignored comments
2070
- [example]
2071
- [source,html]
2072
- ----
2073
- <!-- Navigation -->
2074
- <nav>
2075
- <!-- Primary menu -->
2076
- <ul>
2077
- <li>Home</li>
2078
- </ul>
2079
- </nav>
2080
-
2081
- <nav>
2082
- <ul>
2083
- <li>Home</li>
2084
- </ul>
2085
- </nav>
2086
- ----
2087
-
2088
- [source,ruby]
2089
- ----
2090
- html_with_comments = <<~HTML
2091
- <!-- Navigation -->
2092
- <nav>
2093
- <!-- Primary menu -->
2094
- <ul>
2095
- <li>Home</li>
2096
- </ul>
2097
- </nav>
2098
- HTML
2099
-
2100
- html_without_comments = <<~HTML
2101
- <nav>
2102
- <ul>
2103
- <li>Home</li>
2104
- </ul>
2105
- </nav>
2106
- HTML
2107
-
2108
- expect(html_with_comments).to be_html_equivalent_to(
2109
- html_without_comments,
2110
- match_options: { comments: :ignore }
2111
- )
2112
- # => true (documents are equivalent)
2113
- ----
2114
-
2115
- ==== Normalize behavior
2116
-
2117
- When `comments: :normalize`, comment content is trimmed and whitespace is collapsed before comparison.
2118
-
2119
- .Example with normalized comments
2120
- [example]
2121
- [source,ruby]
2122
- ----
2123
- xml1 = "<root><!-- comment with spaces --><a>text</a></root>"
2124
- xml2 = "<root><!-- comment with spaces --><a>text</a></root>"
2125
-
2126
- expect(xml1).to be_xml_equivalent_to(
2127
- xml2,
2128
- match_options: { comments: :normalize }
2129
- )
2130
- # => true (comments are normalized before comparison)
2131
- ----
2132
-
2133
- ==== Precedence resolution
2134
-
2135
- When multiple configuration sources are present, Canon resolves them in this order (highest to lowest precedence):
2136
-
2137
- . Explicit `match_options` hash in the test
2138
- . Named `match_profile` in the test
2139
- . Global format-specific profile (e.g., `xml_match_profile`)
2140
- . Format-specific defaults (e.g., XML → strict, HTML → rendered)
2141
-
2142
- .Example of precedence resolution
2143
- ====
2144
- [source,ruby]
2145
- ----
2146
- # Global configuration
2147
- Canon::RSpecMatchers.configure do |config|
2148
- config.xml_match_profile = :spec_friendly
2149
- end
2150
-
2151
- # This uses strict for attribute_whitespace (explicit option)
2152
- # and spec_friendly for other dimensions (global profile)
2153
- expect(actual).to be_xml_equivalent_to(
2154
- expected,
2155
- match_options: {
2156
- attribute_whitespace: :strict
2157
- }
2158
- )
2159
- ----
2160
- ====
2161
-
2162
-
2163
- [[ignore_attr_order]]
2164
- ==== ignore_attr_order
2165
-
2166
- `ignore_attr_order: {true|false}` default: `true`
2167
-
2168
- When `true`, all attributes are sorted before comparison and only attributes of
2169
- the same type are compared.
2170
-
2171
- Usage:
2172
-
2173
- [source,ruby]
2174
- ----
2175
- Canon::Comparison.equivalent?(doc1, doc2, ignore_attr_order: true)
2176
- ----
2177
-
2178
- .HTML examples with ignore_attr_order
2179
- [example]
2180
- ====
2181
- When `true` the following HTML strings are considered equal:
2182
-
2183
- [source,html]
2184
- ----
2185
- <a href="/admin" class="button" target="_blank">Link</a>
2186
- <a class="button" target="_blank" href="/admin">Link</a>
2187
- ----
2188
-
2189
- [source,ruby]
2190
- ----
2191
- html1 = '<a href="/admin" class="button" target="_blank">Link</a>'
2192
- html2 = '<a class="button" target="_blank" href="/admin">Link</a>'
2193
- Canon::Comparison.equivalent?(html1, html2, ignore_attr_order: true)
2194
- # => true
2195
- ----
2196
-
2197
- When `false` attributes are compared in order:
2198
-
2199
- [source,ruby]
2200
- ----
2201
- html1 = '<a href="/admin" class="button">Link</a>'
2202
- html2 = '<a class="button" href="/admin">Link</a>'
2203
- Canon::Comparison.equivalent?(html1, html2, ignore_attr_order: false)
2204
- # => false
2205
- ----
2206
- ====
2207
-
2208
- .XML examples with ignore_attr_order
2209
- [example]
2210
- ====
2211
- When `true` the following XML strings are considered equal:
2212
-
2213
- [source,xml]
2214
- ----
2215
- <item id="1" name="Widget" price="10.00"/>
2216
- <item price="10.00" id="1" name="Widget"/>
2217
- ----
2218
-
2219
- [source,ruby]
2220
- ----
2221
- xml1 = '<item id="1" name="Widget" price="10.00"/>'
2222
- xml2 = '<item price="10.00" id="1" name="Widget"/>'
2223
- Canon::Comparison.equivalent?(xml1, xml2, ignore_attr_order: true)
2224
- # => true
2225
- ----
2226
- ====
2227
-
2228
-
2229
- [[verbose]]
2230
- ==== verbose
2231
-
2232
- `verbose: {true|false}` default: `false`
2233
-
2234
- When `true`, instead of returning a boolean value `Canon::Comparison.equivalent?`
2235
- returns an array of all errors encountered when performing a comparison.
2236
-
2237
- WARNING: When `true`, the comparison takes longer! Not only because more
2238
- processing is required to produce meaningful differences, but also because in
2239
- this mode, comparison does **NOT** stop when a first difference is encountered,
2240
- because the goal is to capture as many differences as possible.
2241
-
2242
- Usage:
2243
-
2244
- [source,ruby]
2245
- ----
2246
- Canon::Comparison.equivalent?(doc1, doc2, verbose: true)
2247
- ----
2248
-
2249
- Return values in verbose mode:
2250
-
2251
- * Empty array `[]` if documents are equivalent
2252
- * Array of difference hashes if documents differ
2253
-
2254
- Each difference hash contains:
2255
-
2256
- `node1`:: The first node involved in the difference
2257
- `node2`:: The second node involved in the difference
2258
- `diff1`:: Difference code for the first node
2259
- `diff2`:: Difference code for the second node
2260
-
2261
- Difference codes:
2262
-
2263
- * `Canon::Comparison::EQUIVALENT` (1) - Nodes are equivalent
2264
- * `Canon::Comparison::MISSING_ATTRIBUTE` (2) - Attribute missing
2265
- * `Canon::Comparison::MISSING_NODE` (3) - Node missing
2266
- * `Canon::Comparison::UNEQUAL_ATTRIBUTES` (4) - Attributes differ
2267
- * `Canon::Comparison::UNEQUAL_COMMENTS` (5) - Comments differ
2268
- * `Canon::Comparison::UNEQUAL_ELEMENTS` (7) - Element names differ
2269
- * `Canon::Comparison::UNEQUAL_NODES_TYPES` (8) - Node types differ
2270
- * `Canon::Comparison::UNEQUAL_TEXT_CONTENTS` (9) - Text content differs
2271
-
2272
- .Verbose mode examples
2273
- [example]
2274
- ====
2275
- [source,ruby]
2276
- ----
2277
- # Verbose mode with equivalent documents
2278
- html1 = '<div>Hello</div>'
2279
- html2 = '<div>Hello</div>'
2280
- result = Canon::Comparison.equivalent?(html1, html2, verbose: true)
2281
- # => [] (empty array indicates equivalence)
2282
-
2283
- # Verbose mode with different text content
2284
- html1 = '<div>Hello</div>'
2285
- html2 = '<div>Goodbye</div>'
2286
- result = Canon::Comparison.equivalent?(html1, html2, verbose: true)
2287
- # => [{
2288
- # node1: <Nokogiri::XML::Text>,
2289
- # node2: <Nokogiri::XML::Text>,
2290
- # diff1: 9, # UNEQUAL_TEXT_CONTENTS
2291
- # diff2: 9 # UNEQUAL_TEXT_CONTENTS
2292
- # }]
2293
-
2294
- # Verbose mode with different element names
2295
- html1 = '<div>Test</div>'
2296
- html2 = '<span>Test</span>'
2297
- result = Canon::Comparison.equivalent?(html1, html2, verbose: true)
2298
- # => [{
2299
- # node1: <Nokogiri::XML::Element: div>,
2300
- # node2: <Nokogiri::XML::Element: span>,
2301
- # diff1: 7, # UNEQUAL_ELEMENTS
2302
- # diff2: 7 # UNEQUAL_ELEMENTS
2303
- # }]
2304
-
2305
- # Verbose mode with missing attributes
2306
- html1 = '<div class="foo" id="bar">Test</div>'
2307
- html2 = '<div class="foo">Test</div>'
2308
- result = Canon::Comparison.equivalent?(html1, html2, verbose: true)
2309
- # => [{
2310
- # node1: <Nokogiri::XML::Element: div>,
2311
- # node2: <Nokogiri::XML::Element: div>,
2312
- # diff1: 2, # MISSING_ATTRIBUTE
2313
- # diff2: 2 # MISSING_ATTRIBUTE
2314
- # }]
2315
-
2316
- # Check difference type programmatically
2317
- result = Canon::Comparison.equivalent?(html1, html2, verbose: true)
2318
- if result.empty?
2319
- puts "Documents are equivalent"
2320
- else
2321
- result.each do |diff|
2322
- case diff[:diff1]
2323
- when Canon::Comparison::UNEQUAL_TEXT_CONTENTS
2324
- puts "Text content differs"
2325
- when Canon::Comparison::UNEQUAL_ELEMENTS
2326
- puts "Element names differ"
2327
- when Canon::Comparison::MISSING_ATTRIBUTE
2328
- puts "Attributes differ"
2329
- end
2330
- end
2331
- end
2332
- ----
2333
- ====
2334
-
2335
- === Input validation
2336
-
2337
- Canon provides comprehensive input validation for all supported formats (XML,
2338
- HTML, JSON, YAML). When malformed input is detected, Canon raises a
2339
- `Canon::ValidationError` with detailed location information to help you quickly
2340
- identify and fix the problem.
2341
-
2342
- ==== Purpose
2343
-
2344
- Input validation ensures that:
2345
-
2346
- * Malformed documents are detected early with clear error messages
2347
- * Syntax errors show exact line and column numbers
2348
- * Error details appear in RSpec test output (not hidden in log files)
2349
- * Users receive actionable feedback about what's wrong and where
2350
-
2351
- ==== How it works
2352
-
2353
- Canon validates input **before parsing** using format-specific validators:
2354
-
2355
- * `Canon::Validators::XmlValidator` - Strict XML syntax validation
2356
- * `Canon::Validators::HtmlValidator` - HTML5 and XHTML validation
2357
- * `Canon::Validators::JsonValidator` - JSON syntax validation
2358
- * `Canon::Validators::YamlValidator` - YAML syntax validation
2359
-
2360
- Validation happens automatically when you use Canon's formatters or comparison
2361
- methods.
2362
-
2363
- ==== Validation error format
2364
-
2365
- When validation fails, Canon raises `Canon::ValidationError` with:
2366
-
2367
- * `format` - The format being validated (`:xml`, `:html`, `:json`, `:yaml`)
2368
- * `line` - Line number where the error occurred (if available)
2369
- * `column` - Column number where the error occurred (if available)
2370
- * `details` - Additional context about the error
2371
-
2372
- .Validation error example
2373
- [example]
2374
- [source,ruby]
2375
- ----
2376
- require 'canon'
2377
-
2378
- malformed_xml = '<root><unclosed>'
2379
-
2380
- begin
2381
- Canon.format(malformed_xml, :xml)
2382
- rescue Canon::ValidationError => e
2383
- puts e.message
2384
- # XML Validation Error: Premature end of data in tag unclosed line 1
2385
- # Line: 1
2386
- # Column: 18
2387
-
2388
- puts "Format: #{e.format}" # => :xml
2389
- puts "Line: #{e.line}" # => 1
2390
- puts "Column: #{e.column}" # => 18
2391
- end
2392
- ----
2393
-
2394
- ==== Format-specific validation
2395
-
2396
- ===== XML validation
2397
-
2398
- Uses Nokogiri's strict XML parsing to detect:
2399
-
2400
- * Unclosed tags
2401
- * Mismatched tags
2402
- * Invalid XML declaration
2403
- * Malformed attributes
2404
- * Invalid character references
2405
-
2406
- .XML validation examples
2407
- [example]
2408
- [source,ruby]
2409
- ----
2410
- # Unclosed tag
2411
- Canon.format('<root><item>', :xml)
2412
- # => Canon::ValidationError: XML Validation Error: Premature end of data in tag item line 1
2413
- # Line: 1
2414
-
2415
- # Mismatched tags
2416
- Canon.format('<root><item></root>', :xml)
2417
- # => Canon::ValidationError: XML Validation Error: Opening and ending tag mismatch: item line 1 and root
2418
- # Line: 1
2419
- ----
2420
-
2421
- ===== HTML validation
2422
-
2423
- Automatically detects HTML5 vs XHTML and applies appropriate validation:
2424
-
2425
- * HTML5: Uses Nokogiri::HTML5 parser with error filtering
2426
- * XHTML: Uses strict XML parsing
2427
-
2428
- Special handling:
2429
-
2430
- * Strips XML declarations from HTML (common in legacy HTML files)
2431
- * Filters out non-critical HTML5 parser warnings
2432
- * Only reports significant errors (level 2+)
2433
-
2434
- .HTML validation examples
2435
- [example]
2436
- [source,ruby]
2437
- ----
2438
- # Malformed XHTML
2439
- xhtml = '<html xmlns="http://www.w3.org/1999/xhtml"><body><p>Unclosed'
2440
- Canon.format(xhtml, :html)
2441
- # => Canon::ValidationError: HTML Validation Error: Premature end of data in tag p line 1
2442
- # Line: 1
2443
-
2444
- # HTML5 with errors
2445
- html5 = '<div><span></div>'
2446
- Canon.format(html5, :html)
2447
- # => Canon::ValidationError: HTML Validation Error: Unexpected end tag : span
2448
- # Line: 1
2449
- ----
2450
-
2451
- ===== JSON validation
2452
-
2453
- Validates JSON syntax using Ruby's JSON parser:
2454
-
2455
- * Missing/extra braces or brackets
2456
- * Trailing commas
2457
- * Invalid escape sequences
2458
- * Invalid numbers
2459
-
2460
- Provides context showing the error location in the JSON structure.
2461
-
2462
- .JSON validation examples
2463
- [example]
2464
- [source,ruby]
2465
- ----
2466
- # Missing closing brace
2467
- Canon.format('{"key": "value"', :json)
2468
- # => Canon::ValidationError: JSON Validation Error: unexpected token at '{"key": "value"'
2469
- # Details: Error at position 16
2470
-
2471
- # Trailing comma (invalid in JSON)
2472
- Canon.format('{"a": 1,}', :json)
2473
- # => Canon::ValidationError: JSON Validation Error: unexpected token at '{"a": 1,}'
2474
- # Details: Error at position 8
2475
- ----
2476
-
2477
- ===== YAML validation
2478
-
2479
- Validates YAML syntax using Psych (Ruby's YAML parser):
2480
-
2481
- * Invalid indentation
2482
- * Unclosed brackets/braces
2483
- * Invalid anchors/aliases
2484
- * Type mismatches
2485
-
2486
- Shows error location with line numbers and context.
2487
-
2488
- .YAML validation examples
2489
- [example]
2490
- [source,ruby]
2491
- ----
2492
- # Unclosed bracket
2493
- Canon.format("key: {unclosed", :yaml)
2494
- # => Canon::ValidationError: YAML Validation Error: (<unknown>): did not find expected node content...
2495
- # Line: 1
2496
- # Details: Shows context around error
2497
-
2498
- # Invalid indentation
2499
- yaml = <<~YAML
2500
- parent:
2501
- child: value
2502
- YAML
2503
- Canon.format(yaml, :yaml)
2504
- # => Canon::ValidationError: YAML Validation Error: mapping values are not allowed in this context
2505
- # Line: 2
2506
- ----
2507
-
2508
- ==== Validation in RSpec tests
2509
-
2510
- Canon's RSpec matchers automatically propagate validation errors to test output,
2511
- making it easy to see what's wrong:
2512
-
2513
- .RSpec validation error example
2514
- [example]
2515
- [source,ruby]
2516
- ----
2517
- require 'canon/rspec_matchers'
2518
-
2519
- RSpec.describe 'XML validation' do
2520
- it 'validates input' do
2521
- malformed_xml = '<root><unclosed>'
2522
- expected_xml = '<root><item/></root>'
2523
-
2524
- # This will fail with a clear validation error message
2525
- expect(malformed_xml).to be_xml_equivalent_to(expected_xml)
2526
- end
2527
- end
2528
-
2529
- # Test output shows:
2530
- # Canon::ValidationError:
2531
- # XML Validation Error: Premature end of data in tag unclosed line 1
2532
- # Line: 1
2533
- # Column: 18
2534
- ----
2535
-
2536
- The error appears directly in the RSpec output, not hidden in separate error
2537
- files or logs.
2538
-
2539
- ==== Validation in comparison
2540
-
2541
- Validation also occurs when using `Canon::Comparison.equivalent?`:
2542
-
2543
- .Comparison validation example
2544
- [example]
2545
- [source,ruby]
2546
- ----
2547
- require 'canon/comparison'
2548
-
2549
- xml1 = '<root><item/></root>'
2550
- xml2 = '<root><unclosed>'
2551
-
2552
- Canon::Comparison.equivalent?(xml1, xml2)
2553
- # => Canon::ValidationError: XML Validation Error: Premature end of data in tag unclosed line 1
2554
- # Line: 1
2555
- # Column: 18
2556
- ----
2557
-
2558
- ==== Benefits
2559
-
2560
- Input validation provides several key benefits:
2561
-
2562
- **Early error detection**:: Problems are caught before processing begins, saving
2563
- time and providing clear feedback
2564
-
2565
- **Precise error location**:: Line and column numbers pinpoint exactly where the
2566
- problem is, especially useful in large documents
2567
-
2568
- **Clear error messages**:: Descriptive messages explain what's wrong and often
2569
- suggest how to fix it
2570
-
2571
- **Test-friendly**:: Errors appear in RSpec output where developers expect them,
2572
- not in separate log files
2573
-
2574
- **Format-aware**:: Each validator understands format-specific rules and provides
2575
- relevant error details
2576
-
2577
- === Reporting options (diff options)
2578
-
2579
- ==== General
2580
-
2581
- Canon provides comprehensive diff formatting capabilities across three interfaces:
2582
- RSpec matchers, CLI commands, and the Ruby API. All interfaces support the same
2583
- set of parameters for consistent behavior.
2584
-
2585
- ==== Parameters
2586
-
2587
- The following table shows all available diff formatting parameters and their
2588
- availability across interfaces:
2589
-
2590
- [cols="1,1,1,1,2,1"]
2591
- |===
2592
- |Parameter |RSpec |CLI |Ruby API |Description |Default
2593
-
2594
- |`use_color`
2595
- |✓
2596
- |✓
2597
- |✓
2598
- |Enable/disable colored output
2599
- |`true`
2600
-
2601
- |`diff_mode`
2602
- |✓
2603
- |✓
2604
- |✓
2605
- |Comparison mode: `:by_object` or `:by_line`
2606
- |`:by_line` (RSpec), `:by_object` (XML/JSON/YAML)
2607
-
2608
- |`context_lines`
2609
- |✓
2610
- |✓
2611
- |✓
2612
- |Number of unchanged lines to show around each change
2613
- |`3`
2614
-
2615
- |`diff_grouping_lines`
2616
- |✓
2617
- |✓
2618
- |✓
2619
- |Maximum line distance to group separate diffs into context blocks
2620
- |`10`
2621
- |===
2622
-
2623
- ==== Interface-specific usage
2624
-
2625
- ===== RSpec matchers configuration
2626
-
2627
- Configure diff formatting for RSpec matchers using `Canon::RspecMatchers`:
2628
-
2629
- [source,ruby]
2630
- ----
2631
- require 'canon/rspec_matchers'
2632
-
2633
- # Configure globally for all matchers
2634
- Canon::RspecMatchers.diff_mode = :by_object
2635
- Canon::RspecMatchers.use_color = true
2636
- Canon::RspecMatchers.context_lines = 5
2637
- Canon::RspecMatchers.diff_grouping_lines = 10
2638
-
2639
- # Use in specs
2640
- RSpec.describe 'My comparison' do
2641
- it 'shows formatted diff' do
2642
- expect(actual_xml).to be_xml_equivalent_to(expected_xml)
2643
- end
2644
- end
2645
- ----
2646
-
2647
- ===== CLI usage
2648
-
2649
- Pass options to the `canon diff` command:
2650
-
2651
- [source,bash]
2652
- ----
2653
- # Basic diff with default settings
2654
- $ canon diff file1.xml file2.xml --verbose
2655
-
2656
- # Customize diff output
2657
- $ canon diff file1.xml file2.xml \
2658
- --verbose \
2659
- --by-line \
2660
- --no-color \
2661
- --context-lines 5 \
2662
- --diff-grouping-lines 10
2663
- ----
2664
-
2665
- ===== Ruby API usage
2666
-
2667
- Use `Canon::DiffFormatter` directly in your code:
2668
-
2669
- [source,ruby]
2670
- ----
2671
- require 'canon/diff_formatter'
2672
- require 'canon/comparison'
2673
-
2674
- # Compare documents
2675
- comparison = Canon::Comparison.new(doc1, doc2)
2676
- result = comparison.compare
2677
-
2678
- # Format diff output
2679
- formatter = Canon::DiffFormatter.new(
2680
- use_color: true,
2681
- mode: :by_object,
2682
- context_lines: 5,
2683
- diff_grouping_lines: 10
2684
- )
2685
-
2686
- diff_output = formatter.format(result)
2687
- puts diff_output
2688
- ----
2689
-
2690
- ==== Parameter details
2691
-
2692
- ===== use_color
2693
-
2694
- Controls whether diff output includes ANSI color codes.
2695
-
2696
- * Type: Boolean
2697
- * Default: `true`
2698
- * Colors used:
2699
- ** Red: Deletions/removed content
2700
- ** Green: Additions/inserted content
2701
- ** Yellow: Modified content
2702
- ** Cyan: Element names and structure
2703
-
2704
- [source,ruby]
2705
- ----
2706
- # Disable colors for plain text output
2707
- Canon::RspecMatchers.use_color = false
2708
-
2709
- # CLI
2710
- $ canon diff file1.xml file2.xml --no-color --verbose
2711
- ----
2712
-
2713
- ===== diff_mode
2714
-
2715
- Determines the comparison and display strategy.
2716
-
2717
- * Type: Symbol (`:by_object` or `:by_line`)
2718
- * Default: `:by_line` for RSpec matchers, format-dependent for CLI/API
2719
- * Modes:
2720
- ** `:by_object` - Semantic tree-based comparison showing structural changes
2721
- ** `:by_line` - Line-by-line diff after canonicalization
2722
-
2723
- [source,ruby]
2724
- ----
2725
- # Use object-based diff for RSpec matchers
2726
- Canon::RspecMatchers.diff_mode = :by_object
2727
-
2728
- # CLI - XML uses by-object by default, force by-line
2729
- $ canon diff file1.xml file2.xml --by-line --verbose
2730
- ----
2731
-
2732
- ===== context_lines
2733
-
2734
- Number of unchanged lines to display around each change for context.
2735
-
2736
- * Type: Numeric
2737
- * Default: `3`
2738
- * Range: `0` to any positive integer
2739
- * Effect: Higher values show more surrounding context, lower values show only changes
2740
-
2741
- [source,ruby]
2742
- ----
2743
- # Show 5 lines of context around each change
2744
- Canon::RspecMatchers.context_lines = 5
2745
-
2746
- # CLI
2747
- $ canon diff file1.xml file2.xml --context-lines 5 --verbose
2748
-
2749
- # Ruby API
2750
- formatter = Canon::DiffFormatter.new(context_lines: 5)
2751
- ----
2752
-
2753
- ===== diff_grouping_lines
2754
-
2755
- Maximum line distance between separate changes to group them into a single
2756
- context block.
2757
-
2758
- * Type: Numeric or `nil`
2759
- * Default: `nil` (no grouping)
2760
- * Effect: When set, changes within N lines of each other are grouped into
2761
- context blocks with a header showing the number of diffs in the block
2762
-
2763
- [source,ruby]
2764
- ----
2765
- # Group changes that are within 10 lines of each other
2766
- Canon::RspecMatchers.diff_grouping_lines = 10
2767
-
2768
- # CLI
2769
- $ canon diff file1.xml file2.xml --diff-grouping-lines 10 --verbose
2770
-
2771
- # Ruby API
2772
- formatter = Canon::DiffFormatter.new(diff_grouping_lines: 10)
2773
- ----
2774
-
2775
- .Example of grouped diff output
2776
- [example]
2777
- When `diff_grouping_lines` is set to `10`, changes close together are grouped:
2778
-
2779
- [source]
2780
- ----
2781
- Context block has 3 diffs (lines 5-18):
2782
- 5 - | <foreword id="fwd">
2783
- 5 + | <foreword displayorder="2" id="fwd">
2784
- 6 | <p>First paragraph</p>
2785
- ...
2786
- 15 - | <title>Scope</title>
2787
- 15 + | <title>Application Scope</title>
2788
- 16 | </clause>
2789
- 17 + | <p>New content</p>
2790
- 18 | </sections>
2791
- ----
2792
-
2793
- Without grouping, these would appear as separate diff sections.
2794
-
2795
- === Visualization options
2796
-
2797
- ==== Enhanced diff output features
2798
-
2799
- Canon's diff formatter includes several enhancements designed to make diffs more
2800
- readable and informative, especially when working with RSpec test failures.
2801
-
2802
- ===== Color-coded line numbers and structure
2803
-
2804
- **Purpose**: Improve readability by distinguishing structural elements from
2805
- content changes.
2806
-
2807
- When color mode is enabled (`use_color: true`), the diff formatter uses a
2808
- consistent color scheme:
2809
-
2810
- * **Yellow**: Line numbers and pipe separators
2811
- * **Red**: Deletion markers (`-`) and removed content
2812
- * **Green**: Addition markers (`+`) and inserted content
2813
- * **Default terminal color**: Unchanged context lines (no ANSI codes applied)
2814
-
2815
- This color scheme helps differentiate between:
2816
-
2817
- * The diff structure (line numbers, pipes)
2818
- * Content that was removed (red)
2819
- * Content that was added (green)
2820
- * Content that stayed the same (your terminal's default color)
2821
-
2822
- .Example colored diff output
2823
- [example]
2824
- In a colored terminal, a typical diff line appears as:
2825
-
2826
- [source]
2827
- ----
2828
- 5| 5 | <p>First paragraph</p> # Context line (yellow numbers/pipes, default text)
2829
- 6| -| <old>Text</old> # Deletion (yellow numbers/pipes, red marker/content)
2830
- | 6+| <new>Text</new> # Addition (yellow numbers/pipes, green marker/content)
2831
- ----
2832
-
2833
- Where:
2834
-
2835
- * Line numbers (`5`, `6`) are in yellow
2836
- * Pipe separators (`|`) are in yellow
2837
- * Markers (`-`, `+`) are in red/green respectively
2838
- * Changed content is highlighted in red (deletions) or green (additions)
2839
- * Unchanged content uses your terminal's default color (no forced white/black)
2840
-
2841
- **Why this matters**: When running tests with RSpec, the framework initially sets
2842
- output to red. Canon's diff formatter explicitly resets colors to prevent RSpec's
2843
- red from bleeding into the diff output, ensuring consistent and readable diffs.
2844
-
2845
- ===== Whitespace visualization
2846
-
2847
- **Purpose**: Make invisible whitespace and special characters visible in diffs.
2848
-
2849
- Whitespace changes can be difficult to spot in traditional diffs because spaces,
2850
- tabs, and other invisible characters don't appear in output. Canon visualizes
2851
- these changes using a comprehensive set of Unicode symbols that are safe for use
2852
- with CJK (Chinese, Japanese, Korean) text.
2853
-
2854
- **Visualization scope**: Character visualization is applied only to **diff lines**
2855
- (additions, deletions, and changes), not to context lines (unchanged lines). This
2856
- ensures that:
2857
-
2858
- * Context lines display content in its original form without substitution
2859
- * Only actual changes show visualization, making differences easier to spot
2860
- * Within changed lines showing token-level diffs, unchanged tokens are displayed
2861
- in the terminal's default color (not red/green) to distinguish them from actual
2862
- changes
2863
-
2864
- ====== Default character visualization map
2865
-
2866
- Canon provides a comprehensive CJK-safe character mapping for common non-visible
2867
- characters encountered in diffs:
2868
-
2869
- NOTE: These visualization symbols appear **only in diff lines** (additions,
2870
- deletions, and changes), not in context lines (unchanged lines).
2871
-
2872
- .Common whitespace characters
2873
- [cols="1,1,1,2"]
2874
- |===
2875
- |Character |Unicode |Symbol |Description
2876
-
2877
- |Regular space
2878
- |U+0020
2879
- |`░`
2880
- |Light Shade (U+2591)
2881
-
2882
- |Tab
2883
- |U+0009
2884
- |`⇥`
2885
- |Rightwards Arrow to Bar (U+21E5)
2886
-
2887
- |Non-breaking space
2888
- |U+00A0
2889
- |`␣`
2890
- |Open Box (U+2423)
2891
- |===
2892
-
2893
- .Line endings
2894
- [cols="1,1,1,2"]
2895
- |===
2896
- |Character |Unicode |Symbol |Description
2897
-
2898
- |Line feed (LF)
2899
- |U+000A
2900
- |`↵`
2901
- |Downwards Arrow with Corner Leftwards (U+21B5)
2902
-
2903
- |Carriage return (CR)
2904
- |U+000D
2905
- |`⏎`
2906
- |Return Symbol (U+23CE)
2907
-
2908
- |Windows line ending (CRLF)
2909
- |U+000D U+000A
2910
- |`↵`
2911
- |Downwards Arrow with Corner Leftwards (U+21B5)
2912
-
2913
- |Next line (NEL)
2914
- |U+0085
2915
- |`⏎`
2916
- |Return Symbol (U+23CE)
2917
-
2918
- |Line separator
2919
- |U+2028
2920
- |`⤓`
2921
- |Downwards Arrow to Bar (U+2913)
2922
-
2923
- |Paragraph separator
2924
- |U+2029
2925
- |`⤓`
2926
- |Downwards Arrow to Bar (U+2913)
2927
- |===
2928
-
2929
- .Unicode spaces (various widths)
2930
- [cols="1,1,1,2"]
2931
- |===
2932
- |Character |Unicode |Symbol |Description
2933
-
2934
- |En space
2935
- |U+2002
2936
- |`▭`
2937
- |White Rectangle (U+25AD)
2938
-
2939
- |Em space
2940
- |U+2003
2941
- |`▬`
2942
- |Black Rectangle (U+25AC)
2943
-
2944
- |Four-per-em space
2945
- |U+2005
2946
- |`⏓`
2947
- |Metrical Short Over Long (U+23D3)
2948
-
2949
- |Six-per-em space
2950
- |U+2006
2951
- |`⏕`
2952
- |Metrical Two Shorts Over Long (U+23D5)
2953
-
2954
- |Thin space
2955
- |U+2009
2956
- |`▯`
2957
- |White Vertical Rectangle (U+25AF)
2958
-
2959
- |Hair space
2960
- |U+200A
2961
- |`▮`
2962
- |Black Vertical Rectangle (U+25AE)
2963
-
2964
- |Figure space
2965
- |U+2007
2966
- |`□`
2967
- |White Square (U+25A1)
2968
-
2969
- |Narrow no-break space
2970
- |U+202F
2971
- |`▫`
2972
- |White Small Square (U+25AB)
2973
-
2974
- |Medium mathematical space
2975
- |U+205F
2976
- |`▭`
2977
- |White Rectangle (U+25AD)
2978
-
2979
- |Ideographic space
2980
- |U+3000
2981
- |`⎵`
2982
- |Bottom Square Bracket (U+23B5)
2983
-
2984
- |Ideographic half space
2985
- |U+303F
2986
- |`⏑`
2987
- |Metrical Breve (U+23D1)
2988
-
2989
- |===
2990
-
2991
- .Zero-width characters (invisible troublemakers)
2992
- [cols="1,1,1,2"]
2993
- |===
2994
- |Character |Unicode |Symbol |Description
2995
-
2996
- |Zero-width space
2997
- |U+200B
2998
- |`→`
2999
- |Rightwards Arrow (U+2192)
3000
-
3001
- |Zero-width non-joiner
3002
- |U+200C
3003
- |`↛`
3004
- |Rightwards Arrow with Stroke (U+219B)
3005
-
3006
- |Zero-width joiner
3007
- |U+200D
3008
- |`⇢`
3009
- |Rightwards Dashed Arrow (U+21E2)
3010
-
3011
- |Zero-width no-break space (BOM)
3012
- |U+FEFF
3013
- |`⇨`
3014
- |Rightwards White Arrow (U+21E8)
3015
- |===
3016
-
3017
- .Bidirectional/RTL markers
3018
- [cols="1,1,1,2"]
3019
- |===
3020
- |Character |Unicode |Symbol |Description
3021
-
3022
- |Left-to-right mark
3023
- |U+200E
3024
- |`⟹`
3025
- |Long Rightwards Double Arrow (U+27F9)
3026
-
3027
- |Right-to-left mark
3028
- |U+200F
3029
- |`⟸`
3030
- |Long Leftwards Double Arrow (U+27F8)
3031
-
3032
- |LTR embedding
3033
- |U+202A
3034
- |`⇒`
3035
- |Rightwards Double Arrow (U+21D2)
3036
-
3037
- |RTL embedding
3038
- |U+202B
3039
- |`⇐`
3040
- |Leftwards Double Arrow (U+21D0)
3041
-
3042
- |Pop directional formatting
3043
- |U+202C
3044
- |`↔`
3045
- |Left Right Arrow (U+2194)
3046
-
3047
- |LTR override
3048
- |U+202D
3049
- |`⇉`
3050
- |Rightwards Paired Arrows (U+21C9)
3051
-
3052
- |RTL override
3053
- |U+202E
3054
- |`⇇`
3055
- |Leftwards Paired Arrows (U+21C7)
3056
- |===
3057
-
3058
- .Control characters
3059
- [cols="1,1,1,2"]
3060
- |===
3061
- |Character |Unicode |Symbol |Description
3062
-
3063
- |Null
3064
- |U+0000
3065
- |`␀`
3066
- |Symbol for Null (U+2400)
3067
-
3068
- |Soft hyphen
3069
- |U+00AD
3070
- |`­‐`
3071
- |Hyphen (U+2010)
3072
-
3073
- |Backspace
3074
- |U+0008
3075
- |`␈`
3076
- |Symbol for Backspace (U+2408)
3077
-
3078
- |Delete
3079
- |U+007F
3080
- |`␡`
3081
- |Symbol for Delete (U+2421)
3082
- |===
3083
-
3084
- ====== CJK safety
3085
-
3086
- The visualization characters are specifically chosen to avoid conflicts with CJK
3087
- text:
3088
-
3089
- * **No middle dots** (`·`) - commonly used as separators in CJK
3090
- * **No bullets** (`∙`) - used in CJK lists
3091
- * **No circles** (`◌◍◎`) - look similar to CJK characters like ○ ●
3092
- * **No small dots** (`⋅`) - conflict with CJK punctuation
3093
-
3094
- Instead, Canon uses:
3095
- * Box characters (`□▭▬▯▮▫`) for various space types
3096
- * Arrow symbols (`→↛⇢⇨⟹⟸⇒⇐`) for zero-width and directional characters
3097
- * Control Pictures block symbols (`␀␈␡`) for control characters
3098
-
3099
- ====== Customizing character visualization
3100
-
3101
- You can customize the character visualization map for your specific needs:
3102
-
3103
- [source,ruby]
3104
- ----
3105
- require 'canon/diff_formatter'
3106
-
3107
- # Create custom visualization map
3108
- custom_map = Canon::DiffFormatter.merge_visualization_map({
3109
- ' ' => '·', # Use middle dot for spaces (if not using CJK)
3110
- "\t" => '→', # Use simple arrow for tabs
3111
- "\u200B" => '⚠' # Warning symbol for zero-width space
3112
- })
3113
-
3114
- # Use custom map with formatter
3115
- formatter = Canon::DiffFormatter.new(
3116
- use_color: true,
3117
- visualization_map: custom_map
3118
- )
3119
-
3120
- # The custom map merges with defaults, so unspecified
3121
- # characters still use the default visualization
3122
- ----
3123
-
3124
- ====== Visualization in action
3125
-
3126
- .Whitespace visualization examples
3127
- [example]
3128
- [source]
3129
- ----
3130
- # Space added between tags
3131
- 10| -| <tag>Value</tag> # No space
3132
- | 10+| <tag>░Value</tag> # Space added (green light shade)
3133
-
3134
- # Tab character
3135
- 15| -| <tag>⇥Value</tag> # Tab (red arrow-to-bar)
3136
- | 15+| <tag>░░Value</tag> # Two spaces (green light shades)
3137
-
3138
- # Non-breaking space (U+00A0)
3139
- 20| -| <tag>Value</tag> # Regular space
3140
- | 20+| <tag>Value␣</tag> # Non-breaking space (green open box)
3141
-
3142
- # Zero-width space (U+200B)
3143
- 25| -| <word1><word2> # No zero-width space
3144
- | 25+| <word1>→<word2> # Zero-width space (green arrow)
3145
-
3146
- # Mixed invisible characters
3147
- 30| -| <p>Text▬more</p> # Em space (red black rectangle)
3148
- | 30+| <p>Text░more</p> # Regular space (green light shade)
3149
- ----
3150
-
3151
- Where visualization symbols appear in:
3152
-
3153
- * Red when showing removed/deleted characters
3154
- * Green when showing added/inserted characters
3155
- * Bold to make them more visible
3156
-
3157
- **When is this useful?**
3158
-
3159
- 1. **Test failures due to formatting**: Your test expects compact XML but receives
3160
- pretty-printed XML with different indentation
3161
-
3162
- 2. **Mixed whitespace**: Some parts of your code use tabs while others use spaces
3163
-
3164
- 3. **Non-breaking spaces**: Copy-pasted content from browsers often contains
3165
- U+00A0 instead of regular spaces
3166
-
3167
- 4. **Zero-width characters**: Invisible Unicode characters that cause mysterious
3168
- comparison failures
3169
-
3170
- 5. **RTL/LTR markers**: Bidirectional text markers in internationalized content
3171
-
3172
- 6. **Template differences**: Generated output has invisible character differences
3173
-
3174
- .Real-world example: Non-breaking space from web copy-paste
3175
- [example]
3176
- Without whitespace visualization, these two lines look identical:
3177
-
3178
- [source,xml]
3179
- ----
3180
- <foreword id="fwd">
3181
- <foreword id="fwd">
3182
- ----
3183
-
3184
- With whitespace visualization enabled, the difference is immediately visible:
3185
-
3186
- [source]
3187
- ----
3188
- 4| -| <foreword░id="fwd"> # Regular space (U+0020)
3189
- | 4+| <foreword␣id="fwd"> # Non-breaking space (U+00A0)
3190
- ----
3191
-
3192
- The different symbols (`░` vs `␣`) clearly show that one uses a regular space
3193
- while the other uses a non-breaking space, likely from copying text from a web
3194
- page or word processor.
3195
-
3196
- .Real-world example: Zero-width characters
3197
- [example]
3198
- Zero-width characters are completely invisible but affect comparison:
3199
-
3200
- [source,xml]
3201
- ----
3202
- <item>Widget</item>
3203
- <item>Widget</item> <!-- Contains U+200B zero-width space after "Widget" -->
3204
- ----
3205
-
3206
- The diff shows:
3207
-
3208
- [source]
3209
- ----
3210
- 5| -| <item>Widget</item>
3211
- | 5+| <item>Widget→</item> # Zero-width space visualized as →
3212
- ----
3213
-
3214
- The rightwards arrow (`→`) reveals the presence of a zero-width space that would
3215
- otherwise be impossible to detect.
3216
-
3217
- ===== Non-ASCII character detection
3218
-
3219
- **Purpose**: Alert users when diffs contain non-ASCII characters that might cause
3220
- unexpected comparison failures or encoding issues.
3221
-
3222
- When Canon detects non-ASCII characters (any character with Unicode codepoint >
3223
- U+007F) in a diff block, it displays a yellow warning with the specific
3224
- characters and their Unicode codepoints.
3225
-
3226
- .Non-ASCII warning format
3227
- [example]
3228
- [source]
3229
- ----
3230
- Context block has 3 diffs (lines 10-25):
3231
- (WARNING: non-ASCII characters detected in diff: [' ' (U+00A0, shown as: ␣), '—' (U+2014, shown as: —)])
3232
-
3233
- 10| -| <p>Hello░world</p>
3234
- | 10+| <p>Hello␣world</p> # Contains non-breaking space (U+00A0)
3235
- 15| -| <p>Text - more text</p>
3236
- | 15+| <p>Text — more text</p> # Contains em dash (U+2014)
3237
- ----
3238
-
3239
- The warning appears immediately after the "Context block has X diffs" header.
3240
-
3241
- **Common non-ASCII characters in diffs**:
3242
-
3243
- |===
3244
- |Character |Unicode |Name |Common source
3245
-
3246
- |` ` (looks like space)
3247
- |U+00A0
3248
- |Non-breaking space
3249
- |Copy-paste from web browsers, word processors
3250
-
3251
- |`—`
3252
- |U+2014
3253
- |Em dash
3254
- |Word processors, smart quotes enabled
3255
-
3256
- |`–`
3257
- |U+2013
3258
- |En dash
3259
- |Word processors, smart quotes enabled
3260
-
3261
- |`'` `'`
3262
- |U+2018, U+2019
3263
- |Smart single quotes
3264
- |Word processors, text editors with smart quotes
3265
-
3266
- |`"` `"`
3267
- |U+201C, U+201D
3268
- |Smart double quotes
3269
- |Word processors, text editors with smart quotes
3270
-
3271
- |`…`
3272
- |U+2026
3273
- |Ellipsis
3274
- |Word processors
3275
-
3276
- |Various
3277
- |U+2000-U+200B
3278
- |Various spaces
3279
- |HTML entities, special formatting
3280
- |===
3281
-
3282
- **Why this matters**:
3283
-
3284
- 1. **Invisible differences**: Many non-ASCII characters look identical to their
3285
- ASCII equivalents but cause comparison failures
3286
-
3287
- 2. **Encoding issues**: Non-ASCII characters may behave differently across
3288
- systems with different encodings
3289
-
3290
- 3. **Copy-paste errors**: Content copied from browsers or documents often
3291
- includes non-breaking spaces instead of regular spaces
3292
-
3293
- 4. **Smart quotes**: Text editors may automatically convert straight quotes to
3294
- curly quotes
3295
-
3296
- .Practical example
3297
- [example]
3298
- A test fails because the expected output was copied from a web page:
3299
-
3300
- [source,ruby]
3301
- ----
3302
- # Expected (copied from documentation website - contains U+00A0)
3303
- expected = '<p>Hello world</p>' # Space between "Hello" and "world" is U+00A0
3304
-
3305
- # Actual (generated by code - contains regular space)
3306
- actual = '<p>Hello world</p>' # Space is U+0020
3307
-
3308
- expect(actual).to be_xml_equivalent_to(expected)
3309
- # FAILS: Documents appear identical but contain different space characters
3310
- ----
3311
-
3312
- Canon's diff output shows:
3313
-
3314
- [source]
3315
- ----
3316
- Context block has 1 diff (line 1):
3317
- (WARNING: non-ASCII characters detected in diff: [' ' (U+00A0)])
3318
-
3319
- 1| -| <p>Hello world</p> # U+0020 (regular space)
3320
- | 1+| <p>Hello░world</p> # U+00A0 (non-breaking space, shown as block)
3321
- ----
3322
-
3323
- The warning alerts you to check for non-breaking spaces, and the light shade
3324
- block visualization shows where the difference occurs.
3325
-
3326
- ===== Configuration and usage
3327
-
3328
- All enhanced diff features are enabled by default when `use_color` is `true` and
3329
- automatically applied across all Canon interfaces:
3330
-
3331
- [source,ruby]
3332
- ----
3333
- # RSpec matchers (automatically enabled)
3334
- expect(xml1).to be_xml_equivalent_to(xml2)
3335
- # Output includes: colored line numbers, whitespace visualization, non-ASCII warnings
3336
-
3337
- # CLI (enabled by default)
3338
- $ canon diff file1.xml file2.xml --verbose
3339
- # Output includes all enhanced features
3340
-
3341
- # Ruby API (controlled by use_color parameter)
3342
- formatter = Canon::DiffFormatter.new(use_color: true) # Enhanced features enabled
3343
- formatter = Canon::DiffFormatter.new(use_color: false) # Plain text only
3344
- ----
3345
-
3346
- To disable colored output (and all color-dependent enhancements):
3347
-
3348
- [source,ruby]
3349
- ----
3350
- # RSpec
3351
- Canon::RspecMatchers.use_color = false
3352
-
3353
- # CLI
3354
- $ canon diff file1.xml file2.xml --no-color --verbose
3355
-
3356
- # Ruby API
3357
- formatter = Canon::DiffFormatter.new(use_color: false)
3358
- ----
3359
-
3360
- When `use_color` is `false`:
3361
-
3362
- * Line numbers and pipes are plain text
3363
- * Whitespace is not visualized (remains invisible)
3364
- * Non-ASCII warnings are still shown (but without yellow color)
3365
- * Content changes are shown without color highlighting
3366
-
3367
-
3368
- == Development
3369
-
3370
- After checking out the repo, run `bin/setup` to install dependencies. Then, run
3371
- `rake spec` to run the tests. You can also run `bin/console` for an interactive
3372
- prompt that will allow you to experiment.
3373
-
3374
-
3375
- == Contributing
3376
-
3377
- Bug reports and pull requests are welcome on GitHub at
3378
- https://github.com/lutaml/canon.
3379
-
3380
-
3381
- == Copyright and license
3382
-
3383
- Copyright Ribose.
3384
- https://opensource.org/licenses/BSD-2-Clause[BSD-2-Clause License].