canon 0.1.8 → 0.1.9
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- checksums.yaml +4 -4
- data/.rubocop_todo.yml +112 -25
- data/docs/Gemfile +1 -0
- data/docs/_config.yml +90 -1
- data/docs/advanced/diff-classification.adoc +82 -2
- data/docs/features/match-options/index.adoc +239 -1
- data/lib/canon/comparison/format_detector.rb +2 -1
- data/lib/canon/comparison/html_comparator.rb +19 -8
- data/lib/canon/comparison/html_compare_profile.rb +8 -2
- data/lib/canon/comparison/match_options/base_resolver.rb +7 -0
- data/lib/canon/comparison/whitespace_sensitivity.rb +208 -0
- data/lib/canon/comparison/xml_comparator/child_comparison.rb +15 -7
- data/lib/canon/comparison/xml_comparator/node_parser.rb +10 -5
- data/lib/canon/comparison/xml_comparator/node_type_comparator.rb +14 -7
- data/lib/canon/comparison/xml_comparator.rb +48 -23
- data/lib/canon/comparison/xml_node_comparison.rb +25 -3
- data/lib/canon/diff/diff_classifier.rb +101 -2
- data/lib/canon/diff/formatting_detector.rb +1 -1
- data/lib/canon/rspec_matchers.rb +37 -8
- data/lib/canon/version.rb +1 -1
- data/lib/canon/xml/data_model.rb +24 -13
- metadata +3 -78
- data/docs/plans/2025-01-17-html-parser-selection-fix.adoc +0 -250
- data/false_positive_analysis.txt +0 -0
- data/file1.html +0 -1
- data/file2.html +0 -1
- data/old-docs/ADVANCED_TOPICS.adoc +0 -20
- data/old-docs/BASIC_USAGE.adoc +0 -16
- data/old-docs/CHARACTER_VISUALIZATION.adoc +0 -567
- data/old-docs/CLI.adoc +0 -497
- data/old-docs/CUSTOMIZING_BEHAVIOR.adoc +0 -19
- data/old-docs/DIFF_ARCHITECTURE.adoc +0 -435
- data/old-docs/DIFF_FORMATTING.adoc +0 -540
- data/old-docs/DIFF_PARAMETERS.adoc +0 -261
- data/old-docs/DOM_DIFF.adoc +0 -1017
- data/old-docs/ENV_CONFIG.adoc +0 -876
- data/old-docs/FORMATS.adoc +0 -867
- data/old-docs/INPUT_VALIDATION.adoc +0 -477
- data/old-docs/MATCHER_BEHAVIOR.adoc +0 -90
- data/old-docs/MATCH_ARCHITECTURE.adoc +0 -463
- data/old-docs/MATCH_OPTIONS.adoc +0 -912
- data/old-docs/MODES.adoc +0 -432
- data/old-docs/NORMATIVE_INFORMATIVE_DIFFS.adoc +0 -219
- data/old-docs/OPTIONS.adoc +0 -1387
- data/old-docs/PREPROCESSING.adoc +0 -491
- data/old-docs/README.old.adoc +0 -2831
- data/old-docs/RSPEC.adoc +0 -814
- data/old-docs/RUBY_API.adoc +0 -485
- data/old-docs/SEMANTIC_DIFF_REPORT.adoc +0 -646
- data/old-docs/SEMANTIC_TREE_DIFF.adoc +0 -765
- data/old-docs/STRING_COMPARE.adoc +0 -345
- data/old-docs/TMP.adoc +0 -3384
- data/old-docs/TREE_DIFF.adoc +0 -1080
- data/old-docs/UNDERSTANDING_CANON.adoc +0 -17
- data/old-docs/VERBOSE.adoc +0 -482
- data/old-docs/VISUALIZATION_MAP.adoc +0 -625
- data/old-docs/WHITESPACE_TREATMENT.adoc +0 -1155
- data/scripts/analyze_current_state.rb +0 -85
- data/scripts/analyze_false_positives.rb +0 -114
- data/scripts/analyze_remaining_failures.rb +0 -105
- data/scripts/compare_current_failures.rb +0 -95
- data/scripts/compare_dom_tree_diff.rb +0 -158
- data/scripts/compare_failures.rb +0 -151
- data/scripts/debug_attribute_extraction.rb +0 -66
- data/scripts/debug_blocks_839.rb +0 -115
- data/scripts/debug_meta_matching.rb +0 -52
- data/scripts/debug_p_matching.rb +0 -192
- data/scripts/debug_signature_matching.rb +0 -118
- data/scripts/debug_sourcecode_124.rb +0 -32
- data/scripts/debug_whitespace_sensitive.rb +0 -192
- data/scripts/extract_false_positives.rb +0 -138
- data/scripts/find_actual_false_positives.rb +0 -125
- data/scripts/investigate_all_false_positives.rb +0 -161
- data/scripts/investigate_batch1.rb +0 -127
- data/scripts/investigate_classification.rb +0 -150
- data/scripts/investigate_classification_detailed.rb +0 -190
- data/scripts/investigate_common_failures.rb +0 -342
- data/scripts/investigate_false_negative.rb +0 -80
- data/scripts/investigate_false_positive.rb +0 -83
- data/scripts/investigate_false_positives.rb +0 -227
- data/scripts/investigate_false_positives_batch.rb +0 -163
- data/scripts/investigate_mixed_content.rb +0 -125
- data/scripts/investigate_remaining_16.rb +0 -214
- data/scripts/run_single_test.rb +0 -29
- data/scripts/test_all_false_positives.rb +0 -95
- data/scripts/test_attribute_details.rb +0 -61
- data/scripts/test_both_algorithms.rb +0 -49
- data/scripts/test_both_simple.rb +0 -49
- data/scripts/test_enhanced_semantic_output.rb +0 -125
- data/scripts/test_readme_examples.rb +0 -131
- data/scripts/test_semantic_tree_diff.rb +0 -99
- data/scripts/test_semantic_ux_improvements.rb +0 -135
- data/scripts/test_single_false_positive.rb +0 -119
- data/scripts/test_size_limits.rb +0 -99
- data/test_html_1.html +0 -21
- data/test_html_2.html +0 -21
- data/test_nokogiri.rb +0 -33
- data/test_normalize.rb +0 -45
|
@@ -1,345 +0,0 @@
|
|
|
1
|
-
= String Comparison in Canon
|
|
2
|
-
:toc:
|
|
3
|
-
:toclevels: 3
|
|
4
|
-
|
|
5
|
-
== General
|
|
6
|
-
|
|
7
|
-
Canon provides advanced string comparison capabilities with character-level visualization and diff rendering. This feature is particularly useful for:
|
|
8
|
-
|
|
9
|
-
* Comparing strings with invisible whitespace differences (spaces, tabs, trailing newlines)
|
|
10
|
-
* Detecting Unicode differences (non-breaking spaces, zero-width characters, etc.)
|
|
11
|
-
* Visualizing multi-line string differences with context
|
|
12
|
-
* Identifying trailing newline differences in text output
|
|
13
|
-
|
|
14
|
-
== Character Visualization
|
|
15
|
-
|
|
16
|
-
Canon automatically visualizes invisible and special characters to make differences clear:
|
|
17
|
-
|
|
18
|
-
[cols="2,3,2",options="header"]
|
|
19
|
-
|===
|
|
20
|
-
|Character
|
|
21
|
-
|Description
|
|
22
|
-
|Visualization
|
|
23
|
-
|
|
24
|
-
|Space (U+0020)
|
|
25
|
-
|Regular space
|
|
26
|
-
|`░`
|
|
27
|
-
|
|
28
|
-
|Tab (U+0009)
|
|
29
|
-
|Tab character
|
|
30
|
-
|`⇥`
|
|
31
|
-
|
|
32
|
-
|Non-breaking space (U+00A0)
|
|
33
|
-
|Non-breaking space
|
|
34
|
-
|`␣`
|
|
35
|
-
|
|
36
|
-
|Zero-width space (U+200B)
|
|
37
|
-
|Zero-width space
|
|
38
|
-
|`→`
|
|
39
|
-
|
|
40
|
-
|Other invisible characters
|
|
41
|
-
|Various Unicode invisibles
|
|
42
|
-
|See Unicode legend in output
|
|
43
|
-
|===
|
|
44
|
-
|
|
45
|
-
== Usage in RSpec
|
|
46
|
-
|
|
47
|
-
=== Auto-detection with `be_equivalent_to`
|
|
48
|
-
|
|
49
|
-
The `be_equivalent_to` matcher automatically detects the format (XML, JSON, YAML, or string) and uses the appropriate comparison mode.
|
|
50
|
-
|
|
51
|
-
.Auto-detecting string mode
|
|
52
|
-
[source,ruby]
|
|
53
|
-
----
|
|
54
|
-
RSpec.describe "String comparison" do
|
|
55
|
-
it "auto-detects string format" do
|
|
56
|
-
actual = "Hello World"
|
|
57
|
-
expected = "Hello Universe"
|
|
58
|
-
|
|
59
|
-
expect(actual).to be_equivalent_to(expected)
|
|
60
|
-
# Automatically uses STRING mode for plain text
|
|
61
|
-
end
|
|
62
|
-
end
|
|
63
|
-
----
|
|
64
|
-
|
|
65
|
-
=== Explicit string mode with `be_string_equivalent_to`
|
|
66
|
-
|
|
67
|
-
For explicit string comparison, use the `be_string_equivalent_to` matcher.
|
|
68
|
-
|
|
69
|
-
.Basic string comparison
|
|
70
|
-
[source,ruby]
|
|
71
|
-
----
|
|
72
|
-
RSpec.describe "String comparison" do
|
|
73
|
-
it "compares strings exactly" do
|
|
74
|
-
expect("Hello World").to be_string_equivalent_to("Hello World")
|
|
75
|
-
end
|
|
76
|
-
|
|
77
|
-
it "detects differences" do
|
|
78
|
-
expect("Hello World").not_to be_string_equivalent_to("Hello Universe")
|
|
79
|
-
end
|
|
80
|
-
end
|
|
81
|
-
----
|
|
82
|
-
|
|
83
|
-
=== Whitespace differences
|
|
84
|
-
|
|
85
|
-
.Detecting extra spaces
|
|
86
|
-
[source,ruby]
|
|
87
|
-
----
|
|
88
|
-
it "detects extra spaces" do
|
|
89
|
-
actual = "Hello World"
|
|
90
|
-
expected = "Hello World" # Two spaces
|
|
91
|
-
|
|
92
|
-
expect(actual).not_to be_string_equivalent_to(expected)
|
|
93
|
-
end
|
|
94
|
-
----
|
|
95
|
-
|
|
96
|
-
When this test fails, the output shows:
|
|
97
|
-
|
|
98
|
-
----
|
|
99
|
-
expected STRING to be equivalent
|
|
100
|
-
|
|
101
|
-
Line-by-line diff (STRING mode):
|
|
102
|
-
1| 1- | Hello░World
|
|
103
|
-
| 1+ | Hello░░World
|
|
104
|
-
----
|
|
105
|
-
|
|
106
|
-
=== Trailing newline differences
|
|
107
|
-
|
|
108
|
-
.Detecting trailing newlines
|
|
109
|
-
[source,ruby]
|
|
110
|
-
----
|
|
111
|
-
it "detects trailing newline" do
|
|
112
|
-
actual = ""
|
|
113
|
-
expected = "\n"
|
|
114
|
-
|
|
115
|
-
expect(actual).not_to be_string_equivalent_to(expected)
|
|
116
|
-
end
|
|
117
|
-
----
|
|
118
|
-
|
|
119
|
-
When this test fails, the output shows two separate lines:
|
|
120
|
-
|
|
121
|
-
----
|
|
122
|
-
expected STRING to be equivalent
|
|
123
|
-
|
|
124
|
-
Line-by-line diff (STRING mode):
|
|
125
|
-
1| 1 | 
|
|
126
|
-
| 2+ |
|
|
127
|
-
----
|
|
128
|
-
|
|
129
|
-
The empty line on line 2 (shown in green with `+`) represents the trailing newline character.
|
|
130
|
-
|
|
131
|
-
=== Unicode character differences
|
|
132
|
-
|
|
133
|
-
.Detecting Unicode differences
|
|
134
|
-
[source,ruby]
|
|
135
|
-
----
|
|
136
|
-
it "detects non-breaking space vs regular space" do
|
|
137
|
-
actual = "Hello World"
|
|
138
|
-
expected = "Hello\u00A0World" # Non-breaking space
|
|
139
|
-
|
|
140
|
-
expect(actual).not_to be_string_equivalent_to(expected)
|
|
141
|
-
end
|
|
142
|
-
----
|
|
143
|
-
|
|
144
|
-
When this test fails, the output includes a Unicode legend:
|
|
145
|
-
|
|
146
|
-
----
|
|
147
|
-
expected STRING to be equivalent
|
|
148
|
-
|
|
149
|
-
Character Visualization Legend:
|
|
150
|
-
░ = U+0020 (Space)
|
|
151
|
-
␣ = U+00A0 (Non-breaking space)
|
|
152
|
-
|
|
153
|
-
Line-by-line diff (STRING mode):
|
|
154
|
-
1| 1- | Hello░World
|
|
155
|
-
| 1+ | Hello␣World
|
|
156
|
-
----
|
|
157
|
-
|
|
158
|
-
=== Multi-line string differences
|
|
159
|
-
|
|
160
|
-
.Comparing multi-line strings
|
|
161
|
-
[source,ruby]
|
|
162
|
-
----
|
|
163
|
-
it "shows line-by-line diff for multi-line strings" do
|
|
164
|
-
actual = <<~TEXT
|
|
165
|
-
Line 1
|
|
166
|
-
Line 2
|
|
167
|
-
Line 3
|
|
168
|
-
TEXT
|
|
169
|
-
|
|
170
|
-
expected = <<~TEXT
|
|
171
|
-
Line 1
|
|
172
|
-
Line 2 Modified
|
|
173
|
-
Line 3
|
|
174
|
-
TEXT
|
|
175
|
-
|
|
176
|
-
expect(actual).not_to be_string_equivalent_to(expected)
|
|
177
|
-
end
|
|
178
|
-
----
|
|
179
|
-
|
|
180
|
-
Output shows context around the changed line:
|
|
181
|
-
|
|
182
|
-
----
|
|
183
|
-
expected STRING to be equivalent
|
|
184
|
-
|
|
185
|
-
Line-by-line diff (STRING mode):
|
|
186
|
-
1| 1 | Line 1
|
|
187
|
-
2| 2- | Line 2
|
|
188
|
-
| 2+ | Line 2 Modified
|
|
189
|
-
3| 3 | Line 3
|
|
190
|
-
----
|
|
191
|
-
|
|
192
|
-
== Usage in CLI
|
|
193
|
-
|
|
194
|
-
=== Comparing string files
|
|
195
|
-
|
|
196
|
-
.Using the diff command
|
|
197
|
-
[source,bash]
|
|
198
|
-
----
|
|
199
|
-
canon diff actual.txt expected.txt --format string
|
|
200
|
-
----
|
|
201
|
-
|
|
202
|
-
.Output
|
|
203
|
-
----
|
|
204
|
-
Line-by-line diff (STRING mode):
|
|
205
|
-
1| 1 | Hello World
|
|
206
|
-
2| 2- | Line with spaces
|
|
207
|
-
| 2+ | Line with spaces
|
|
208
|
-
3| 3 | Final line
|
|
209
|
-
----
|
|
210
|
-
|
|
211
|
-
=== Format auto-detection
|
|
212
|
-
|
|
213
|
-
If you don't specify `--format`, Canon will auto-detect the format based on file content:
|
|
214
|
-
|
|
215
|
-
[source,bash]
|
|
216
|
-
----
|
|
217
|
-
canon diff actual.txt expected.txt
|
|
218
|
-
# Auto-detects as string if content isn't XML/JSON/YAML
|
|
219
|
-
----
|
|
220
|
-
|
|
221
|
-
== Usage in API
|
|
222
|
-
|
|
223
|
-
=== Using DiffFormatter directly
|
|
224
|
-
|
|
225
|
-
.Programmatic string comparison
|
|
226
|
-
[source,ruby]
|
|
227
|
-
----
|
|
228
|
-
require 'canon/diff_formatter'
|
|
229
|
-
|
|
230
|
-
actual = "Hello World"
|
|
231
|
-
expected = "Hello World" # Two spaces
|
|
232
|
-
|
|
233
|
-
formatter = Canon::DiffFormatter.new(
|
|
234
|
-
use_color: true,
|
|
235
|
-
mode: :by_line,
|
|
236
|
-
context_lines: 3
|
|
237
|
-
)
|
|
238
|
-
|
|
239
|
-
diff = formatter.format([], :string,
|
|
240
|
-
doc1: actual,
|
|
241
|
-
doc2: expected)
|
|
242
|
-
puts diff
|
|
243
|
-
----
|
|
244
|
-
|
|
245
|
-
.Output
|
|
246
|
-
----
|
|
247
|
-
Line-by-line diff (STRING mode):
|
|
248
|
-
1| 1- | Hello░World
|
|
249
|
-
| 1+ | Hello░░World
|
|
250
|
-
----
|
|
251
|
-
|
|
252
|
-
=== Using Canon.format for canonicalization
|
|
253
|
-
|
|
254
|
-
Note: For strings, Canon does not perform canonicalization (no formatting changes are applied). The string is compared exactly as-is.
|
|
255
|
-
|
|
256
|
-
.String comparison example
|
|
257
|
-
[source,ruby]
|
|
258
|
-
----
|
|
259
|
-
require 'canon'
|
|
260
|
-
|
|
261
|
-
actual = "Hello World"
|
|
262
|
-
expected = "Hello World"
|
|
263
|
-
|
|
264
|
-
# Strings are compared as-is (no canonicalization)
|
|
265
|
-
result = Canon.format(actual, :string)
|
|
266
|
-
# => "Hello World"
|
|
267
|
-
|
|
268
|
-
result == expected # => true
|
|
269
|
-
----
|
|
270
|
-
|
|
271
|
-
== Configuration
|
|
272
|
-
|
|
273
|
-
=== Customizing character visualization
|
|
274
|
-
|
|
275
|
-
You can customize how characters are visualized by configuring the visualization map:
|
|
276
|
-
|
|
277
|
-
.Customizing character map
|
|
278
|
-
[source,ruby]
|
|
279
|
-
----
|
|
280
|
-
Canon::RSpecMatchers.configure do |config|
|
|
281
|
-
config.use_color = true
|
|
282
|
-
config.context_lines = 3
|
|
283
|
-
config.diff_mode = :by_line
|
|
284
|
-
end
|
|
285
|
-
----
|
|
286
|
-
|
|
287
|
-
=== Color output
|
|
288
|
-
|
|
289
|
-
Color output is enabled by default in RSpec. To disable:
|
|
290
|
-
|
|
291
|
-
.Disabling colors
|
|
292
|
-
[source,ruby]
|
|
293
|
-
----
|
|
294
|
-
Canon::RSpecMatchers.configure do |config|
|
|
295
|
-
config.use_color = false
|
|
296
|
-
end
|
|
297
|
-
----
|
|
298
|
-
|
|
299
|
-
== Advanced Features
|
|
300
|
-
|
|
301
|
-
=== Trailing newline detection
|
|
302
|
-
|
|
303
|
-
Canon properly handles trailing newlines by preserving them during line splitting. This ensures that strings like:
|
|
304
|
-
|
|
305
|
-
* `"text"` (no trailing newline)
|
|
306
|
-
* `"text\n"` (with trailing newline)
|
|
307
|
-
|
|
308
|
-
Are shown as different, with the trailing newline visualized as an empty line in the diff output.
|
|
309
|
-
|
|
310
|
-
=== Context lines
|
|
311
|
-
|
|
312
|
-
By default, Canon shows 3 lines of context around changes. This can be configured:
|
|
313
|
-
|
|
314
|
-
.Adjusting context lines
|
|
315
|
-
[source,ruby]
|
|
316
|
-
----
|
|
317
|
-
Canon::RSpecMatchers.configure do |config|
|
|
318
|
-
config.context_lines = 5 # Show 5 lines of context
|
|
319
|
-
end
|
|
320
|
-
----
|
|
321
|
-
|
|
322
|
-
== Technical Implementation
|
|
323
|
-
|
|
324
|
-
=== Line splitting
|
|
325
|
-
|
|
326
|
-
Canon uses `split("\n", -1)` to preserve trailing empty strings, ensuring that:
|
|
327
|
-
|
|
328
|
-
[source,ruby]
|
|
329
|
-
----
|
|
330
|
-
"abc\n".split("\n", -1) # => ["abc", ""]
|
|
331
|
-
"abc".split("\n", -1) # => ["abc"]
|
|
332
|
-
----
|
|
333
|
-
|
|
334
|
-
This allows proper detection and visualization of trailing newlines.
|
|
335
|
-
|
|
336
|
-
=== Character visualization map
|
|
337
|
-
|
|
338
|
-
The character visualization is configurable via `lib/canon/diff_formatter/character_map.yml`. See link:VISUALIZATION_MAP[Visualization Map Documentation] for details on customizing character representations.
|
|
339
|
-
|
|
340
|
-
== See Also
|
|
341
|
-
|
|
342
|
-
* link:README[Canon README] - General Canon documentation
|
|
343
|
-
* link:VISUALIZATION_MAP[Visualization Map] - Character visualization customization
|
|
344
|
-
* link:DIFF_PARAMETERS[Diff Parameters] - Diff formatting options
|
|
345
|
-
* link:MATCHER_BEHAVIOR[Matcher Behavior] - RSpec matcher details
|