sitediff 0.0.6 → 1.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (55) hide show
  1. checksums.yaml +5 -5
  2. data/.eslintignore +1 -0
  3. data/.eslintrc.json +28 -0
  4. data/.project +11 -0
  5. data/.rubocop.yml +179 -0
  6. data/.rubocop_todo.yml +51 -0
  7. data/CHANGELOG.md +28 -0
  8. data/Dockerfile +33 -0
  9. data/Gemfile +11 -0
  10. data/Gemfile.lock +85 -0
  11. data/INSTALLATION.md +146 -0
  12. data/LICENSE +339 -0
  13. data/README.md +810 -0
  14. data/Rakefile +12 -0
  15. data/Thorfile +135 -0
  16. data/bin/sitediff +9 -2
  17. data/config/.gitkeep +0 -0
  18. data/config/sanitize_domains.example.yaml +8 -0
  19. data/config/sitediff.example.yaml +81 -0
  20. data/docker-compose.test.yml +3 -0
  21. data/lib/sitediff/api.rb +276 -0
  22. data/lib/sitediff/cache.rb +57 -8
  23. data/lib/sitediff/cli.rb +156 -176
  24. data/lib/sitediff/config/creator.rb +61 -77
  25. data/lib/sitediff/config/preset.rb +75 -0
  26. data/lib/sitediff/config.rb +436 -31
  27. data/lib/sitediff/crawler.rb +27 -21
  28. data/lib/sitediff/diff.rb +32 -9
  29. data/lib/sitediff/fetch.rb +10 -3
  30. data/lib/sitediff/files/diff.html.erb +20 -2
  31. data/lib/sitediff/files/jquery.min.js +2 -0
  32. data/lib/sitediff/files/normalize.css +349 -0
  33. data/lib/sitediff/files/report.html.erb +171 -0
  34. data/lib/sitediff/files/sidebyside.html.erb +5 -2
  35. data/lib/sitediff/files/sitediff.css +303 -30
  36. data/lib/sitediff/files/sitediff.js +367 -0
  37. data/lib/sitediff/presets/drupal.yaml +63 -0
  38. data/lib/sitediff/report.rb +254 -0
  39. data/lib/sitediff/result.rb +50 -20
  40. data/lib/sitediff/sanitize/dom_transform.rb +47 -8
  41. data/lib/sitediff/sanitize/regexp.rb +24 -3
  42. data/lib/sitediff/sanitize.rb +81 -12
  43. data/lib/sitediff/uriwrapper.rb +65 -23
  44. data/lib/sitediff/webserver/resultserver.rb +30 -33
  45. data/lib/sitediff/webserver.rb +15 -3
  46. data/lib/sitediff.rb +130 -83
  47. data/misc/sitediff - overview report.png +0 -0
  48. data/misc/sitediff - page report.png +0 -0
  49. data/package-lock.json +878 -0
  50. data/package.json +25 -0
  51. data/sitediff.gemspec +51 -0
  52. metadata +91 -29
  53. data/lib/sitediff/files/html_report.html.erb +0 -66
  54. data/lib/sitediff/files/rules/drupal.yaml +0 -63
  55. data/lib/sitediff/rules.rb +0 -65
data/README.md ADDED
@@ -0,0 +1,810 @@
1
+ # SiteDiff CLI
2
+
3
+ **Warning:** SiteDiff 1.2.0 requires at least Ruby 3.1.2.
4
+
5
+ **Warning:** SiteDiff 1.0.0 introduces some backwards incompatible changes.
6
+
7
+ [![Build Status](https://travis-ci.org/evolvingweb/sitediff.svg?branch=master)](https://travis-ci.org/evolvingweb/sitediff)
8
+
9
+ ## Table of contents
10
+
11
+ - [Introduction](#introduction)
12
+ - [Installation](#installation)
13
+ - [Demo](#demo)
14
+ - [Usage](#usage)
15
+ - [Getting Started](#getting-started)
16
+ - [Comparing 2 Sites](#comparing-2-sites)
17
+ - [Spurious Diffs](#spurious-diffs)
18
+ - [Command Line Options](#command-line-options)
19
+ - [Finding Configuration Files](#finding-configuration-files)
20
+ - [Specifying Paths](#specifying-paths)
21
+ - [Debugging Rules](#debugging-rules)
22
+ - [Including and Excluding URLs](#including-and-excluding-urls)
23
+ - [Paths and Paths-file](#paths--paths-file)
24
+ - [Report Export](#export)
25
+ - [Running inside containers](#running-inside-containers)
26
+ - [Configuration](#configuration)
27
+ - [before_url / after_url](#before_url--after_url)
28
+ - [selector](#selector)
29
+ - [sanitization](#sanitization)
30
+ - [ignore_whitespace](#ignore_whitespace)
31
+ - [before / after](#before--after)
32
+ - [includes](#incudes)
33
+ - [dom_transform](#dom_transform)
34
+ - [remove](#remove)
35
+ - [strip](#strip)
36
+ - [unwrap](#unwrap)
37
+ - [remove_class](#remove_class)
38
+ - [unwrap_root](#unwrap_root)
39
+ - [Organizing configuration files](#organizing-configuration-files)
40
+ - [Named regions](#named-regions)
41
+ - [report](#report)
42
+ - [title](#title)
43
+ - [details](#details)
44
+ - [before_note](#before_note)
45
+ - [after_note](#after_note)
46
+ - [before_url_report / after_url_report](#before_url_report--after_url_report)
47
+ - [Miscellaneous](#miscellaneous)
48
+ - [preset](#preset)
49
+ - [Include / Exclude Paths](#includeexclude-paths)
50
+ - [Curl Options](#curl-options)
51
+ - [Throttling](#throttling)
52
+ - [Timeouts](#timeouts)
53
+ - [Handling security](#handling-security)
54
+ - [interval](#interval)
55
+ - [concurrency](#concurrency)
56
+ - [depth](#depth)
57
+ - [curl_opts](#curl_opts)
58
+ - [Tips and Tricks](#tips-and-tricks)
59
+ - [Removing empty elements](#removing-empty-elements)
60
+ - [HTML Tag Formatting](#html-tag-formatting)
61
+ - [Empty Attributes](#empty-attributes)
62
+ - [Acknowledgements](#acknowledgements)
63
+
64
+ ## Introduction
65
+ SiteDiff makes it easy to see how a website changes. It can compare two similar
66
+ sites or it can show how a single site changed over time. It helps identify
67
+ undesirable changes to the site's HTML and it's a useful tool for conducting QA
68
+ on re-deployments, site upgrades, and more!
69
+
70
+ When you run SiteDiff, it produces an HTML report showing whether pages on
71
+ your site have changed or not. For pages that have changed, you can see a
72
+ colorized diff exactly what changed, or compare the visual differences
73
+ side-by-side in a browser.
74
+
75
+ SiteDiff supports a range of normalization / sanitization rules. These allow
76
+ you to eliminate spurious differences, narrowing down differences to the ones
77
+ that materially affect the site.
78
+
79
+ ## Installation
80
+
81
+ SiteDiff is fairly easy to install. Please refer to the
82
+ [installation docs](INSTALLATION.md).
83
+
84
+ ## Demo
85
+
86
+ After installing all dependencies including the `bundle` version 2 gem, you can quickly
87
+ see what SiteDiff can do. Simply use the following commands:
88
+
89
+ ```sh
90
+ git clone https://github.com/evolvingweb/sitediff
91
+ cd sitediff
92
+ bundle install
93
+ bundle exec thor fixture:serve
94
+ ```
95
+
96
+ Then visit `http://localhost:13080/` to view the report.
97
+
98
+ SiteDiff shows you an overview of all the pages and clearly indicates which
99
+ pages have changed and not changed.
100
+ ![page report preview](misc/sitediff%20-%20overview%20report.png?raw=true)
101
+
102
+ When you click on a changed page, you see a colorized diff of the page's markup
103
+ showing exactly what changed on the page.
104
+ ![page report preview](misc/sitediff%20-%20page%20report.png?raw=true)
105
+
106
+ ## Usage
107
+
108
+ Here are some instructions on getting started with SiteDiff. To see a list of
109
+ commands that SiteDiff offers, you can run:
110
+
111
+ ```sitediff help```
112
+
113
+ To get help for a particular command, say, `diff`, you can run:
114
+
115
+ ```sitediff help diff```
116
+
117
+ ### Getting started
118
+
119
+ To use SiteDiff on your site, create a configuration for your site:
120
+
121
+ ```sitediff init http://mysite.example.com```
122
+
123
+ SiteDiff will generate a configuration file named `sitediff.yaml` by default.
124
+
125
+ You can open the configuration file ```sitediff/sitediff.yaml``` to see the
126
+ default configuration generated by SiteDiff.
127
+ The [the configuration reference](#configuration) section explains the contents
128
+ of this file and helps you customize it as per your requirements.
129
+
130
+ Then get SiteDiff to crawl your site by using:
131
+
132
+ ```sitediff crawl```
133
+
134
+ SiteDiff will then crawl your site, finding pages and caching their
135
+ contents. A list of discovered paths will be saved to a `paths.txt` file.
136
+
137
+ Now, you can make alterations to your site. For example, change a word on your
138
+ site's front page. After you're done, you can check what actually changed:
139
+
140
+ ```sitediff diff```
141
+
142
+ For each page, SiteDiff will report whether it did or did not change. For pages
143
+ that changed, it will display a diff. You can also see an HTML version of the
144
+ report using the following command:
145
+
146
+ ```sitediff serve```
147
+
148
+ SiteDiff will start an internal web server and open a report page on your
149
+ browser. For each page, you can see the diff and a side-by-side view of the
150
+ old and new versions.
151
+
152
+ You can now see if the changes were as you expected, or if some things didn't
153
+ quite work out as you hoped. If you noticed unexpected changes, congratulations:
154
+ SiteDiff just helped you find an issue you would have otherwise missed!
155
+
156
+ As you fix any issues, you can continue to alter your site and run
157
+ ```sitediff diff``` to check the changes against the old version. Once you're
158
+ satisfied with the state of your site, you can inform SiteDiff that it should
159
+ re-cache your site:
160
+
161
+ ```sitediff store```
162
+
163
+ This takes a snapshot of your website and the next time you run
164
+ ```sitediff diff```, it will use this new version as the reference for
165
+ comparison.
166
+
167
+ Happy diffing!
168
+
169
+ ### Comparing 2 sites
170
+
171
+ Sometimes you have two sites that you want to compare, for example a production
172
+ site hosted on a public server and a development site hosted on your computer.
173
+ SiteDiff can handle this situation, too! Just inform SiteDiff that there are
174
+ two sites to compare:
175
+
176
+ ```sitediff init http://mysite.example.com http://localhost/mysite```
177
+
178
+ Then when you run `sitediff diff`, it will compare the cached version of the
179
+ first site with the current version of the second site.
180
+
181
+ If both the first and second sites may be changing, you should tell SiteDiff
182
+ not to cache either site:
183
+
184
+ ```sitediff diff --cached=none```
185
+
186
+ ### Spurious diffs
187
+
188
+ Sometimes sites have spurious differences, that you don't want to show up in a
189
+ comparison. For example, many sites protect against Cross-Site Request Forgery
190
+ using a [semi-random token](http://en.wikipedia.org/wiki/Cross-site_request_forgery#Synchronizer_token_pattern).
191
+ Since this token changes on each HTTP GET, you probably don't care about such
192
+ a change.
193
+
194
+ To help with issues such as this, SiteDiff allows you to normalize the HTML it
195
+ fetches as it compares pages. In the ```sitediff.yaml``` configuration file,
196
+ you can add "sanitization rules", which specify either DOM transformations or
197
+ regular expression substitutions.
198
+
199
+ Here's an example of a rule you might add to remove CSRF-protection tokens
200
+ generated by Django:
201
+
202
+ ```yaml
203
+ dom_transform:
204
+ - title: Remove CSRF tokens
205
+ type: remove
206
+ selector: input[name=csrfmiddlewaretoken]
207
+ ```
208
+
209
+ You can use one of the presets to apply framework-specific sanitization.
210
+ Currently, SiteDiff only comes with Drupal-specific presets.
211
+
212
+ See the [preset](#preset) section for more details.
213
+
214
+ ## Command Line Options
215
+
216
+ ### Finding configuration files
217
+
218
+ By default SiteDiff will put everything in the `sitediff` folder. You can use
219
+ the `--directory` flag to specify a different directory.
220
+
221
+ ```bash
222
+ sitediff init -C my_project_folder https://example.com
223
+ sitediff diff -C my_project_folder
224
+ sitediff serve -C my_project_folder
225
+ ```
226
+
227
+ ### Specifying paths
228
+
229
+ When you run ```sitediff diff```, you can specify which pages to look at in
230
+ 2 ways:
231
+
232
+ 1. The option ```--paths /foo /bar ...```.
233
+
234
+ If you're trying to fix one page in particular, specifying just that one
235
+ path will make ```sitediff diff``` run quickly!
236
+
237
+ 2. The option ```--paths-file FILE``` with a newline-delimited text file.
238
+
239
+ This is particularly useful when you're trying to eliminate all diffs.
240
+ SiteDiff creates a file ```output/failures.txt``` containing all paths
241
+ which had differences, so as you try to fix differences, you can run:
242
+
243
+ ```sitediff diff --paths-file sitediff/failures.txt```
244
+
245
+ ### Debugging rules
246
+
247
+ When a sanitization rule isn't working quite right for you, you might run
248
+ `sitediff diff` many times over. If fetching all the pages is taking too long,
249
+ try adding the option ```--cached=all```. This tells SiteDiff not to re-fetch
250
+ the content, but just compare previously cached versions — it's a lot faster!
251
+
252
+ ### Including and Excluding URLs
253
+
254
+ By default sitediff crawls pages that are indicated with an HTML anchor using
255
+ the `<A HREF` syntax. Most pages linked will be HTML pages, but some links
256
+ will contain binaries such as PDF documents and images.
257
+
258
+ Using the option `--exclude='.*\.pdf'` ensures the crawler skips links
259
+ for document with a `.pdf` extension. Note that the regular expression is
260
+ applied to the path of the URL, not the base of the URL.
261
+
262
+ For example `--include='.*\.com'` will not match `http://www.google.com/`,
263
+ because the path of that URL is `/` while the base is `www.google.com`.
264
+
265
+ ### paths / paths-file
266
+
267
+ SiteDiff allows you to specify a list of paths that you want it to work with.
268
+ Alternatively, it can crawl the entire site and detect all paths.
269
+
270
+ * Running `sitediff init` configures SiteDiff for crawling and seeing differences.
271
+
272
+ * Running `sitediff crawl` makes sitediff crawl your site and detect
273
+ available paths. These paths are written to a `paths.txt` file which you
274
+ can modify according to your needs.
275
+
276
+ * You can also compute diffs only for paths specified in a custom paths file
277
+ using the `--paths-file` parameter. This file should contain paths starting
278
+ with a `/`, having one path per line.
279
+
280
+ ```
281
+ sitediff diff --paths-file=/path/to/paths.txt
282
+ ```
283
+
284
+ * You can also compute diffs for a handful of specific paths by specifying
285
+ them directly on the command line using the `--paths` parameter. Each path
286
+ should be separated by a space.
287
+
288
+ ```
289
+ sitediff diff --paths=/home /about /contact
290
+ ```
291
+
292
+ ### export
293
+ Generate a gzipped tar file containing the HTML report instead of generating
294
+ and serving live web pages, this option overrides `--report-format`, forcing
295
+ HTML.
296
+
297
+ ### Running inside containers
298
+
299
+ If you run SiteDiff inside a container or virtual machine, the URLs in its
300
+ report might not work from your host, such as ```localhost```. You can fix
301
+ this by using the ```--before-url-report``` and ```--after-url-report```
302
+ options, to tell SiteDiff to use a different URL in the report than the one
303
+ it uses for fetching.
304
+
305
+ For example, if you ran `sitediff init http://mysite.com http://localhost`
306
+ inside a [Vagrant](https://www.vagrantup.com/) VM, you might then run
307
+ something like:
308
+
309
+ ```sitediff diff --after-url-report=http://vagrant:8080```
310
+
311
+ ## Configuration
312
+
313
+ SiteDiff relies on a [YAML](http://yaml.org/) configuration file, usually
314
+ called `sitediff.yaml`. You can create a reasonable one using `sitediff init`,
315
+ but there are many useful things you may want to add or change manually.
316
+
317
+ In the `sitediff.yaml`, SiteDiff recognizes the keys described below. The
318
+ `config` directory contains some example `sitediff.yaml` files. For example,
319
+ [sitediff.example.yaml](config/sitediff.example.yaml).
320
+
321
+ ### before_url / after_url
322
+
323
+ ```yaml
324
+ before_url: http://example.com/subsite
325
+ after_url: http://localhost:8080/subsite
326
+ ```
327
+
328
+ They can also be paths to directories on the local filesystem.
329
+
330
+ The `after_url` MUST provided either at the command-line or in the
331
+ `sitediff.yaml`. If the `before_url` is provided, SiteDiff will compare the
332
+ two sites. Otherwise, it will compare the current version of the `after` site
333
+ with the stored version of that site, as created by `sitediff init` or
334
+ `sitediff store`.
335
+
336
+ ### selector
337
+
338
+ Chooses the sections of HTML we wish to compare, if you don't
339
+ want to compare the entire page. For example if you only want to compare
340
+ breadcrumbs between your two sites, you might specify:
341
+
342
+ ```yaml
343
+ selector: '#breadcrumb'
344
+ ```
345
+
346
+ ### sanitization
347
+
348
+ A list of regular expression rules to normalize your HTML for comparison.
349
+
350
+ Each rule should have a **pattern** regex, which is used to search the HTML.
351
+ Each found instance is replaced with the provided **substitute** or deleted
352
+ if no substitute is provided. A rule may also have a **selector**, which
353
+ constrains it to operate only on HTML fragments which match that CSS selector.
354
+
355
+ For example, forms on Drupal sites have a randomly generated `form_build_id`
356
+ on form pages:
357
+
358
+ ```html
359
+ <input type="hidden" name="form_build_id" value="form-1cac6b5b6141a72b2382928249605fb1"/>
360
+ ```
361
+
362
+ We're not interested in comparing random content, so we could use the
363
+ following rule to fix this:
364
+
365
+ ```yaml
366
+ sanitization:
367
+ # Remove form build IDs
368
+ - pattern: '<input type="hidden" name="form_build_id" value="form-[a-zA-Z0-9_-]+" *\/?>'
369
+ selector: 'input'
370
+ substitute: '<input type="hidden" name="form_build_id" value="__form_build_id__">'
371
+ ```
372
+
373
+ Sanitization rules may also have a **path** attribute, whose value is a
374
+ regular expression. If present, the rule will only apply to matching paths.
375
+
376
+ ### ignore_whitespace
377
+ Ignore whitespace when doing the diff. This passes the `-w` option to the native OS `diff` command.
378
+
379
+ ```yaml
380
+ ignore_whitespace: true
381
+ ```
382
+
383
+ On the command line, use `-w` or `--ignore-whitespace`.
384
+
385
+ ```bash
386
+ sitediff diff -w
387
+ ```
388
+
389
+ ### before / after
390
+
391
+ Applies rules to just one side of the comparison.
392
+
393
+ These blocks can contain any of the following sections: `selector`,
394
+ `sanitization`, `dom_transform`. Such a section placed in `before` will be
395
+ applied just to the `before` side of the comparison and similarly for `after`.
396
+
397
+ For example, if you wanted to let different date formatting not create diff
398
+ failures, you might use the following:
399
+
400
+ ```yaml
401
+ before:
402
+ sanitization:
403
+ - pattern: '[1-2][0-9]{3}/[0-1][0-9]/[0-9]{2}'
404
+ substitute: '__date__'
405
+ after:
406
+ sanitization:
407
+ - pattern: '[A-Z][a-z]{2} [0-9]{1,2}(st|nd|rd|th) [1-2][0-9]{3}'
408
+ substitute: '__date__'
409
+ ```
410
+
411
+ The above rule will replace dates of the form `2004/12/05` in `before` and
412
+ dates of the form `May 12th 2004` in `after` with `__date__`.
413
+
414
+ ### includes
415
+
416
+ The names of other configuration YAML files to merge with this one.
417
+
418
+ ```yaml
419
+ includes:
420
+ - config/sanitize_domains.yaml
421
+ - config/strip_css_js.yaml
422
+ ```
423
+
424
+ ### dom_transform
425
+
426
+ A list of transformations to apply to the HTML before comparing.
427
+
428
+ This is similar to _sanitization_, but it applies transformations to the
429
+ structure of the HTML, instead of to the text. Each transformation has a
430
+ **type**, and potentially other attributes. The following types are available:
431
+
432
+ #### remove
433
+
434
+ Given a **selector**, removes all elements that match it.
435
+
436
+ For example, say we have a block containing the current time, which is
437
+ expected to change. To ignore that, we might choose to delete the block
438
+ before comparison:
439
+
440
+ ```yaml
441
+ dom_transform:
442
+ # Remove current time block
443
+ - type: remove
444
+ - selector: div#block-time
445
+ ```
446
+
447
+ #### strip
448
+
449
+ Strip leading and trailing whitespace from the contents of a tag.
450
+
451
+ Uses the Ruby string `strip()` method. Whitespace is defined as any of the
452
+ following characters: null, horizontal tab, line feed, vertical tab, form
453
+ feed, carriage return, space.
454
+
455
+ To transform `<h1> Foo and Bar\n </h1>` to `<h1>Foo and Bar<\h1>`:
456
+
457
+ ```yaml
458
+ dom_transform:
459
+ # Strip H1 tags
460
+ - type: strip
461
+ - selector: h1
462
+ ```
463
+
464
+ #### unwrap
465
+
466
+ Given a **selector**, replaces all matching elements with
467
+ their children. For example, your content on one side of the comparison might
468
+ look like this:
469
+
470
+ ```html
471
+ <p>This is some text</p>
472
+ <img src="lola.png" alt="Lola is a cute kitten." />
473
+ ```
474
+
475
+ But on the other side, it might be wrapped in an `article` tag:
476
+ ```html
477
+ <article>
478
+ <p>This is some text</p>
479
+ <img src="test.png"/>
480
+ </article>
481
+ ```
482
+
483
+ You could fix it with the following configuration:
484
+
485
+ ```yaml
486
+ dom_transform:
487
+ - type: unwrap
488
+ selector: article
489
+ ```
490
+
491
+ #### remove_class
492
+
493
+ Given a **selector** and a **class**, removes that class
494
+ from each element that matches the selector. It can also take a list of
495
+ classes, instead of just one.
496
+
497
+ For example, here are two sample rules for removing a single class and
498
+ removing multiple classes from all `div` elements:
499
+
500
+ ```yaml
501
+ dom_transform:
502
+ # Remove class foo from div elements
503
+ - type: remove_class
504
+ selector: div
505
+ class: class-foo
506
+ # Remove class bar and class baz from div elements
507
+ - type: remove_class
508
+ selector: div
509
+ class:
510
+ - class-bar
511
+ - class-baz
512
+ ```
513
+
514
+ #### unwrap_root
515
+
516
+ Replaces the entire root element with its children.
517
+
518
+ ### report
519
+
520
+ The settings under the `report` key allow you to display helpful details on the report.
521
+
522
+ ```yaml
523
+ report:
524
+ title: "Updates to example.com"
525
+ details: "This report verifies updates to example.com."
526
+ before_note: "The old site"
527
+ after_note: "The new site"
528
+ before_url_report: http://example.com
529
+ after_url_report: http://staging.example.com
530
+ ```
531
+
532
+ #### title
533
+
534
+ Display a title string at the top of the report.
535
+
536
+ #### details
537
+
538
+ Text displays as a paragraph at the top of the report, below the title.
539
+
540
+ #### before_note
541
+
542
+ Display a brief explanatory note next to `before` URL.
543
+
544
+ #### after_note
545
+
546
+ Display a brief explanatory note next to `after` URL.
547
+
548
+ #### before_url_report / after_url_report
549
+
550
+ Changes how SiteDiff reports which URLs it is comparing, but don't change what
551
+ it actually compares.
552
+
553
+ Suppose you are serving your 'after' website on a virtual machine with
554
+ IP 192.168.2.3, and you are also running SiteDiff inside that VM. To make links
555
+ in the report accessible from outside the VM, you might provide:
556
+
557
+ ```yaml
558
+ after_url: http://localhost
559
+ report:
560
+ after_url_report: http://192.168.2.3
561
+ ```
562
+
563
+ If you don't wish to have the "Before" or "After" links in the report, set to false:
564
+
565
+ ```yaml
566
+ report:
567
+ after_url_report: false
568
+ ```
569
+
570
+ ### Miscellaneous
571
+
572
+ #### preset
573
+
574
+ Presets are stored in the `/lib/sitediff/presets` directory of this gem. You
575
+ can select a preset as follows:
576
+
577
+ ```yaml
578
+ settings:
579
+ preset: drupal
580
+ ```
581
+
582
+ #### Include/Exclude Paths
583
+
584
+ ##### exclude paths
585
+
586
+ A RegEx indicating the paths that should not be crawled.
587
+
588
+ ##### include paths
589
+
590
+ A RegEx indicating the paths that should be crawled.
591
+
592
+ ### Organizing configuration files
593
+
594
+ If your configuration file starts getting really big, SiteDiff lets you
595
+ separate it out into multiple files. Just have one base file that includes
596
+ other files:
597
+
598
+ ```yaml
599
+ includes:
600
+ - sanitization.yaml
601
+ - paths.yaml
602
+ ```
603
+
604
+ This allows you to separate your configuration into logical groups.
605
+ For example, generic rules for your site could live in a `generic.yaml` file,
606
+ while rules pertaining to a particular update you're conducting could
607
+ live in `update-8.2.yaml`.
608
+
609
+ ### Named regions
610
+
611
+ In major upgrades and migrations where there are significant changes to the markup,
612
+ simple diffs will not be of much value. To assist in these cases, `named
613
+ regions` let you define regions in the page markup and the specify order in which
614
+ they should be compared. Specifying the order helps in cases where the fields are
615
+ not in the same order on the new site.
616
+
617
+ For example, if you have a CMS displaying `title`, `author`, and `body` fields, you
618
+ could define the named regions and the selectors for the three fields as follows:
619
+
620
+ ```yaml
621
+ regions:
622
+ - name: title
623
+ selector: h1.title
624
+ - name: author
625
+ selector: .field-name-attribution
626
+ - name: body
627
+ selector: .field-name-body
628
+ ```
629
+
630
+ (You need to define `regions` for both the `before` and `after` sections.)
631
+
632
+ You must then define the order that the fields should be compared, using the
633
+ `output` key.
634
+
635
+ ```yaml
636
+ output:
637
+ - title
638
+ - author
639
+ - body
640
+ ```
641
+
642
+ Before the two versions are compared, SiteDiff generates markup with
643
+ `<region>` tags and each `region` contains the markup matching the
644
+ corresponding selector.
645
+
646
+ EG:
647
+
648
+ ```html
649
+ <region id="title">
650
+ <h1 class="title">My Blog Post</h1>
651
+ </region>
652
+ <region id="author">
653
+ <div class="field-name-attribution">
654
+ <span class="label">By:</span> Alfred E. Neuman
655
+ </div>
656
+ </region>
657
+ <region id="body">
658
+ <div class=".field-name-attribution">
659
+ <p>Lorem ipsum...
660
+ </div>
661
+ </region>
662
+ ```
663
+
664
+ The regions are processed first, so you can reference the `<region>` tags to
665
+ be more specific in your selectors for `dom_transform` and `sanitization`
666
+ sections.
667
+
668
+ EG:
669
+
670
+ ```yaml
671
+ dom_transform:
672
+ - name: Remove body div wrapper
673
+ type: unwrap
674
+ selector: region#body .field-name-attribution
675
+ ```
676
+
677
+ ### Curl Options
678
+
679
+ [Many options](https://curl.haxx.se/libcurl/c/curl_easy_setopt.html) can be
680
+ passed to the underlying curl library. Add `--curl_options=name1:value1 name2:value2`
681
+ to the command line (such as `--curl_options=max_recv_speed_large:100000`
682
+ (remove the `CURLOPT_` prefix and write the name in lowercase) or add them to
683
+ your configuration file.
684
+
685
+ ```yaml
686
+ settings:
687
+ curl_opts:
688
+ max_recv_speed_large: 10000
689
+ ssl_verifypeer: false
690
+ ```
691
+
692
+ These CURL options can be put under the `settings` section of `sitediff.yaml`
693
+ as demonstrated above.
694
+
695
+ #### Throttling
696
+
697
+ A few options are also available to control how aggressively SiteDiff crawls.
698
+
699
+ - There's a command line option `--concurrency=N` for `sitediff init`
700
+ which controls the maximum number of simultaneous connections made.
701
+ Lower N mean less aggressive. The default is 3. You can specify this in the
702
+ `sitediff.yaml` file under the `settings` key.
703
+
704
+ - The underlying curl library has [many options](https://curl.haxx.se/libcurl/c/curl_easy_setopt.html)
705
+ such as `max_recv_speed_large` which can be helpful.
706
+
707
+ - There is a special command line option `--interval=T` for `sitediff init`.
708
+ This option and allows the fetcher to delay for T milliseconds between
709
+ fetching pages. You can specify this in the `sitediff.yaml` file under the
710
+ `settings` key.
711
+
712
+ #### Timeouts
713
+
714
+ By default, no timeout is set but one can be added `--curl_options=timeout:60`
715
+ or in your configuration file.
716
+
717
+ ```yaml
718
+ settings:
719
+ curl_opts:
720
+ timeout: 60 # In seconds; or...
721
+ timeout_ms: 60000 # In milliseconds.
722
+ ```
723
+
724
+ #### Handling security
725
+
726
+ Often development or staging sites are protected by [HTTP Authentication](http://en.wikipedia.org/wiki/Basic_access_authentication).
727
+ SiteDiff allows you to specify a username and password, by using a URL like
728
+ `http://user:pass@example.com` or by adding a `userpwd` setting to your file.
729
+
730
+ SiteDiff ignores untrusted certificates by default. This is equivalent to the following settings:
731
+
732
+ ```yaml
733
+ settings:
734
+ curl_opts:
735
+ ssl_verifypeer: false
736
+ ssl_verifyhost: 0
737
+ userpwd: "username:password"
738
+ ```
739
+
740
+ This contains various parameters which affect the way SiteDiff works. You can
741
+ have the following keys under `settings`.
742
+
743
+ #### interval
744
+ An integer indicating the number of milliseconds SiteDiff should wait for
745
+ between requests.
746
+
747
+ #### concurrency
748
+ The maximum number of simultaneous requests that SiteDiff should make.
749
+
750
+ #### depth
751
+
752
+ The depth to which SiteDiff should crawl the website. Defaults to 3,
753
+ which means, 3 levels deep.
754
+
755
+ #### curl_opts
756
+
757
+ Options to pass to the underlying curl library. Remove the `CURLOPT_` prefix in
758
+ this [full list of options](https://curl.haxx.se/libcurl/c/curl_easy_setopt.html)
759
+ and write in lowercase. Useful for throttling.
760
+
761
+ ```yaml
762
+ settings:
763
+ curl_opts:
764
+ connecttimeout: 3
765
+ followlocation: true
766
+ max_recv_speed_large: 10000
767
+ ```
768
+
769
+ ## Tips and Tricks
770
+
771
+ Here are some tips and tricks that we've learned using SiteDiff:
772
+
773
+ - Use single quotes or double quotes around selectors. Remember that the `#` is a comment in YAML.
774
+ - Be specific enough with selectors to not affect elements on other pages.
775
+
776
+ ### Removing Empty Elements
777
+
778
+ If you have an empty `<p/>` tag appearing in the diff, you can write the following in your sanitization lists:
779
+ ```yaml
780
+ - name: remove_empty_p
781
+ pattern: '<p/>'
782
+ substitute: ''
783
+ ```
784
+
785
+ ### HTML Tag Formatting
786
+
787
+ There are times when the HTML tags do not have newlines between them on one of the sites you wish to compare. In this
788
+ case, these sanitzation rules are useful:
789
+ ```yaml
790
+ - name: remove_space_before
791
+ pattern: '\s*(\n)<'
792
+ substitute: '\1<'
793
+
794
+ - name: remove_space_after
795
+ pattern: '>(\n)\s*'
796
+ substitute: '>\1'
797
+ ```
798
+
799
+ ### Empty Attributes
800
+
801
+ After writing rules, you may end up with empty attributes, like `width=""`. Here's a sanitization rule:
802
+ ```yaml
803
+ - name: remove_empty_class
804
+ pattern: ' class=""'
805
+ substitute: ''
806
+ ```
807
+
808
+ ## Acknowledgements
809
+
810
+ SiteDiff is brought to you by [Evolving Web](https://evolvingweb.ca/).