sitediff 1.1.1 → 1.2.0

Sign up to get free protection for your applications and to get access to all the features.
data/README.md ADDED
@@ -0,0 +1,810 @@
1
+ # SiteDiff CLI
2
+
3
+ **Warning:** SiteDiff 1.2.0 requires at least Ruby 3.1.2.
4
+
5
+ **Warning:** SiteDiff 1.0.0 introduces some backwards incompatible changes.
6
+
7
+ [![Build Status](https://travis-ci.org/evolvingweb/sitediff.svg?branch=master)](https://travis-ci.org/evolvingweb/sitediff)
8
+
9
+ ## Table of contents
10
+
11
+ - [Introduction](#introduction)
12
+ - [Installation](#installation)
13
+ - [Demo](#demo)
14
+ - [Usage](#usage)
15
+ - [Getting Started](#getting-started)
16
+ - [Comparing 2 Sites](#comparing-2-sites)
17
+ - [Spurious Diffs](#spurious-diffs)
18
+ - [Command Line Options](#command-line-options)
19
+ - [Finding Configuration Files](#finding-configuration-files)
20
+ - [Specifying Paths](#specifying-paths)
21
+ - [Debugging Rules](#debugging-rules)
22
+ - [Including and Excluding URLs](#including-and-excluding-urls)
23
+ - [Paths and Paths-file](#paths--paths-file)
24
+ - [Report Export](#export)
25
+ - [Running inside containers](#running-inside-containers)
26
+ - [Configuration](#configuration)
27
+ - [before_url / after_url](#before_url--after_url)
28
+ - [selector](#selector)
29
+ - [sanitization](#sanitization)
30
+ - [ignore_whitespace](#ignore_whitespace)
31
+ - [before / after](#before--after)
32
+ - [includes](#incudes)
33
+ - [dom_transform](#dom_transform)
34
+ - [remove](#remove)
35
+ - [strip](#strip)
36
+ - [unwrap](#unwrap)
37
+ - [remove_class](#remove_class)
38
+ - [unwrap_root](#unwrap_root)
39
+ - [Organizing configuration files](#organizing-configuration-files)
40
+ - [Named regions](#named-regions)
41
+ - [report](#report)
42
+ - [title](#title)
43
+ - [details](#details)
44
+ - [before_note](#before_note)
45
+ - [after_note](#after_note)
46
+ - [before_url_report / after_url_report](#before_url_report--after_url_report)
47
+ - [Miscellaneous](#miscellaneous)
48
+ - [preset](#preset)
49
+ - [Include / Exclude Paths](#includeexclude-paths)
50
+ - [Curl Options](#curl-options)
51
+ - [Throttling](#throttling)
52
+ - [Timeouts](#timeouts)
53
+ - [Handling security](#handling-security)
54
+ - [interval](#interval)
55
+ - [concurrency](#concurrency)
56
+ - [depth](#depth)
57
+ - [curl_opts](#curl_opts)
58
+ - [Tips and Tricks](#tips-and-tricks)
59
+ - [Removing empty elements](#removing-empty-elements)
60
+ - [HTML Tag Formatting](#html-tag-formatting)
61
+ - [Empty Attributes](#empty-attributes)
62
+ - [Acknowledgements](#acknowledgements)
63
+
64
+ ## Introduction
65
+ SiteDiff makes it easy to see how a website changes. It can compare two similar
66
+ sites or it can show how a single site changed over time. It helps identify
67
+ undesirable changes to the site's HTML and it's a useful tool for conducting QA
68
+ on re-deployments, site upgrades, and more!
69
+
70
+ When you run SiteDiff, it produces an HTML report showing whether pages on
71
+ your site have changed or not. For pages that have changed, you can see a
72
+ colorized diff exactly what changed, or compare the visual differences
73
+ side-by-side in a browser.
74
+
75
+ SiteDiff supports a range of normalization / sanitization rules. These allow
76
+ you to eliminate spurious differences, narrowing down differences to the ones
77
+ that materially affect the site.
78
+
79
+ ## Installation
80
+
81
+ SiteDiff is fairly easy to install. Please refer to the
82
+ [installation docs](INSTALLATION.md).
83
+
84
+ ## Demo
85
+
86
+ After installing all dependencies including the `bundle` version 2 gem, you can quickly
87
+ see what SiteDiff can do. Simply use the following commands:
88
+
89
+ ```sh
90
+ git clone https://github.com/evolvingweb/sitediff
91
+ cd sitediff
92
+ bundle install
93
+ bundle exec thor fixture:serve
94
+ ```
95
+
96
+ Then visit `http://localhost:13080/` to view the report.
97
+
98
+ SiteDiff shows you an overview of all the pages and clearly indicates which
99
+ pages have changed and not changed.
100
+ ![page report preview](misc/sitediff%20-%20overview%20report.png?raw=true)
101
+
102
+ When you click on a changed page, you see a colorized diff of the page's markup
103
+ showing exactly what changed on the page.
104
+ ![page report preview](misc/sitediff%20-%20page%20report.png?raw=true)
105
+
106
+ ## Usage
107
+
108
+ Here are some instructions on getting started with SiteDiff. To see a list of
109
+ commands that SiteDiff offers, you can run:
110
+
111
+ ```sitediff help```
112
+
113
+ To get help for a particular command, say, `diff`, you can run:
114
+
115
+ ```sitediff help diff```
116
+
117
+ ### Getting started
118
+
119
+ To use SiteDiff on your site, create a configuration for your site:
120
+
121
+ ```sitediff init http://mysite.example.com```
122
+
123
+ SiteDiff will generate a configuration file named `sitediff.yaml` by default.
124
+
125
+ You can open the configuration file ```sitediff/sitediff.yaml``` to see the
126
+ default configuration generated by SiteDiff.
127
+ The [the configuration reference](#configuration) section explains the contents
128
+ of this file and helps you customize it as per your requirements.
129
+
130
+ Then get SiteDiff to crawl your site by using:
131
+
132
+ ```sitediff crawl```
133
+
134
+ SiteDiff will then crawl your site, finding pages and caching their
135
+ contents. A list of discovered paths will be saved to a `paths.txt` file.
136
+
137
+ Now, you can make alterations to your site. For example, change a word on your
138
+ site's front page. After you're done, you can check what actually changed:
139
+
140
+ ```sitediff diff```
141
+
142
+ For each page, SiteDiff will report whether it did or did not change. For pages
143
+ that changed, it will display a diff. You can also see an HTML version of the
144
+ report using the following command:
145
+
146
+ ```sitediff serve```
147
+
148
+ SiteDiff will start an internal web server and open a report page on your
149
+ browser. For each page, you can see the diff and a side-by-side view of the
150
+ old and new versions.
151
+
152
+ You can now see if the changes were as you expected, or if some things didn't
153
+ quite work out as you hoped. If you noticed unexpected changes, congratulations:
154
+ SiteDiff just helped you find an issue you would have otherwise missed!
155
+
156
+ As you fix any issues, you can continue to alter your site and run
157
+ ```sitediff diff``` to check the changes against the old version. Once you're
158
+ satisfied with the state of your site, you can inform SiteDiff that it should
159
+ re-cache your site:
160
+
161
+ ```sitediff store```
162
+
163
+ This takes a snapshot of your website and the next time you run
164
+ ```sitediff diff```, it will use this new version as the reference for
165
+ comparison.
166
+
167
+ Happy diffing!
168
+
169
+ ### Comparing 2 sites
170
+
171
+ Sometimes you have two sites that you want to compare, for example a production
172
+ site hosted on a public server and a development site hosted on your computer.
173
+ SiteDiff can handle this situation, too! Just inform SiteDiff that there are
174
+ two sites to compare:
175
+
176
+ ```sitediff init http://mysite.example.com http://localhost/mysite```
177
+
178
+ Then when you run `sitediff diff`, it will compare the cached version of the
179
+ first site with the current version of the second site.
180
+
181
+ If both the first and second sites may be changing, you should tell SiteDiff
182
+ not to cache either site:
183
+
184
+ ```sitediff diff --cached=none```
185
+
186
+ ### Spurious diffs
187
+
188
+ Sometimes sites have spurious differences, that you don't want to show up in a
189
+ comparison. For example, many sites protect against Cross-Site Request Forgery
190
+ using a [semi-random token](http://en.wikipedia.org/wiki/Cross-site_request_forgery#Synchronizer_token_pattern).
191
+ Since this token changes on each HTTP GET, you probably don't care about such
192
+ a change.
193
+
194
+ To help with issues such as this, SiteDiff allows you to normalize the HTML it
195
+ fetches as it compares pages. In the ```sitediff.yaml``` configuration file,
196
+ you can add "sanitization rules", which specify either DOM transformations or
197
+ regular expression substitutions.
198
+
199
+ Here's an example of a rule you might add to remove CSRF-protection tokens
200
+ generated by Django:
201
+
202
+ ```yaml
203
+ dom_transform:
204
+ - title: Remove CSRF tokens
205
+ type: remove
206
+ selector: input[name=csrfmiddlewaretoken]
207
+ ```
208
+
209
+ You can use one of the presets to apply framework-specific sanitization.
210
+ Currently, SiteDiff only comes with Drupal-specific presets.
211
+
212
+ See the [preset](#preset) section for more details.
213
+
214
+ ## Command Line Options
215
+
216
+ ### Finding configuration files
217
+
218
+ By default SiteDiff will put everything in the `sitediff` folder. You can use
219
+ the `--directory` flag to specify a different directory.
220
+
221
+ ```bash
222
+ sitediff init -C my_project_folder https://example.com
223
+ sitediff diff -C my_project_folder
224
+ sitediff serve -C my_project_folder
225
+ ```
226
+
227
+ ### Specifying paths
228
+
229
+ When you run ```sitediff diff```, you can specify which pages to look at in
230
+ 2 ways:
231
+
232
+ 1. The option ```--paths /foo /bar ...```.
233
+
234
+ If you're trying to fix one page in particular, specifying just that one
235
+ path will make ```sitediff diff``` run quickly!
236
+
237
+ 2. The option ```--paths-file FILE``` with a newline-delimited text file.
238
+
239
+ This is particularly useful when you're trying to eliminate all diffs.
240
+ SiteDiff creates a file ```output/failures.txt``` containing all paths
241
+ which had differences, so as you try to fix differences, you can run:
242
+
243
+ ```sitediff diff --paths-file sitediff/failures.txt```
244
+
245
+ ### Debugging rules
246
+
247
+ When a sanitization rule isn't working quite right for you, you might run
248
+ `sitediff diff` many times over. If fetching all the pages is taking too long,
249
+ try adding the option ```--cached=all```. This tells SiteDiff not to re-fetch
250
+ the content, but just compare previously cached versions — it's a lot faster!
251
+
252
+ ### Including and Excluding URLs
253
+
254
+ By default sitediff crawls pages that are indicated with an HTML anchor using
255
+ the `<A HREF` syntax. Most pages linked will be HTML pages, but some links
256
+ will contain binaries such as PDF documents and images.
257
+
258
+ Using the option `--exclude='.*\.pdf'` ensures the crawler skips links
259
+ for document with a `.pdf` extension. Note that the regular expression is
260
+ applied to the path of the URL, not the base of the URL.
261
+
262
+ For example `--include='.*\.com'` will not match `http://www.google.com/`,
263
+ because the path of that URL is `/` while the base is `www.google.com`.
264
+
265
+ ### paths / paths-file
266
+
267
+ SiteDiff allows you to specify a list of paths that you want it to work with.
268
+ Alternatively, it can crawl the entire site and detect all paths.
269
+
270
+ * Running `sitediff init` configures SiteDiff for crawling and seeing differences.
271
+
272
+ * Running `sitediff crawl` makes sitediff crawl your site and detect
273
+ available paths. These paths are written to a `paths.txt` file which you
274
+ can modify according to your needs.
275
+
276
+ * You can also compute diffs only for paths specified in a custom paths file
277
+ using the `--paths-file` parameter. This file should contain paths starting
278
+ with a `/`, having one path per line.
279
+
280
+ ```
281
+ sitediff diff --paths-file=/path/to/paths.txt
282
+ ```
283
+
284
+ * You can also compute diffs for a handful of specific paths by specifying
285
+ them directly on the command line using the `--paths` parameter. Each path
286
+ should be separated by a space.
287
+
288
+ ```
289
+ sitediff diff --paths=/home /about /contact
290
+ ```
291
+
292
+ ### export
293
+ Generate a gzipped tar file containing the HTML report instead of generating
294
+ and serving live web pages, this option overrides `--report-format`, forcing
295
+ HTML.
296
+
297
+ ### Running inside containers
298
+
299
+ If you run SiteDiff inside a container or virtual machine, the URLs in its
300
+ report might not work from your host, such as ```localhost```. You can fix
301
+ this by using the ```--before-url-report``` and ```--after-url-report```
302
+ options, to tell SiteDiff to use a different URL in the report than the one
303
+ it uses for fetching.
304
+
305
+ For example, if you ran `sitediff init http://mysite.com http://localhost`
306
+ inside a [Vagrant](https://www.vagrantup.com/) VM, you might then run
307
+ something like:
308
+
309
+ ```sitediff diff --after-url-report=http://vagrant:8080```
310
+
311
+ ## Configuration
312
+
313
+ SiteDiff relies on a [YAML](http://yaml.org/) configuration file, usually
314
+ called `sitediff.yaml`. You can create a reasonable one using `sitediff init`,
315
+ but there are many useful things you may want to add or change manually.
316
+
317
+ In the `sitediff.yaml`, SiteDiff recognizes the keys described below. The
318
+ `config` directory contains some example `sitediff.yaml` files. For example,
319
+ [sitediff.example.yaml](config/sitediff.example.yaml).
320
+
321
+ ### before_url / after_url
322
+
323
+ ```yaml
324
+ before_url: http://example.com/subsite
325
+ after_url: http://localhost:8080/subsite
326
+ ```
327
+
328
+ They can also be paths to directories on the local filesystem.
329
+
330
+ The `after_url` MUST provided either at the command-line or in the
331
+ `sitediff.yaml`. If the `before_url` is provided, SiteDiff will compare the
332
+ two sites. Otherwise, it will compare the current version of the `after` site
333
+ with the stored version of that site, as created by `sitediff init` or
334
+ `sitediff store`.
335
+
336
+ ### selector
337
+
338
+ Chooses the sections of HTML we wish to compare, if you don't
339
+ want to compare the entire page. For example if you only want to compare
340
+ breadcrumbs between your two sites, you might specify:
341
+
342
+ ```yaml
343
+ selector: '#breadcrumb'
344
+ ```
345
+
346
+ ### sanitization
347
+
348
+ A list of regular expression rules to normalize your HTML for comparison.
349
+
350
+ Each rule should have a **pattern** regex, which is used to search the HTML.
351
+ Each found instance is replaced with the provided **substitute** or deleted
352
+ if no substitute is provided. A rule may also have a **selector**, which
353
+ constrains it to operate only on HTML fragments which match that CSS selector.
354
+
355
+ For example, forms on Drupal sites have a randomly generated `form_build_id`
356
+ on form pages:
357
+
358
+ ```html
359
+ <input type="hidden" name="form_build_id" value="form-1cac6b5b6141a72b2382928249605fb1"/>
360
+ ```
361
+
362
+ We're not interested in comparing random content, so we could use the
363
+ following rule to fix this:
364
+
365
+ ```yaml
366
+ sanitization:
367
+ # Remove form build IDs
368
+ - pattern: '<input type="hidden" name="form_build_id" value="form-[a-zA-Z0-9_-]+" *\/?>'
369
+ selector: 'input'
370
+ substitute: '<input type="hidden" name="form_build_id" value="__form_build_id__">'
371
+ ```
372
+
373
+ Sanitization rules may also have a **path** attribute, whose value is a
374
+ regular expression. If present, the rule will only apply to matching paths.
375
+
376
+ ### ignore_whitespace
377
+ Ignore whitespace when doing the diff. This passes the `-w` option to the native OS `diff` command.
378
+
379
+ ```yaml
380
+ ignore_whitespace: true
381
+ ```
382
+
383
+ On the command line, use `-w` or `--ignore-whitespace`.
384
+
385
+ ```bash
386
+ sitediff diff -w
387
+ ```
388
+
389
+ ### before / after
390
+
391
+ Applies rules to just one side of the comparison.
392
+
393
+ These blocks can contain any of the following sections: `selector`,
394
+ `sanitization`, `dom_transform`. Such a section placed in `before` will be
395
+ applied just to the `before` side of the comparison and similarly for `after`.
396
+
397
+ For example, if you wanted to let different date formatting not create diff
398
+ failures, you might use the following:
399
+
400
+ ```yaml
401
+ before:
402
+ sanitization:
403
+ - pattern: '[1-2][0-9]{3}/[0-1][0-9]/[0-9]{2}'
404
+ substitute: '__date__'
405
+ after:
406
+ sanitization:
407
+ - pattern: '[A-Z][a-z]{2} [0-9]{1,2}(st|nd|rd|th) [1-2][0-9]{3}'
408
+ substitute: '__date__'
409
+ ```
410
+
411
+ The above rule will replace dates of the form `2004/12/05` in `before` and
412
+ dates of the form `May 12th 2004` in `after` with `__date__`.
413
+
414
+ ### includes
415
+
416
+ The names of other configuration YAML files to merge with this one.
417
+
418
+ ```yaml
419
+ includes:
420
+ - config/sanitize_domains.yaml
421
+ - config/strip_css_js.yaml
422
+ ```
423
+
424
+ ### dom_transform
425
+
426
+ A list of transformations to apply to the HTML before comparing.
427
+
428
+ This is similar to _sanitization_, but it applies transformations to the
429
+ structure of the HTML, instead of to the text. Each transformation has a
430
+ **type**, and potentially other attributes. The following types are available:
431
+
432
+ #### remove
433
+
434
+ Given a **selector**, removes all elements that match it.
435
+
436
+ For example, say we have a block containing the current time, which is
437
+ expected to change. To ignore that, we might choose to delete the block
438
+ before comparison:
439
+
440
+ ```yaml
441
+ dom_transform:
442
+ # Remove current time block
443
+ - type: remove
444
+ - selector: div#block-time
445
+ ```
446
+
447
+ #### strip
448
+
449
+ Strip leading and trailing whitespace from the contents of a tag.
450
+
451
+ Uses the Ruby string `strip()` method. Whitespace is defined as any of the
452
+ following characters: null, horizontal tab, line feed, vertical tab, form
453
+ feed, carriage return, space.
454
+
455
+ To transform `<h1> Foo and Bar\n </h1>` to `<h1>Foo and Bar<\h1>`:
456
+
457
+ ```yaml
458
+ dom_transform:
459
+ # Strip H1 tags
460
+ - type: strip
461
+ - selector: h1
462
+ ```
463
+
464
+ #### unwrap
465
+
466
+ Given a **selector**, replaces all matching elements with
467
+ their children. For example, your content on one side of the comparison might
468
+ look like this:
469
+
470
+ ```html
471
+ <p>This is some text</p>
472
+ <img src="lola.png" alt="Lola is a cute kitten." />
473
+ ```
474
+
475
+ But on the other side, it might be wrapped in an `article` tag:
476
+ ```html
477
+ <article>
478
+ <p>This is some text</p>
479
+ <img src="test.png"/>
480
+ </article>
481
+ ```
482
+
483
+ You could fix it with the following configuration:
484
+
485
+ ```yaml
486
+ dom_transform:
487
+ - type: unwrap
488
+ selector: article
489
+ ```
490
+
491
+ #### remove_class
492
+
493
+ Given a **selector** and a **class**, removes that class
494
+ from each element that matches the selector. It can also take a list of
495
+ classes, instead of just one.
496
+
497
+ For example, here are two sample rules for removing a single class and
498
+ removing multiple classes from all `div` elements:
499
+
500
+ ```yaml
501
+ dom_transform:
502
+ # Remove class foo from div elements
503
+ - type: remove_class
504
+ selector: div
505
+ class: class-foo
506
+ # Remove class bar and class baz from div elements
507
+ - type: remove_class
508
+ selector: div
509
+ class:
510
+ - class-bar
511
+ - class-baz
512
+ ```
513
+
514
+ #### unwrap_root
515
+
516
+ Replaces the entire root element with its children.
517
+
518
+ ### report
519
+
520
+ The settings under the `report` key allow you to display helpful details on the report.
521
+
522
+ ```yaml
523
+ report:
524
+ title: "Updates to example.com"
525
+ details: "This report verifies updates to example.com."
526
+ before_note: "The old site"
527
+ after_note: "The new site"
528
+ before_url_report: http://example.com
529
+ after_url_report: http://staging.example.com
530
+ ```
531
+
532
+ #### title
533
+
534
+ Display a title string at the top of the report.
535
+
536
+ #### details
537
+
538
+ Text displays as a paragraph at the top of the report, below the title.
539
+
540
+ #### before_note
541
+
542
+ Display a brief explanatory note next to `before` URL.
543
+
544
+ #### after_note
545
+
546
+ Display a brief explanatory note next to `after` URL.
547
+
548
+ #### before_url_report / after_url_report
549
+
550
+ Changes how SiteDiff reports which URLs it is comparing, but don't change what
551
+ it actually compares.
552
+
553
+ Suppose you are serving your 'after' website on a virtual machine with
554
+ IP 192.168.2.3, and you are also running SiteDiff inside that VM. To make links
555
+ in the report accessible from outside the VM, you might provide:
556
+
557
+ ```yaml
558
+ after_url: http://localhost
559
+ report:
560
+ after_url_report: http://192.168.2.3
561
+ ```
562
+
563
+ If you don't wish to have the "Before" or "After" links in the report, set to false:
564
+
565
+ ```yaml
566
+ report:
567
+ after_url_report: false
568
+ ```
569
+
570
+ ### Miscellaneous
571
+
572
+ #### preset
573
+
574
+ Presets are stored in the `/lib/sitediff/presets` directory of this gem. You
575
+ can select a preset as follows:
576
+
577
+ ```yaml
578
+ settings:
579
+ preset: drupal
580
+ ```
581
+
582
+ #### Include/Exclude Paths
583
+
584
+ ##### exclude paths
585
+
586
+ A RegEx indicating the paths that should not be crawled.
587
+
588
+ ##### include paths
589
+
590
+ A RegEx indicating the paths that should be crawled.
591
+
592
+ ### Organizing configuration files
593
+
594
+ If your configuration file starts getting really big, SiteDiff lets you
595
+ separate it out into multiple files. Just have one base file that includes
596
+ other files:
597
+
598
+ ```yaml
599
+ includes:
600
+ - sanitization.yaml
601
+ - paths.yaml
602
+ ```
603
+
604
+ This allows you to separate your configuration into logical groups.
605
+ For example, generic rules for your site could live in a `generic.yaml` file,
606
+ while rules pertaining to a particular update you're conducting could
607
+ live in `update-8.2.yaml`.
608
+
609
+ ### Named regions
610
+
611
+ In major upgrades and migrations where there are significant changes to the markup,
612
+ simple diffs will not be of much value. To assist in these cases, `named
613
+ regions` let you define regions in the page markup and the specify order in which
614
+ they should be compared. Specifying the order helps in cases where the fields are
615
+ not in the same order on the new site.
616
+
617
+ For example, if you have a CMS displaying `title`, `author`, and `body` fields, you
618
+ could define the named regions and the selectors for the three fields as follows:
619
+
620
+ ```yaml
621
+ regions:
622
+ - name: title
623
+ selector: h1.title
624
+ - name: author
625
+ selector: .field-name-attribution
626
+ - name: body
627
+ selector: .field-name-body
628
+ ```
629
+
630
+ (You need to define `regions` for both the `before` and `after` sections.)
631
+
632
+ You must then define the order that the fields should be compared, using the
633
+ `output` key.
634
+
635
+ ```yaml
636
+ output:
637
+ - title
638
+ - author
639
+ - body
640
+ ```
641
+
642
+ Before the two versions are compared, SiteDiff generates markup with
643
+ `<region>` tags and each `region` contains the markup matching the
644
+ corresponding selector.
645
+
646
+ EG:
647
+
648
+ ```html
649
+ <region id="title">
650
+ <h1 class="title">My Blog Post</h1>
651
+ </region>
652
+ <region id="author">
653
+ <div class="field-name-attribution">
654
+ <span class="label">By:</span> Alfred E. Neuman
655
+ </div>
656
+ </region>
657
+ <region id="body">
658
+ <div class=".field-name-attribution">
659
+ <p>Lorem ipsum...
660
+ </div>
661
+ </region>
662
+ ```
663
+
664
+ The regions are processed first, so you can reference the `<region>` tags to
665
+ be more specific in your selectors for `dom_transform` and `sanitization`
666
+ sections.
667
+
668
+ EG:
669
+
670
+ ```yaml
671
+ dom_transform:
672
+ - name: Remove body div wrapper
673
+ type: unwrap
674
+ selector: region#body .field-name-attribution
675
+ ```
676
+
677
+ ### Curl Options
678
+
679
+ [Many options](https://curl.haxx.se/libcurl/c/curl_easy_setopt.html) can be
680
+ passed to the underlying curl library. Add `--curl_options=name1:value1 name2:value2`
681
+ to the command line (such as `--curl_options=max_recv_speed_large:100000`
682
+ (remove the `CURLOPT_` prefix and write the name in lowercase) or add them to
683
+ your configuration file.
684
+
685
+ ```yaml
686
+ settings:
687
+ curl_opts:
688
+ max_recv_speed_large: 10000
689
+ ssl_verifypeer: false
690
+ ```
691
+
692
+ These CURL options can be put under the `settings` section of `sitediff.yaml`
693
+ as demonstrated above.
694
+
695
+ #### Throttling
696
+
697
+ A few options are also available to control how aggressively SiteDiff crawls.
698
+
699
+ - There's a command line option `--concurrency=N` for `sitediff init`
700
+ which controls the maximum number of simultaneous connections made.
701
+ Lower N mean less aggressive. The default is 3. You can specify this in the
702
+ `sitediff.yaml` file under the `settings` key.
703
+
704
+ - The underlying curl library has [many options](https://curl.haxx.se/libcurl/c/curl_easy_setopt.html)
705
+ such as `max_recv_speed_large` which can be helpful.
706
+
707
+ - There is a special command line option `--interval=T` for `sitediff init`.
708
+ This option and allows the fetcher to delay for T milliseconds between
709
+ fetching pages. You can specify this in the `sitediff.yaml` file under the
710
+ `settings` key.
711
+
712
+ #### Timeouts
713
+
714
+ By default, no timeout is set but one can be added `--curl_options=timeout:60`
715
+ or in your configuration file.
716
+
717
+ ```yaml
718
+ settings:
719
+ curl_opts:
720
+ timeout: 60 # In seconds; or...
721
+ timeout_ms: 60000 # In milliseconds.
722
+ ```
723
+
724
+ #### Handling security
725
+
726
+ Often development or staging sites are protected by [HTTP Authentication](http://en.wikipedia.org/wiki/Basic_access_authentication).
727
+ SiteDiff allows you to specify a username and password, by using a URL like
728
+ `http://user:pass@example.com` or by adding a `userpwd` setting to your file.
729
+
730
+ SiteDiff ignores untrusted certificates by default. This is equivalent to the following settings:
731
+
732
+ ```yaml
733
+ settings:
734
+ curl_opts:
735
+ ssl_verifypeer: false
736
+ ssl_verifyhost: 0
737
+ userpwd: "username:password"
738
+ ```
739
+
740
+ This contains various parameters which affect the way SiteDiff works. You can
741
+ have the following keys under `settings`.
742
+
743
+ #### interval
744
+ An integer indicating the number of milliseconds SiteDiff should wait for
745
+ between requests.
746
+
747
+ #### concurrency
748
+ The maximum number of simultaneous requests that SiteDiff should make.
749
+
750
+ #### depth
751
+
752
+ The depth to which SiteDiff should crawl the website. Defaults to 3,
753
+ which means, 3 levels deep.
754
+
755
+ #### curl_opts
756
+
757
+ Options to pass to the underlying curl library. Remove the `CURLOPT_` prefix in
758
+ this [full list of options](https://curl.haxx.se/libcurl/c/curl_easy_setopt.html)
759
+ and write in lowercase. Useful for throttling.
760
+
761
+ ```yaml
762
+ settings:
763
+ curl_opts:
764
+ connecttimeout: 3
765
+ followlocation: true
766
+ max_recv_speed_large: 10000
767
+ ```
768
+
769
+ ## Tips and Tricks
770
+
771
+ Here are some tips and tricks that we've learned using SiteDiff:
772
+
773
+ - Use single quotes or double quotes around selectors. Remember that the `#` is a comment in YAML.
774
+ - Be specific enough with selectors to not affect elements on other pages.
775
+
776
+ ### Removing Empty Elements
777
+
778
+ If you have an empty `<p/>` tag appearing in the diff, you can write the following in your sanitization lists:
779
+ ```yaml
780
+ - name: remove_empty_p
781
+ pattern: '<p/>'
782
+ substitute: ''
783
+ ```
784
+
785
+ ### HTML Tag Formatting
786
+
787
+ There are times when the HTML tags do not have newlines between them on one of the sites you wish to compare. In this
788
+ case, these sanitzation rules are useful:
789
+ ```yaml
790
+ - name: remove_space_before
791
+ pattern: '\s*(\n)<'
792
+ substitute: '\1<'
793
+
794
+ - name: remove_space_after
795
+ pattern: '>(\n)\s*'
796
+ substitute: '>\1'
797
+ ```
798
+
799
+ ### Empty Attributes
800
+
801
+ After writing rules, you may end up with empty attributes, like `width=""`. Here's a sanitization rule:
802
+ ```yaml
803
+ - name: remove_empty_class
804
+ pattern: ' class=""'
805
+ substitute: ''
806
+ ```
807
+
808
+ ## Acknowledgements
809
+
810
+ SiteDiff is brought to you by [Evolving Web](https://evolvingweb.ca/).