sitediff 1.1.1 → 1.2.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/.eslintignore +1 -0
- data/.eslintrc.json +28 -0
- data/.project +11 -0
- data/.rubocop.yml +179 -0
- data/.rubocop_todo.yml +51 -0
- data/CHANGELOG.md +28 -0
- data/Dockerfile +33 -0
- data/Gemfile +11 -0
- data/Gemfile.lock +85 -0
- data/INSTALLATION.md +146 -0
- data/LICENSE +339 -0
- data/README.md +810 -0
- data/Rakefile +12 -0
- data/Thorfile +135 -0
- data/config/.gitkeep +0 -0
- data/config/sanitize_domains.example.yaml +8 -0
- data/config/sitediff.example.yaml +81 -0
- data/docker-compose.test.yml +3 -0
- data/lib/sitediff/api.rb +17 -6
- data/lib/sitediff/cache.rb +5 -3
- data/lib/sitediff/cli.rb +4 -3
- data/lib/sitediff/config/creator.rb +13 -13
- data/lib/sitediff/config/preset.rb +6 -6
- data/lib/sitediff/config.rb +9 -9
- data/lib/sitediff/crawler.rb +12 -2
- data/lib/sitediff/diff.rb +1 -1
- data/lib/sitediff/fetch.rb +2 -2
- data/lib/sitediff/files/report.html.erb +1 -1
- data/lib/sitediff/presets/drupal.yaml +63 -0
- data/lib/sitediff/report.rb +6 -6
- data/lib/sitediff/result.rb +5 -5
- data/lib/sitediff/sanitize/dom_transform.rb +2 -2
- data/lib/sitediff/sanitize/regexp.rb +2 -2
- data/lib/sitediff/sanitize.rb +5 -5
- data/lib/sitediff/uriwrapper.rb +8 -10
- data/lib/sitediff/webserver/resultserver.rb +2 -0
- data/lib/sitediff/webserver.rb +3 -0
- data/lib/sitediff.rb +9 -9
- data/misc/sitediff - overview report.png +0 -0
- data/misc/sitediff - page report.png +0 -0
- data/package-lock.json +878 -0
- data/package.json +25 -0
- data/sitediff.gemspec +51 -0
- metadata +62 -18
data/README.md
ADDED
@@ -0,0 +1,810 @@
|
|
1
|
+
# SiteDiff CLI
|
2
|
+
|
3
|
+
**Warning:** SiteDiff 1.2.0 requires at least Ruby 3.1.2.
|
4
|
+
|
5
|
+
**Warning:** SiteDiff 1.0.0 introduces some backwards incompatible changes.
|
6
|
+
|
7
|
+
[![Build Status](https://travis-ci.org/evolvingweb/sitediff.svg?branch=master)](https://travis-ci.org/evolvingweb/sitediff)
|
8
|
+
|
9
|
+
## Table of contents
|
10
|
+
|
11
|
+
- [Introduction](#introduction)
|
12
|
+
- [Installation](#installation)
|
13
|
+
- [Demo](#demo)
|
14
|
+
- [Usage](#usage)
|
15
|
+
- [Getting Started](#getting-started)
|
16
|
+
- [Comparing 2 Sites](#comparing-2-sites)
|
17
|
+
- [Spurious Diffs](#spurious-diffs)
|
18
|
+
- [Command Line Options](#command-line-options)
|
19
|
+
- [Finding Configuration Files](#finding-configuration-files)
|
20
|
+
- [Specifying Paths](#specifying-paths)
|
21
|
+
- [Debugging Rules](#debugging-rules)
|
22
|
+
- [Including and Excluding URLs](#including-and-excluding-urls)
|
23
|
+
- [Paths and Paths-file](#paths--paths-file)
|
24
|
+
- [Report Export](#export)
|
25
|
+
- [Running inside containers](#running-inside-containers)
|
26
|
+
- [Configuration](#configuration)
|
27
|
+
- [before_url / after_url](#before_url--after_url)
|
28
|
+
- [selector](#selector)
|
29
|
+
- [sanitization](#sanitization)
|
30
|
+
- [ignore_whitespace](#ignore_whitespace)
|
31
|
+
- [before / after](#before--after)
|
32
|
+
- [includes](#incudes)
|
33
|
+
- [dom_transform](#dom_transform)
|
34
|
+
- [remove](#remove)
|
35
|
+
- [strip](#strip)
|
36
|
+
- [unwrap](#unwrap)
|
37
|
+
- [remove_class](#remove_class)
|
38
|
+
- [unwrap_root](#unwrap_root)
|
39
|
+
- [Organizing configuration files](#organizing-configuration-files)
|
40
|
+
- [Named regions](#named-regions)
|
41
|
+
- [report](#report)
|
42
|
+
- [title](#title)
|
43
|
+
- [details](#details)
|
44
|
+
- [before_note](#before_note)
|
45
|
+
- [after_note](#after_note)
|
46
|
+
- [before_url_report / after_url_report](#before_url_report--after_url_report)
|
47
|
+
- [Miscellaneous](#miscellaneous)
|
48
|
+
- [preset](#preset)
|
49
|
+
- [Include / Exclude Paths](#includeexclude-paths)
|
50
|
+
- [Curl Options](#curl-options)
|
51
|
+
- [Throttling](#throttling)
|
52
|
+
- [Timeouts](#timeouts)
|
53
|
+
- [Handling security](#handling-security)
|
54
|
+
- [interval](#interval)
|
55
|
+
- [concurrency](#concurrency)
|
56
|
+
- [depth](#depth)
|
57
|
+
- [curl_opts](#curl_opts)
|
58
|
+
- [Tips and Tricks](#tips-and-tricks)
|
59
|
+
- [Removing empty elements](#removing-empty-elements)
|
60
|
+
- [HTML Tag Formatting](#html-tag-formatting)
|
61
|
+
- [Empty Attributes](#empty-attributes)
|
62
|
+
- [Acknowledgements](#acknowledgements)
|
63
|
+
|
64
|
+
## Introduction
|
65
|
+
SiteDiff makes it easy to see how a website changes. It can compare two similar
|
66
|
+
sites or it can show how a single site changed over time. It helps identify
|
67
|
+
undesirable changes to the site's HTML and it's a useful tool for conducting QA
|
68
|
+
on re-deployments, site upgrades, and more!
|
69
|
+
|
70
|
+
When you run SiteDiff, it produces an HTML report showing whether pages on
|
71
|
+
your site have changed or not. For pages that have changed, you can see a
|
72
|
+
colorized diff exactly what changed, or compare the visual differences
|
73
|
+
side-by-side in a browser.
|
74
|
+
|
75
|
+
SiteDiff supports a range of normalization / sanitization rules. These allow
|
76
|
+
you to eliminate spurious differences, narrowing down differences to the ones
|
77
|
+
that materially affect the site.
|
78
|
+
|
79
|
+
## Installation
|
80
|
+
|
81
|
+
SiteDiff is fairly easy to install. Please refer to the
|
82
|
+
[installation docs](INSTALLATION.md).
|
83
|
+
|
84
|
+
## Demo
|
85
|
+
|
86
|
+
After installing all dependencies including the `bundle` version 2 gem, you can quickly
|
87
|
+
see what SiteDiff can do. Simply use the following commands:
|
88
|
+
|
89
|
+
```sh
|
90
|
+
git clone https://github.com/evolvingweb/sitediff
|
91
|
+
cd sitediff
|
92
|
+
bundle install
|
93
|
+
bundle exec thor fixture:serve
|
94
|
+
```
|
95
|
+
|
96
|
+
Then visit `http://localhost:13080/` to view the report.
|
97
|
+
|
98
|
+
SiteDiff shows you an overview of all the pages and clearly indicates which
|
99
|
+
pages have changed and not changed.
|
100
|
+
![page report preview](misc/sitediff%20-%20overview%20report.png?raw=true)
|
101
|
+
|
102
|
+
When you click on a changed page, you see a colorized diff of the page's markup
|
103
|
+
showing exactly what changed on the page.
|
104
|
+
![page report preview](misc/sitediff%20-%20page%20report.png?raw=true)
|
105
|
+
|
106
|
+
## Usage
|
107
|
+
|
108
|
+
Here are some instructions on getting started with SiteDiff. To see a list of
|
109
|
+
commands that SiteDiff offers, you can run:
|
110
|
+
|
111
|
+
```sitediff help```
|
112
|
+
|
113
|
+
To get help for a particular command, say, `diff`, you can run:
|
114
|
+
|
115
|
+
```sitediff help diff```
|
116
|
+
|
117
|
+
### Getting started
|
118
|
+
|
119
|
+
To use SiteDiff on your site, create a configuration for your site:
|
120
|
+
|
121
|
+
```sitediff init http://mysite.example.com```
|
122
|
+
|
123
|
+
SiteDiff will generate a configuration file named `sitediff.yaml` by default.
|
124
|
+
|
125
|
+
You can open the configuration file ```sitediff/sitediff.yaml``` to see the
|
126
|
+
default configuration generated by SiteDiff.
|
127
|
+
The [the configuration reference](#configuration) section explains the contents
|
128
|
+
of this file and helps you customize it as per your requirements.
|
129
|
+
|
130
|
+
Then get SiteDiff to crawl your site by using:
|
131
|
+
|
132
|
+
```sitediff crawl```
|
133
|
+
|
134
|
+
SiteDiff will then crawl your site, finding pages and caching their
|
135
|
+
contents. A list of discovered paths will be saved to a `paths.txt` file.
|
136
|
+
|
137
|
+
Now, you can make alterations to your site. For example, change a word on your
|
138
|
+
site's front page. After you're done, you can check what actually changed:
|
139
|
+
|
140
|
+
```sitediff diff```
|
141
|
+
|
142
|
+
For each page, SiteDiff will report whether it did or did not change. For pages
|
143
|
+
that changed, it will display a diff. You can also see an HTML version of the
|
144
|
+
report using the following command:
|
145
|
+
|
146
|
+
```sitediff serve```
|
147
|
+
|
148
|
+
SiteDiff will start an internal web server and open a report page on your
|
149
|
+
browser. For each page, you can see the diff and a side-by-side view of the
|
150
|
+
old and new versions.
|
151
|
+
|
152
|
+
You can now see if the changes were as you expected, or if some things didn't
|
153
|
+
quite work out as you hoped. If you noticed unexpected changes, congratulations:
|
154
|
+
SiteDiff just helped you find an issue you would have otherwise missed!
|
155
|
+
|
156
|
+
As you fix any issues, you can continue to alter your site and run
|
157
|
+
```sitediff diff``` to check the changes against the old version. Once you're
|
158
|
+
satisfied with the state of your site, you can inform SiteDiff that it should
|
159
|
+
re-cache your site:
|
160
|
+
|
161
|
+
```sitediff store```
|
162
|
+
|
163
|
+
This takes a snapshot of your website and the next time you run
|
164
|
+
```sitediff diff```, it will use this new version as the reference for
|
165
|
+
comparison.
|
166
|
+
|
167
|
+
Happy diffing!
|
168
|
+
|
169
|
+
### Comparing 2 sites
|
170
|
+
|
171
|
+
Sometimes you have two sites that you want to compare, for example a production
|
172
|
+
site hosted on a public server and a development site hosted on your computer.
|
173
|
+
SiteDiff can handle this situation, too! Just inform SiteDiff that there are
|
174
|
+
two sites to compare:
|
175
|
+
|
176
|
+
```sitediff init http://mysite.example.com http://localhost/mysite```
|
177
|
+
|
178
|
+
Then when you run `sitediff diff`, it will compare the cached version of the
|
179
|
+
first site with the current version of the second site.
|
180
|
+
|
181
|
+
If both the first and second sites may be changing, you should tell SiteDiff
|
182
|
+
not to cache either site:
|
183
|
+
|
184
|
+
```sitediff diff --cached=none```
|
185
|
+
|
186
|
+
### Spurious diffs
|
187
|
+
|
188
|
+
Sometimes sites have spurious differences, that you don't want to show up in a
|
189
|
+
comparison. For example, many sites protect against Cross-Site Request Forgery
|
190
|
+
using a [semi-random token](http://en.wikipedia.org/wiki/Cross-site_request_forgery#Synchronizer_token_pattern).
|
191
|
+
Since this token changes on each HTTP GET, you probably don't care about such
|
192
|
+
a change.
|
193
|
+
|
194
|
+
To help with issues such as this, SiteDiff allows you to normalize the HTML it
|
195
|
+
fetches as it compares pages. In the ```sitediff.yaml``` configuration file,
|
196
|
+
you can add "sanitization rules", which specify either DOM transformations or
|
197
|
+
regular expression substitutions.
|
198
|
+
|
199
|
+
Here's an example of a rule you might add to remove CSRF-protection tokens
|
200
|
+
generated by Django:
|
201
|
+
|
202
|
+
```yaml
|
203
|
+
dom_transform:
|
204
|
+
- title: Remove CSRF tokens
|
205
|
+
type: remove
|
206
|
+
selector: input[name=csrfmiddlewaretoken]
|
207
|
+
```
|
208
|
+
|
209
|
+
You can use one of the presets to apply framework-specific sanitization.
|
210
|
+
Currently, SiteDiff only comes with Drupal-specific presets.
|
211
|
+
|
212
|
+
See the [preset](#preset) section for more details.
|
213
|
+
|
214
|
+
## Command Line Options
|
215
|
+
|
216
|
+
### Finding configuration files
|
217
|
+
|
218
|
+
By default SiteDiff will put everything in the `sitediff` folder. You can use
|
219
|
+
the `--directory` flag to specify a different directory.
|
220
|
+
|
221
|
+
```bash
|
222
|
+
sitediff init -C my_project_folder https://example.com
|
223
|
+
sitediff diff -C my_project_folder
|
224
|
+
sitediff serve -C my_project_folder
|
225
|
+
```
|
226
|
+
|
227
|
+
### Specifying paths
|
228
|
+
|
229
|
+
When you run ```sitediff diff```, you can specify which pages to look at in
|
230
|
+
2 ways:
|
231
|
+
|
232
|
+
1. The option ```--paths /foo /bar ...```.
|
233
|
+
|
234
|
+
If you're trying to fix one page in particular, specifying just that one
|
235
|
+
path will make ```sitediff diff``` run quickly!
|
236
|
+
|
237
|
+
2. The option ```--paths-file FILE``` with a newline-delimited text file.
|
238
|
+
|
239
|
+
This is particularly useful when you're trying to eliminate all diffs.
|
240
|
+
SiteDiff creates a file ```output/failures.txt``` containing all paths
|
241
|
+
which had differences, so as you try to fix differences, you can run:
|
242
|
+
|
243
|
+
```sitediff diff --paths-file sitediff/failures.txt```
|
244
|
+
|
245
|
+
### Debugging rules
|
246
|
+
|
247
|
+
When a sanitization rule isn't working quite right for you, you might run
|
248
|
+
`sitediff diff` many times over. If fetching all the pages is taking too long,
|
249
|
+
try adding the option ```--cached=all```. This tells SiteDiff not to re-fetch
|
250
|
+
the content, but just compare previously cached versions — it's a lot faster!
|
251
|
+
|
252
|
+
### Including and Excluding URLs
|
253
|
+
|
254
|
+
By default sitediff crawls pages that are indicated with an HTML anchor using
|
255
|
+
the `<A HREF` syntax. Most pages linked will be HTML pages, but some links
|
256
|
+
will contain binaries such as PDF documents and images.
|
257
|
+
|
258
|
+
Using the option `--exclude='.*\.pdf'` ensures the crawler skips links
|
259
|
+
for document with a `.pdf` extension. Note that the regular expression is
|
260
|
+
applied to the path of the URL, not the base of the URL.
|
261
|
+
|
262
|
+
For example `--include='.*\.com'` will not match `http://www.google.com/`,
|
263
|
+
because the path of that URL is `/` while the base is `www.google.com`.
|
264
|
+
|
265
|
+
### paths / paths-file
|
266
|
+
|
267
|
+
SiteDiff allows you to specify a list of paths that you want it to work with.
|
268
|
+
Alternatively, it can crawl the entire site and detect all paths.
|
269
|
+
|
270
|
+
* Running `sitediff init` configures SiteDiff for crawling and seeing differences.
|
271
|
+
|
272
|
+
* Running `sitediff crawl` makes sitediff crawl your site and detect
|
273
|
+
available paths. These paths are written to a `paths.txt` file which you
|
274
|
+
can modify according to your needs.
|
275
|
+
|
276
|
+
* You can also compute diffs only for paths specified in a custom paths file
|
277
|
+
using the `--paths-file` parameter. This file should contain paths starting
|
278
|
+
with a `/`, having one path per line.
|
279
|
+
|
280
|
+
```
|
281
|
+
sitediff diff --paths-file=/path/to/paths.txt
|
282
|
+
```
|
283
|
+
|
284
|
+
* You can also compute diffs for a handful of specific paths by specifying
|
285
|
+
them directly on the command line using the `--paths` parameter. Each path
|
286
|
+
should be separated by a space.
|
287
|
+
|
288
|
+
```
|
289
|
+
sitediff diff --paths=/home /about /contact
|
290
|
+
```
|
291
|
+
|
292
|
+
### export
|
293
|
+
Generate a gzipped tar file containing the HTML report instead of generating
|
294
|
+
and serving live web pages, this option overrides `--report-format`, forcing
|
295
|
+
HTML.
|
296
|
+
|
297
|
+
### Running inside containers
|
298
|
+
|
299
|
+
If you run SiteDiff inside a container or virtual machine, the URLs in its
|
300
|
+
report might not work from your host, such as ```localhost```. You can fix
|
301
|
+
this by using the ```--before-url-report``` and ```--after-url-report```
|
302
|
+
options, to tell SiteDiff to use a different URL in the report than the one
|
303
|
+
it uses for fetching.
|
304
|
+
|
305
|
+
For example, if you ran `sitediff init http://mysite.com http://localhost`
|
306
|
+
inside a [Vagrant](https://www.vagrantup.com/) VM, you might then run
|
307
|
+
something like:
|
308
|
+
|
309
|
+
```sitediff diff --after-url-report=http://vagrant:8080```
|
310
|
+
|
311
|
+
## Configuration
|
312
|
+
|
313
|
+
SiteDiff relies on a [YAML](http://yaml.org/) configuration file, usually
|
314
|
+
called `sitediff.yaml`. You can create a reasonable one using `sitediff init`,
|
315
|
+
but there are many useful things you may want to add or change manually.
|
316
|
+
|
317
|
+
In the `sitediff.yaml`, SiteDiff recognizes the keys described below. The
|
318
|
+
`config` directory contains some example `sitediff.yaml` files. For example,
|
319
|
+
[sitediff.example.yaml](config/sitediff.example.yaml).
|
320
|
+
|
321
|
+
### before_url / after_url
|
322
|
+
|
323
|
+
```yaml
|
324
|
+
before_url: http://example.com/subsite
|
325
|
+
after_url: http://localhost:8080/subsite
|
326
|
+
```
|
327
|
+
|
328
|
+
They can also be paths to directories on the local filesystem.
|
329
|
+
|
330
|
+
The `after_url` MUST provided either at the command-line or in the
|
331
|
+
`sitediff.yaml`. If the `before_url` is provided, SiteDiff will compare the
|
332
|
+
two sites. Otherwise, it will compare the current version of the `after` site
|
333
|
+
with the stored version of that site, as created by `sitediff init` or
|
334
|
+
`sitediff store`.
|
335
|
+
|
336
|
+
### selector
|
337
|
+
|
338
|
+
Chooses the sections of HTML we wish to compare, if you don't
|
339
|
+
want to compare the entire page. For example if you only want to compare
|
340
|
+
breadcrumbs between your two sites, you might specify:
|
341
|
+
|
342
|
+
```yaml
|
343
|
+
selector: '#breadcrumb'
|
344
|
+
```
|
345
|
+
|
346
|
+
### sanitization
|
347
|
+
|
348
|
+
A list of regular expression rules to normalize your HTML for comparison.
|
349
|
+
|
350
|
+
Each rule should have a **pattern** regex, which is used to search the HTML.
|
351
|
+
Each found instance is replaced with the provided **substitute** or deleted
|
352
|
+
if no substitute is provided. A rule may also have a **selector**, which
|
353
|
+
constrains it to operate only on HTML fragments which match that CSS selector.
|
354
|
+
|
355
|
+
For example, forms on Drupal sites have a randomly generated `form_build_id`
|
356
|
+
on form pages:
|
357
|
+
|
358
|
+
```html
|
359
|
+
<input type="hidden" name="form_build_id" value="form-1cac6b5b6141a72b2382928249605fb1"/>
|
360
|
+
```
|
361
|
+
|
362
|
+
We're not interested in comparing random content, so we could use the
|
363
|
+
following rule to fix this:
|
364
|
+
|
365
|
+
```yaml
|
366
|
+
sanitization:
|
367
|
+
# Remove form build IDs
|
368
|
+
- pattern: '<input type="hidden" name="form_build_id" value="form-[a-zA-Z0-9_-]+" *\/?>'
|
369
|
+
selector: 'input'
|
370
|
+
substitute: '<input type="hidden" name="form_build_id" value="__form_build_id__">'
|
371
|
+
```
|
372
|
+
|
373
|
+
Sanitization rules may also have a **path** attribute, whose value is a
|
374
|
+
regular expression. If present, the rule will only apply to matching paths.
|
375
|
+
|
376
|
+
### ignore_whitespace
|
377
|
+
Ignore whitespace when doing the diff. This passes the `-w` option to the native OS `diff` command.
|
378
|
+
|
379
|
+
```yaml
|
380
|
+
ignore_whitespace: true
|
381
|
+
```
|
382
|
+
|
383
|
+
On the command line, use `-w` or `--ignore-whitespace`.
|
384
|
+
|
385
|
+
```bash
|
386
|
+
sitediff diff -w
|
387
|
+
```
|
388
|
+
|
389
|
+
### before / after
|
390
|
+
|
391
|
+
Applies rules to just one side of the comparison.
|
392
|
+
|
393
|
+
These blocks can contain any of the following sections: `selector`,
|
394
|
+
`sanitization`, `dom_transform`. Such a section placed in `before` will be
|
395
|
+
applied just to the `before` side of the comparison and similarly for `after`.
|
396
|
+
|
397
|
+
For example, if you wanted to let different date formatting not create diff
|
398
|
+
failures, you might use the following:
|
399
|
+
|
400
|
+
```yaml
|
401
|
+
before:
|
402
|
+
sanitization:
|
403
|
+
- pattern: '[1-2][0-9]{3}/[0-1][0-9]/[0-9]{2}'
|
404
|
+
substitute: '__date__'
|
405
|
+
after:
|
406
|
+
sanitization:
|
407
|
+
- pattern: '[A-Z][a-z]{2} [0-9]{1,2}(st|nd|rd|th) [1-2][0-9]{3}'
|
408
|
+
substitute: '__date__'
|
409
|
+
```
|
410
|
+
|
411
|
+
The above rule will replace dates of the form `2004/12/05` in `before` and
|
412
|
+
dates of the form `May 12th 2004` in `after` with `__date__`.
|
413
|
+
|
414
|
+
### includes
|
415
|
+
|
416
|
+
The names of other configuration YAML files to merge with this one.
|
417
|
+
|
418
|
+
```yaml
|
419
|
+
includes:
|
420
|
+
- config/sanitize_domains.yaml
|
421
|
+
- config/strip_css_js.yaml
|
422
|
+
```
|
423
|
+
|
424
|
+
### dom_transform
|
425
|
+
|
426
|
+
A list of transformations to apply to the HTML before comparing.
|
427
|
+
|
428
|
+
This is similar to _sanitization_, but it applies transformations to the
|
429
|
+
structure of the HTML, instead of to the text. Each transformation has a
|
430
|
+
**type**, and potentially other attributes. The following types are available:
|
431
|
+
|
432
|
+
#### remove
|
433
|
+
|
434
|
+
Given a **selector**, removes all elements that match it.
|
435
|
+
|
436
|
+
For example, say we have a block containing the current time, which is
|
437
|
+
expected to change. To ignore that, we might choose to delete the block
|
438
|
+
before comparison:
|
439
|
+
|
440
|
+
```yaml
|
441
|
+
dom_transform:
|
442
|
+
# Remove current time block
|
443
|
+
- type: remove
|
444
|
+
- selector: div#block-time
|
445
|
+
```
|
446
|
+
|
447
|
+
#### strip
|
448
|
+
|
449
|
+
Strip leading and trailing whitespace from the contents of a tag.
|
450
|
+
|
451
|
+
Uses the Ruby string `strip()` method. Whitespace is defined as any of the
|
452
|
+
following characters: null, horizontal tab, line feed, vertical tab, form
|
453
|
+
feed, carriage return, space.
|
454
|
+
|
455
|
+
To transform `<h1> Foo and Bar\n </h1>` to `<h1>Foo and Bar<\h1>`:
|
456
|
+
|
457
|
+
```yaml
|
458
|
+
dom_transform:
|
459
|
+
# Strip H1 tags
|
460
|
+
- type: strip
|
461
|
+
- selector: h1
|
462
|
+
```
|
463
|
+
|
464
|
+
#### unwrap
|
465
|
+
|
466
|
+
Given a **selector**, replaces all matching elements with
|
467
|
+
their children. For example, your content on one side of the comparison might
|
468
|
+
look like this:
|
469
|
+
|
470
|
+
```html
|
471
|
+
<p>This is some text</p>
|
472
|
+
<img src="lola.png" alt="Lola is a cute kitten." />
|
473
|
+
```
|
474
|
+
|
475
|
+
But on the other side, it might be wrapped in an `article` tag:
|
476
|
+
```html
|
477
|
+
<article>
|
478
|
+
<p>This is some text</p>
|
479
|
+
<img src="test.png"/>
|
480
|
+
</article>
|
481
|
+
```
|
482
|
+
|
483
|
+
You could fix it with the following configuration:
|
484
|
+
|
485
|
+
```yaml
|
486
|
+
dom_transform:
|
487
|
+
- type: unwrap
|
488
|
+
selector: article
|
489
|
+
```
|
490
|
+
|
491
|
+
#### remove_class
|
492
|
+
|
493
|
+
Given a **selector** and a **class**, removes that class
|
494
|
+
from each element that matches the selector. It can also take a list of
|
495
|
+
classes, instead of just one.
|
496
|
+
|
497
|
+
For example, here are two sample rules for removing a single class and
|
498
|
+
removing multiple classes from all `div` elements:
|
499
|
+
|
500
|
+
```yaml
|
501
|
+
dom_transform:
|
502
|
+
# Remove class foo from div elements
|
503
|
+
- type: remove_class
|
504
|
+
selector: div
|
505
|
+
class: class-foo
|
506
|
+
# Remove class bar and class baz from div elements
|
507
|
+
- type: remove_class
|
508
|
+
selector: div
|
509
|
+
class:
|
510
|
+
- class-bar
|
511
|
+
- class-baz
|
512
|
+
```
|
513
|
+
|
514
|
+
#### unwrap_root
|
515
|
+
|
516
|
+
Replaces the entire root element with its children.
|
517
|
+
|
518
|
+
### report
|
519
|
+
|
520
|
+
The settings under the `report` key allow you to display helpful details on the report.
|
521
|
+
|
522
|
+
```yaml
|
523
|
+
report:
|
524
|
+
title: "Updates to example.com"
|
525
|
+
details: "This report verifies updates to example.com."
|
526
|
+
before_note: "The old site"
|
527
|
+
after_note: "The new site"
|
528
|
+
before_url_report: http://example.com
|
529
|
+
after_url_report: http://staging.example.com
|
530
|
+
```
|
531
|
+
|
532
|
+
#### title
|
533
|
+
|
534
|
+
Display a title string at the top of the report.
|
535
|
+
|
536
|
+
#### details
|
537
|
+
|
538
|
+
Text displays as a paragraph at the top of the report, below the title.
|
539
|
+
|
540
|
+
#### before_note
|
541
|
+
|
542
|
+
Display a brief explanatory note next to `before` URL.
|
543
|
+
|
544
|
+
#### after_note
|
545
|
+
|
546
|
+
Display a brief explanatory note next to `after` URL.
|
547
|
+
|
548
|
+
#### before_url_report / after_url_report
|
549
|
+
|
550
|
+
Changes how SiteDiff reports which URLs it is comparing, but don't change what
|
551
|
+
it actually compares.
|
552
|
+
|
553
|
+
Suppose you are serving your 'after' website on a virtual machine with
|
554
|
+
IP 192.168.2.3, and you are also running SiteDiff inside that VM. To make links
|
555
|
+
in the report accessible from outside the VM, you might provide:
|
556
|
+
|
557
|
+
```yaml
|
558
|
+
after_url: http://localhost
|
559
|
+
report:
|
560
|
+
after_url_report: http://192.168.2.3
|
561
|
+
```
|
562
|
+
|
563
|
+
If you don't wish to have the "Before" or "After" links in the report, set to false:
|
564
|
+
|
565
|
+
```yaml
|
566
|
+
report:
|
567
|
+
after_url_report: false
|
568
|
+
```
|
569
|
+
|
570
|
+
### Miscellaneous
|
571
|
+
|
572
|
+
#### preset
|
573
|
+
|
574
|
+
Presets are stored in the `/lib/sitediff/presets` directory of this gem. You
|
575
|
+
can select a preset as follows:
|
576
|
+
|
577
|
+
```yaml
|
578
|
+
settings:
|
579
|
+
preset: drupal
|
580
|
+
```
|
581
|
+
|
582
|
+
#### Include/Exclude Paths
|
583
|
+
|
584
|
+
##### exclude paths
|
585
|
+
|
586
|
+
A RegEx indicating the paths that should not be crawled.
|
587
|
+
|
588
|
+
##### include paths
|
589
|
+
|
590
|
+
A RegEx indicating the paths that should be crawled.
|
591
|
+
|
592
|
+
### Organizing configuration files
|
593
|
+
|
594
|
+
If your configuration file starts getting really big, SiteDiff lets you
|
595
|
+
separate it out into multiple files. Just have one base file that includes
|
596
|
+
other files:
|
597
|
+
|
598
|
+
```yaml
|
599
|
+
includes:
|
600
|
+
- sanitization.yaml
|
601
|
+
- paths.yaml
|
602
|
+
```
|
603
|
+
|
604
|
+
This allows you to separate your configuration into logical groups.
|
605
|
+
For example, generic rules for your site could live in a `generic.yaml` file,
|
606
|
+
while rules pertaining to a particular update you're conducting could
|
607
|
+
live in `update-8.2.yaml`.
|
608
|
+
|
609
|
+
### Named regions
|
610
|
+
|
611
|
+
In major upgrades and migrations where there are significant changes to the markup,
|
612
|
+
simple diffs will not be of much value. To assist in these cases, `named
|
613
|
+
regions` let you define regions in the page markup and the specify order in which
|
614
|
+
they should be compared. Specifying the order helps in cases where the fields are
|
615
|
+
not in the same order on the new site.
|
616
|
+
|
617
|
+
For example, if you have a CMS displaying `title`, `author`, and `body` fields, you
|
618
|
+
could define the named regions and the selectors for the three fields as follows:
|
619
|
+
|
620
|
+
```yaml
|
621
|
+
regions:
|
622
|
+
- name: title
|
623
|
+
selector: h1.title
|
624
|
+
- name: author
|
625
|
+
selector: .field-name-attribution
|
626
|
+
- name: body
|
627
|
+
selector: .field-name-body
|
628
|
+
```
|
629
|
+
|
630
|
+
(You need to define `regions` for both the `before` and `after` sections.)
|
631
|
+
|
632
|
+
You must then define the order that the fields should be compared, using the
|
633
|
+
`output` key.
|
634
|
+
|
635
|
+
```yaml
|
636
|
+
output:
|
637
|
+
- title
|
638
|
+
- author
|
639
|
+
- body
|
640
|
+
```
|
641
|
+
|
642
|
+
Before the two versions are compared, SiteDiff generates markup with
|
643
|
+
`<region>` tags and each `region` contains the markup matching the
|
644
|
+
corresponding selector.
|
645
|
+
|
646
|
+
EG:
|
647
|
+
|
648
|
+
```html
|
649
|
+
<region id="title">
|
650
|
+
<h1 class="title">My Blog Post</h1>
|
651
|
+
</region>
|
652
|
+
<region id="author">
|
653
|
+
<div class="field-name-attribution">
|
654
|
+
<span class="label">By:</span> Alfred E. Neuman
|
655
|
+
</div>
|
656
|
+
</region>
|
657
|
+
<region id="body">
|
658
|
+
<div class=".field-name-attribution">
|
659
|
+
<p>Lorem ipsum...
|
660
|
+
</div>
|
661
|
+
</region>
|
662
|
+
```
|
663
|
+
|
664
|
+
The regions are processed first, so you can reference the `<region>` tags to
|
665
|
+
be more specific in your selectors for `dom_transform` and `sanitization`
|
666
|
+
sections.
|
667
|
+
|
668
|
+
EG:
|
669
|
+
|
670
|
+
```yaml
|
671
|
+
dom_transform:
|
672
|
+
- name: Remove body div wrapper
|
673
|
+
type: unwrap
|
674
|
+
selector: region#body .field-name-attribution
|
675
|
+
```
|
676
|
+
|
677
|
+
### Curl Options
|
678
|
+
|
679
|
+
[Many options](https://curl.haxx.se/libcurl/c/curl_easy_setopt.html) can be
|
680
|
+
passed to the underlying curl library. Add `--curl_options=name1:value1 name2:value2`
|
681
|
+
to the command line (such as `--curl_options=max_recv_speed_large:100000`
|
682
|
+
(remove the `CURLOPT_` prefix and write the name in lowercase) or add them to
|
683
|
+
your configuration file.
|
684
|
+
|
685
|
+
```yaml
|
686
|
+
settings:
|
687
|
+
curl_opts:
|
688
|
+
max_recv_speed_large: 10000
|
689
|
+
ssl_verifypeer: false
|
690
|
+
```
|
691
|
+
|
692
|
+
These CURL options can be put under the `settings` section of `sitediff.yaml`
|
693
|
+
as demonstrated above.
|
694
|
+
|
695
|
+
#### Throttling
|
696
|
+
|
697
|
+
A few options are also available to control how aggressively SiteDiff crawls.
|
698
|
+
|
699
|
+
- There's a command line option `--concurrency=N` for `sitediff init`
|
700
|
+
which controls the maximum number of simultaneous connections made.
|
701
|
+
Lower N mean less aggressive. The default is 3. You can specify this in the
|
702
|
+
`sitediff.yaml` file under the `settings` key.
|
703
|
+
|
704
|
+
- The underlying curl library has [many options](https://curl.haxx.se/libcurl/c/curl_easy_setopt.html)
|
705
|
+
such as `max_recv_speed_large` which can be helpful.
|
706
|
+
|
707
|
+
- There is a special command line option `--interval=T` for `sitediff init`.
|
708
|
+
This option and allows the fetcher to delay for T milliseconds between
|
709
|
+
fetching pages. You can specify this in the `sitediff.yaml` file under the
|
710
|
+
`settings` key.
|
711
|
+
|
712
|
+
#### Timeouts
|
713
|
+
|
714
|
+
By default, no timeout is set but one can be added `--curl_options=timeout:60`
|
715
|
+
or in your configuration file.
|
716
|
+
|
717
|
+
```yaml
|
718
|
+
settings:
|
719
|
+
curl_opts:
|
720
|
+
timeout: 60 # In seconds; or...
|
721
|
+
timeout_ms: 60000 # In milliseconds.
|
722
|
+
```
|
723
|
+
|
724
|
+
#### Handling security
|
725
|
+
|
726
|
+
Often development or staging sites are protected by [HTTP Authentication](http://en.wikipedia.org/wiki/Basic_access_authentication).
|
727
|
+
SiteDiff allows you to specify a username and password, by using a URL like
|
728
|
+
`http://user:pass@example.com` or by adding a `userpwd` setting to your file.
|
729
|
+
|
730
|
+
SiteDiff ignores untrusted certificates by default. This is equivalent to the following settings:
|
731
|
+
|
732
|
+
```yaml
|
733
|
+
settings:
|
734
|
+
curl_opts:
|
735
|
+
ssl_verifypeer: false
|
736
|
+
ssl_verifyhost: 0
|
737
|
+
userpwd: "username:password"
|
738
|
+
```
|
739
|
+
|
740
|
+
This contains various parameters which affect the way SiteDiff works. You can
|
741
|
+
have the following keys under `settings`.
|
742
|
+
|
743
|
+
#### interval
|
744
|
+
An integer indicating the number of milliseconds SiteDiff should wait for
|
745
|
+
between requests.
|
746
|
+
|
747
|
+
#### concurrency
|
748
|
+
The maximum number of simultaneous requests that SiteDiff should make.
|
749
|
+
|
750
|
+
#### depth
|
751
|
+
|
752
|
+
The depth to which SiteDiff should crawl the website. Defaults to 3,
|
753
|
+
which means, 3 levels deep.
|
754
|
+
|
755
|
+
#### curl_opts
|
756
|
+
|
757
|
+
Options to pass to the underlying curl library. Remove the `CURLOPT_` prefix in
|
758
|
+
this [full list of options](https://curl.haxx.se/libcurl/c/curl_easy_setopt.html)
|
759
|
+
and write in lowercase. Useful for throttling.
|
760
|
+
|
761
|
+
```yaml
|
762
|
+
settings:
|
763
|
+
curl_opts:
|
764
|
+
connecttimeout: 3
|
765
|
+
followlocation: true
|
766
|
+
max_recv_speed_large: 10000
|
767
|
+
```
|
768
|
+
|
769
|
+
## Tips and Tricks
|
770
|
+
|
771
|
+
Here are some tips and tricks that we've learned using SiteDiff:
|
772
|
+
|
773
|
+
- Use single quotes or double quotes around selectors. Remember that the `#` is a comment in YAML.
|
774
|
+
- Be specific enough with selectors to not affect elements on other pages.
|
775
|
+
|
776
|
+
### Removing Empty Elements
|
777
|
+
|
778
|
+
If you have an empty `<p/>` tag appearing in the diff, you can write the following in your sanitization lists:
|
779
|
+
```yaml
|
780
|
+
- name: remove_empty_p
|
781
|
+
pattern: '<p/>'
|
782
|
+
substitute: ''
|
783
|
+
```
|
784
|
+
|
785
|
+
### HTML Tag Formatting
|
786
|
+
|
787
|
+
There are times when the HTML tags do not have newlines between them on one of the sites you wish to compare. In this
|
788
|
+
case, these sanitzation rules are useful:
|
789
|
+
```yaml
|
790
|
+
- name: remove_space_before
|
791
|
+
pattern: '\s*(\n)<'
|
792
|
+
substitute: '\1<'
|
793
|
+
|
794
|
+
- name: remove_space_after
|
795
|
+
pattern: '>(\n)\s*'
|
796
|
+
substitute: '>\1'
|
797
|
+
```
|
798
|
+
|
799
|
+
### Empty Attributes
|
800
|
+
|
801
|
+
After writing rules, you may end up with empty attributes, like `width=""`. Here's a sanitization rule:
|
802
|
+
```yaml
|
803
|
+
- name: remove_empty_class
|
804
|
+
pattern: ' class=""'
|
805
|
+
substitute: ''
|
806
|
+
```
|
807
|
+
|
808
|
+
## Acknowledgements
|
809
|
+
|
810
|
+
SiteDiff is brought to you by [Evolving Web](https://evolvingweb.ca/).
|