websitiary 0.1.0
Sign up to get free protection for your applications and to get access to all the features.
- data/History.txt +4 -0
- data/Manifest.txt +6 -0
- data/README.txt +474 -0
- data/Rakefile +20 -0
- data/bin/websitiary +1351 -0
- data/setup.rb +1585 -0
- metadata +71 -0
data/History.txt
ADDED
data/Manifest.txt
ADDED
data/README.txt
ADDED
@@ -0,0 +1,474 @@
|
|
1
|
+
websitiary by Thomas Link
|
2
|
+
http://rubyforge.org/projects/websitiary/
|
3
|
+
|
4
|
+
This is a script for monitoring webpages that reuses other programs to
|
5
|
+
do the actual work. By default, it works on an ASCII basis, i.e. with
|
6
|
+
the output of text-based webbrowsers. With the help of some friends, it
|
7
|
+
can also work with HTML.
|
8
|
+
|
9
|
+
|
10
|
+
== DESCRIPTION:
|
11
|
+
This is a script for monitoring webpages that reuses other programs
|
12
|
+
(w3m, diff, webdiff etc.) to do most of the actual work. By default, it
|
13
|
+
works on an ASCII basis, i.e. with the output of text-based webbrowsers
|
14
|
+
like w3m (or lynx, links etc.) as the output can easily be
|
15
|
+
post-processed. With the help of some friends (see the section below on
|
16
|
+
requirements), it can also work with HTML. E.g., if you have websec
|
17
|
+
installed, you can also use its webdiff program to show colored diffs.
|
18
|
+
|
19
|
+
By default, this script will use w3m to dump HTML pages and then run
|
20
|
+
diff over the current page and the previous backup. Some pages are
|
21
|
+
better viewed with lynx or links. Downloaded documents (HTML or ASCII)
|
22
|
+
can be post-processed (e.g., filtered through some ruby block that
|
23
|
+
extracts elements via hpricot and the like). Please see the
|
24
|
+
configuration options below to find out how to change this globally or
|
25
|
+
for a single source.
|
26
|
+
|
27
|
+
=== CAVEAT:
|
28
|
+
The script also includes experimental support for monitoring whole
|
29
|
+
websites. Basically, this script supports robots.txt directives (see
|
30
|
+
requirements) but this is hardly tested and may not work in some cases.
|
31
|
+
|
32
|
+
While it is okay for your own websites to ignore robots.txt, it is not
|
33
|
+
for others. Please make sure that the webpages you run this program on
|
34
|
+
allow such a use. Some webpages disallow the use of any automatic
|
35
|
+
downloader or offline reader in their user agreements.
|
36
|
+
|
37
|
+
|
38
|
+
== FEATURES/PROBLEMS:
|
39
|
+
* Download webpages on defined intervalls
|
40
|
+
* Compare webpages with previous backups
|
41
|
+
* Display differences between the current version and the backup
|
42
|
+
* Provide hooks to post-process the downloaded documents and the diff
|
43
|
+
* Display a one page report summarizing all news
|
44
|
+
* Automatically open the report in your favourite web-browser
|
45
|
+
* Quite customizable
|
46
|
+
|
47
|
+
ISSUES, TODO:
|
48
|
+
* Improved support for robots.txt (test it)
|
49
|
+
* The use of :website_below and :website is hardly tested (please
|
50
|
+
report errors).
|
51
|
+
* download => :body_html tries to rewrite references (a, img) which may
|
52
|
+
fail on certain kind of urls (please report errors).
|
53
|
+
* When using :body_html for download, it may happen that some
|
54
|
+
JavaScript code is stripped, which breaks some JavaScript-generated
|
55
|
+
links.
|
56
|
+
|
57
|
+
|
58
|
+
== SYNOPSIS:
|
59
|
+
|
60
|
+
=== Usage
|
61
|
+
Example:
|
62
|
+
# Run "profile"
|
63
|
+
websitiary profile
|
64
|
+
|
65
|
+
# Edit "~/.websitiary/profile.rb"
|
66
|
+
websitiary --edit=profile
|
67
|
+
|
68
|
+
# View the latest report
|
69
|
+
websitiary --review
|
70
|
+
|
71
|
+
# Refetch all sources regardless of :days and :hours restrictions
|
72
|
+
websitiary -signore_age=true
|
73
|
+
|
74
|
+
# Create html and rss reports for my websites
|
75
|
+
websitiary -fhtml,rss mysites
|
76
|
+
|
77
|
+
For example output see:
|
78
|
+
* html[http://deplate.sourceforge.net/websitiary.html]
|
79
|
+
* rss[http://deplate.sourceforge.net/websitiary.rss]
|
80
|
+
* text[http://deplate.sourceforge.net/websitiary.txt]
|
81
|
+
|
82
|
+
|
83
|
+
=== Configuration
|
84
|
+
Profiles are plain ruby files (with the '.rb' suffix) stored in
|
85
|
+
~/.websitiary/.
|
86
|
+
|
87
|
+
The profile config.rb is always loaded if available.
|
88
|
+
|
89
|
+
|
90
|
+
==== default 'PROFILE1', 'PROFILE2' ...
|
91
|
+
Set the default profile(s).
|
92
|
+
|
93
|
+
Example:
|
94
|
+
default 'my_profile'
|
95
|
+
|
96
|
+
|
97
|
+
==== diff 'CMD "%s" "%s"'
|
98
|
+
Use this shell command to make the diff.
|
99
|
+
%s %s will be replaced with the old and new filename.
|
100
|
+
|
101
|
+
diff is used by default.
|
102
|
+
|
103
|
+
|
104
|
+
==== diffprocess lambda {|text| ...}
|
105
|
+
Use this ruby snippet to post-process the diff.
|
106
|
+
|
107
|
+
|
108
|
+
==== download 'CMD "%s"'
|
109
|
+
Use this shell command to download a page.
|
110
|
+
%s will be replaced with the url.
|
111
|
+
|
112
|
+
w3m is used by default.
|
113
|
+
|
114
|
+
Example:
|
115
|
+
download 'lynx -dump "%s"'
|
116
|
+
|
117
|
+
|
118
|
+
==== downloadprocess lambda {|text| ...}
|
119
|
+
Use this ruby snippet to post-process what was downloaded.
|
120
|
+
|
121
|
+
|
122
|
+
==== edit 'CMD "%s"'
|
123
|
+
Use this shell command to edit a profile. %s will be replaced with the filename.
|
124
|
+
|
125
|
+
vi is used by default.
|
126
|
+
|
127
|
+
Example:
|
128
|
+
edit 'gvim "%s"&'
|
129
|
+
|
130
|
+
|
131
|
+
==== option TYPE, OPTION => VALUE
|
132
|
+
Set a global option.
|
133
|
+
|
134
|
+
TYPE can be one of:
|
135
|
+
<tt>:diff</tt>::
|
136
|
+
Generate a diff
|
137
|
+
<tt>:diffprocess</tt>::
|
138
|
+
Post-process a diff (if necessary)
|
139
|
+
<tt>:format</tt>::
|
140
|
+
Format the diff for output
|
141
|
+
<tt>:download</tt>::
|
142
|
+
Download webpages
|
143
|
+
<tt>:downloadprocess</tt>::
|
144
|
+
Post-process downloaded webpages
|
145
|
+
<tt>:page</tt>::
|
146
|
+
The :format field defines the format of the final report. Here VALUE
|
147
|
+
is a format string that takes 3 variables as arguments: report title,
|
148
|
+
toc, contents.
|
149
|
+
|
150
|
+
DOWNLOAD is a symbol
|
151
|
+
|
152
|
+
VALUE is either a format string or a block of code (of class Proc).
|
153
|
+
|
154
|
+
Example:
|
155
|
+
set :download, :foo => lambda {|url| get_url(url)}
|
156
|
+
|
157
|
+
|
158
|
+
==== output_format FORMAT, output_format [FORMAT1, FORMAT2, ...]
|
159
|
+
Set the output format.
|
160
|
+
Format can be one of:
|
161
|
+
|
162
|
+
* html
|
163
|
+
* text, txt (this only works with text based downloaders)
|
164
|
+
* rss (prove of concept only;
|
165
|
+
it requires :rss[:url] to be set to the url, where the rss feed will
|
166
|
+
be published, using the <tt>option :rss, :url => URL</tt>
|
167
|
+
configuration command; you either have to use a text-based downloader
|
168
|
+
or include <tt>:rss_format => 'html'</tt> to the url options)
|
169
|
+
|
170
|
+
|
171
|
+
==== set OPTION => VALUE; set TYPE, OPTION => VALUE; unset OPTIONS
|
172
|
+
(Un)Set an option for the following source commands.
|
173
|
+
|
174
|
+
Example:
|
175
|
+
set :download, :foo => lambda {|url| get_url(url)}
|
176
|
+
set :days => 7, sort => true
|
177
|
+
unset :days, :sort
|
178
|
+
|
179
|
+
|
180
|
+
==== source URL(S), [OPTIONS]
|
181
|
+
Options
|
182
|
+
|
183
|
+
<tt>:cols => FROM..TO</tt>::
|
184
|
+
Use only these colums from the output (used after applying the :lines
|
185
|
+
option)
|
186
|
+
|
187
|
+
<tt>:depth => INTEGER</tt>::
|
188
|
+
In conjunction with a :website type of :download option, fetch url up
|
189
|
+
to this depth.
|
190
|
+
|
191
|
+
<tt>:diff => "CMD", :diff => SHORTCUT</tt>::
|
192
|
+
Use this command to make the diff for this page. Possible values for
|
193
|
+
SHORTCUT are: :webdiff (useful in conjunction with :download => :curl,
|
194
|
+
:wget, or :body_html). :body_html, :website_below, :website and
|
195
|
+
:openuri are synonyms for :webdiff.
|
196
|
+
|
197
|
+
<tt>:diffprocess => lambda {|text| ...}</tt>::
|
198
|
+
Use this ruby snippet to post-process this diff
|
199
|
+
|
200
|
+
<tt>:download => "CMD", :download => SHORTCUT</tt>:
|
201
|
+
Use this command to download this page. For possible values for
|
202
|
+
SHORTCUT see the section on shortcuts below.
|
203
|
+
|
204
|
+
<tt>:downloadprocess => lambda {|text| ...}</tt>::
|
205
|
+
Use this ruby snippet to post-process what was downloaded. This is the
|
206
|
+
place where, e.g., hpricot can be used to extract certain elements
|
207
|
+
from the HTML code.
|
208
|
+
Example:
|
209
|
+
lambda {|text| Hpricot(text).at('div#content').inner_html}
|
210
|
+
|
211
|
+
<tt>:format => "FORMAT %s STRING", :format => SHORTCUT</tt>::
|
212
|
+
The format string for the diff text. The default (the :diff shortcut)
|
213
|
+
wraps the output in +pre+ tags. :webdiff, :body_html, :website_below,
|
214
|
+
:website, and :openuri will simply add a newline character.
|
215
|
+
|
216
|
+
<tt>:hours => HOURS, :days => DAYS</tt>::
|
217
|
+
Don't download the file unless it's older than that
|
218
|
+
|
219
|
+
<tt>:ignore_age => true</tt>::
|
220
|
+
Ignore any :days and :hours settings. This is useful in some cases
|
221
|
+
when set on the command line.
|
222
|
+
|
223
|
+
<tt>:lines => FROM..TO</tt>::
|
224
|
+
Use only these lines from the output
|
225
|
+
|
226
|
+
<tt>:match => REGEXP</tt>::
|
227
|
+
When recursively walking a website, follow only links that match this
|
228
|
+
regexp.
|
229
|
+
|
230
|
+
<tt>:sort => true, :sort => lambda {|a,b| ...}</tt>::
|
231
|
+
Sort lines in output
|
232
|
+
|
233
|
+
<tt>:strip => true</tt>::
|
234
|
+
Strip empty lines
|
235
|
+
|
236
|
+
<tt>:title => "TEXT"</tt>::
|
237
|
+
Display TEXT instead of URL
|
238
|
+
|
239
|
+
<tt>:use => SYMBOL</tt>::
|
240
|
+
Use SYMBOL for any other option. I.e. <tt>:download => :body_html
|
241
|
+
:diff => :webdiff</tt> can be abbreviated as <tt>:use =>
|
242
|
+
:body_html</tt> (because for :diff :body_html is a synonym for
|
243
|
+
:webdiff).
|
244
|
+
|
245
|
+
|
246
|
+
Example configuration file extract:
|
247
|
+
source 'URL', :days => 7, :download => :lynx
|
248
|
+
|
249
|
+
# Daily
|
250
|
+
set :days => 1
|
251
|
+
source 'http://www.example.com', :use => :body_html,
|
252
|
+
:downloadprocess => lambda {|text| Hpricot(text).at('div#content').inner_html}
|
253
|
+
|
254
|
+
# Weekly
|
255
|
+
set :days => 7
|
256
|
+
source 'http://www.example.com', :lines => 10..-1, :title => 'My Page'
|
257
|
+
|
258
|
+
# Bi-weekly
|
259
|
+
set :days => 14
|
260
|
+
source <<URLS
|
261
|
+
http://www.example.com
|
262
|
+
http://www.example.com/page.html
|
263
|
+
URLS
|
264
|
+
|
265
|
+
# Make HTML diffs and highlight occurences of a word
|
266
|
+
source 'http://www.example.com',
|
267
|
+
:title => 'Example',
|
268
|
+
:use => :body_html,
|
269
|
+
:diffprocess => highlighter(/word/i)
|
270
|
+
|
271
|
+
# Download the whole website below this path (only pages with
|
272
|
+
# html-suffix)
|
273
|
+
# Download only php and html pages
|
274
|
+
# Follow links 2 levels deep
|
275
|
+
source 'http://www.example.com/foo/bar.html',
|
276
|
+
:title => 'Example -- Bar', :use => :website_below,
|
277
|
+
:match => /\.(php|html)\b/, :depth => 2
|
278
|
+
|
279
|
+
unset :days
|
280
|
+
|
281
|
+
|
282
|
+
==== view 'CMD "%s"'
|
283
|
+
Use this shell command to view the output (usually a HTML file).
|
284
|
+
%s will be replaced with the filename.
|
285
|
+
|
286
|
+
w3m is used by default.
|
287
|
+
|
288
|
+
Example:
|
289
|
+
view 'gnome-open "%s"' # Gnome Desktop
|
290
|
+
view 'kfmclient "%s"' # KDE
|
291
|
+
view 'cygstart "%s"' # Cygwin
|
292
|
+
view 'start "%s"' # Windows
|
293
|
+
view 'firefox "%s"'
|
294
|
+
|
295
|
+
|
296
|
+
=== Shortcuts for use with :use, :download and other options
|
297
|
+
<tt>:w3m</tt>::
|
298
|
+
Use w3m for downloading the source. Use diff for generating diffs.
|
299
|
+
|
300
|
+
<tt>:lynx</tt>::
|
301
|
+
Use lynx for downloading the source. Use diff for generating diffs.
|
302
|
+
|
303
|
+
<tt>:links</tt>::
|
304
|
+
Use links for downloading the source. Use diff for generating diffs.
|
305
|
+
|
306
|
+
<tt>:curl</tt>::
|
307
|
+
Use curl for downloading the source. Use webdiff for generating diffs.
|
308
|
+
|
309
|
+
<tt>:wget</tt>::
|
310
|
+
Use wget for downloading the source. Use webdiff for generating diffs.
|
311
|
+
|
312
|
+
<tt>:openuri</tt>::
|
313
|
+
Use open-uri for downloading the source. Use webdiff for generating
|
314
|
+
diffs.
|
315
|
+
|
316
|
+
<tt>:body_html</tt>::
|
317
|
+
This requires hpricot to be installed. Use open-uri for downloading
|
318
|
+
the source, use only the body. Use webdiff for generating diffs. Try
|
319
|
+
to rewrite references (a, img) so that the point to the webpage. By
|
320
|
+
default, this will also strip tags like script, form, object ...
|
321
|
+
|
322
|
+
<tt>:website</tt>::
|
323
|
+
Use :body_html to download the source. Follow all links referring to
|
324
|
+
the same host with the same file suffix. Use webdiff for generating
|
325
|
+
diff.
|
326
|
+
|
327
|
+
<tt>:website_below</tt>::
|
328
|
+
Use :body_html to download the source. Follow all links referring to
|
329
|
+
the same host and a file below the top directory with the same file
|
330
|
+
suffix. Use webdiff for generating diff.
|
331
|
+
|
332
|
+
<tt>:website_txt</tt>::
|
333
|
+
Use :website to download the source but convert the output to plain
|
334
|
+
text.
|
335
|
+
|
336
|
+
<tt>:website_txt_below</tt>::
|
337
|
+
Use :website_below to download the source but convert the output to
|
338
|
+
plain text.
|
339
|
+
|
340
|
+
Any shortcuts relying on :body_html will also try to rewrite any
|
341
|
+
references so that the links point to the webpage.
|
342
|
+
|
343
|
+
|
344
|
+
== REQUIREMENTS:
|
345
|
+
websitiary is a ruby-based application. You thus need a ruby
|
346
|
+
interpreter.
|
347
|
+
|
348
|
+
It depends on how you use websitiary whether you actually need the
|
349
|
+
following libraries, applications.
|
350
|
+
|
351
|
+
By default this script expects the following applications to be
|
352
|
+
present:
|
353
|
+
|
354
|
+
* diff
|
355
|
+
* vi (or some other editor)
|
356
|
+
|
357
|
+
and one of:
|
358
|
+
|
359
|
+
* w3m[http://w3m.sourceforge.net/] (default)
|
360
|
+
* lynx[http://lynx.isc.org/]
|
361
|
+
* links[http://links.twibright.com/]
|
362
|
+
* websec[http://baruch.ev-en.org/proj/websec/]
|
363
|
+
(or at Savannah[http://savannah.nongnu.org/projects/websec/])
|
364
|
+
|
365
|
+
The use of :webdiff as :diff application requires
|
366
|
+
websec[http://download.savannah.gnu.org/releases/websec/] to be
|
367
|
+
installed. In conjunction with :body_html, :openuri, or :curl, this
|
368
|
+
will give you colored HTML diffs.
|
369
|
+
Why not use +websec+ if I have to install it, you might ask. Well,
|
370
|
+
+websec+ is written in perl and I didn't quite manage to make it work
|
371
|
+
the way I want it to. websitiary is made to be better to configure.
|
372
|
+
|
373
|
+
For downloading HTML, you need one of these:
|
374
|
+
|
375
|
+
* open-uri (should be part of ruby)
|
376
|
+
* hpricot[http://code.whytheluckystiff.net/hpricot] (used e.g. by
|
377
|
+
:body_html, :website, and :website_below)
|
378
|
+
* curl[http://curl.haxx.se/]
|
379
|
+
* wget[http://www.gnu.org/software/wget/]
|
380
|
+
|
381
|
+
The following ruby libraries are needed in conjunction with :body_html
|
382
|
+
and :website related shortcuts:
|
383
|
+
|
384
|
+
* hpricot[http://code.whytheluckystiff.net/hpricot] (parse HTML, use
|
385
|
+
only the body etc.)
|
386
|
+
* robot_rules.rb[http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589]
|
387
|
+
for parsing robots.txt
|
388
|
+
|
389
|
+
I personally would suggest to choose the following setup:
|
390
|
+
|
391
|
+
* w3m[http://w3m.sourceforge.net/]
|
392
|
+
* websec[http://baruch.ev-en.org/proj/websec/]
|
393
|
+
* hpricot[http://code.whytheluckystiff.net/hpricot]
|
394
|
+
* robot_rules.rb[http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589]
|
395
|
+
|
396
|
+
|
397
|
+
== INSTALL:
|
398
|
+
=== Use rubygems
|
399
|
+
Run
|
400
|
+
|
401
|
+
gem install websitiary
|
402
|
+
|
403
|
+
This will download the package and install it.
|
404
|
+
|
405
|
+
|
406
|
+
=== Use the zip
|
407
|
+
The zip[http://rubyforge.org/frs/?group_id=4030] contains a file
|
408
|
+
setup.rb that does the work. Run
|
409
|
+
|
410
|
+
ruby setup.rb
|
411
|
+
|
412
|
+
|
413
|
+
=== Copy Manually
|
414
|
+
Get the single file websitiary[http://rubyforge.org/frs/?group_id=4030]
|
415
|
+
script and copy it to some directory in $PATH.
|
416
|
+
|
417
|
+
|
418
|
+
=== Initial Configuration
|
419
|
+
Please check the requirements section above and get the extra libraries
|
420
|
+
needed:
|
421
|
+
* hpricot
|
422
|
+
* robot_rules.rb
|
423
|
+
|
424
|
+
You might then want to create a profile ~/.websitiary/config.rb that is
|
425
|
+
loaded on every run. In this profile you could set the default output
|
426
|
+
viewer and profile editor, as well as a default profile.
|
427
|
+
|
428
|
+
Example:
|
429
|
+
|
430
|
+
# Load standard.rb if no profile is given on the command line.
|
431
|
+
default 'standard'
|
432
|
+
|
433
|
+
# Use cygwin's cygstart to view the output with the default HTML
|
434
|
+
# viewer
|
435
|
+
view '/usr/bin/cygstart "%s"'
|
436
|
+
|
437
|
+
# Use Windows gvim from cygwin ruby which is why we convert the path
|
438
|
+
# first
|
439
|
+
edit 'gvim $(cygpath -w -- "%s")'
|
440
|
+
|
441
|
+
Where these configuration files reside, may differ. If the environment
|
442
|
+
variable $HOME is defined, the default is $HOME/.websitiary/ unless one
|
443
|
+
of the following directories exist, which will then be used instead:
|
444
|
+
|
445
|
+
* $USERPROFILE/websitiary (on Windows)
|
446
|
+
* SYSCONFDIR/websitiary (where SYSCONFDIR usually is /etc but you can
|
447
|
+
run ruby to find out more:
|
448
|
+
<tt>ruby -e "p Config::CONFIG['sysconfdir']"</tt>)
|
449
|
+
|
450
|
+
If neither directory exists and no $HOME variable is defined, the
|
451
|
+
current directory will be used.
|
452
|
+
|
453
|
+
|
454
|
+
== LICENSE:
|
455
|
+
websitiary Webpage Monitor
|
456
|
+
Copyright (C) 2007 Thomas Link
|
457
|
+
|
458
|
+
This program is free software; you can redistribute it and/or modify
|
459
|
+
it under the terms of the GNU General Public License as published by
|
460
|
+
the Free Software Foundation; either version 2 of the License, or
|
461
|
+
(at your option) any later version.
|
462
|
+
|
463
|
+
This program is distributed in the hope that it will be useful,
|
464
|
+
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
465
|
+
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
466
|
+
GNU General Public License for more details.
|
467
|
+
|
468
|
+
You should have received a copy of the GNU General Public License
|
469
|
+
along with this program; if not, write to the Free Software
|
470
|
+
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307
|
471
|
+
USA
|
472
|
+
|
473
|
+
|
474
|
+
# vi: ft=rd:tw=72:ts=4
|