websitary 0.2.0
Sign up to get free protection for your applications and to get access to all the features.
- data/History.txt +57 -0
- data/Manifest.txt +11 -0
- data/README.txt +732 -0
- data/Rakefile +27 -0
- data/bin/websitary +43 -0
- data/lib/websitary.rb +610 -0
- data/lib/websitary/applog.rb +39 -0
- data/lib/websitary/configuration.rb +1505 -0
- data/lib/websitary/filemtimes.rb +50 -0
- data/lib/websitary/htmldiff.rb +93 -0
- data/setup.rb +1585 -0
- metadata +76 -0
data/History.txt
ADDED
@@ -0,0 +1,57 @@
|
|
1
|
+
= 0.2.0
|
2
|
+
|
3
|
+
* Renamed the project from websitiary to websitary (without the
|
4
|
+
additional "i")
|
5
|
+
* The default output filename is now constructed on basis of the profile
|
6
|
+
names joined with a comma.
|
7
|
+
* Apply rewrite-rules to URLs in text output.
|
8
|
+
* Set user-agent (:body_html)
|
9
|
+
* Exit with 1 if differences were found
|
10
|
+
* Command line options have slightly changed: -e now is the short form
|
11
|
+
for --execute
|
12
|
+
* Commands that can be triggered by the -e command-line switch: downdiff
|
13
|
+
(default), configuration (list currently configured urls), latest
|
14
|
+
(show the current version of all urls), review (show the latest
|
15
|
+
report)
|
16
|
+
* Protect against filenames being too long (max size can be configured
|
17
|
+
via: <tt>option :global, :filename_size => N</tt>)
|
18
|
+
* Try to migrate local copies from the older flat to the new
|
19
|
+
hierarchical cache layout
|
20
|
+
* Disabled -E/--edit, --review command-line options (use -e instead)
|
21
|
+
* Try to maintain file atime/mtime when copying/moving files
|
22
|
+
* FIX: Problem with loading robots.txt
|
23
|
+
* Respect meta tag: robots="nofollow" (noindex is only checked in
|
24
|
+
conjunction with :download => :website*)
|
25
|
+
* quicklist profile: register urls via the -eadd command-line switch;
|
26
|
+
see "Usage" for an example
|
27
|
+
* Temporaly save diffs, so that we can reuse them when websitary should
|
28
|
+
exit ungracefully.
|
29
|
+
* Renamed :inner_html to :body_html
|
30
|
+
* New shortcuts: :ftp, :ftp_recursive, :img, :rss, :opml (rudementary)
|
31
|
+
* New experimental commands: aggregate, show ... can be used to
|
32
|
+
periodically check for changes (e.g. of rss feeds) but to review these
|
33
|
+
changes only once in a while
|
34
|
+
* Experimental --timer command-line option to re-run websitary every X
|
35
|
+
seconds.
|
36
|
+
* The :rss differ has an option :rss_enclosure (true or directory name)
|
37
|
+
that will be used for automatically saving new enclosures (e.g. mp3
|
38
|
+
files in podcasts); in theory, one should thus be able to use
|
39
|
+
websitary as pod catcher etc.
|
40
|
+
* Cache mtimes in order to reduce disk access.
|
41
|
+
* Special profile "__END__": The section in the script file after the
|
42
|
+
__END__ line. This seems useful in some situations when employing a
|
43
|
+
single script.
|
44
|
+
* Don't follow javascript links.
|
45
|
+
* New date constraint for sources:
|
46
|
+
:daily => true ... Once a day
|
47
|
+
:days_of_month => BEGIN..END ... download URL only once per month
|
48
|
+
within this range of days.
|
49
|
+
:days_of_week => BEGIN..END ... download URL only once per week
|
50
|
+
within this range of days.
|
51
|
+
:months => N (calculated on basis of the calendar month, not the
|
52
|
+
number of days)
|
53
|
+
|
54
|
+
== 0.1.0 / 2007-07-16
|
55
|
+
|
56
|
+
* Initial release
|
57
|
+
|
data/Manifest.txt
ADDED
data/README.txt
ADDED
@@ -0,0 +1,732 @@
|
|
1
|
+
websitary by Thomas Link
|
2
|
+
http://rubyforge.org/projects/websitiary/
|
3
|
+
|
4
|
+
This script monitors webpages, rss feeds, podcasts etc. and reports
|
5
|
+
what's new. For many tasks, it reuses other programs to do the actual
|
6
|
+
work. By default, it works on an ASCII basis, i.e. with the output of
|
7
|
+
text-based webbrowsers. With the help of some friends, it can also work
|
8
|
+
with HTML.
|
9
|
+
|
10
|
+
|
11
|
+
== DESCRIPTION:
|
12
|
+
websitary (formerly known as websitiary with an extra "i") monitors
|
13
|
+
webpages, rss feeds, podcasts etc. It reuses other programs (w3m, diff,
|
14
|
+
webdiff etc.) to do most of the actual work. By default, it works on an
|
15
|
+
ASCII basis, i.e. with the output of text-based webbrowsers like w3m (or
|
16
|
+
lynx, links etc.) as the output can easily be post-processed. With the
|
17
|
+
help of some friends (see the section below on requirements), it can
|
18
|
+
also work with HTML. E.g., if you have websec installed, you can also
|
19
|
+
use its webdiff program to show colored diffs. This script was
|
20
|
+
originally planned as a ruby-based websec replacement. For HTML diffs,
|
21
|
+
it stills relies on the webdiff perl script that comes with websec.
|
22
|
+
|
23
|
+
By default, this script will use w3m to dump HTML pages and then run
|
24
|
+
diff over the current page and the previous backup. Some pages are
|
25
|
+
better viewed with lynx or links. Downloaded documents (HTML or ASCII)
|
26
|
+
can be post-processed (e.g., filtered through some ruby block that
|
27
|
+
extracts elements via hpricot and the like). Please see the
|
28
|
+
configuration options below to find out how to change this globally or
|
29
|
+
for a single source.
|
30
|
+
|
31
|
+
|
32
|
+
== FEATURES/PROBLEMS:
|
33
|
+
* Handle webpages, rss feeds (optionally save attachments in podcasts
|
34
|
+
etc.)
|
35
|
+
* Compare webpages with previous backups
|
36
|
+
* Display differences between the current version and the backup
|
37
|
+
* Provide hooks to post-process the downloaded documents and the diff
|
38
|
+
* Display a one-page report summarizing all news
|
39
|
+
* Automatically open the report in your favourite web-browser
|
40
|
+
* Experimental: Download webpages on defined intervalls and generate
|
41
|
+
incremental diffs.
|
42
|
+
|
43
|
+
ISSUES, TODO:
|
44
|
+
* With HTML output, changes are presented on one single page, which
|
45
|
+
means that pages with different encodings cause problems.
|
46
|
+
* Improved support for robots.txt (test it)
|
47
|
+
* The use of :website_below and :website is hardly tested (please
|
48
|
+
report errors).
|
49
|
+
* download => :body_html tries to rewrite references (a, img) which may
|
50
|
+
fail on certain kind of urls (please report errors).
|
51
|
+
* When using :body_html for download, it may happen that some
|
52
|
+
JavaScript code is stripped, which breaks some JavaScript-generated
|
53
|
+
links.
|
54
|
+
* The --log command line will create a new instance of the logger and
|
55
|
+
thus reset any previous options related to the logging level.
|
56
|
+
|
57
|
+
NOTE: The script was previously called websitiary but was renamed (from
|
58
|
+
0.2 on) to websitary (without the superfluous i).
|
59
|
+
|
60
|
+
|
61
|
+
=== CAVEAT:
|
62
|
+
The script also includes experimental support for monitoring whole
|
63
|
+
websites. Basically, this script supports robots.txt directives (see
|
64
|
+
requirements) but this is hardly tested and may not work in some cases.
|
65
|
+
|
66
|
+
While it is okay for your own websites to ignore robots.txt, it is not
|
67
|
+
for others. Please make sure that the webpages you run this program on
|
68
|
+
allow such a use. Some webpages disallow the use of any automatic
|
69
|
+
downloader or offline reader in their user agreements.
|
70
|
+
|
71
|
+
|
72
|
+
== SYNOPSIS:
|
73
|
+
This manual is also available as
|
74
|
+
PDF[http://websitiary.rubyforge.org/websitary.pdf].
|
75
|
+
|
76
|
+
=== Usage
|
77
|
+
Example:
|
78
|
+
# Run "profile"
|
79
|
+
websitary profile
|
80
|
+
|
81
|
+
# Edit "~/.websitary/profile.rb"
|
82
|
+
websitary --edit=profile
|
83
|
+
|
84
|
+
# View the latest report
|
85
|
+
websitary -ereview
|
86
|
+
|
87
|
+
# Refetch all sources regardless of :days and :hours restrictions
|
88
|
+
websitary -signore_age=true
|
89
|
+
|
90
|
+
# Create html and rss reports for my websites
|
91
|
+
websitary -fhtml,rss mysites
|
92
|
+
|
93
|
+
# Add an url to the quicklist profile
|
94
|
+
websitary -eadd http://www.example.com
|
95
|
+
|
96
|
+
For example output see:
|
97
|
+
* html[http://deplate.sourceforge.net/websitary.html]
|
98
|
+
* rss[http://deplate.sourceforge.net/websitary.rss]
|
99
|
+
* text[http://deplate.sourceforge.net/websitary.txt]
|
100
|
+
|
101
|
+
|
102
|
+
=== Configuration
|
103
|
+
Profiles are plain ruby files (with the '.rb' suffix) stored in
|
104
|
+
~/.websitary/.
|
105
|
+
|
106
|
+
The profile "config" (~/.websitary/config.rb) is always loaded if
|
107
|
+
available.
|
108
|
+
|
109
|
+
There are two special profile names:
|
110
|
+
|
111
|
+
-::
|
112
|
+
Read URLs from STDIN.
|
113
|
+
<tt>__END__</tt>::
|
114
|
+
Read the profile contained in the script source after the __END__
|
115
|
+
line.
|
116
|
+
|
117
|
+
|
118
|
+
==== default 'PROFILE1', 'PROFILE2' ...
|
119
|
+
Set the default profile(s). The default is: quicklist
|
120
|
+
|
121
|
+
Example:
|
122
|
+
default 'my_profile'
|
123
|
+
|
124
|
+
|
125
|
+
==== diff 'CMD "%s" "%s"'
|
126
|
+
Use this shell command to make the diff.
|
127
|
+
%s %s will be replaced with the old and new filename.
|
128
|
+
|
129
|
+
diff is used by default.
|
130
|
+
|
131
|
+
|
132
|
+
==== diffprocess lambda {|text| ...}
|
133
|
+
Use this ruby snippet to post-process the diff.
|
134
|
+
|
135
|
+
|
136
|
+
==== download 'CMD "%s"'
|
137
|
+
Use this shell command to download a page.
|
138
|
+
%s will be replaced with the url.
|
139
|
+
|
140
|
+
w3m is used by default.
|
141
|
+
|
142
|
+
Example:
|
143
|
+
download 'lynx -dump "%s"'
|
144
|
+
|
145
|
+
|
146
|
+
==== downloadprocess lambda {|text| ...}
|
147
|
+
Use this ruby snippet to post-process what was downloaded. Return the
|
148
|
+
new text.
|
149
|
+
|
150
|
+
|
151
|
+
==== edit 'CMD "%s"'
|
152
|
+
Use this shell command to edit a profile. %s will be replaced with the filename.
|
153
|
+
|
154
|
+
vi is used by default.
|
155
|
+
|
156
|
+
Example:
|
157
|
+
edit 'gvim "%s"&'
|
158
|
+
|
159
|
+
|
160
|
+
==== option TYPE, OPTION => VALUE
|
161
|
+
Set a global option.
|
162
|
+
|
163
|
+
TYPE can be one of:
|
164
|
+
<tt>:diff</tt>::
|
165
|
+
Generate a diff
|
166
|
+
<tt>:diffprocess</tt>::
|
167
|
+
Post-process a diff (if necessary)
|
168
|
+
<tt>:format</tt>::
|
169
|
+
Format the diff for output
|
170
|
+
<tt>:download</tt>::
|
171
|
+
Download webpages
|
172
|
+
<tt>:downloadprocess</tt>::
|
173
|
+
Post-process downloaded webpages
|
174
|
+
<tt>:page</tt>::
|
175
|
+
The :format field defines the format of the final report. Here VALUE
|
176
|
+
is a format string that takes 3 variables as arguments: report title,
|
177
|
+
toc, contents.
|
178
|
+
<tt>:global</tt>::
|
179
|
+
Set a "global" option.
|
180
|
+
|
181
|
+
DOWNLOAD is a symbol
|
182
|
+
|
183
|
+
VALUE is either a format string or a block of code (of class Proc).
|
184
|
+
|
185
|
+
Example:
|
186
|
+
set :download, :foo => lambda {|url| get_url(url)}
|
187
|
+
|
188
|
+
|
189
|
+
==== global OPTION => VALUE
|
190
|
+
This is the same a <tt>option :global, OPTION => VALUE</tt>.
|
191
|
+
|
192
|
+
Known global options:
|
193
|
+
|
194
|
+
<tt>:filename_size => N</tt>::
|
195
|
+
The max filename size. If a filename becomes longer, md5 encoding will
|
196
|
+
be used for local copies in the cache.
|
197
|
+
|
198
|
+
<tt>:downloadhtml => SHORTCUT</tt>::
|
199
|
+
The default shortcut for downloading plain HTML.
|
200
|
+
|
201
|
+
<tt>:file_url => BLOCK(FILENAME)</tt>::
|
202
|
+
Rewrite a filename as it is used for creating file urls to local
|
203
|
+
copies in the output. This may useful if you want to use the same
|
204
|
+
repository on several computers with in different locations etc.
|
205
|
+
|
206
|
+
<tt>:canonic_filename => BLOCK(FILENAME)</tt>::
|
207
|
+
Rewrite filenames as they are stored in the mtimes register. This may
|
208
|
+
useful if you want to use the same repository on several computers
|
209
|
+
with in different locations etc.
|
210
|
+
|
211
|
+
|
212
|
+
==== output_format FORMAT, output_format [FORMAT1, FORMAT2, ...]
|
213
|
+
Set the output format.
|
214
|
+
Format can be one of:
|
215
|
+
|
216
|
+
* html
|
217
|
+
* text, txt (this only works with text based downloaders)
|
218
|
+
* rss (prove of concept only;
|
219
|
+
it requires :rss[:url] to be set to the url, where the rss feed will
|
220
|
+
be published, using the <tt>option :rss, :url => URL</tt>
|
221
|
+
configuration command; you either have to use a text-based downloader
|
222
|
+
or include <tt>:rss_format => 'html'</tt> to the url options)
|
223
|
+
|
224
|
+
|
225
|
+
==== set OPTION => VALUE; set TYPE, OPTION => VALUE; unset OPTIONS
|
226
|
+
(Un)Set an option for the following source commands.
|
227
|
+
|
228
|
+
Example:
|
229
|
+
set :download, :foo => lambda {|url| get_url(url)}
|
230
|
+
set :days => 7, sort => true
|
231
|
+
unset :days, :sort
|
232
|
+
|
233
|
+
|
234
|
+
==== source URL(S), [OPTIONS]
|
235
|
+
Options
|
236
|
+
|
237
|
+
<tt>:cols => FROM..TO</tt>::
|
238
|
+
Use only these colums from the output (used after applying the :lines
|
239
|
+
option)
|
240
|
+
|
241
|
+
<tt>:depth => INTEGER</tt>::
|
242
|
+
In conjunction with a :website type of :download option, fetch url up
|
243
|
+
to this depth.
|
244
|
+
|
245
|
+
<tt>:diff => "CMD", :diff => SHORTCUT</tt>::
|
246
|
+
Use this command to make the diff for this page. Possible values for
|
247
|
+
SHORTCUT are: :webdiff (useful in conjunction with :download => :curl,
|
248
|
+
:wget, or :body_html). :body_html, :website_below, :website and
|
249
|
+
:openuri are synonyms for :webdiff.
|
250
|
+
|
251
|
+
<tt>:diffprocess => lambda {|text| ...}</tt>::
|
252
|
+
Use this ruby snippet to post-process this diff
|
253
|
+
|
254
|
+
<tt>:download => "CMD", :download => SHORTCUT</tt>::
|
255
|
+
Use this command to download this page. For possible values for
|
256
|
+
SHORTCUT see the section on shortcuts below.
|
257
|
+
|
258
|
+
<tt>:downloadprocess => lambda {|text| ...}</tt>::
|
259
|
+
Use this ruby snippet to post-process what was downloaded. This is the
|
260
|
+
place where, e.g., hpricot can be used to extract certain elements
|
261
|
+
from the HTML code.
|
262
|
+
Example:
|
263
|
+
lambda {|text| Hpricot(text).at('div#content').inner_html}
|
264
|
+
|
265
|
+
<tt>:format => "FORMAT %s STRING", :format => SHORTCUT</tt>::
|
266
|
+
The format string for the diff text. The default (the :diff shortcut)
|
267
|
+
wraps the output in +pre+ tags. :webdiff, :body_html, :website_below,
|
268
|
+
:website, and :openuri will simply add a newline character.
|
269
|
+
|
270
|
+
<tt>:hours => HOURS, :days => DAYS</tt>::
|
271
|
+
Don't download the file unless it's older than that
|
272
|
+
|
273
|
+
<tt>:days_of_month => DAY..DAY, :wdays => DAY..DAY</tt>::
|
274
|
+
Download only once per month within a certain range of days (e.g.,
|
275
|
+
15..31 ... Check once after the 15th). The argument can also be an
|
276
|
+
array (e.g, [1, 15]) or an integer.
|
277
|
+
|
278
|
+
<tt>:days_of_week => DAY..DAY, :mdays => DAY..DAY</tt>::
|
279
|
+
Download only once per week within a certain range of days (e.g., 1..2
|
280
|
+
... Check once on monday or tuesday; sunday = 0). The argument can
|
281
|
+
also be an array (e.g, [1, 15]) or an integer.
|
282
|
+
|
283
|
+
<tt>:daily => true</tt>::
|
284
|
+
Download only once a day.
|
285
|
+
|
286
|
+
<tt>:ignore_age => true</tt>::
|
287
|
+
Ignore any :days and :hours settings. This is useful in some cases
|
288
|
+
when set on the command line.
|
289
|
+
|
290
|
+
<tt>:lines => FROM..TO</tt>::
|
291
|
+
Use only these lines from the output
|
292
|
+
|
293
|
+
<tt>:match => REGEXP</tt>::
|
294
|
+
When recursively walking a website, follow only links that match this
|
295
|
+
regexp.
|
296
|
+
|
297
|
+
<tt>:rss_rewrite_enclosed_urls => true</tt>::
|
298
|
+
If true, replace urls in the rss feed item description pointing to the
|
299
|
+
enclosure with a file url pointing to the local copy
|
300
|
+
|
301
|
+
<tt>:rss_enclosure => true|"DIRECTORY"</tt>::
|
302
|
+
If true, save rss feed enclosures in
|
303
|
+
"~/.websitary/attachments/RSS_FEED_NAME/". If a string, use this as
|
304
|
+
destination directory.
|
305
|
+
|
306
|
+
<tt>:rss_format (default: "plain_text")</tt>::
|
307
|
+
When output format is :rss, create rss item descriptios as plain text.
|
308
|
+
|
309
|
+
<tt>:show_initial => true</tt>::
|
310
|
+
Include initial copies in the report (may not always work properly).
|
311
|
+
This can also be set as a global option.
|
312
|
+
|
313
|
+
<tt>:sleep => SECS</tt>::
|
314
|
+
Wait SECS seconds (float or integer) before downloading the page.
|
315
|
+
|
316
|
+
<tt>:sort => true, :sort => lambda {|a,b| ...}</tt>::
|
317
|
+
Sort lines in output
|
318
|
+
|
319
|
+
<tt>:strip => true</tt>::
|
320
|
+
Strip empty lines
|
321
|
+
|
322
|
+
<tt>:title => "TEXT"</tt>::
|
323
|
+
Display TEXT instead of URL
|
324
|
+
|
325
|
+
<tt>:use => SYMBOL</tt>::
|
326
|
+
Use SYMBOL for any other option. I.e. <tt>:download => :body_html
|
327
|
+
:diff => :webdiff</tt> can be abbreviated as <tt>:use =>
|
328
|
+
:body_html</tt> (because for :diff :body_html is a synonym for
|
329
|
+
:webdiff).
|
330
|
+
|
331
|
+
The order of age constraints is:
|
332
|
+
:hours > :daily > :wdays > :mdays > :days > :months.
|
333
|
+
I.e. if :wdays is set, :mdays, :days, or :months are ignored.
|
334
|
+
|
335
|
+
|
336
|
+
==== view 'CMD "%s"'
|
337
|
+
Use this shell command to view the output (usually a HTML file).
|
338
|
+
%s will be replaced with the filename.
|
339
|
+
|
340
|
+
w3m is used by default.
|
341
|
+
|
342
|
+
Example:
|
343
|
+
view 'gnome-open "%s"' # Gnome Desktop
|
344
|
+
view 'kfmclient "%s"' # KDE
|
345
|
+
view 'cygstart "%s"' # Cygwin
|
346
|
+
view 'start "%s"' # Windows
|
347
|
+
view 'firefox "%s"'
|
348
|
+
|
349
|
+
|
350
|
+
=== Shortcuts for use with :use, :download and other options
|
351
|
+
<tt>:w3m</tt>::
|
352
|
+
Use w3m for downloading the source. Use diff for generating diffs.
|
353
|
+
|
354
|
+
<tt>:lynx</tt>::
|
355
|
+
Use lynx for downloading the source. Use diff for generating diffs.
|
356
|
+
Lynx doesn't try to recreate the layout of a page like w3m or links
|
357
|
+
do. As a result the output IMHO sometimes deviates from the original
|
358
|
+
design but is better suited for being post-processed in some
|
359
|
+
situation.
|
360
|
+
|
361
|
+
<tt>:links</tt>::
|
362
|
+
Use links for downloading the source. Use diff for generating diffs.
|
363
|
+
|
364
|
+
<tt>:curl</tt>::
|
365
|
+
Use curl for downloading the source. Use webdiff for generating diffs.
|
366
|
+
|
367
|
+
<tt>:wget</tt>::
|
368
|
+
Use wget for downloading the source. Use webdiff for generating diffs.
|
369
|
+
|
370
|
+
<tt>:openuri</tt>::
|
371
|
+
Use open-uri for downloading the source. Use webdiff for generating
|
372
|
+
diffs. This doesn't handle cookies and the like.
|
373
|
+
|
374
|
+
<tt>:text</tt>::
|
375
|
+
This requires hpricot to be installed. Use open-uri for downloading
|
376
|
+
and hpricot for converting HTML to plain text. This still requires
|
377
|
+
diff as external helper.
|
378
|
+
|
379
|
+
<tt>:body_html</tt>::
|
380
|
+
This requires hpricot to be installed. Use open-uri for downloading
|
381
|
+
the source, use only the body. Use webdiff for generating diffs. Try
|
382
|
+
to rewrite references (a, img) so that the point to the webpage. By
|
383
|
+
default, this will also strip tags like script, form, object ...
|
384
|
+
|
385
|
+
<tt>:website</tt>::
|
386
|
+
Use :body_html to download the source. Follow all links referring to
|
387
|
+
the same host with the same file suffix. Use webdiff for generating
|
388
|
+
diff.
|
389
|
+
|
390
|
+
<tt>:website_below</tt>::
|
391
|
+
Use :body_html to download the source. Follow all links referring to
|
392
|
+
the same host and a file below the top directory with the same file
|
393
|
+
suffix. Use webdiff for generating diff.
|
394
|
+
|
395
|
+
<tt>:website_txt</tt>::
|
396
|
+
Use :website to download the source but convert the output to plain
|
397
|
+
text.
|
398
|
+
|
399
|
+
<tt>:website_txt_below</tt>::
|
400
|
+
Use :website_below to download the source but convert the output to
|
401
|
+
plain text.
|
402
|
+
|
403
|
+
<tt>:rss</tt>::
|
404
|
+
Download an rss feed, show changed items.
|
405
|
+
|
406
|
+
<tt>:opml</tt>::
|
407
|
+
Experimental. Download the rss feeds registered in opml. No support
|
408
|
+
for atom yet.
|
409
|
+
|
410
|
+
<tt>:img</tt>::
|
411
|
+
Download an image and display it in the output if it has changed
|
412
|
+
(according to diff). You can use hpricot to extract an image from a
|
413
|
+
HTML source. Example:
|
414
|
+
|
415
|
+
Any shortcuts relying on :body_html will also try to rewrite any
|
416
|
+
references so that the links point to the webpage.
|
417
|
+
|
418
|
+
|
419
|
+
|
420
|
+
=== Example configuration file for demonstration purposes
|
421
|
+
|
422
|
+
# Daily
|
423
|
+
set :days => 1
|
424
|
+
|
425
|
+
# Use lynx instead of the default downloader (w3m).
|
426
|
+
source 'http://www.example.com', :days => 7, :download => :lynx
|
427
|
+
|
428
|
+
# Use the HTML body and process via webdiff.
|
429
|
+
source 'http://www.example.com', :use => :body_html,
|
430
|
+
:downloadprocess => lambda {|text| Hpricot(text).at('div#content').inner_html}
|
431
|
+
|
432
|
+
# Download a podcast
|
433
|
+
source 'http://www.example.com/podcast.xml', :title => 'Podcast',
|
434
|
+
:use => :rss,
|
435
|
+
:rss_enclosure => '/home/me/podcasts/example'
|
436
|
+
|
437
|
+
# Check a rss feed.
|
438
|
+
source 'http://www.example.com/news.xml', :title => 'News', :use => :rss
|
439
|
+
|
440
|
+
# Get rss feed info from an opml file (EXPERIMENTAL).
|
441
|
+
# @cfgdir is most likely '~/.websitary'.
|
442
|
+
source File.join(@cfgdir, 'news.opml'), :use => :opml
|
443
|
+
|
444
|
+
|
445
|
+
# Weekly
|
446
|
+
set :days => 7
|
447
|
+
|
448
|
+
# Consider the page body only from the 10th line downwards.
|
449
|
+
source 'http://www.example.com', :lines => 10..-1, :title => 'My Page'
|
450
|
+
|
451
|
+
|
452
|
+
# Bi-weekly
|
453
|
+
set :days => 14
|
454
|
+
|
455
|
+
# Use these urls with the default options.
|
456
|
+
source <<URLS
|
457
|
+
http://www.example.com
|
458
|
+
http://www.example.com/page.html
|
459
|
+
URLS
|
460
|
+
|
461
|
+
# Make HTML diffs and highlight occurences of a word
|
462
|
+
source 'http://www.example.com',
|
463
|
+
:title => 'Example',
|
464
|
+
:use => :body_html,
|
465
|
+
:diffprocess => highlighter(/word/i)
|
466
|
+
|
467
|
+
# Download the whole website below this path (only pages with
|
468
|
+
# html-suffix), wait 30 secs between downloads.
|
469
|
+
# Download only php and html pages
|
470
|
+
# Follow links 2 levels deep
|
471
|
+
source 'http://www.example.com/foo/bar.html',
|
472
|
+
:title => 'Example -- Bar',
|
473
|
+
:use => :website_below, :sleep => 30,
|
474
|
+
:match => /\.(php|html)\b/, :depth => 2
|
475
|
+
|
476
|
+
# Download images from some kind of daily-image site (check the user
|
477
|
+
# agreement first, if this is allowed). This may require some ruby
|
478
|
+
# hacking in order to extract the right url.
|
479
|
+
source 'http://www.example.com/daily_image/', :title => 'Daily Image',
|
480
|
+
:use => :img,
|
481
|
+
:download => lambda {|url|
|
482
|
+
# Read the HTML.
|
483
|
+
html = open(url) {|io| io.read}
|
484
|
+
# This check is probably unnecessary as the failure to read
|
485
|
+
# the HTML document would most likely result in an
|
486
|
+
# exception.
|
487
|
+
if html
|
488
|
+
rv = nil
|
489
|
+
# Parse the HTML document.
|
490
|
+
doc = Hpricot(html)
|
491
|
+
# The following could actually be simplified using xpath
|
492
|
+
# or css search expressions. This isn't the most elegant
|
493
|
+
# solution but it works with any value of ALT.
|
494
|
+
# This downloads the image <img src="..." alt="Current Image">
|
495
|
+
# Check all img tags in the HTML document.
|
496
|
+
for e in doc.search(%{//img})
|
497
|
+
# Is this the image we're looking for?
|
498
|
+
if e['alt'] == "Current Image"
|
499
|
+
# Make relative urls absolute
|
500
|
+
img = rewrite_href(e['src'], url)
|
501
|
+
# Get the actual image data
|
502
|
+
rv = open(img, 'rb') {|io| io.read}
|
503
|
+
# Exit the for loop
|
504
|
+
break
|
505
|
+
end
|
506
|
+
end
|
507
|
+
rv
|
508
|
+
end
|
509
|
+
}
|
510
|
+
|
511
|
+
|
512
|
+
unset :days
|
513
|
+
|
514
|
+
|
515
|
+
|
516
|
+
=== Commands for use with the -e command-line option
|
517
|
+
Most of these commands require you to name a profile on the command
|
518
|
+
line. You can define default profiles with the "default" configuration
|
519
|
+
command.
|
520
|
+
|
521
|
+
If no command is given, "downdiff" is executed.
|
522
|
+
|
523
|
+
add::
|
524
|
+
Add the URLs given on the command line to the quicklist profile.
|
525
|
+
ATTENTION: The following arguments on the command line are URLs, not
|
526
|
+
profile names.
|
527
|
+
|
528
|
+
aggregate::
|
529
|
+
Retrieve information and save changes for later review.
|
530
|
+
|
531
|
+
configuration::
|
532
|
+
Show the fully qualified configuration of each source.
|
533
|
+
|
534
|
+
downdiff::
|
535
|
+
Download and show differences (DEFAULT)
|
536
|
+
|
537
|
+
edit::
|
538
|
+
Edit the profile given on the command line (use vi by default)
|
539
|
+
|
540
|
+
latest::
|
541
|
+
Show the latest copies of the sources from the profiles given
|
542
|
+
on the command line.
|
543
|
+
|
544
|
+
rebuild::
|
545
|
+
Rebuild the latest report.
|
546
|
+
|
547
|
+
review::
|
548
|
+
Review the latest report (just show it with the browser)
|
549
|
+
|
550
|
+
show::
|
551
|
+
Show previously aggregated items. A typical use would be to
|
552
|
+
periodically run in the background a command like
|
553
|
+
websitary -eaggregate newsfeeds
|
554
|
+
and then
|
555
|
+
websitary -eshow newsfeeds
|
556
|
+
to review the changes.
|
557
|
+
|
558
|
+
unroll::
|
559
|
+
Undo the latest fetch.
|
560
|
+
|
561
|
+
|
562
|
+
|
563
|
+
== TIPS:
|
564
|
+
=== Ruby
|
565
|
+
The profiles are regular ruby sources that are evaluated in the context
|
566
|
+
of the configuration object (Websitary::Configuration). Find out more
|
567
|
+
about ruby at:
|
568
|
+
* http://www.ruby-lang.org/en/documentation/
|
569
|
+
* http://www.ruby-doc.org/docs/ProgrammingRuby/ (especially
|
570
|
+
the
|
571
|
+
language[http://www.ruby-doc.org/docs/ProgrammingRuby/html/language.html]
|
572
|
+
chapter)
|
573
|
+
|
574
|
+
|
575
|
+
=== Cygwin
|
576
|
+
Mixing native Windows apps and cygwin apps can cause problems. The
|
577
|
+
following settings (e.g. in ~/.websitary/config.rb) can be used to use
|
578
|
+
a native Windows editor and browser:
|
579
|
+
|
580
|
+
# Use the default Windows programs (as if double-clicked)
|
581
|
+
view '/usr/bin/cygstart "%s"'
|
582
|
+
|
583
|
+
# Translate the profile filename and edit it with a native Windows editor
|
584
|
+
edit 'notepad.exe $(cygpath -w -- "%s")'
|
585
|
+
|
586
|
+
# Rewrite cygwin filenames for use with a native Windows browser
|
587
|
+
option :global, :file_url => lambda {|f| f.sub(/\/cygdrive\/.+?\/.websitary\//, '')}
|
588
|
+
|
589
|
+
|
590
|
+
=== Windows
|
591
|
+
Backslashes usually have to be escaped by backslashes -- or use slashes.
|
592
|
+
I.e. instead of 'c:\foo\bar' write either 'c:\\foo\\bar' or
|
593
|
+
'c:/foo/bar'.
|
594
|
+
|
595
|
+
|
596
|
+
== REQUIREMENTS:
|
597
|
+
websitary is a ruby-based application. You thus need a ruby
|
598
|
+
interpreter.
|
599
|
+
|
600
|
+
It depends on how you use websitary whether you actually need the
|
601
|
+
following libraries, applications.
|
602
|
+
|
603
|
+
By default this script expects the following applications to be
|
604
|
+
present:
|
605
|
+
|
606
|
+
* diff
|
607
|
+
* vi (or some other editor)
|
608
|
+
|
609
|
+
and one of:
|
610
|
+
|
611
|
+
* w3m[http://w3m.sourceforge.net/] (default)
|
612
|
+
* lynx[http://lynx.isc.org/]
|
613
|
+
* links[http://links.twibright.com/]
|
614
|
+
* websec[http://baruch.ev-en.org/proj/websec/]
|
615
|
+
(or at Savannah[http://savannah.nongnu.org/projects/websec/])
|
616
|
+
|
617
|
+
The use of :webdiff as :diff application requires
|
618
|
+
websec[http://download.savannah.gnu.org/releases/websec/] to be
|
619
|
+
installed. In conjunction with :body_html, :openuri, or :curl, this
|
620
|
+
will give you colored HTML diffs.
|
621
|
+
Why not use +websec+ if I have to install it, you might ask. Well,
|
622
|
+
+websec+ is written in perl and I didn't quite manage to make it work
|
623
|
+
the way I want it to. websitary is made to be better to configure.
|
624
|
+
|
625
|
+
For downloading HTML, you need one of these:
|
626
|
+
|
627
|
+
* open-uri (should be part of ruby)
|
628
|
+
* hpricot[http://code.whytheluckystiff.net/hpricot] (used e.g. by
|
629
|
+
:body_html, :website, and :website_below)
|
630
|
+
* curl[http://curl.haxx.se/]
|
631
|
+
* wget[http://www.gnu.org/software/wget/]
|
632
|
+
|
633
|
+
The following ruby libraries are needed in conjunction with :body_html
|
634
|
+
and :website related shortcuts:
|
635
|
+
|
636
|
+
* hpricot[http://code.whytheluckystiff.net/hpricot] (parse HTML, use
|
637
|
+
only the body etc.)
|
638
|
+
* robot_rules.rb[http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589]
|
639
|
+
for parsing robots.txt
|
640
|
+
|
641
|
+
I personally would suggest to choose the following setup:
|
642
|
+
|
643
|
+
* w3m[http://w3m.sourceforge.net/]
|
644
|
+
* websec[http://baruch.ev-en.org/proj/websec/]
|
645
|
+
* hpricot[http://code.whytheluckystiff.net/hpricot]
|
646
|
+
* robot_rules.rb[http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589]
|
647
|
+
|
648
|
+
|
649
|
+
== INSTALL:
|
650
|
+
=== Use rubygems
|
651
|
+
Run
|
652
|
+
|
653
|
+
gem install websitary
|
654
|
+
|
655
|
+
This will download the package and install it.
|
656
|
+
|
657
|
+
|
658
|
+
=== Use the zip
|
659
|
+
The zip[http://rubyforge.org/frs/?group_id=4030] contains a file
|
660
|
+
setup.rb that does the work. Run
|
661
|
+
|
662
|
+
ruby setup.rb
|
663
|
+
|
664
|
+
|
665
|
+
=== Initial Configuration
|
666
|
+
Please check the requirements section above and get the extra libraries
|
667
|
+
needed:
|
668
|
+
* hpricot
|
669
|
+
* robot_rules.rb
|
670
|
+
|
671
|
+
These could be installed by:
|
672
|
+
|
673
|
+
# Install hpricot
|
674
|
+
gem install hpricot
|
675
|
+
|
676
|
+
# Install robot_rules.rb
|
677
|
+
curl http://www.rubyquiz.com/quiz64_sols.zip
|
678
|
+
# Check the correct path to site_ruby first!
|
679
|
+
unzip -p quiz64_sols.zip "solutions/James Edward Gray II/robot_rules.rb" > /lib/ruby/site_ruby/1.8/robot_rules.rb
|
680
|
+
rm quiz64_sols.zip
|
681
|
+
|
682
|
+
You might then want to create a profile ~/.websitary/config.rb that is
|
683
|
+
loaded on every run. In this profile you could set the default output
|
684
|
+
viewer and profile editor, as well as a default profile.
|
685
|
+
|
686
|
+
Example:
|
687
|
+
|
688
|
+
# Load standard.rb if no profile is given on the command line.
|
689
|
+
default 'standard'
|
690
|
+
|
691
|
+
# Use cygwin's cygstart to view the output with the default HTML
|
692
|
+
# viewer
|
693
|
+
view '/usr/bin/cygstart "%s"'
|
694
|
+
|
695
|
+
# Use Windows gvim from cygwin ruby which is why we convert the path
|
696
|
+
# first
|
697
|
+
edit 'gvim $(cygpath -w -- "%s")'
|
698
|
+
|
699
|
+
Where these configuration files reside, may differ. If the environment
|
700
|
+
variable $HOME is defined, the default is $HOME/.websitary/ unless one
|
701
|
+
of the following directories exist, which will then be used instead:
|
702
|
+
|
703
|
+
* $USERPROFILE/websitary (on Windows)
|
704
|
+
* SYSCONFDIR/websitary (where SYSCONFDIR usually is /etc but you can
|
705
|
+
run ruby to find out more:
|
706
|
+
<tt>ruby -e "p Config::CONFIG['sysconfdir']"</tt>)
|
707
|
+
|
708
|
+
If neither directory exists and no $HOME variable is defined, the
|
709
|
+
current directory will be used.
|
710
|
+
|
711
|
+
|
712
|
+
== LICENSE:
|
713
|
+
websitary Webpage Monitor
|
714
|
+
Copyright (C) 2007 Thomas Link
|
715
|
+
|
716
|
+
This program is free software; you can redistribute it and/or modify
|
717
|
+
it under the terms of the GNU General Public License as published by
|
718
|
+
the Free Software Foundation; either version 2 of the License, or
|
719
|
+
(at your option) any later version.
|
720
|
+
|
721
|
+
This program is distributed in the hope that it will be useful,
|
722
|
+
but WITHOUT ANY WARRANTY; without even the implied warranty of
|
723
|
+
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
|
724
|
+
GNU General Public License for more details.
|
725
|
+
|
726
|
+
You should have received a copy of the GNU General Public License
|
727
|
+
along with this program; if not, write to the Free Software
|
728
|
+
Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307
|
729
|
+
USA
|
730
|
+
|
731
|
+
|
732
|
+
% vi: ft=rd:tw=72:ts=4
|