websitary 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
data/History.txt ADDED
@@ -0,0 +1,57 @@
1
+ = 0.2.0
2
+
3
+ * Renamed the project from websitiary to websitary (without the
4
+ additional "i")
5
+ * The default output filename is now constructed on basis of the profile
6
+ names joined with a comma.
7
+ * Apply rewrite-rules to URLs in text output.
8
+ * Set user-agent (:body_html)
9
+ * Exit with 1 if differences were found
10
+ * Command line options have slightly changed: -e now is the short form
11
+ for --execute
12
+ * Commands that can be triggered by the -e command-line switch: downdiff
13
+ (default), configuration (list currently configured urls), latest
14
+ (show the current version of all urls), review (show the latest
15
+ report)
16
+ * Protect against filenames being too long (max size can be configured
17
+ via: <tt>option :global, :filename_size => N</tt>)
18
+ * Try to migrate local copies from the older flat to the new
19
+ hierarchical cache layout
20
+ * Disabled -E/--edit, --review command-line options (use -e instead)
21
+ * Try to maintain file atime/mtime when copying/moving files
22
+ * FIX: Problem with loading robots.txt
23
+ * Respect meta tag: robots="nofollow" (noindex is only checked in
24
+ conjunction with :download => :website*)
25
+ * quicklist profile: register urls via the -eadd command-line switch;
26
+ see "Usage" for an example
27
+ * Temporaly save diffs, so that we can reuse them when websitary should
28
+ exit ungracefully.
29
+ * Renamed :inner_html to :body_html
30
+ * New shortcuts: :ftp, :ftp_recursive, :img, :rss, :opml (rudementary)
31
+ * New experimental commands: aggregate, show ... can be used to
32
+ periodically check for changes (e.g. of rss feeds) but to review these
33
+ changes only once in a while
34
+ * Experimental --timer command-line option to re-run websitary every X
35
+ seconds.
36
+ * The :rss differ has an option :rss_enclosure (true or directory name)
37
+ that will be used for automatically saving new enclosures (e.g. mp3
38
+ files in podcasts); in theory, one should thus be able to use
39
+ websitary as pod catcher etc.
40
+ * Cache mtimes in order to reduce disk access.
41
+ * Special profile "__END__": The section in the script file after the
42
+ __END__ line. This seems useful in some situations when employing a
43
+ single script.
44
+ * Don't follow javascript links.
45
+ * New date constraint for sources:
46
+ :daily => true ... Once a day
47
+ :days_of_month => BEGIN..END ... download URL only once per month
48
+ within this range of days.
49
+ :days_of_week => BEGIN..END ... download URL only once per week
50
+ within this range of days.
51
+ :months => N (calculated on basis of the calendar month, not the
52
+ number of days)
53
+
54
+ == 0.1.0 / 2007-07-16
55
+
56
+ * Initial release
57
+
data/Manifest.txt ADDED
@@ -0,0 +1,11 @@
1
+ History.txt
2
+ Manifest.txt
3
+ README.txt
4
+ Rakefile
5
+ setup.rb
6
+ bin/websitary
7
+ lib/websitary.rb
8
+ lib/websitary/applog.rb
9
+ lib/websitary/configuration.rb
10
+ lib/websitary/filemtimes.rb
11
+ lib/websitary/htmldiff.rb
data/README.txt ADDED
@@ -0,0 +1,732 @@
1
+ websitary by Thomas Link
2
+ http://rubyforge.org/projects/websitiary/
3
+
4
+ This script monitors webpages, rss feeds, podcasts etc. and reports
5
+ what's new. For many tasks, it reuses other programs to do the actual
6
+ work. By default, it works on an ASCII basis, i.e. with the output of
7
+ text-based webbrowsers. With the help of some friends, it can also work
8
+ with HTML.
9
+
10
+
11
+ == DESCRIPTION:
12
+ websitary (formerly known as websitiary with an extra "i") monitors
13
+ webpages, rss feeds, podcasts etc. It reuses other programs (w3m, diff,
14
+ webdiff etc.) to do most of the actual work. By default, it works on an
15
+ ASCII basis, i.e. with the output of text-based webbrowsers like w3m (or
16
+ lynx, links etc.) as the output can easily be post-processed. With the
17
+ help of some friends (see the section below on requirements), it can
18
+ also work with HTML. E.g., if you have websec installed, you can also
19
+ use its webdiff program to show colored diffs. This script was
20
+ originally planned as a ruby-based websec replacement. For HTML diffs,
21
+ it stills relies on the webdiff perl script that comes with websec.
22
+
23
+ By default, this script will use w3m to dump HTML pages and then run
24
+ diff over the current page and the previous backup. Some pages are
25
+ better viewed with lynx or links. Downloaded documents (HTML or ASCII)
26
+ can be post-processed (e.g., filtered through some ruby block that
27
+ extracts elements via hpricot and the like). Please see the
28
+ configuration options below to find out how to change this globally or
29
+ for a single source.
30
+
31
+
32
+ == FEATURES/PROBLEMS:
33
+ * Handle webpages, rss feeds (optionally save attachments in podcasts
34
+ etc.)
35
+ * Compare webpages with previous backups
36
+ * Display differences between the current version and the backup
37
+ * Provide hooks to post-process the downloaded documents and the diff
38
+ * Display a one-page report summarizing all news
39
+ * Automatically open the report in your favourite web-browser
40
+ * Experimental: Download webpages on defined intervalls and generate
41
+ incremental diffs.
42
+
43
+ ISSUES, TODO:
44
+ * With HTML output, changes are presented on one single page, which
45
+ means that pages with different encodings cause problems.
46
+ * Improved support for robots.txt (test it)
47
+ * The use of :website_below and :website is hardly tested (please
48
+ report errors).
49
+ * download => :body_html tries to rewrite references (a, img) which may
50
+ fail on certain kind of urls (please report errors).
51
+ * When using :body_html for download, it may happen that some
52
+ JavaScript code is stripped, which breaks some JavaScript-generated
53
+ links.
54
+ * The --log command line will create a new instance of the logger and
55
+ thus reset any previous options related to the logging level.
56
+
57
+ NOTE: The script was previously called websitiary but was renamed (from
58
+ 0.2 on) to websitary (without the superfluous i).
59
+
60
+
61
+ === CAVEAT:
62
+ The script also includes experimental support for monitoring whole
63
+ websites. Basically, this script supports robots.txt directives (see
64
+ requirements) but this is hardly tested and may not work in some cases.
65
+
66
+ While it is okay for your own websites to ignore robots.txt, it is not
67
+ for others. Please make sure that the webpages you run this program on
68
+ allow such a use. Some webpages disallow the use of any automatic
69
+ downloader or offline reader in their user agreements.
70
+
71
+
72
+ == SYNOPSIS:
73
+ This manual is also available as
74
+ PDF[http://websitiary.rubyforge.org/websitary.pdf].
75
+
76
+ === Usage
77
+ Example:
78
+ # Run "profile"
79
+ websitary profile
80
+
81
+ # Edit "~/.websitary/profile.rb"
82
+ websitary --edit=profile
83
+
84
+ # View the latest report
85
+ websitary -ereview
86
+
87
+ # Refetch all sources regardless of :days and :hours restrictions
88
+ websitary -signore_age=true
89
+
90
+ # Create html and rss reports for my websites
91
+ websitary -fhtml,rss mysites
92
+
93
+ # Add an url to the quicklist profile
94
+ websitary -eadd http://www.example.com
95
+
96
+ For example output see:
97
+ * html[http://deplate.sourceforge.net/websitary.html]
98
+ * rss[http://deplate.sourceforge.net/websitary.rss]
99
+ * text[http://deplate.sourceforge.net/websitary.txt]
100
+
101
+
102
+ === Configuration
103
+ Profiles are plain ruby files (with the '.rb' suffix) stored in
104
+ ~/.websitary/.
105
+
106
+ The profile "config" (~/.websitary/config.rb) is always loaded if
107
+ available.
108
+
109
+ There are two special profile names:
110
+
111
+ -::
112
+ Read URLs from STDIN.
113
+ <tt>__END__</tt>::
114
+ Read the profile contained in the script source after the __END__
115
+ line.
116
+
117
+
118
+ ==== default 'PROFILE1', 'PROFILE2' ...
119
+ Set the default profile(s). The default is: quicklist
120
+
121
+ Example:
122
+ default 'my_profile'
123
+
124
+
125
+ ==== diff 'CMD "%s" "%s"'
126
+ Use this shell command to make the diff.
127
+ %s %s will be replaced with the old and new filename.
128
+
129
+ diff is used by default.
130
+
131
+
132
+ ==== diffprocess lambda {|text| ...}
133
+ Use this ruby snippet to post-process the diff.
134
+
135
+
136
+ ==== download 'CMD "%s"'
137
+ Use this shell command to download a page.
138
+ %s will be replaced with the url.
139
+
140
+ w3m is used by default.
141
+
142
+ Example:
143
+ download 'lynx -dump "%s"'
144
+
145
+
146
+ ==== downloadprocess lambda {|text| ...}
147
+ Use this ruby snippet to post-process what was downloaded. Return the
148
+ new text.
149
+
150
+
151
+ ==== edit 'CMD "%s"'
152
+ Use this shell command to edit a profile. %s will be replaced with the filename.
153
+
154
+ vi is used by default.
155
+
156
+ Example:
157
+ edit 'gvim "%s"&'
158
+
159
+
160
+ ==== option TYPE, OPTION => VALUE
161
+ Set a global option.
162
+
163
+ TYPE can be one of:
164
+ <tt>:diff</tt>::
165
+ Generate a diff
166
+ <tt>:diffprocess</tt>::
167
+ Post-process a diff (if necessary)
168
+ <tt>:format</tt>::
169
+ Format the diff for output
170
+ <tt>:download</tt>::
171
+ Download webpages
172
+ <tt>:downloadprocess</tt>::
173
+ Post-process downloaded webpages
174
+ <tt>:page</tt>::
175
+ The :format field defines the format of the final report. Here VALUE
176
+ is a format string that takes 3 variables as arguments: report title,
177
+ toc, contents.
178
+ <tt>:global</tt>::
179
+ Set a "global" option.
180
+
181
+ DOWNLOAD is a symbol
182
+
183
+ VALUE is either a format string or a block of code (of class Proc).
184
+
185
+ Example:
186
+ set :download, :foo => lambda {|url| get_url(url)}
187
+
188
+
189
+ ==== global OPTION => VALUE
190
+ This is the same a <tt>option :global, OPTION => VALUE</tt>.
191
+
192
+ Known global options:
193
+
194
+ <tt>:filename_size => N</tt>::
195
+ The max filename size. If a filename becomes longer, md5 encoding will
196
+ be used for local copies in the cache.
197
+
198
+ <tt>:downloadhtml => SHORTCUT</tt>::
199
+ The default shortcut for downloading plain HTML.
200
+
201
+ <tt>:file_url => BLOCK(FILENAME)</tt>::
202
+ Rewrite a filename as it is used for creating file urls to local
203
+ copies in the output. This may useful if you want to use the same
204
+ repository on several computers with in different locations etc.
205
+
206
+ <tt>:canonic_filename => BLOCK(FILENAME)</tt>::
207
+ Rewrite filenames as they are stored in the mtimes register. This may
208
+ useful if you want to use the same repository on several computers
209
+ with in different locations etc.
210
+
211
+
212
+ ==== output_format FORMAT, output_format [FORMAT1, FORMAT2, ...]
213
+ Set the output format.
214
+ Format can be one of:
215
+
216
+ * html
217
+ * text, txt (this only works with text based downloaders)
218
+ * rss (prove of concept only;
219
+ it requires :rss[:url] to be set to the url, where the rss feed will
220
+ be published, using the <tt>option :rss, :url => URL</tt>
221
+ configuration command; you either have to use a text-based downloader
222
+ or include <tt>:rss_format => 'html'</tt> to the url options)
223
+
224
+
225
+ ==== set OPTION => VALUE; set TYPE, OPTION => VALUE; unset OPTIONS
226
+ (Un)Set an option for the following source commands.
227
+
228
+ Example:
229
+ set :download, :foo => lambda {|url| get_url(url)}
230
+ set :days => 7, sort => true
231
+ unset :days, :sort
232
+
233
+
234
+ ==== source URL(S), [OPTIONS]
235
+ Options
236
+
237
+ <tt>:cols => FROM..TO</tt>::
238
+ Use only these colums from the output (used after applying the :lines
239
+ option)
240
+
241
+ <tt>:depth => INTEGER</tt>::
242
+ In conjunction with a :website type of :download option, fetch url up
243
+ to this depth.
244
+
245
+ <tt>:diff => "CMD", :diff => SHORTCUT</tt>::
246
+ Use this command to make the diff for this page. Possible values for
247
+ SHORTCUT are: :webdiff (useful in conjunction with :download => :curl,
248
+ :wget, or :body_html). :body_html, :website_below, :website and
249
+ :openuri are synonyms for :webdiff.
250
+
251
+ <tt>:diffprocess => lambda {|text| ...}</tt>::
252
+ Use this ruby snippet to post-process this diff
253
+
254
+ <tt>:download => "CMD", :download => SHORTCUT</tt>::
255
+ Use this command to download this page. For possible values for
256
+ SHORTCUT see the section on shortcuts below.
257
+
258
+ <tt>:downloadprocess => lambda {|text| ...}</tt>::
259
+ Use this ruby snippet to post-process what was downloaded. This is the
260
+ place where, e.g., hpricot can be used to extract certain elements
261
+ from the HTML code.
262
+ Example:
263
+ lambda {|text| Hpricot(text).at('div#content').inner_html}
264
+
265
+ <tt>:format => "FORMAT %s STRING", :format => SHORTCUT</tt>::
266
+ The format string for the diff text. The default (the :diff shortcut)
267
+ wraps the output in +pre+ tags. :webdiff, :body_html, :website_below,
268
+ :website, and :openuri will simply add a newline character.
269
+
270
+ <tt>:hours => HOURS, :days => DAYS</tt>::
271
+ Don't download the file unless it's older than that
272
+
273
+ <tt>:days_of_month => DAY..DAY, :wdays => DAY..DAY</tt>::
274
+ Download only once per month within a certain range of days (e.g.,
275
+ 15..31 ... Check once after the 15th). The argument can also be an
276
+ array (e.g, [1, 15]) or an integer.
277
+
278
+ <tt>:days_of_week => DAY..DAY, :mdays => DAY..DAY</tt>::
279
+ Download only once per week within a certain range of days (e.g., 1..2
280
+ ... Check once on monday or tuesday; sunday = 0). The argument can
281
+ also be an array (e.g, [1, 15]) or an integer.
282
+
283
+ <tt>:daily => true</tt>::
284
+ Download only once a day.
285
+
286
+ <tt>:ignore_age => true</tt>::
287
+ Ignore any :days and :hours settings. This is useful in some cases
288
+ when set on the command line.
289
+
290
+ <tt>:lines => FROM..TO</tt>::
291
+ Use only these lines from the output
292
+
293
+ <tt>:match => REGEXP</tt>::
294
+ When recursively walking a website, follow only links that match this
295
+ regexp.
296
+
297
+ <tt>:rss_rewrite_enclosed_urls => true</tt>::
298
+ If true, replace urls in the rss feed item description pointing to the
299
+ enclosure with a file url pointing to the local copy
300
+
301
+ <tt>:rss_enclosure => true|"DIRECTORY"</tt>::
302
+ If true, save rss feed enclosures in
303
+ "~/.websitary/attachments/RSS_FEED_NAME/". If a string, use this as
304
+ destination directory.
305
+
306
+ <tt>:rss_format (default: "plain_text")</tt>::
307
+ When output format is :rss, create rss item descriptios as plain text.
308
+
309
+ <tt>:show_initial => true</tt>::
310
+ Include initial copies in the report (may not always work properly).
311
+ This can also be set as a global option.
312
+
313
+ <tt>:sleep => SECS</tt>::
314
+ Wait SECS seconds (float or integer) before downloading the page.
315
+
316
+ <tt>:sort => true, :sort => lambda {|a,b| ...}</tt>::
317
+ Sort lines in output
318
+
319
+ <tt>:strip => true</tt>::
320
+ Strip empty lines
321
+
322
+ <tt>:title => "TEXT"</tt>::
323
+ Display TEXT instead of URL
324
+
325
+ <tt>:use => SYMBOL</tt>::
326
+ Use SYMBOL for any other option. I.e. <tt>:download => :body_html
327
+ :diff => :webdiff</tt> can be abbreviated as <tt>:use =>
328
+ :body_html</tt> (because for :diff :body_html is a synonym for
329
+ :webdiff).
330
+
331
+ The order of age constraints is:
332
+ :hours > :daily > :wdays > :mdays > :days > :months.
333
+ I.e. if :wdays is set, :mdays, :days, or :months are ignored.
334
+
335
+
336
+ ==== view 'CMD "%s"'
337
+ Use this shell command to view the output (usually a HTML file).
338
+ %s will be replaced with the filename.
339
+
340
+ w3m is used by default.
341
+
342
+ Example:
343
+ view 'gnome-open "%s"' # Gnome Desktop
344
+ view 'kfmclient "%s"' # KDE
345
+ view 'cygstart "%s"' # Cygwin
346
+ view 'start "%s"' # Windows
347
+ view 'firefox "%s"'
348
+
349
+
350
+ === Shortcuts for use with :use, :download and other options
351
+ <tt>:w3m</tt>::
352
+ Use w3m for downloading the source. Use diff for generating diffs.
353
+
354
+ <tt>:lynx</tt>::
355
+ Use lynx for downloading the source. Use diff for generating diffs.
356
+ Lynx doesn't try to recreate the layout of a page like w3m or links
357
+ do. As a result the output IMHO sometimes deviates from the original
358
+ design but is better suited for being post-processed in some
359
+ situation.
360
+
361
+ <tt>:links</tt>::
362
+ Use links for downloading the source. Use diff for generating diffs.
363
+
364
+ <tt>:curl</tt>::
365
+ Use curl for downloading the source. Use webdiff for generating diffs.
366
+
367
+ <tt>:wget</tt>::
368
+ Use wget for downloading the source. Use webdiff for generating diffs.
369
+
370
+ <tt>:openuri</tt>::
371
+ Use open-uri for downloading the source. Use webdiff for generating
372
+ diffs. This doesn't handle cookies and the like.
373
+
374
+ <tt>:text</tt>::
375
+ This requires hpricot to be installed. Use open-uri for downloading
376
+ and hpricot for converting HTML to plain text. This still requires
377
+ diff as external helper.
378
+
379
+ <tt>:body_html</tt>::
380
+ This requires hpricot to be installed. Use open-uri for downloading
381
+ the source, use only the body. Use webdiff for generating diffs. Try
382
+ to rewrite references (a, img) so that the point to the webpage. By
383
+ default, this will also strip tags like script, form, object ...
384
+
385
+ <tt>:website</tt>::
386
+ Use :body_html to download the source. Follow all links referring to
387
+ the same host with the same file suffix. Use webdiff for generating
388
+ diff.
389
+
390
+ <tt>:website_below</tt>::
391
+ Use :body_html to download the source. Follow all links referring to
392
+ the same host and a file below the top directory with the same file
393
+ suffix. Use webdiff for generating diff.
394
+
395
+ <tt>:website_txt</tt>::
396
+ Use :website to download the source but convert the output to plain
397
+ text.
398
+
399
+ <tt>:website_txt_below</tt>::
400
+ Use :website_below to download the source but convert the output to
401
+ plain text.
402
+
403
+ <tt>:rss</tt>::
404
+ Download an rss feed, show changed items.
405
+
406
+ <tt>:opml</tt>::
407
+ Experimental. Download the rss feeds registered in opml. No support
408
+ for atom yet.
409
+
410
+ <tt>:img</tt>::
411
+ Download an image and display it in the output if it has changed
412
+ (according to diff). You can use hpricot to extract an image from a
413
+ HTML source. Example:
414
+
415
+ Any shortcuts relying on :body_html will also try to rewrite any
416
+ references so that the links point to the webpage.
417
+
418
+
419
+
420
+ === Example configuration file for demonstration purposes
421
+
422
+ # Daily
423
+ set :days => 1
424
+
425
+ # Use lynx instead of the default downloader (w3m).
426
+ source 'http://www.example.com', :days => 7, :download => :lynx
427
+
428
+ # Use the HTML body and process via webdiff.
429
+ source 'http://www.example.com', :use => :body_html,
430
+ :downloadprocess => lambda {|text| Hpricot(text).at('div#content').inner_html}
431
+
432
+ # Download a podcast
433
+ source 'http://www.example.com/podcast.xml', :title => 'Podcast',
434
+ :use => :rss,
435
+ :rss_enclosure => '/home/me/podcasts/example'
436
+
437
+ # Check a rss feed.
438
+ source 'http://www.example.com/news.xml', :title => 'News', :use => :rss
439
+
440
+ # Get rss feed info from an opml file (EXPERIMENTAL).
441
+ # @cfgdir is most likely '~/.websitary'.
442
+ source File.join(@cfgdir, 'news.opml'), :use => :opml
443
+
444
+
445
+ # Weekly
446
+ set :days => 7
447
+
448
+ # Consider the page body only from the 10th line downwards.
449
+ source 'http://www.example.com', :lines => 10..-1, :title => 'My Page'
450
+
451
+
452
+ # Bi-weekly
453
+ set :days => 14
454
+
455
+ # Use these urls with the default options.
456
+ source <<URLS
457
+ http://www.example.com
458
+ http://www.example.com/page.html
459
+ URLS
460
+
461
+ # Make HTML diffs and highlight occurences of a word
462
+ source 'http://www.example.com',
463
+ :title => 'Example',
464
+ :use => :body_html,
465
+ :diffprocess => highlighter(/word/i)
466
+
467
+ # Download the whole website below this path (only pages with
468
+ # html-suffix), wait 30 secs between downloads.
469
+ # Download only php and html pages
470
+ # Follow links 2 levels deep
471
+ source 'http://www.example.com/foo/bar.html',
472
+ :title => 'Example -- Bar',
473
+ :use => :website_below, :sleep => 30,
474
+ :match => /\.(php|html)\b/, :depth => 2
475
+
476
+ # Download images from some kind of daily-image site (check the user
477
+ # agreement first, if this is allowed). This may require some ruby
478
+ # hacking in order to extract the right url.
479
+ source 'http://www.example.com/daily_image/', :title => 'Daily Image',
480
+ :use => :img,
481
+ :download => lambda {|url|
482
+ # Read the HTML.
483
+ html = open(url) {|io| io.read}
484
+ # This check is probably unnecessary as the failure to read
485
+ # the HTML document would most likely result in an
486
+ # exception.
487
+ if html
488
+ rv = nil
489
+ # Parse the HTML document.
490
+ doc = Hpricot(html)
491
+ # The following could actually be simplified using xpath
492
+ # or css search expressions. This isn't the most elegant
493
+ # solution but it works with any value of ALT.
494
+ # This downloads the image <img src="..." alt="Current Image">
495
+ # Check all img tags in the HTML document.
496
+ for e in doc.search(%{//img})
497
+ # Is this the image we're looking for?
498
+ if e['alt'] == "Current Image"
499
+ # Make relative urls absolute
500
+ img = rewrite_href(e['src'], url)
501
+ # Get the actual image data
502
+ rv = open(img, 'rb') {|io| io.read}
503
+ # Exit the for loop
504
+ break
505
+ end
506
+ end
507
+ rv
508
+ end
509
+ }
510
+
511
+
512
+ unset :days
513
+
514
+
515
+
516
+ === Commands for use with the -e command-line option
517
+ Most of these commands require you to name a profile on the command
518
+ line. You can define default profiles with the "default" configuration
519
+ command.
520
+
521
+ If no command is given, "downdiff" is executed.
522
+
523
+ add::
524
+ Add the URLs given on the command line to the quicklist profile.
525
+ ATTENTION: The following arguments on the command line are URLs, not
526
+ profile names.
527
+
528
+ aggregate::
529
+ Retrieve information and save changes for later review.
530
+
531
+ configuration::
532
+ Show the fully qualified configuration of each source.
533
+
534
+ downdiff::
535
+ Download and show differences (DEFAULT)
536
+
537
+ edit::
538
+ Edit the profile given on the command line (use vi by default)
539
+
540
+ latest::
541
+ Show the latest copies of the sources from the profiles given
542
+ on the command line.
543
+
544
+ rebuild::
545
+ Rebuild the latest report.
546
+
547
+ review::
548
+ Review the latest report (just show it with the browser)
549
+
550
+ show::
551
+ Show previously aggregated items. A typical use would be to
552
+ periodically run in the background a command like
553
+ websitary -eaggregate newsfeeds
554
+ and then
555
+ websitary -eshow newsfeeds
556
+ to review the changes.
557
+
558
+ unroll::
559
+ Undo the latest fetch.
560
+
561
+
562
+
563
+ == TIPS:
564
+ === Ruby
565
+ The profiles are regular ruby sources that are evaluated in the context
566
+ of the configuration object (Websitary::Configuration). Find out more
567
+ about ruby at:
568
+ * http://www.ruby-lang.org/en/documentation/
569
+ * http://www.ruby-doc.org/docs/ProgrammingRuby/ (especially
570
+ the
571
+ language[http://www.ruby-doc.org/docs/ProgrammingRuby/html/language.html]
572
+ chapter)
573
+
574
+
575
+ === Cygwin
576
+ Mixing native Windows apps and cygwin apps can cause problems. The
577
+ following settings (e.g. in ~/.websitary/config.rb) can be used to use
578
+ a native Windows editor and browser:
579
+
580
+ # Use the default Windows programs (as if double-clicked)
581
+ view '/usr/bin/cygstart "%s"'
582
+
583
+ # Translate the profile filename and edit it with a native Windows editor
584
+ edit 'notepad.exe $(cygpath -w -- "%s")'
585
+
586
+ # Rewrite cygwin filenames for use with a native Windows browser
587
+ option :global, :file_url => lambda {|f| f.sub(/\/cygdrive\/.+?\/.websitary\//, '')}
588
+
589
+
590
+ === Windows
591
+ Backslashes usually have to be escaped by backslashes -- or use slashes.
592
+ I.e. instead of 'c:\foo\bar' write either 'c:\\foo\\bar' or
593
+ 'c:/foo/bar'.
594
+
595
+
596
+ == REQUIREMENTS:
597
+ websitary is a ruby-based application. You thus need a ruby
598
+ interpreter.
599
+
600
+ It depends on how you use websitary whether you actually need the
601
+ following libraries, applications.
602
+
603
+ By default this script expects the following applications to be
604
+ present:
605
+
606
+ * diff
607
+ * vi (or some other editor)
608
+
609
+ and one of:
610
+
611
+ * w3m[http://w3m.sourceforge.net/] (default)
612
+ * lynx[http://lynx.isc.org/]
613
+ * links[http://links.twibright.com/]
614
+ * websec[http://baruch.ev-en.org/proj/websec/]
615
+ (or at Savannah[http://savannah.nongnu.org/projects/websec/])
616
+
617
+ The use of :webdiff as :diff application requires
618
+ websec[http://download.savannah.gnu.org/releases/websec/] to be
619
+ installed. In conjunction with :body_html, :openuri, or :curl, this
620
+ will give you colored HTML diffs.
621
+ Why not use +websec+ if I have to install it, you might ask. Well,
622
+ +websec+ is written in perl and I didn't quite manage to make it work
623
+ the way I want it to. websitary is made to be better to configure.
624
+
625
+ For downloading HTML, you need one of these:
626
+
627
+ * open-uri (should be part of ruby)
628
+ * hpricot[http://code.whytheluckystiff.net/hpricot] (used e.g. by
629
+ :body_html, :website, and :website_below)
630
+ * curl[http://curl.haxx.se/]
631
+ * wget[http://www.gnu.org/software/wget/]
632
+
633
+ The following ruby libraries are needed in conjunction with :body_html
634
+ and :website related shortcuts:
635
+
636
+ * hpricot[http://code.whytheluckystiff.net/hpricot] (parse HTML, use
637
+ only the body etc.)
638
+ * robot_rules.rb[http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589]
639
+ for parsing robots.txt
640
+
641
+ I personally would suggest to choose the following setup:
642
+
643
+ * w3m[http://w3m.sourceforge.net/]
644
+ * websec[http://baruch.ev-en.org/proj/websec/]
645
+ * hpricot[http://code.whytheluckystiff.net/hpricot]
646
+ * robot_rules.rb[http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589]
647
+
648
+
649
+ == INSTALL:
650
+ === Use rubygems
651
+ Run
652
+
653
+ gem install websitary
654
+
655
+ This will download the package and install it.
656
+
657
+
658
+ === Use the zip
659
+ The zip[http://rubyforge.org/frs/?group_id=4030] contains a file
660
+ setup.rb that does the work. Run
661
+
662
+ ruby setup.rb
663
+
664
+
665
+ === Initial Configuration
666
+ Please check the requirements section above and get the extra libraries
667
+ needed:
668
+ * hpricot
669
+ * robot_rules.rb
670
+
671
+ These could be installed by:
672
+
673
+ # Install hpricot
674
+ gem install hpricot
675
+
676
+ # Install robot_rules.rb
677
+ curl http://www.rubyquiz.com/quiz64_sols.zip
678
+ # Check the correct path to site_ruby first!
679
+ unzip -p quiz64_sols.zip "solutions/James Edward Gray II/robot_rules.rb" > /lib/ruby/site_ruby/1.8/robot_rules.rb
680
+ rm quiz64_sols.zip
681
+
682
+ You might then want to create a profile ~/.websitary/config.rb that is
683
+ loaded on every run. In this profile you could set the default output
684
+ viewer and profile editor, as well as a default profile.
685
+
686
+ Example:
687
+
688
+ # Load standard.rb if no profile is given on the command line.
689
+ default 'standard'
690
+
691
+ # Use cygwin's cygstart to view the output with the default HTML
692
+ # viewer
693
+ view '/usr/bin/cygstart "%s"'
694
+
695
+ # Use Windows gvim from cygwin ruby which is why we convert the path
696
+ # first
697
+ edit 'gvim $(cygpath -w -- "%s")'
698
+
699
+ Where these configuration files reside, may differ. If the environment
700
+ variable $HOME is defined, the default is $HOME/.websitary/ unless one
701
+ of the following directories exist, which will then be used instead:
702
+
703
+ * $USERPROFILE/websitary (on Windows)
704
+ * SYSCONFDIR/websitary (where SYSCONFDIR usually is /etc but you can
705
+ run ruby to find out more:
706
+ <tt>ruby -e "p Config::CONFIG['sysconfdir']"</tt>)
707
+
708
+ If neither directory exists and no $HOME variable is defined, the
709
+ current directory will be used.
710
+
711
+
712
+ == LICENSE:
713
+ websitary Webpage Monitor
714
+ Copyright (C) 2007 Thomas Link
715
+
716
+ This program is free software; you can redistribute it and/or modify
717
+ it under the terms of the GNU General Public License as published by
718
+ the Free Software Foundation; either version 2 of the License, or
719
+ (at your option) any later version.
720
+
721
+ This program is distributed in the hope that it will be useful,
722
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
723
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
724
+ GNU General Public License for more details.
725
+
726
+ You should have received a copy of the GNU General Public License
727
+ along with this program; if not, write to the Free Software
728
+ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307
729
+ USA
730
+
731
+
732
+ % vi: ft=rd:tw=72:ts=4