websitary 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/History.txt ADDED
@@ -0,0 +1,57 @@
1
+ = 0.2.0
2
+
3
+ * Renamed the project from websitiary to websitary (without the
4
+ additional "i")
5
+ * The default output filename is now constructed on basis of the profile
6
+ names joined with a comma.
7
+ * Apply rewrite-rules to URLs in text output.
8
+ * Set user-agent (:body_html)
9
+ * Exit with 1 if differences were found
10
+ * Command line options have slightly changed: -e now is the short form
11
+ for --execute
12
+ * Commands that can be triggered by the -e command-line switch: downdiff
13
+ (default), configuration (list currently configured urls), latest
14
+ (show the current version of all urls), review (show the latest
15
+ report)
16
+ * Protect against filenames being too long (max size can be configured
17
+ via: <tt>option :global, :filename_size => N</tt>)
18
+ * Try to migrate local copies from the older flat to the new
19
+ hierarchical cache layout
20
+ * Disabled -E/--edit, --review command-line options (use -e instead)
21
+ * Try to maintain file atime/mtime when copying/moving files
22
+ * FIX: Problem with loading robots.txt
23
+ * Respect meta tag: robots="nofollow" (noindex is only checked in
24
+ conjunction with :download => :website*)
25
+ * quicklist profile: register urls via the -eadd command-line switch;
26
+ see "Usage" for an example
27
+ * Temporaly save diffs, so that we can reuse them when websitary should
28
+ exit ungracefully.
29
+ * Renamed :inner_html to :body_html
30
+ * New shortcuts: :ftp, :ftp_recursive, :img, :rss, :opml (rudementary)
31
+ * New experimental commands: aggregate, show ... can be used to
32
+ periodically check for changes (e.g. of rss feeds) but to review these
33
+ changes only once in a while
34
+ * Experimental --timer command-line option to re-run websitary every X
35
+ seconds.
36
+ * The :rss differ has an option :rss_enclosure (true or directory name)
37
+ that will be used for automatically saving new enclosures (e.g. mp3
38
+ files in podcasts); in theory, one should thus be able to use
39
+ websitary as pod catcher etc.
40
+ * Cache mtimes in order to reduce disk access.
41
+ * Special profile "__END__": The section in the script file after the
42
+ __END__ line. This seems useful in some situations when employing a
43
+ single script.
44
+ * Don't follow javascript links.
45
+ * New date constraint for sources:
46
+ :daily => true ... Once a day
47
+ :days_of_month => BEGIN..END ... download URL only once per month
48
+ within this range of days.
49
+ :days_of_week => BEGIN..END ... download URL only once per week
50
+ within this range of days.
51
+ :months => N (calculated on basis of the calendar month, not the
52
+ number of days)
53
+
54
+ == 0.1.0 / 2007-07-16
55
+
56
+ * Initial release
57
+
data/Manifest.txt ADDED
@@ -0,0 +1,11 @@
1
+ History.txt
2
+ Manifest.txt
3
+ README.txt
4
+ Rakefile
5
+ setup.rb
6
+ bin/websitary
7
+ lib/websitary.rb
8
+ lib/websitary/applog.rb
9
+ lib/websitary/configuration.rb
10
+ lib/websitary/filemtimes.rb
11
+ lib/websitary/htmldiff.rb
data/README.txt ADDED
@@ -0,0 +1,732 @@
1
+ websitary by Thomas Link
2
+ http://rubyforge.org/projects/websitiary/
3
+
4
+ This script monitors webpages, rss feeds, podcasts etc. and reports
5
+ what's new. For many tasks, it reuses other programs to do the actual
6
+ work. By default, it works on an ASCII basis, i.e. with the output of
7
+ text-based webbrowsers. With the help of some friends, it can also work
8
+ with HTML.
9
+
10
+
11
+ == DESCRIPTION:
12
+ websitary (formerly known as websitiary with an extra "i") monitors
13
+ webpages, rss feeds, podcasts etc. It reuses other programs (w3m, diff,
14
+ webdiff etc.) to do most of the actual work. By default, it works on an
15
+ ASCII basis, i.e. with the output of text-based webbrowsers like w3m (or
16
+ lynx, links etc.) as the output can easily be post-processed. With the
17
+ help of some friends (see the section below on requirements), it can
18
+ also work with HTML. E.g., if you have websec installed, you can also
19
+ use its webdiff program to show colored diffs. This script was
20
+ originally planned as a ruby-based websec replacement. For HTML diffs,
21
+ it stills relies on the webdiff perl script that comes with websec.
22
+
23
+ By default, this script will use w3m to dump HTML pages and then run
24
+ diff over the current page and the previous backup. Some pages are
25
+ better viewed with lynx or links. Downloaded documents (HTML or ASCII)
26
+ can be post-processed (e.g., filtered through some ruby block that
27
+ extracts elements via hpricot and the like). Please see the
28
+ configuration options below to find out how to change this globally or
29
+ for a single source.
30
+
31
+
32
+ == FEATURES/PROBLEMS:
33
+ * Handle webpages, rss feeds (optionally save attachments in podcasts
34
+ etc.)
35
+ * Compare webpages with previous backups
36
+ * Display differences between the current version and the backup
37
+ * Provide hooks to post-process the downloaded documents and the diff
38
+ * Display a one-page report summarizing all news
39
+ * Automatically open the report in your favourite web-browser
40
+ * Experimental: Download webpages on defined intervalls and generate
41
+ incremental diffs.
42
+
43
+ ISSUES, TODO:
44
+ * With HTML output, changes are presented on one single page, which
45
+ means that pages with different encodings cause problems.
46
+ * Improved support for robots.txt (test it)
47
+ * The use of :website_below and :website is hardly tested (please
48
+ report errors).
49
+ * download => :body_html tries to rewrite references (a, img) which may
50
+ fail on certain kind of urls (please report errors).
51
+ * When using :body_html for download, it may happen that some
52
+ JavaScript code is stripped, which breaks some JavaScript-generated
53
+ links.
54
+ * The --log command line will create a new instance of the logger and
55
+ thus reset any previous options related to the logging level.
56
+
57
+ NOTE: The script was previously called websitiary but was renamed (from
58
+ 0.2 on) to websitary (without the superfluous i).
59
+
60
+
61
+ === CAVEAT:
62
+ The script also includes experimental support for monitoring whole
63
+ websites. Basically, this script supports robots.txt directives (see
64
+ requirements) but this is hardly tested and may not work in some cases.
65
+
66
+ While it is okay for your own websites to ignore robots.txt, it is not
67
+ for others. Please make sure that the webpages you run this program on
68
+ allow such a use. Some webpages disallow the use of any automatic
69
+ downloader or offline reader in their user agreements.
70
+
71
+
72
+ == SYNOPSIS:
73
+ This manual is also available as
74
+ PDF[http://websitiary.rubyforge.org/websitary.pdf].
75
+
76
+ === Usage
77
+ Example:
78
+ # Run "profile"
79
+ websitary profile
80
+
81
+ # Edit "~/.websitary/profile.rb"
82
+ websitary --edit=profile
83
+
84
+ # View the latest report
85
+ websitary -ereview
86
+
87
+ # Refetch all sources regardless of :days and :hours restrictions
88
+ websitary -signore_age=true
89
+
90
+ # Create html and rss reports for my websites
91
+ websitary -fhtml,rss mysites
92
+
93
+ # Add an url to the quicklist profile
94
+ websitary -eadd http://www.example.com
95
+
96
+ For example output see:
97
+ * html[http://deplate.sourceforge.net/websitary.html]
98
+ * rss[http://deplate.sourceforge.net/websitary.rss]
99
+ * text[http://deplate.sourceforge.net/websitary.txt]
100
+
101
+
102
+ === Configuration
103
+ Profiles are plain ruby files (with the '.rb' suffix) stored in
104
+ ~/.websitary/.
105
+
106
+ The profile "config" (~/.websitary/config.rb) is always loaded if
107
+ available.
108
+
109
+ There are two special profile names:
110
+
111
+ -::
112
+ Read URLs from STDIN.
113
+ <tt>__END__</tt>::
114
+ Read the profile contained in the script source after the __END__
115
+ line.
116
+
117
+
118
+ ==== default 'PROFILE1', 'PROFILE2' ...
119
+ Set the default profile(s). The default is: quicklist
120
+
121
+ Example:
122
+ default 'my_profile'
123
+
124
+
125
+ ==== diff 'CMD "%s" "%s"'
126
+ Use this shell command to make the diff.
127
+ %s %s will be replaced with the old and new filename.
128
+
129
+ diff is used by default.
130
+
131
+
132
+ ==== diffprocess lambda {|text| ...}
133
+ Use this ruby snippet to post-process the diff.
134
+
135
+
136
+ ==== download 'CMD "%s"'
137
+ Use this shell command to download a page.
138
+ %s will be replaced with the url.
139
+
140
+ w3m is used by default.
141
+
142
+ Example:
143
+ download 'lynx -dump "%s"'
144
+
145
+
146
+ ==== downloadprocess lambda {|text| ...}
147
+ Use this ruby snippet to post-process what was downloaded. Return the
148
+ new text.
149
+
150
+
151
+ ==== edit 'CMD "%s"'
152
+ Use this shell command to edit a profile. %s will be replaced with the filename.
153
+
154
+ vi is used by default.
155
+
156
+ Example:
157
+ edit 'gvim "%s"&'
158
+
159
+
160
+ ==== option TYPE, OPTION => VALUE
161
+ Set a global option.
162
+
163
+ TYPE can be one of:
164
+ <tt>:diff</tt>::
165
+ Generate a diff
166
+ <tt>:diffprocess</tt>::
167
+ Post-process a diff (if necessary)
168
+ <tt>:format</tt>::
169
+ Format the diff for output
170
+ <tt>:download</tt>::
171
+ Download webpages
172
+ <tt>:downloadprocess</tt>::
173
+ Post-process downloaded webpages
174
+ <tt>:page</tt>::
175
+ The :format field defines the format of the final report. Here VALUE
176
+ is a format string that takes 3 variables as arguments: report title,
177
+ toc, contents.
178
+ <tt>:global</tt>::
179
+ Set a "global" option.
180
+
181
+ DOWNLOAD is a symbol
182
+
183
+ VALUE is either a format string or a block of code (of class Proc).
184
+
185
+ Example:
186
+ set :download, :foo => lambda {|url| get_url(url)}
187
+
188
+
189
+ ==== global OPTION => VALUE
190
+ This is the same a <tt>option :global, OPTION => VALUE</tt>.
191
+
192
+ Known global options:
193
+
194
+ <tt>:filename_size => N</tt>::
195
+ The max filename size. If a filename becomes longer, md5 encoding will
196
+ be used for local copies in the cache.
197
+
198
+ <tt>:downloadhtml => SHORTCUT</tt>::
199
+ The default shortcut for downloading plain HTML.
200
+
201
+ <tt>:file_url => BLOCK(FILENAME)</tt>::
202
+ Rewrite a filename as it is used for creating file urls to local
203
+ copies in the output. This may useful if you want to use the same
204
+ repository on several computers with in different locations etc.
205
+
206
+ <tt>:canonic_filename => BLOCK(FILENAME)</tt>::
207
+ Rewrite filenames as they are stored in the mtimes register. This may
208
+ useful if you want to use the same repository on several computers
209
+ with in different locations etc.
210
+
211
+
212
+ ==== output_format FORMAT, output_format [FORMAT1, FORMAT2, ...]
213
+ Set the output format.
214
+ Format can be one of:
215
+
216
+ * html
217
+ * text, txt (this only works with text based downloaders)
218
+ * rss (prove of concept only;
219
+ it requires :rss[:url] to be set to the url, where the rss feed will
220
+ be published, using the <tt>option :rss, :url => URL</tt>
221
+ configuration command; you either have to use a text-based downloader
222
+ or include <tt>:rss_format => 'html'</tt> to the url options)
223
+
224
+
225
+ ==== set OPTION => VALUE; set TYPE, OPTION => VALUE; unset OPTIONS
226
+ (Un)Set an option for the following source commands.
227
+
228
+ Example:
229
+ set :download, :foo => lambda {|url| get_url(url)}
230
+ set :days => 7, sort => true
231
+ unset :days, :sort
232
+
233
+
234
+ ==== source URL(S), [OPTIONS]
235
+ Options
236
+
237
+ <tt>:cols => FROM..TO</tt>::
238
+ Use only these colums from the output (used after applying the :lines
239
+ option)
240
+
241
+ <tt>:depth => INTEGER</tt>::
242
+ In conjunction with a :website type of :download option, fetch url up
243
+ to this depth.
244
+
245
+ <tt>:diff => "CMD", :diff => SHORTCUT</tt>::
246
+ Use this command to make the diff for this page. Possible values for
247
+ SHORTCUT are: :webdiff (useful in conjunction with :download => :curl,
248
+ :wget, or :body_html). :body_html, :website_below, :website and
249
+ :openuri are synonyms for :webdiff.
250
+
251
+ <tt>:diffprocess => lambda {|text| ...}</tt>::
252
+ Use this ruby snippet to post-process this diff
253
+
254
+ <tt>:download => "CMD", :download => SHORTCUT</tt>::
255
+ Use this command to download this page. For possible values for
256
+ SHORTCUT see the section on shortcuts below.
257
+
258
+ <tt>:downloadprocess => lambda {|text| ...}</tt>::
259
+ Use this ruby snippet to post-process what was downloaded. This is the
260
+ place where, e.g., hpricot can be used to extract certain elements
261
+ from the HTML code.
262
+ Example:
263
+ lambda {|text| Hpricot(text).at('div#content').inner_html}
264
+
265
+ <tt>:format => "FORMAT %s STRING", :format => SHORTCUT</tt>::
266
+ The format string for the diff text. The default (the :diff shortcut)
267
+ wraps the output in +pre+ tags. :webdiff, :body_html, :website_below,
268
+ :website, and :openuri will simply add a newline character.
269
+
270
+ <tt>:hours => HOURS, :days => DAYS</tt>::
271
+ Don't download the file unless it's older than that
272
+
273
+ <tt>:days_of_month => DAY..DAY, :wdays => DAY..DAY</tt>::
274
+ Download only once per month within a certain range of days (e.g.,
275
+ 15..31 ... Check once after the 15th). The argument can also be an
276
+ array (e.g, [1, 15]) or an integer.
277
+
278
+ <tt>:days_of_week => DAY..DAY, :mdays => DAY..DAY</tt>::
279
+ Download only once per week within a certain range of days (e.g., 1..2
280
+ ... Check once on monday or tuesday; sunday = 0). The argument can
281
+ also be an array (e.g, [1, 15]) or an integer.
282
+
283
+ <tt>:daily => true</tt>::
284
+ Download only once a day.
285
+
286
+ <tt>:ignore_age => true</tt>::
287
+ Ignore any :days and :hours settings. This is useful in some cases
288
+ when set on the command line.
289
+
290
+ <tt>:lines => FROM..TO</tt>::
291
+ Use only these lines from the output
292
+
293
+ <tt>:match => REGEXP</tt>::
294
+ When recursively walking a website, follow only links that match this
295
+ regexp.
296
+
297
+ <tt>:rss_rewrite_enclosed_urls => true</tt>::
298
+ If true, replace urls in the rss feed item description pointing to the
299
+ enclosure with a file url pointing to the local copy
300
+
301
+ <tt>:rss_enclosure => true|"DIRECTORY"</tt>::
302
+ If true, save rss feed enclosures in
303
+ "~/.websitary/attachments/RSS_FEED_NAME/". If a string, use this as
304
+ destination directory.
305
+
306
+ <tt>:rss_format (default: "plain_text")</tt>::
307
+ When output format is :rss, create rss item descriptios as plain text.
308
+
309
+ <tt>:show_initial => true</tt>::
310
+ Include initial copies in the report (may not always work properly).
311
+ This can also be set as a global option.
312
+
313
+ <tt>:sleep => SECS</tt>::
314
+ Wait SECS seconds (float or integer) before downloading the page.
315
+
316
+ <tt>:sort => true, :sort => lambda {|a,b| ...}</tt>::
317
+ Sort lines in output
318
+
319
+ <tt>:strip => true</tt>::
320
+ Strip empty lines
321
+
322
+ <tt>:title => "TEXT"</tt>::
323
+ Display TEXT instead of URL
324
+
325
+ <tt>:use => SYMBOL</tt>::
326
+ Use SYMBOL for any other option. I.e. <tt>:download => :body_html
327
+ :diff => :webdiff</tt> can be abbreviated as <tt>:use =>
328
+ :body_html</tt> (because for :diff :body_html is a synonym for
329
+ :webdiff).
330
+
331
+ The order of age constraints is:
332
+ :hours > :daily > :wdays > :mdays > :days > :months.
333
+ I.e. if :wdays is set, :mdays, :days, or :months are ignored.
334
+
335
+
336
+ ==== view 'CMD "%s"'
337
+ Use this shell command to view the output (usually a HTML file).
338
+ %s will be replaced with the filename.
339
+
340
+ w3m is used by default.
341
+
342
+ Example:
343
+ view 'gnome-open "%s"' # Gnome Desktop
344
+ view 'kfmclient "%s"' # KDE
345
+ view 'cygstart "%s"' # Cygwin
346
+ view 'start "%s"' # Windows
347
+ view 'firefox "%s"'
348
+
349
+
350
+ === Shortcuts for use with :use, :download and other options
351
+ <tt>:w3m</tt>::
352
+ Use w3m for downloading the source. Use diff for generating diffs.
353
+
354
+ <tt>:lynx</tt>::
355
+ Use lynx for downloading the source. Use diff for generating diffs.
356
+ Lynx doesn't try to recreate the layout of a page like w3m or links
357
+ do. As a result the output IMHO sometimes deviates from the original
358
+ design but is better suited for being post-processed in some
359
+ situation.
360
+
361
+ <tt>:links</tt>::
362
+ Use links for downloading the source. Use diff for generating diffs.
363
+
364
+ <tt>:curl</tt>::
365
+ Use curl for downloading the source. Use webdiff for generating diffs.
366
+
367
+ <tt>:wget</tt>::
368
+ Use wget for downloading the source. Use webdiff for generating diffs.
369
+
370
+ <tt>:openuri</tt>::
371
+ Use open-uri for downloading the source. Use webdiff for generating
372
+ diffs. This doesn't handle cookies and the like.
373
+
374
+ <tt>:text</tt>::
375
+ This requires hpricot to be installed. Use open-uri for downloading
376
+ and hpricot for converting HTML to plain text. This still requires
377
+ diff as external helper.
378
+
379
+ <tt>:body_html</tt>::
380
+ This requires hpricot to be installed. Use open-uri for downloading
381
+ the source, use only the body. Use webdiff for generating diffs. Try
382
+ to rewrite references (a, img) so that the point to the webpage. By
383
+ default, this will also strip tags like script, form, object ...
384
+
385
+ <tt>:website</tt>::
386
+ Use :body_html to download the source. Follow all links referring to
387
+ the same host with the same file suffix. Use webdiff for generating
388
+ diff.
389
+
390
+ <tt>:website_below</tt>::
391
+ Use :body_html to download the source. Follow all links referring to
392
+ the same host and a file below the top directory with the same file
393
+ suffix. Use webdiff for generating diff.
394
+
395
+ <tt>:website_txt</tt>::
396
+ Use :website to download the source but convert the output to plain
397
+ text.
398
+
399
+ <tt>:website_txt_below</tt>::
400
+ Use :website_below to download the source but convert the output to
401
+ plain text.
402
+
403
+ <tt>:rss</tt>::
404
+ Download an rss feed, show changed items.
405
+
406
+ <tt>:opml</tt>::
407
+ Experimental. Download the rss feeds registered in opml. No support
408
+ for atom yet.
409
+
410
+ <tt>:img</tt>::
411
+ Download an image and display it in the output if it has changed
412
+ (according to diff). You can use hpricot to extract an image from a
413
+ HTML source. Example:
414
+
415
+ Any shortcuts relying on :body_html will also try to rewrite any
416
+ references so that the links point to the webpage.
417
+
418
+
419
+
420
+ === Example configuration file for demonstration purposes
421
+
422
+ # Daily
423
+ set :days => 1
424
+
425
+ # Use lynx instead of the default downloader (w3m).
426
+ source 'http://www.example.com', :days => 7, :download => :lynx
427
+
428
+ # Use the HTML body and process via webdiff.
429
+ source 'http://www.example.com', :use => :body_html,
430
+ :downloadprocess => lambda {|text| Hpricot(text).at('div#content').inner_html}
431
+
432
+ # Download a podcast
433
+ source 'http://www.example.com/podcast.xml', :title => 'Podcast',
434
+ :use => :rss,
435
+ :rss_enclosure => '/home/me/podcasts/example'
436
+
437
+ # Check a rss feed.
438
+ source 'http://www.example.com/news.xml', :title => 'News', :use => :rss
439
+
440
+ # Get rss feed info from an opml file (EXPERIMENTAL).
441
+ # @cfgdir is most likely '~/.websitary'.
442
+ source File.join(@cfgdir, 'news.opml'), :use => :opml
443
+
444
+
445
+ # Weekly
446
+ set :days => 7
447
+
448
+ # Consider the page body only from the 10th line downwards.
449
+ source 'http://www.example.com', :lines => 10..-1, :title => 'My Page'
450
+
451
+
452
+ # Bi-weekly
453
+ set :days => 14
454
+
455
+ # Use these urls with the default options.
456
+ source <<URLS
457
+ http://www.example.com
458
+ http://www.example.com/page.html
459
+ URLS
460
+
461
+ # Make HTML diffs and highlight occurences of a word
462
+ source 'http://www.example.com',
463
+ :title => 'Example',
464
+ :use => :body_html,
465
+ :diffprocess => highlighter(/word/i)
466
+
467
+ # Download the whole website below this path (only pages with
468
+ # html-suffix), wait 30 secs between downloads.
469
+ # Download only php and html pages
470
+ # Follow links 2 levels deep
471
+ source 'http://www.example.com/foo/bar.html',
472
+ :title => 'Example -- Bar',
473
+ :use => :website_below, :sleep => 30,
474
+ :match => /\.(php|html)\b/, :depth => 2
475
+
476
+ # Download images from some kind of daily-image site (check the user
477
+ # agreement first, if this is allowed). This may require some ruby
478
+ # hacking in order to extract the right url.
479
+ source 'http://www.example.com/daily_image/', :title => 'Daily Image',
480
+ :use => :img,
481
+ :download => lambda {|url|
482
+ # Read the HTML.
483
+ html = open(url) {|io| io.read}
484
+ # This check is probably unnecessary as the failure to read
485
+ # the HTML document would most likely result in an
486
+ # exception.
487
+ if html
488
+ rv = nil
489
+ # Parse the HTML document.
490
+ doc = Hpricot(html)
491
+ # The following could actually be simplified using xpath
492
+ # or css search expressions. This isn't the most elegant
493
+ # solution but it works with any value of ALT.
494
+ # This downloads the image <img src="..." alt="Current Image">
495
+ # Check all img tags in the HTML document.
496
+ for e in doc.search(%{//img})
497
+ # Is this the image we're looking for?
498
+ if e['alt'] == "Current Image"
499
+ # Make relative urls absolute
500
+ img = rewrite_href(e['src'], url)
501
+ # Get the actual image data
502
+ rv = open(img, 'rb') {|io| io.read}
503
+ # Exit the for loop
504
+ break
505
+ end
506
+ end
507
+ rv
508
+ end
509
+ }
510
+
511
+
512
+ unset :days
513
+
514
+
515
+
516
+ === Commands for use with the -e command-line option
517
+ Most of these commands require you to name a profile on the command
518
+ line. You can define default profiles with the "default" configuration
519
+ command.
520
+
521
+ If no command is given, "downdiff" is executed.
522
+
523
+ add::
524
+ Add the URLs given on the command line to the quicklist profile.
525
+ ATTENTION: The following arguments on the command line are URLs, not
526
+ profile names.
527
+
528
+ aggregate::
529
+ Retrieve information and save changes for later review.
530
+
531
+ configuration::
532
+ Show the fully qualified configuration of each source.
533
+
534
+ downdiff::
535
+ Download and show differences (DEFAULT)
536
+
537
+ edit::
538
+ Edit the profile given on the command line (use vi by default)
539
+
540
+ latest::
541
+ Show the latest copies of the sources from the profiles given
542
+ on the command line.
543
+
544
+ rebuild::
545
+ Rebuild the latest report.
546
+
547
+ review::
548
+ Review the latest report (just show it with the browser)
549
+
550
+ show::
551
+ Show previously aggregated items. A typical use would be to
552
+ periodically run in the background a command like
553
+ websitary -eaggregate newsfeeds
554
+ and then
555
+ websitary -eshow newsfeeds
556
+ to review the changes.
557
+
558
+ unroll::
559
+ Undo the latest fetch.
560
+
561
+
562
+
563
+ == TIPS:
564
+ === Ruby
565
+ The profiles are regular ruby sources that are evaluated in the context
566
+ of the configuration object (Websitary::Configuration). Find out more
567
+ about ruby at:
568
+ * http://www.ruby-lang.org/en/documentation/
569
+ * http://www.ruby-doc.org/docs/ProgrammingRuby/ (especially
570
+ the
571
+ language[http://www.ruby-doc.org/docs/ProgrammingRuby/html/language.html]
572
+ chapter)
573
+
574
+
575
+ === Cygwin
576
+ Mixing native Windows apps and cygwin apps can cause problems. The
577
+ following settings (e.g. in ~/.websitary/config.rb) can be used to use
578
+ a native Windows editor and browser:
579
+
580
+ # Use the default Windows programs (as if double-clicked)
581
+ view '/usr/bin/cygstart "%s"'
582
+
583
+ # Translate the profile filename and edit it with a native Windows editor
584
+ edit 'notepad.exe $(cygpath -w -- "%s")'
585
+
586
+ # Rewrite cygwin filenames for use with a native Windows browser
587
+ option :global, :file_url => lambda {|f| f.sub(/\/cygdrive\/.+?\/.websitary\//, '')}
588
+
589
+
590
+ === Windows
591
+ Backslashes usually have to be escaped by backslashes -- or use slashes.
592
+ I.e. instead of 'c:\foo\bar' write either 'c:\\foo\\bar' or
593
+ 'c:/foo/bar'.
594
+
595
+
596
+ == REQUIREMENTS:
597
+ websitary is a ruby-based application. You thus need a ruby
598
+ interpreter.
599
+
600
+ It depends on how you use websitary whether you actually need the
601
+ following libraries, applications.
602
+
603
+ By default this script expects the following applications to be
604
+ present:
605
+
606
+ * diff
607
+ * vi (or some other editor)
608
+
609
+ and one of:
610
+
611
+ * w3m[http://w3m.sourceforge.net/] (default)
612
+ * lynx[http://lynx.isc.org/]
613
+ * links[http://links.twibright.com/]
614
+ * websec[http://baruch.ev-en.org/proj/websec/]
615
+ (or at Savannah[http://savannah.nongnu.org/projects/websec/])
616
+
617
+ The use of :webdiff as :diff application requires
618
+ websec[http://download.savannah.gnu.org/releases/websec/] to be
619
+ installed. In conjunction with :body_html, :openuri, or :curl, this
620
+ will give you colored HTML diffs.
621
+ Why not use +websec+ if I have to install it, you might ask. Well,
622
+ +websec+ is written in perl and I didn't quite manage to make it work
623
+ the way I want it to. websitary is made to be better to configure.
624
+
625
+ For downloading HTML, you need one of these:
626
+
627
+ * open-uri (should be part of ruby)
628
+ * hpricot[http://code.whytheluckystiff.net/hpricot] (used e.g. by
629
+ :body_html, :website, and :website_below)
630
+ * curl[http://curl.haxx.se/]
631
+ * wget[http://www.gnu.org/software/wget/]
632
+
633
+ The following ruby libraries are needed in conjunction with :body_html
634
+ and :website related shortcuts:
635
+
636
+ * hpricot[http://code.whytheluckystiff.net/hpricot] (parse HTML, use
637
+ only the body etc.)
638
+ * robot_rules.rb[http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589]
639
+ for parsing robots.txt
640
+
641
+ I personally would suggest to choose the following setup:
642
+
643
+ * w3m[http://w3m.sourceforge.net/]
644
+ * websec[http://baruch.ev-en.org/proj/websec/]
645
+ * hpricot[http://code.whytheluckystiff.net/hpricot]
646
+ * robot_rules.rb[http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589]
647
+
648
+
649
+ == INSTALL:
650
+ === Use rubygems
651
+ Run
652
+
653
+ gem install websitary
654
+
655
+ This will download the package and install it.
656
+
657
+
658
+ === Use the zip
659
+ The zip[http://rubyforge.org/frs/?group_id=4030] contains a file
660
+ setup.rb that does the work. Run
661
+
662
+ ruby setup.rb
663
+
664
+
665
+ === Initial Configuration
666
+ Please check the requirements section above and get the extra libraries
667
+ needed:
668
+ * hpricot
669
+ * robot_rules.rb
670
+
671
+ These could be installed by:
672
+
673
+ # Install hpricot
674
+ gem install hpricot
675
+
676
+ # Install robot_rules.rb
677
+ curl http://www.rubyquiz.com/quiz64_sols.zip
678
+ # Check the correct path to site_ruby first!
679
+ unzip -p quiz64_sols.zip "solutions/James Edward Gray II/robot_rules.rb" > /lib/ruby/site_ruby/1.8/robot_rules.rb
680
+ rm quiz64_sols.zip
681
+
682
+ You might then want to create a profile ~/.websitary/config.rb that is
683
+ loaded on every run. In this profile you could set the default output
684
+ viewer and profile editor, as well as a default profile.
685
+
686
+ Example:
687
+
688
+ # Load standard.rb if no profile is given on the command line.
689
+ default 'standard'
690
+
691
+ # Use cygwin's cygstart to view the output with the default HTML
692
+ # viewer
693
+ view '/usr/bin/cygstart "%s"'
694
+
695
+ # Use Windows gvim from cygwin ruby which is why we convert the path
696
+ # first
697
+ edit 'gvim $(cygpath -w -- "%s")'
698
+
699
+ Where these configuration files reside, may differ. If the environment
700
+ variable $HOME is defined, the default is $HOME/.websitary/ unless one
701
+ of the following directories exist, which will then be used instead:
702
+
703
+ * $USERPROFILE/websitary (on Windows)
704
+ * SYSCONFDIR/websitary (where SYSCONFDIR usually is /etc but you can
705
+ run ruby to find out more:
706
+ <tt>ruby -e "p Config::CONFIG['sysconfdir']"</tt>)
707
+
708
+ If neither directory exists and no $HOME variable is defined, the
709
+ current directory will be used.
710
+
711
+
712
+ == LICENSE:
713
+ websitary Webpage Monitor
714
+ Copyright (C) 2007 Thomas Link
715
+
716
+ This program is free software; you can redistribute it and/or modify
717
+ it under the terms of the GNU General Public License as published by
718
+ the Free Software Foundation; either version 2 of the License, or
719
+ (at your option) any later version.
720
+
721
+ This program is distributed in the hope that it will be useful,
722
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
723
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
724
+ GNU General Public License for more details.
725
+
726
+ You should have received a copy of the GNU General Public License
727
+ along with this program; if not, write to the Free Software
728
+ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307
729
+ USA
730
+
731
+
732
+ % vi: ft=rd:tw=72:ts=4