websitiary 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (7) hide show
  1. data/History.txt +4 -0
  2. data/Manifest.txt +6 -0
  3. data/README.txt +474 -0
  4. data/Rakefile +20 -0
  5. data/bin/websitiary +1351 -0
  6. data/setup.rb +1585 -0
  7. metadata +71 -0
data/History.txt ADDED
@@ -0,0 +1,4 @@
1
+ == 0.1.0 / 2007-07-10
2
+
3
+ * Initial release
4
+
data/Manifest.txt ADDED
@@ -0,0 +1,6 @@
1
+ History.txt
2
+ Manifest.txt
3
+ README.txt
4
+ Rakefile
5
+ setup.rb
6
+ bin/websitiary
data/README.txt ADDED
@@ -0,0 +1,474 @@
1
+ websitiary by Thomas Link
2
+ http://rubyforge.org/projects/websitiary/
3
+
4
+ This is a script for monitoring webpages that reuses other programs to
5
+ do the actual work. By default, it works on an ASCII basis, i.e. with
6
+ the output of text-based webbrowsers. With the help of some friends, it
7
+ can also work with HTML.
8
+
9
+
10
+ == DESCRIPTION:
11
+ This is a script for monitoring webpages that reuses other programs
12
+ (w3m, diff, webdiff etc.) to do most of the actual work. By default, it
13
+ works on an ASCII basis, i.e. with the output of text-based webbrowsers
14
+ like w3m (or lynx, links etc.) as the output can easily be
15
+ post-processed. With the help of some friends (see the section below on
16
+ requirements), it can also work with HTML. E.g., if you have websec
17
+ installed, you can also use its webdiff program to show colored diffs.
18
+
19
+ By default, this script will use w3m to dump HTML pages and then run
20
+ diff over the current page and the previous backup. Some pages are
21
+ better viewed with lynx or links. Downloaded documents (HTML or ASCII)
22
+ can be post-processed (e.g., filtered through some ruby block that
23
+ extracts elements via hpricot and the like). Please see the
24
+ configuration options below to find out how to change this globally or
25
+ for a single source.
26
+
27
+ === CAVEAT:
28
+ The script also includes experimental support for monitoring whole
29
+ websites. Basically, this script supports robots.txt directives (see
30
+ requirements) but this is hardly tested and may not work in some cases.
31
+
32
+ While it is okay for your own websites to ignore robots.txt, it is not
33
+ for others. Please make sure that the webpages you run this program on
34
+ allow such a use. Some webpages disallow the use of any automatic
35
+ downloader or offline reader in their user agreements.
36
+
37
+
38
+ == FEATURES/PROBLEMS:
39
+ * Download webpages on defined intervalls
40
+ * Compare webpages with previous backups
41
+ * Display differences between the current version and the backup
42
+ * Provide hooks to post-process the downloaded documents and the diff
43
+ * Display a one page report summarizing all news
44
+ * Automatically open the report in your favourite web-browser
45
+ * Quite customizable
46
+
47
+ ISSUES, TODO:
48
+ * Improved support for robots.txt (test it)
49
+ * The use of :website_below and :website is hardly tested (please
50
+ report errors).
51
+ * download => :body_html tries to rewrite references (a, img) which may
52
+ fail on certain kind of urls (please report errors).
53
+ * When using :body_html for download, it may happen that some
54
+ JavaScript code is stripped, which breaks some JavaScript-generated
55
+ links.
56
+
57
+
58
+ == SYNOPSIS:
59
+
60
+ === Usage
61
+ Example:
62
+ # Run "profile"
63
+ websitiary profile
64
+
65
+ # Edit "~/.websitiary/profile.rb"
66
+ websitiary --edit=profile
67
+
68
+ # View the latest report
69
+ websitiary --review
70
+
71
+ # Refetch all sources regardless of :days and :hours restrictions
72
+ websitiary -signore_age=true
73
+
74
+ # Create html and rss reports for my websites
75
+ websitiary -fhtml,rss mysites
76
+
77
+ For example output see:
78
+ * html[http://deplate.sourceforge.net/websitiary.html]
79
+ * rss[http://deplate.sourceforge.net/websitiary.rss]
80
+ * text[http://deplate.sourceforge.net/websitiary.txt]
81
+
82
+
83
+ === Configuration
84
+ Profiles are plain ruby files (with the '.rb' suffix) stored in
85
+ ~/.websitiary/.
86
+
87
+ The profile config.rb is always loaded if available.
88
+
89
+
90
+ ==== default 'PROFILE1', 'PROFILE2' ...
91
+ Set the default profile(s).
92
+
93
+ Example:
94
+ default 'my_profile'
95
+
96
+
97
+ ==== diff 'CMD "%s" "%s"'
98
+ Use this shell command to make the diff.
99
+ %s %s will be replaced with the old and new filename.
100
+
101
+ diff is used by default.
102
+
103
+
104
+ ==== diffprocess lambda {|text| ...}
105
+ Use this ruby snippet to post-process the diff.
106
+
107
+
108
+ ==== download 'CMD "%s"'
109
+ Use this shell command to download a page.
110
+ %s will be replaced with the url.
111
+
112
+ w3m is used by default.
113
+
114
+ Example:
115
+ download 'lynx -dump "%s"'
116
+
117
+
118
+ ==== downloadprocess lambda {|text| ...}
119
+ Use this ruby snippet to post-process what was downloaded.
120
+
121
+
122
+ ==== edit 'CMD "%s"'
123
+ Use this shell command to edit a profile. %s will be replaced with the filename.
124
+
125
+ vi is used by default.
126
+
127
+ Example:
128
+ edit 'gvim "%s"&'
129
+
130
+
131
+ ==== option TYPE, OPTION => VALUE
132
+ Set a global option.
133
+
134
+ TYPE can be one of:
135
+ <tt>:diff</tt>::
136
+ Generate a diff
137
+ <tt>:diffprocess</tt>::
138
+ Post-process a diff (if necessary)
139
+ <tt>:format</tt>::
140
+ Format the diff for output
141
+ <tt>:download</tt>::
142
+ Download webpages
143
+ <tt>:downloadprocess</tt>::
144
+ Post-process downloaded webpages
145
+ <tt>:page</tt>::
146
+ The :format field defines the format of the final report. Here VALUE
147
+ is a format string that takes 3 variables as arguments: report title,
148
+ toc, contents.
149
+
150
+ DOWNLOAD is a symbol
151
+
152
+ VALUE is either a format string or a block of code (of class Proc).
153
+
154
+ Example:
155
+ set :download, :foo => lambda {|url| get_url(url)}
156
+
157
+
158
+ ==== output_format FORMAT, output_format [FORMAT1, FORMAT2, ...]
159
+ Set the output format.
160
+ Format can be one of:
161
+
162
+ * html
163
+ * text, txt (this only works with text based downloaders)
164
+ * rss (prove of concept only;
165
+ it requires :rss[:url] to be set to the url, where the rss feed will
166
+ be published, using the <tt>option :rss, :url => URL</tt>
167
+ configuration command; you either have to use a text-based downloader
168
+ or include <tt>:rss_format => 'html'</tt> to the url options)
169
+
170
+
171
+ ==== set OPTION => VALUE; set TYPE, OPTION => VALUE; unset OPTIONS
172
+ (Un)Set an option for the following source commands.
173
+
174
+ Example:
175
+ set :download, :foo => lambda {|url| get_url(url)}
176
+ set :days => 7, sort => true
177
+ unset :days, :sort
178
+
179
+
180
+ ==== source URL(S), [OPTIONS]
181
+ Options
182
+
183
+ <tt>:cols => FROM..TO</tt>::
184
+ Use only these colums from the output (used after applying the :lines
185
+ option)
186
+
187
+ <tt>:depth => INTEGER</tt>::
188
+ In conjunction with a :website type of :download option, fetch url up
189
+ to this depth.
190
+
191
+ <tt>:diff => "CMD", :diff => SHORTCUT</tt>::
192
+ Use this command to make the diff for this page. Possible values for
193
+ SHORTCUT are: :webdiff (useful in conjunction with :download => :curl,
194
+ :wget, or :body_html). :body_html, :website_below, :website and
195
+ :openuri are synonyms for :webdiff.
196
+
197
+ <tt>:diffprocess => lambda {|text| ...}</tt>::
198
+ Use this ruby snippet to post-process this diff
199
+
200
+ <tt>:download => "CMD", :download => SHORTCUT</tt>:
201
+ Use this command to download this page. For possible values for
202
+ SHORTCUT see the section on shortcuts below.
203
+
204
+ <tt>:downloadprocess => lambda {|text| ...}</tt>::
205
+ Use this ruby snippet to post-process what was downloaded. This is the
206
+ place where, e.g., hpricot can be used to extract certain elements
207
+ from the HTML code.
208
+ Example:
209
+ lambda {|text| Hpricot(text).at('div#content').inner_html}
210
+
211
+ <tt>:format => "FORMAT %s STRING", :format => SHORTCUT</tt>::
212
+ The format string for the diff text. The default (the :diff shortcut)
213
+ wraps the output in +pre+ tags. :webdiff, :body_html, :website_below,
214
+ :website, and :openuri will simply add a newline character.
215
+
216
+ <tt>:hours => HOURS, :days => DAYS</tt>::
217
+ Don't download the file unless it's older than that
218
+
219
+ <tt>:ignore_age => true</tt>::
220
+ Ignore any :days and :hours settings. This is useful in some cases
221
+ when set on the command line.
222
+
223
+ <tt>:lines => FROM..TO</tt>::
224
+ Use only these lines from the output
225
+
226
+ <tt>:match => REGEXP</tt>::
227
+ When recursively walking a website, follow only links that match this
228
+ regexp.
229
+
230
+ <tt>:sort => true, :sort => lambda {|a,b| ...}</tt>::
231
+ Sort lines in output
232
+
233
+ <tt>:strip => true</tt>::
234
+ Strip empty lines
235
+
236
+ <tt>:title => "TEXT"</tt>::
237
+ Display TEXT instead of URL
238
+
239
+ <tt>:use => SYMBOL</tt>::
240
+ Use SYMBOL for any other option. I.e. <tt>:download => :body_html
241
+ :diff => :webdiff</tt> can be abbreviated as <tt>:use =>
242
+ :body_html</tt> (because for :diff :body_html is a synonym for
243
+ :webdiff).
244
+
245
+
246
+ Example configuration file extract:
247
+ source 'URL', :days => 7, :download => :lynx
248
+
249
+ # Daily
250
+ set :days => 1
251
+ source 'http://www.example.com', :use => :body_html,
252
+ :downloadprocess => lambda {|text| Hpricot(text).at('div#content').inner_html}
253
+
254
+ # Weekly
255
+ set :days => 7
256
+ source 'http://www.example.com', :lines => 10..-1, :title => 'My Page'
257
+
258
+ # Bi-weekly
259
+ set :days => 14
260
+ source <<URLS
261
+ http://www.example.com
262
+ http://www.example.com/page.html
263
+ URLS
264
+
265
+ # Make HTML diffs and highlight occurences of a word
266
+ source 'http://www.example.com',
267
+ :title => 'Example',
268
+ :use => :body_html,
269
+ :diffprocess => highlighter(/word/i)
270
+
271
+ # Download the whole website below this path (only pages with
272
+ # html-suffix)
273
+ # Download only php and html pages
274
+ # Follow links 2 levels deep
275
+ source 'http://www.example.com/foo/bar.html',
276
+ :title => 'Example -- Bar', :use => :website_below,
277
+ :match => /\.(php|html)\b/, :depth => 2
278
+
279
+ unset :days
280
+
281
+
282
+ ==== view 'CMD "%s"'
283
+ Use this shell command to view the output (usually a HTML file).
284
+ %s will be replaced with the filename.
285
+
286
+ w3m is used by default.
287
+
288
+ Example:
289
+ view 'gnome-open "%s"' # Gnome Desktop
290
+ view 'kfmclient "%s"' # KDE
291
+ view 'cygstart "%s"' # Cygwin
292
+ view 'start "%s"' # Windows
293
+ view 'firefox "%s"'
294
+
295
+
296
+ === Shortcuts for use with :use, :download and other options
297
+ <tt>:w3m</tt>::
298
+ Use w3m for downloading the source. Use diff for generating diffs.
299
+
300
+ <tt>:lynx</tt>::
301
+ Use lynx for downloading the source. Use diff for generating diffs.
302
+
303
+ <tt>:links</tt>::
304
+ Use links for downloading the source. Use diff for generating diffs.
305
+
306
+ <tt>:curl</tt>::
307
+ Use curl for downloading the source. Use webdiff for generating diffs.
308
+
309
+ <tt>:wget</tt>::
310
+ Use wget for downloading the source. Use webdiff for generating diffs.
311
+
312
+ <tt>:openuri</tt>::
313
+ Use open-uri for downloading the source. Use webdiff for generating
314
+ diffs.
315
+
316
+ <tt>:body_html</tt>::
317
+ This requires hpricot to be installed. Use open-uri for downloading
318
+ the source, use only the body. Use webdiff for generating diffs. Try
319
+ to rewrite references (a, img) so that the point to the webpage. By
320
+ default, this will also strip tags like script, form, object ...
321
+
322
+ <tt>:website</tt>::
323
+ Use :body_html to download the source. Follow all links referring to
324
+ the same host with the same file suffix. Use webdiff for generating
325
+ diff.
326
+
327
+ <tt>:website_below</tt>::
328
+ Use :body_html to download the source. Follow all links referring to
329
+ the same host and a file below the top directory with the same file
330
+ suffix. Use webdiff for generating diff.
331
+
332
+ <tt>:website_txt</tt>::
333
+ Use :website to download the source but convert the output to plain
334
+ text.
335
+
336
+ <tt>:website_txt_below</tt>::
337
+ Use :website_below to download the source but convert the output to
338
+ plain text.
339
+
340
+ Any shortcuts relying on :body_html will also try to rewrite any
341
+ references so that the links point to the webpage.
342
+
343
+
344
+ == REQUIREMENTS:
345
+ websitiary is a ruby-based application. You thus need a ruby
346
+ interpreter.
347
+
348
+ It depends on how you use websitiary whether you actually need the
349
+ following libraries, applications.
350
+
351
+ By default this script expects the following applications to be
352
+ present:
353
+
354
+ * diff
355
+ * vi (or some other editor)
356
+
357
+ and one of:
358
+
359
+ * w3m[http://w3m.sourceforge.net/] (default)
360
+ * lynx[http://lynx.isc.org/]
361
+ * links[http://links.twibright.com/]
362
+ * websec[http://baruch.ev-en.org/proj/websec/]
363
+ (or at Savannah[http://savannah.nongnu.org/projects/websec/])
364
+
365
+ The use of :webdiff as :diff application requires
366
+ websec[http://download.savannah.gnu.org/releases/websec/] to be
367
+ installed. In conjunction with :body_html, :openuri, or :curl, this
368
+ will give you colored HTML diffs.
369
+ Why not use +websec+ if I have to install it, you might ask. Well,
370
+ +websec+ is written in perl and I didn't quite manage to make it work
371
+ the way I want it to. websitiary is made to be better to configure.
372
+
373
+ For downloading HTML, you need one of these:
374
+
375
+ * open-uri (should be part of ruby)
376
+ * hpricot[http://code.whytheluckystiff.net/hpricot] (used e.g. by
377
+ :body_html, :website, and :website_below)
378
+ * curl[http://curl.haxx.se/]
379
+ * wget[http://www.gnu.org/software/wget/]
380
+
381
+ The following ruby libraries are needed in conjunction with :body_html
382
+ and :website related shortcuts:
383
+
384
+ * hpricot[http://code.whytheluckystiff.net/hpricot] (parse HTML, use
385
+ only the body etc.)
386
+ * robot_rules.rb[http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589]
387
+ for parsing robots.txt
388
+
389
+ I personally would suggest to choose the following setup:
390
+
391
+ * w3m[http://w3m.sourceforge.net/]
392
+ * websec[http://baruch.ev-en.org/proj/websec/]
393
+ * hpricot[http://code.whytheluckystiff.net/hpricot]
394
+ * robot_rules.rb[http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589]
395
+
396
+
397
+ == INSTALL:
398
+ === Use rubygems
399
+ Run
400
+
401
+ gem install websitiary
402
+
403
+ This will download the package and install it.
404
+
405
+
406
+ === Use the zip
407
+ The zip[http://rubyforge.org/frs/?group_id=4030] contains a file
408
+ setup.rb that does the work. Run
409
+
410
+ ruby setup.rb
411
+
412
+
413
+ === Copy Manually
414
+ Get the single file websitiary[http://rubyforge.org/frs/?group_id=4030]
415
+ script and copy it to some directory in $PATH.
416
+
417
+
418
+ === Initial Configuration
419
+ Please check the requirements section above and get the extra libraries
420
+ needed:
421
+ * hpricot
422
+ * robot_rules.rb
423
+
424
+ You might then want to create a profile ~/.websitiary/config.rb that is
425
+ loaded on every run. In this profile you could set the default output
426
+ viewer and profile editor, as well as a default profile.
427
+
428
+ Example:
429
+
430
+ # Load standard.rb if no profile is given on the command line.
431
+ default 'standard'
432
+
433
+ # Use cygwin's cygstart to view the output with the default HTML
434
+ # viewer
435
+ view '/usr/bin/cygstart "%s"'
436
+
437
+ # Use Windows gvim from cygwin ruby which is why we convert the path
438
+ # first
439
+ edit 'gvim $(cygpath -w -- "%s")'
440
+
441
+ Where these configuration files reside, may differ. If the environment
442
+ variable $HOME is defined, the default is $HOME/.websitiary/ unless one
443
+ of the following directories exist, which will then be used instead:
444
+
445
+ * $USERPROFILE/websitiary (on Windows)
446
+ * SYSCONFDIR/websitiary (where SYSCONFDIR usually is /etc but you can
447
+ run ruby to find out more:
448
+ <tt>ruby -e "p Config::CONFIG['sysconfdir']"</tt>)
449
+
450
+ If neither directory exists and no $HOME variable is defined, the
451
+ current directory will be used.
452
+
453
+
454
+ == LICENSE:
455
+ websitiary Webpage Monitor
456
+ Copyright (C) 2007 Thomas Link
457
+
458
+ This program is free software; you can redistribute it and/or modify
459
+ it under the terms of the GNU General Public License as published by
460
+ the Free Software Foundation; either version 2 of the License, or
461
+ (at your option) any later version.
462
+
463
+ This program is distributed in the hope that it will be useful,
464
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
465
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
466
+ GNU General Public License for more details.
467
+
468
+ You should have received a copy of the GNU General Public License
469
+ along with this program; if not, write to the Free Software
470
+ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307
471
+ USA
472
+
473
+
474
+ # vi: ft=rd:tw=72:ts=4