websitiary 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (7) hide show
  1. data/History.txt +4 -0
  2. data/Manifest.txt +6 -0
  3. data/README.txt +474 -0
  4. data/Rakefile +20 -0
  5. data/bin/websitiary +1351 -0
  6. data/setup.rb +1585 -0
  7. metadata +71 -0
data/History.txt ADDED
@@ -0,0 +1,4 @@
1
+ == 0.1.0 / 2007-07-10
2
+
3
+ * Initial release
4
+
data/Manifest.txt ADDED
@@ -0,0 +1,6 @@
1
+ History.txt
2
+ Manifest.txt
3
+ README.txt
4
+ Rakefile
5
+ setup.rb
6
+ bin/websitiary
data/README.txt ADDED
@@ -0,0 +1,474 @@
1
+ websitiary by Thomas Link
2
+ http://rubyforge.org/projects/websitiary/
3
+
4
+ This is a script for monitoring webpages that reuses other programs to
5
+ do the actual work. By default, it works on an ASCII basis, i.e. with
6
+ the output of text-based webbrowsers. With the help of some friends, it
7
+ can also work with HTML.
8
+
9
+
10
+ == DESCRIPTION:
11
+ This is a script for monitoring webpages that reuses other programs
12
+ (w3m, diff, webdiff etc.) to do most of the actual work. By default, it
13
+ works on an ASCII basis, i.e. with the output of text-based webbrowsers
14
+ like w3m (or lynx, links etc.) as the output can easily be
15
+ post-processed. With the help of some friends (see the section below on
16
+ requirements), it can also work with HTML. E.g., if you have websec
17
+ installed, you can also use its webdiff program to show colored diffs.
18
+
19
+ By default, this script will use w3m to dump HTML pages and then run
20
+ diff over the current page and the previous backup. Some pages are
21
+ better viewed with lynx or links. Downloaded documents (HTML or ASCII)
22
+ can be post-processed (e.g., filtered through some ruby block that
23
+ extracts elements via hpricot and the like). Please see the
24
+ configuration options below to find out how to change this globally or
25
+ for a single source.
26
+
27
+ === CAVEAT:
28
+ The script also includes experimental support for monitoring whole
29
+ websites. Basically, this script supports robots.txt directives (see
30
+ requirements) but this is hardly tested and may not work in some cases.
31
+
32
+ While it is okay for your own websites to ignore robots.txt, it is not
33
+ for others. Please make sure that the webpages you run this program on
34
+ allow such a use. Some webpages disallow the use of any automatic
35
+ downloader or offline reader in their user agreements.
36
+
37
+
38
+ == FEATURES/PROBLEMS:
39
+ * Download webpages on defined intervalls
40
+ * Compare webpages with previous backups
41
+ * Display differences between the current version and the backup
42
+ * Provide hooks to post-process the downloaded documents and the diff
43
+ * Display a one page report summarizing all news
44
+ * Automatically open the report in your favourite web-browser
45
+ * Quite customizable
46
+
47
+ ISSUES, TODO:
48
+ * Improved support for robots.txt (test it)
49
+ * The use of :website_below and :website is hardly tested (please
50
+ report errors).
51
+ * download => :body_html tries to rewrite references (a, img) which may
52
+ fail on certain kind of urls (please report errors).
53
+ * When using :body_html for download, it may happen that some
54
+ JavaScript code is stripped, which breaks some JavaScript-generated
55
+ links.
56
+
57
+
58
+ == SYNOPSIS:
59
+
60
+ === Usage
61
+ Example:
62
+ # Run "profile"
63
+ websitiary profile
64
+
65
+ # Edit "~/.websitiary/profile.rb"
66
+ websitiary --edit=profile
67
+
68
+ # View the latest report
69
+ websitiary --review
70
+
71
+ # Refetch all sources regardless of :days and :hours restrictions
72
+ websitiary -signore_age=true
73
+
74
+ # Create html and rss reports for my websites
75
+ websitiary -fhtml,rss mysites
76
+
77
+ For example output see:
78
+ * html[http://deplate.sourceforge.net/websitiary.html]
79
+ * rss[http://deplate.sourceforge.net/websitiary.rss]
80
+ * text[http://deplate.sourceforge.net/websitiary.txt]
81
+
82
+
83
+ === Configuration
84
+ Profiles are plain ruby files (with the '.rb' suffix) stored in
85
+ ~/.websitiary/.
86
+
87
+ The profile config.rb is always loaded if available.
88
+
89
+
90
+ ==== default 'PROFILE1', 'PROFILE2' ...
91
+ Set the default profile(s).
92
+
93
+ Example:
94
+ default 'my_profile'
95
+
96
+
97
+ ==== diff 'CMD "%s" "%s"'
98
+ Use this shell command to make the diff.
99
+ %s %s will be replaced with the old and new filename.
100
+
101
+ diff is used by default.
102
+
103
+
104
+ ==== diffprocess lambda {|text| ...}
105
+ Use this ruby snippet to post-process the diff.
106
+
107
+
108
+ ==== download 'CMD "%s"'
109
+ Use this shell command to download a page.
110
+ %s will be replaced with the url.
111
+
112
+ w3m is used by default.
113
+
114
+ Example:
115
+ download 'lynx -dump "%s"'
116
+
117
+
118
+ ==== downloadprocess lambda {|text| ...}
119
+ Use this ruby snippet to post-process what was downloaded.
120
+
121
+
122
+ ==== edit 'CMD "%s"'
123
+ Use this shell command to edit a profile. %s will be replaced with the filename.
124
+
125
+ vi is used by default.
126
+
127
+ Example:
128
+ edit 'gvim "%s"&'
129
+
130
+
131
+ ==== option TYPE, OPTION => VALUE
132
+ Set a global option.
133
+
134
+ TYPE can be one of:
135
+ <tt>:diff</tt>::
136
+ Generate a diff
137
+ <tt>:diffprocess</tt>::
138
+ Post-process a diff (if necessary)
139
+ <tt>:format</tt>::
140
+ Format the diff for output
141
+ <tt>:download</tt>::
142
+ Download webpages
143
+ <tt>:downloadprocess</tt>::
144
+ Post-process downloaded webpages
145
+ <tt>:page</tt>::
146
+ The :format field defines the format of the final report. Here VALUE
147
+ is a format string that takes 3 variables as arguments: report title,
148
+ toc, contents.
149
+
150
+ DOWNLOAD is a symbol
151
+
152
+ VALUE is either a format string or a block of code (of class Proc).
153
+
154
+ Example:
155
+ set :download, :foo => lambda {|url| get_url(url)}
156
+
157
+
158
+ ==== output_format FORMAT, output_format [FORMAT1, FORMAT2, ...]
159
+ Set the output format.
160
+ Format can be one of:
161
+
162
+ * html
163
+ * text, txt (this only works with text based downloaders)
164
+ * rss (prove of concept only;
165
+ it requires :rss[:url] to be set to the url, where the rss feed will
166
+ be published, using the <tt>option :rss, :url => URL</tt>
167
+ configuration command; you either have to use a text-based downloader
168
+ or include <tt>:rss_format => 'html'</tt> to the url options)
169
+
170
+
171
+ ==== set OPTION => VALUE; set TYPE, OPTION => VALUE; unset OPTIONS
172
+ (Un)Set an option for the following source commands.
173
+
174
+ Example:
175
+ set :download, :foo => lambda {|url| get_url(url)}
176
+ set :days => 7, sort => true
177
+ unset :days, :sort
178
+
179
+
180
+ ==== source URL(S), [OPTIONS]
181
+ Options
182
+
183
+ <tt>:cols => FROM..TO</tt>::
184
+ Use only these colums from the output (used after applying the :lines
185
+ option)
186
+
187
+ <tt>:depth => INTEGER</tt>::
188
+ In conjunction with a :website type of :download option, fetch url up
189
+ to this depth.
190
+
191
+ <tt>:diff => "CMD", :diff => SHORTCUT</tt>::
192
+ Use this command to make the diff for this page. Possible values for
193
+ SHORTCUT are: :webdiff (useful in conjunction with :download => :curl,
194
+ :wget, or :body_html). :body_html, :website_below, :website and
195
+ :openuri are synonyms for :webdiff.
196
+
197
+ <tt>:diffprocess => lambda {|text| ...}</tt>::
198
+ Use this ruby snippet to post-process this diff
199
+
200
+ <tt>:download => "CMD", :download => SHORTCUT</tt>:
201
+ Use this command to download this page. For possible values for
202
+ SHORTCUT see the section on shortcuts below.
203
+
204
+ <tt>:downloadprocess => lambda {|text| ...}</tt>::
205
+ Use this ruby snippet to post-process what was downloaded. This is the
206
+ place where, e.g., hpricot can be used to extract certain elements
207
+ from the HTML code.
208
+ Example:
209
+ lambda {|text| Hpricot(text).at('div#content').inner_html}
210
+
211
+ <tt>:format => "FORMAT %s STRING", :format => SHORTCUT</tt>::
212
+ The format string for the diff text. The default (the :diff shortcut)
213
+ wraps the output in +pre+ tags. :webdiff, :body_html, :website_below,
214
+ :website, and :openuri will simply add a newline character.
215
+
216
+ <tt>:hours => HOURS, :days => DAYS</tt>::
217
+ Don't download the file unless it's older than that
218
+
219
+ <tt>:ignore_age => true</tt>::
220
+ Ignore any :days and :hours settings. This is useful in some cases
221
+ when set on the command line.
222
+
223
+ <tt>:lines => FROM..TO</tt>::
224
+ Use only these lines from the output
225
+
226
+ <tt>:match => REGEXP</tt>::
227
+ When recursively walking a website, follow only links that match this
228
+ regexp.
229
+
230
+ <tt>:sort => true, :sort => lambda {|a,b| ...}</tt>::
231
+ Sort lines in output
232
+
233
+ <tt>:strip => true</tt>::
234
+ Strip empty lines
235
+
236
+ <tt>:title => "TEXT"</tt>::
237
+ Display TEXT instead of URL
238
+
239
+ <tt>:use => SYMBOL</tt>::
240
+ Use SYMBOL for any other option. I.e. <tt>:download => :body_html
241
+ :diff => :webdiff</tt> can be abbreviated as <tt>:use =>
242
+ :body_html</tt> (because for :diff :body_html is a synonym for
243
+ :webdiff).
244
+
245
+
246
+ Example configuration file extract:
247
+ source 'URL', :days => 7, :download => :lynx
248
+
249
+ # Daily
250
+ set :days => 1
251
+ source 'http://www.example.com', :use => :body_html,
252
+ :downloadprocess => lambda {|text| Hpricot(text).at('div#content').inner_html}
253
+
254
+ # Weekly
255
+ set :days => 7
256
+ source 'http://www.example.com', :lines => 10..-1, :title => 'My Page'
257
+
258
+ # Bi-weekly
259
+ set :days => 14
260
+ source <<URLS
261
+ http://www.example.com
262
+ http://www.example.com/page.html
263
+ URLS
264
+
265
+ # Make HTML diffs and highlight occurences of a word
266
+ source 'http://www.example.com',
267
+ :title => 'Example',
268
+ :use => :body_html,
269
+ :diffprocess => highlighter(/word/i)
270
+
271
+ # Download the whole website below this path (only pages with
272
+ # html-suffix)
273
+ # Download only php and html pages
274
+ # Follow links 2 levels deep
275
+ source 'http://www.example.com/foo/bar.html',
276
+ :title => 'Example -- Bar', :use => :website_below,
277
+ :match => /\.(php|html)\b/, :depth => 2
278
+
279
+ unset :days
280
+
281
+
282
+ ==== view 'CMD "%s"'
283
+ Use this shell command to view the output (usually a HTML file).
284
+ %s will be replaced with the filename.
285
+
286
+ w3m is used by default.
287
+
288
+ Example:
289
+ view 'gnome-open "%s"' # Gnome Desktop
290
+ view 'kfmclient "%s"' # KDE
291
+ view 'cygstart "%s"' # Cygwin
292
+ view 'start "%s"' # Windows
293
+ view 'firefox "%s"'
294
+
295
+
296
+ === Shortcuts for use with :use, :download and other options
297
+ <tt>:w3m</tt>::
298
+ Use w3m for downloading the source. Use diff for generating diffs.
299
+
300
+ <tt>:lynx</tt>::
301
+ Use lynx for downloading the source. Use diff for generating diffs.
302
+
303
+ <tt>:links</tt>::
304
+ Use links for downloading the source. Use diff for generating diffs.
305
+
306
+ <tt>:curl</tt>::
307
+ Use curl for downloading the source. Use webdiff for generating diffs.
308
+
309
+ <tt>:wget</tt>::
310
+ Use wget for downloading the source. Use webdiff for generating diffs.
311
+
312
+ <tt>:openuri</tt>::
313
+ Use open-uri for downloading the source. Use webdiff for generating
314
+ diffs.
315
+
316
+ <tt>:body_html</tt>::
317
+ This requires hpricot to be installed. Use open-uri for downloading
318
+ the source, use only the body. Use webdiff for generating diffs. Try
319
+ to rewrite references (a, img) so that the point to the webpage. By
320
+ default, this will also strip tags like script, form, object ...
321
+
322
+ <tt>:website</tt>::
323
+ Use :body_html to download the source. Follow all links referring to
324
+ the same host with the same file suffix. Use webdiff for generating
325
+ diff.
326
+
327
+ <tt>:website_below</tt>::
328
+ Use :body_html to download the source. Follow all links referring to
329
+ the same host and a file below the top directory with the same file
330
+ suffix. Use webdiff for generating diff.
331
+
332
+ <tt>:website_txt</tt>::
333
+ Use :website to download the source but convert the output to plain
334
+ text.
335
+
336
+ <tt>:website_txt_below</tt>::
337
+ Use :website_below to download the source but convert the output to
338
+ plain text.
339
+
340
+ Any shortcuts relying on :body_html will also try to rewrite any
341
+ references so that the links point to the webpage.
342
+
343
+
344
+ == REQUIREMENTS:
345
+ websitiary is a ruby-based application. You thus need a ruby
346
+ interpreter.
347
+
348
+ It depends on how you use websitiary whether you actually need the
349
+ following libraries, applications.
350
+
351
+ By default this script expects the following applications to be
352
+ present:
353
+
354
+ * diff
355
+ * vi (or some other editor)
356
+
357
+ and one of:
358
+
359
+ * w3m[http://w3m.sourceforge.net/] (default)
360
+ * lynx[http://lynx.isc.org/]
361
+ * links[http://links.twibright.com/]
362
+ * websec[http://baruch.ev-en.org/proj/websec/]
363
+ (or at Savannah[http://savannah.nongnu.org/projects/websec/])
364
+
365
+ The use of :webdiff as :diff application requires
366
+ websec[http://download.savannah.gnu.org/releases/websec/] to be
367
+ installed. In conjunction with :body_html, :openuri, or :curl, this
368
+ will give you colored HTML diffs.
369
+ Why not use +websec+ if I have to install it, you might ask. Well,
370
+ +websec+ is written in perl and I didn't quite manage to make it work
371
+ the way I want it to. websitiary is made to be better to configure.
372
+
373
+ For downloading HTML, you need one of these:
374
+
375
+ * open-uri (should be part of ruby)
376
+ * hpricot[http://code.whytheluckystiff.net/hpricot] (used e.g. by
377
+ :body_html, :website, and :website_below)
378
+ * curl[http://curl.haxx.se/]
379
+ * wget[http://www.gnu.org/software/wget/]
380
+
381
+ The following ruby libraries are needed in conjunction with :body_html
382
+ and :website related shortcuts:
383
+
384
+ * hpricot[http://code.whytheluckystiff.net/hpricot] (parse HTML, use
385
+ only the body etc.)
386
+ * robot_rules.rb[http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589]
387
+ for parsing robots.txt
388
+
389
+ I personally would suggest to choose the following setup:
390
+
391
+ * w3m[http://w3m.sourceforge.net/]
392
+ * websec[http://baruch.ev-en.org/proj/websec/]
393
+ * hpricot[http://code.whytheluckystiff.net/hpricot]
394
+ * robot_rules.rb[http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589]
395
+
396
+
397
+ == INSTALL:
398
+ === Use rubygems
399
+ Run
400
+
401
+ gem install websitiary
402
+
403
+ This will download the package and install it.
404
+
405
+
406
+ === Use the zip
407
+ The zip[http://rubyforge.org/frs/?group_id=4030] contains a file
408
+ setup.rb that does the work. Run
409
+
410
+ ruby setup.rb
411
+
412
+
413
+ === Copy Manually
414
+ Get the single file websitiary[http://rubyforge.org/frs/?group_id=4030]
415
+ script and copy it to some directory in $PATH.
416
+
417
+
418
+ === Initial Configuration
419
+ Please check the requirements section above and get the extra libraries
420
+ needed:
421
+ * hpricot
422
+ * robot_rules.rb
423
+
424
+ You might then want to create a profile ~/.websitiary/config.rb that is
425
+ loaded on every run. In this profile you could set the default output
426
+ viewer and profile editor, as well as a default profile.
427
+
428
+ Example:
429
+
430
+ # Load standard.rb if no profile is given on the command line.
431
+ default 'standard'
432
+
433
+ # Use cygwin's cygstart to view the output with the default HTML
434
+ # viewer
435
+ view '/usr/bin/cygstart "%s"'
436
+
437
+ # Use Windows gvim from cygwin ruby which is why we convert the path
438
+ # first
439
+ edit 'gvim $(cygpath -w -- "%s")'
440
+
441
+ Where these configuration files reside, may differ. If the environment
442
+ variable $HOME is defined, the default is $HOME/.websitiary/ unless one
443
+ of the following directories exist, which will then be used instead:
444
+
445
+ * $USERPROFILE/websitiary (on Windows)
446
+ * SYSCONFDIR/websitiary (where SYSCONFDIR usually is /etc but you can
447
+ run ruby to find out more:
448
+ <tt>ruby -e "p Config::CONFIG['sysconfdir']"</tt>)
449
+
450
+ If neither directory exists and no $HOME variable is defined, the
451
+ current directory will be used.
452
+
453
+
454
+ == LICENSE:
455
+ websitiary Webpage Monitor
456
+ Copyright (C) 2007 Thomas Link
457
+
458
+ This program is free software; you can redistribute it and/or modify
459
+ it under the terms of the GNU General Public License as published by
460
+ the Free Software Foundation; either version 2 of the License, or
461
+ (at your option) any later version.
462
+
463
+ This program is distributed in the hope that it will be useful,
464
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
465
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
466
+ GNU General Public License for more details.
467
+
468
+ You should have received a copy of the GNU General Public License
469
+ along with this program; if not, write to the Free Software
470
+ Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307
471
+ USA
472
+
473
+
474
+ # vi: ft=rd:tw=72:ts=4