RubyGems - websitiary - Versions diffs - 0.1.0 - Mend

websitiary 0.1.0

Files changed (7) hide show

data/History.txt ADDED Viewed

@@ -0,0 +1,4 @@
+== 0.1.0 / 2007-07-10
+* Initial release

data/Manifest.txt ADDED Viewed

@@ -0,0 +1,6 @@
+History.txt
+Manifest.txt
+README.txt
+Rakefile
+setup.rb
+bin/websitiary

data/README.txt ADDED Viewed

@@ -0,0 +1,474 @@
+websitiary by Thomas Link
+http://rubyforge.org/projects/websitiary/
+This is a script for monitoring webpages that reuses other programs to
+do the actual work. By default, it works on an ASCII basis, i.e. with
+the output of text-based webbrowsers. With the help of some friends, it
+can also work with HTML.
+== DESCRIPTION:
+This is a script for monitoring webpages that reuses other programs
+(w3m, diff, webdiff etc.) to do most of the actual work. By default, it
+works on an ASCII basis, i.e. with the output of text-based webbrowsers
+like w3m (or lynx, links etc.) as the output can easily be
+post-processed. With the help of some friends (see the section below on
+requirements), it can also work with HTML. E.g., if you have websec
+installed, you can also use its webdiff program to show colored diffs.
+By default, this script will use w3m to dump HTML pages and then run
+diff over the current page and the previous backup. Some pages are
+better viewed with lynx or links. Downloaded documents (HTML or ASCII)
+can be post-processed (e.g., filtered through some ruby block that
+extracts elements via hpricot and the like). Please see the
+configuration options below to find out how to change this globally or
+for a single source.
+=== CAVEAT:
+The script also includes experimental support for monitoring whole
+websites. Basically, this script supports robots.txt directives (see
+requirements) but this is hardly tested and may not work in some cases.
+While it is okay for your own websites to ignore robots.txt, it is not
+for others. Please make sure that the webpages you run this program on
+allow such a use.  Some webpages disallow the use of any automatic
+downloader or offline reader in their user agreements.
+== FEATURES/PROBLEMS:
+* Download webpages on defined intervalls
+* Compare webpages with previous backups
+* Display differences between the current version and the backup
+* Provide hooks to post-process the downloaded documents and the diff
+* Display a one page report summarizing all news
+* Automatically open the report in your favourite web-browser
+* Quite customizable
+ISSUES, TODO:
+* Improved support for robots.txt (test it)
+* The use of :website_below and :website is hardly tested (please
+  report errors).
+* download => :body_html tries to rewrite references (a, img) which may
+  fail on certain kind of urls (please report errors).
+* When using :body_html for download, it may happen that some
+  JavaScript code is stripped, which breaks some JavaScript-generated
+  links.
+== SYNOPSIS:
+=== Usage
+Example:
+  # Run "profile"
+  websitiary profile
+  # Edit "~/.websitiary/profile.rb"
+  websitiary --edit=profile
+  # View the latest report
+  websitiary --review
+  # Refetch all sources regardless of :days and :hours restrictions
+  websitiary -signore_age=true
+  # Create html and rss reports for my websites
+  websitiary -fhtml,rss mysites
+For example output see:
+* html[http://deplate.sourceforge.net/websitiary.html]
+* rss[http://deplate.sourceforge.net/websitiary.rss]
+* text[http://deplate.sourceforge.net/websitiary.txt]
+=== Configuration
+Profiles are plain ruby files (with the '.rb' suffix) stored in
+~/.websitiary/.
+The profile config.rb is always loaded if available.
+==== default 'PROFILE1', 'PROFILE2' ...
+Set the default profile(s).
+Example:
+  default 'my_profile'
+==== diff 'CMD "%s" "%s"'
+Use this shell command to make the diff.
+%s %s will be replaced with the old and new filename.
+diff is used by default.
+==== diffprocess lambda {|text| ...}
+Use this ruby snippet to post-process the diff.
+==== download 'CMD "%s"'
+Use this shell command to download a page.
+%s will be replaced with the url.
+w3m is used by default.
+Example:
+  download 'lynx -dump "%s"'
+==== downloadprocess lambda {|text| ...}
+Use this ruby snippet to post-process what was downloaded.
+==== edit 'CMD "%s"'
+Use this shell command to edit a profile. %s will be replaced with the filename.
+vi is used by default.
+Example:
+  edit 'gvim "%s"&'
+==== option TYPE, OPTION => VALUE
+Set a global option.
+TYPE can be one of:
+<tt>:diff</tt>::
+  Generate a diff
+<tt>:diffprocess</tt>::
+  Post-process a diff (if necessary)
+<tt>:format</tt>::
+  Format the diff for output
+<tt>:download</tt>::
+  Download webpages
+<tt>:downloadprocess</tt>::
+  Post-process downloaded webpages
+<tt>:page</tt>::
+  The :format field defines the format of the final report. Here VALUE
+  is a format string that takes 3 variables as arguments: report title,
+  toc, contents.
+DOWNLOAD is a symbol
+VALUE is either a format string or a block of code (of class Proc).
+Example:
+  set :download, :foo => lambda {|url| get_url(url)}
+==== output_format FORMAT, output_format [FORMAT1, FORMAT2, ...]
+Set the output format.
+Format can be one of:
+* html
+* text, txt (this only works with text based downloaders)
+* rss (prove of concept only;
+  it requires :rss[:url] to be set to the url, where the rss feed will
+  be published, using the <tt>option :rss, :url => URL</tt>
+  configuration command; you either have to use a text-based downloader
+  or include <tt>:rss_format => 'html'</tt> to the url options)
+==== set OPTION => VALUE; set TYPE, OPTION => VALUE; unset OPTIONS
+(Un)Set an option for the following source commands.
+Example:
+  set :download, :foo => lambda {|url| get_url(url)}
+  set :days => 7, sort => true
+  unset :days, :sort
+==== source URL(S), [OPTIONS]
+Options
+<tt>:cols => FROM..TO</tt>::
+  Use only these colums from the output (used after applying the :lines
+  option)
+<tt>:depth => INTEGER</tt>::
+  In conjunction with a :website type of :download option, fetch url up
+  to this depth.
+<tt>:diff => "CMD", :diff => SHORTCUT</tt>::
+  Use this command to make the diff for this page. Possible values for
+  SHORTCUT are: :webdiff (useful in conjunction with :download => :curl,
+  :wget, or :body_html). :body_html, :website_below, :website and
+  :openuri are synonyms for :webdiff.
+<tt>:diffprocess => lambda {|text| ...}</tt>::
+  Use this ruby snippet to post-process this diff
+<tt>:download => "CMD", :download => SHORTCUT</tt>:
+  Use this command to download this page. For possible values for
+  SHORTCUT see the section on shortcuts below.
+<tt>:downloadprocess => lambda {|text| ...}</tt>::
+  Use this ruby snippet to post-process what was downloaded. This is the
+  place where, e.g., hpricot can be used to extract certain elements
+  from the HTML code.
+  Example:
+    lambda {|text| Hpricot(text).at('div#content').inner_html}
+<tt>:format => "FORMAT %s STRING", :format => SHORTCUT</tt>::
+  The format string for the diff text. The default (the :diff shortcut)
+  wraps the output in +pre+ tags. :webdiff, :body_html, :website_below,
+  :website, and :openuri will simply add a newline character.
+<tt>:hours => HOURS, :days => DAYS</tt>::
+  Don't download the file unless it's older than that
+<tt>:ignore_age => true</tt>::
+  Ignore any :days and :hours settings. This is useful in some cases
+  when set on the command line.
+<tt>:lines => FROM..TO</tt>::
+  Use only these lines from the output
+<tt>:match => REGEXP</tt>::
+  When recursively walking a website, follow only links that match this
+  regexp.
+<tt>:sort => true, :sort => lambda {|a,b| ...}</tt>::
+  Sort lines in output
+<tt>:strip => true</tt>::
+  Strip empty lines
+<tt>:title => "TEXT"</tt>::
+  Display TEXT instead of URL
+<tt>:use => SYMBOL</tt>::
+  Use SYMBOL for any other option. I.e. <tt>:download => :body_html
+  :diff => :webdiff</tt> can be abbreviated as <tt>:use =>
+  :body_html</tt> (because for :diff :body_html is a synonym for
+  :webdiff).
+Example configuration file extract:
+  source 'URL', :days => 7, :download => :lynx
+  # Daily
+  set :days => 1
+  source 'http://www.example.com', :use => :body_html,
+    :downloadprocess => lambda {|text| Hpricot(text).at('div#content').inner_html}
+  # Weekly
+  set :days => 7
+  source 'http://www.example.com', :lines => 10..-1, :title => 'My Page'
+  # Bi-weekly
+  set :days => 14
+  source <<URLS
+  http://www.example.com
+  http://www.example.com/page.html
+  URLS
+  # Make HTML diffs and highlight occurences of a word
+  source 'http://www.example.com',
+    :title => 'Example',
+    :use => :body_html,
+    :diffprocess => highlighter(/word/i)
+  # Download the whole website below this path (only pages with
+  # html-suffix)
+  # Download only php and html pages
+  # Follow links 2 levels deep
+  source 'http://www.example.com/foo/bar.html',
+    :title => 'Example -- Bar', :use => :website_below,
+    :match => /\.(php|html)\b/, :depth => 2
+  unset :days
+==== view 'CMD "%s"'
+Use this shell command to view the output (usually a HTML file).
+%s will be replaced with the filename.
+w3m is used by default.
+Example:
+  view 'gnome-open "%s"' # Gnome Desktop
+  view 'kfmclient "%s"'  # KDE
+  view 'cygstart "%s"'   # Cygwin
+  view 'start "%s"'      # Windows
+  view 'firefox "%s"'
+=== Shortcuts for use with :use, :download and other options
+<tt>:w3m</tt>::
+  Use w3m for downloading the source. Use diff for generating diffs.
+<tt>:lynx</tt>::
+  Use lynx for downloading the source. Use diff for generating diffs.
+<tt>:links</tt>::
+  Use links for downloading the source. Use diff for generating diffs.
+<tt>:curl</tt>::
+  Use curl for downloading the source. Use webdiff for generating diffs.
+<tt>:wget</tt>::
+  Use wget for downloading the source. Use webdiff for generating diffs.
+<tt>:openuri</tt>::
+  Use open-uri for downloading the source. Use webdiff for generating
+  diffs.
+<tt>:body_html</tt>::
+  This requires hpricot to be installed. Use open-uri for downloading
+  the source, use only the body. Use webdiff for generating diffs. Try
+  to rewrite references (a, img) so that the point to the webpage. By
+  default, this will also strip tags like script, form, object ...
+<tt>:website</tt>::
+  Use :body_html to download the source. Follow all links referring to
+  the same host with the same file suffix. Use webdiff for generating
+  diff.
+<tt>:website_below</tt>::
+  Use :body_html to download the source. Follow all links referring to
+  the same host and a file below the top directory with the same file
+  suffix. Use webdiff for generating diff.
+<tt>:website_txt</tt>::
+  Use :website to download the source but convert the output to plain
+  text.
+<tt>:website_txt_below</tt>::
+  Use :website_below to download the source but convert the output to
+  plain text.
+Any shortcuts relying on :body_html will also try to rewrite any
+references so that the links point to the webpage.
+== REQUIREMENTS:
+websitiary is a ruby-based application. You thus need a ruby
+interpreter.
+It depends on how you use websitiary whether you actually need the
+following libraries, applications.
+By default this script expects the following applications to be
+present:
+* diff
+* vi (or some other editor)
+and one of:
+* w3m[http://w3m.sourceforge.net/] (default)
+* lynx[http://lynx.isc.org/]
+* links[http://links.twibright.com/]
+* websec[http://baruch.ev-en.org/proj/websec/]
+  (or at Savannah[http://savannah.nongnu.org/projects/websec/])
+The use of :webdiff as :diff application requires
+websec[http://download.savannah.gnu.org/releases/websec/] to be
+installed. In conjunction with :body_html, :openuri, or :curl, this
+will give you colored HTML diffs.
+Why not use +websec+ if I have to install it, you might ask.  Well,
++websec+ is written in perl and I didn't quite manage to make it work
+the way I want it to. websitiary is made to be better to configure.
+For downloading HTML, you need one of these:
+* open-uri (should be part of ruby)
+* hpricot[http://code.whytheluckystiff.net/hpricot] (used e.g. by
+  :body_html, :website, and :website_below)
+* curl[http://curl.haxx.se/]
+* wget[http://www.gnu.org/software/wget/]
+The following ruby libraries are needed in conjunction with :body_html
+and :website related shortcuts:
+* hpricot[http://code.whytheluckystiff.net/hpricot] (parse HTML, use
+  only the body etc.)
+* robot_rules.rb[http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589]
+  for parsing robots.txt
+I personally would suggest to choose the following setup:
+* w3m[http://w3m.sourceforge.net/]
+* websec[http://baruch.ev-en.org/proj/websec/]
+* hpricot[http://code.whytheluckystiff.net/hpricot]
+* robot_rules.rb[http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589]
+== INSTALL:
+=== Use rubygems
+Run
+    gem install websitiary
+This will download the package and install it.
+=== Use the zip
+The zip[http://rubyforge.org/frs/?group_id=4030] contains a file
+setup.rb that does the work. Run
+    ruby setup.rb
+=== Copy Manually
+Get the single file websitiary[http://rubyforge.org/frs/?group_id=4030]
+script and copy it to some directory in $PATH.
+=== Initial Configuration
+Please check the requirements section above and get the extra libraries
+needed:
+* hpricot
+* robot_rules.rb
+You might then want to create a profile ~/.websitiary/config.rb that is
+loaded on every run. In this profile you could set the default output
+viewer and profile editor, as well as a default profile.
+Example:
+    # Load standard.rb if no profile is given on the command line.
+    default 'standard'
+    # Use cygwin's cygstart to view the output with the default HTML
+    # viewer
+    view '/usr/bin/cygstart "%s"'
+    # Use Windows gvim from cygwin ruby which is why we convert the path
+    # first
+    edit 'gvim $(cygpath -w -- "%s")'
+Where these configuration files reside, may differ. If the environment
+variable $HOME is defined, the default is $HOME/.websitiary/ unless one
+of the following directories exist, which will then be used instead:
+* $USERPROFILE/websitiary (on Windows)
+* SYSCONFDIR/websitiary (where SYSCONFDIR usually is /etc but you can
+  run ruby to find out more:
+  <tt>ruby -e "p Config::CONFIG['sysconfdir']"</tt>)
+If neither directory exists and no $HOME variable is defined, the
+current directory will be used.
+== LICENSE:
+websitiary Webpage Monitor
+Copyright (C) 2007 Thomas Link
+This program is free software; you can redistribute it and/or modify
+it under the terms of the GNU General Public License as published by
+the Free Software Foundation; either version 2 of the License, or
+(at your option) any later version.
+This program is distributed in the hope that it will be useful,
+but WITHOUT ANY WARRANTY; without even the implied warranty of
+MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+GNU General Public License for more details.
+You should have received a copy of the GNU General Public License
+along with this program; if not, write to the Free Software
+Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307
+USA
+# vi: ft=rd:tw=72:ts=4