RubyGems - websitary - Versions diffs - 0.2.1 → 0.3 - Mend

websitary 0.2.1 → 0.3

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (7) hide show

data/History.txt +16 -0
data/README.txt +32 -27
data/Rakefile +1 -1
data/lib/websitary.rb +39 -18
data/lib/websitary/configuration.rb +96 -40
data/lib/websitary/htmldiff.rb +89 -40
metadata +4 -4

data/History.txt CHANGED Viewed

@@ -1,3 +1,19 @@
+= 0.3
+* Renamed the global option :downloadhtml to :download_html.
+* The downloader for robots and rss enclosures should now be properly
+  configurable via the global options :download_robots and
+  :download_rss_enclosure (default: :openuri).
+* Respect rel="nofollow" on hyperreferences.
+* :wdays, :mdays didn't work.
+* --exclude command line options, exclude configuration command
+* Check for robots.txt-compliance after testing if the URL is
+  appropriate.
+* htmldiff.rb can now also highlight differences � la websec's webdiff.
+* configuration.rb: Ignore pubDate and certain other non-essential fields (tags
+  etc.) when constructing rss item IDs.
 = 0.2.1
 * Use URI.merge for constructing robots.txt uri.

data/README.txt CHANGED Viewed

@@ -4,21 +4,18 @@ http://rubyforge.org/projects/websitiary/
 This script monitors webpages, rss feeds, podcasts etc. and reports
 what's new. For many tasks, it reuses other programs to do the actual
 work. By default, it works on an ASCII basis, i.e. with the output of
-text-based webbrowsers. With the help of some friends, it can also work
+text-based webbrowsers. With the help of some friends, it works also
 with HTML.
 == DESCRIPTION:
 websitary (formerly known as websitiary with an extra "i") monitors
-webpages, rss feeds, podcasts etc. It reuses other programs (w3m, diff,
-webdiff etc.) to do most of the actual work. By default, it works on an
-ASCII basis, i.e. with the output of text-based webbrowsers like w3m (or
-lynx, links etc.) as the output can easily be post-processed. With the
-help of some friends (see the section below on requirements), it can
-also work with HTML. E.g., if you have websec installed, you can also
-use its webdiff program to show colored diffs. This script was
-originally planned as a ruby-based websec replacement. For HTML diffs,
-it stills relies on the webdiff perl script that comes with websec.
+webpages, rss feeds, podcasts etc. It reuses other programs (w3m, diff
+etc.) to do most of the actual work. By default, it works on an ASCII
+basis, i.e. with the output of text-based webbrowsers like w3m (or lynx,
+links etc.) as the output can easily be post-processed. It can also work
+with HTML and highlight new items. This script was originally planned as
+a ruby-based websec replacement.
 By default, this script will use w3m to dump HTML pages and then run
 diff over the current page and the previous backup. Some pages are
@@ -28,6 +25,9 @@ extracts elements via hpricot and the like). Please see the
 configuration options below to find out how to change this globally or
 for a single source.
+This user manual is also available as
+PDF[http://websitiary.rubyforge.org/websitary.pdf].
 == FEATURES/PROBLEMS:
 * Handle webpages, rss feeds (optionally save attachments in podcasts
@@ -58,7 +58,7 @@ NOTE: The script was previously called websitiary but was renamed (from
 0.2 on) to websitary (without the superfluous i).
-=== CAVEAT:
+=== Caveat
 The script also includes experimental support for monitoring whole
 websites. Basically, this script supports robots.txt directives (see
 requirements) but this is hardly tested and may not work in some cases.
@@ -70,8 +70,6 @@ downloader or offline reader in their user agreements.
 == SYNOPSIS:
-This manual is also available as
-PDF[http://websitiary.rubyforge.org/websitary.pdf].
 === Usage
 Example:
@@ -245,8 +243,13 @@ Options
 <tt>:diff => "CMD", :diff => SHORTCUT</tt>::
   Use this command to make the diff for this page. Possible values for
   SHORTCUT are: :webdiff (useful in conjunction with :download => :curl,
-  :wget, or :body_html). :body_html, :website_below, :website and
-  :openuri are synonyms for :webdiff.
+  :wget, or :body_html), :websec_webdiff (use websec's webdiff tool),
+  :body_html, :website_below, :website and :openuri are synonyms for
+  :webdiff.
+  NOTE: Since version 0.3, :webdiff is mapped to websitary's own
+  htmldiff class (which can also be used as stand-alone script). Before
+  0.3, websitary used websec's webdiff script, which is now mapped to
+  :websec_webdiff.
 <tt>:diffprocess => lambda {|text| ...}</tt>::
   Use this ruby snippet to post-process this diff
@@ -479,13 +482,13 @@ references so that the links point to the webpage.
   source 'http://www.example.com/daily_image/', :title => 'Daily Image',
     :use => :img,
     :download => lambda {|url|
+      rv = nil
       # Read the HTML.
       html = open(url) {|io| io.read}
       # This check is probably unnecessary as the failure to read
       # the HTML document would most likely result in an
       # exception.
       if html
-        rv  = nil
         # Parse the HTML document.
         doc = Hpricot(html)
         # The following could actually be simplified using xpath
@@ -541,6 +544,9 @@ latest::
     Show the latest copies of the sources from the profiles given
     on the command line.
+ls::
+    List number of aggregated diffs.
 rebuild::
     Rebuild the latest report.
@@ -611,16 +617,14 @@ and one of:
 * w3m[http://w3m.sourceforge.net/] (default)
 * lynx[http://lynx.isc.org/]
 * links[http://links.twibright.com/]
-* websec[http://baruch.ev-en.org/proj/websec/]
-  (or at Savannah[http://savannah.nongnu.org/projects/websec/])
-The use of :webdiff as :diff application requires
-websec[http://download.savannah.gnu.org/releases/websec/] to be
-installed. In conjunction with :body_html, :openuri, or :curl, this
-will give you colored HTML diffs.
-Why not use +websec+ if I have to install it, you might ask.  Well,
-+websec+ is written in perl and I didn't quite manage to make it work
-the way I want it to. websitary is made to be better to configure.
+The use of :websec_webdiff as :diff application requires
+websec[http://baruch.ev-en.org/proj/websec/] (or at
+Savannah[http://savannah.nongnu.org/projects/websec/]) to be installed.
+By default, websitary uses it's own htmldiff class/script, which is less
+well tested and may return inferior results in comparison with websec's
+webdiff. In conjunction with :body_html, :openuri, or :curl, this will
+give you colored HTML diffs.
 For downloading HTML, you need one of these:
@@ -641,7 +645,6 @@ and :website related shortcuts:
 I personally would suggest to choose the following setup:
 * w3m[http://w3m.sourceforge.net/]
-* websec[http://baruch.ev-en.org/proj/websec/]
 * hpricot[http://code.whytheluckystiff.net/hpricot]
 * robot_rules.rb[http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589]
@@ -674,7 +677,7 @@ These could be installed by:
   gem install hpricot
   # Install robot_rules.rb
-  curl http://www.rubyquiz.com/quiz64_sols.zip
+  wget http://www.rubyquiz.com/quiz64_sols.zip
   # Check the correct path to site_ruby first!
   unzip -p quiz64_sols.zip "solutions/James Edward Gray II/robot_rules.rb" > /lib/ruby/site_ruby/1.8/robot_rules.rb
   rm quiz64_sols.zip
@@ -708,6 +711,8 @@ of the following directories exist, which will then be used instead:
 If neither directory exists and no $HOME variable is defined, the
 current directory will be used.
+Now check out the configuration commands in the Synopsis section.
 == LICENSE:
 websitary Webpage Monitor

data/Rakefile CHANGED Viewed

@@ -21,7 +21,7 @@ require 'rtagstask'
 RTagsTask.new
 task :ctags do
-    `ctags --extra=+q -R bin lib`
+    `ctags --extra=+q --fields=+i -R bin lib`
 end
 # vim: syntax=Ruby

data/lib/websitary.rb CHANGED Viewed

@@ -1,13 +1,8 @@
 # websitary.rb
-# @Last Change: 2007-09-16.
+# @Last Change: 2007-10-26.
 # Author::      Thomas Link (micathom AT gmail com)
 # License::     GPL (see http://www.gnu.org/licenses/gpl.txt)
 # Created::     2007-09-08.
-#
-# = TODO
-# * Built-in support for robots.txt
-# * Option to append to output files (e.g. rss)
-# * Option to trim output files (when appending items)
 require 'cgi'
@@ -37,8 +32,8 @@ end
 module Websitary
     APPNAME     = 'websitary'
-    VERSION     = '0.2.1'
-    REVISION    = '2405'
+    VERSION     = '0.3'
+    REVISION    = '2437'
 end
 require 'websitary/applog'
@@ -48,7 +43,7 @@ require 'websitary/htmldiff'
 # Basic usage:
-#   Websitary.new(ARGV).process
+#   Websitary::App.new(ARGV).process
 class Websitary::App
     MINUTE_SECS = 60
     HOUR_SECS   = MINUTE_SECS * 60
@@ -207,7 +202,7 @@ CSS
     def cmdline_arg_add(configuration, url)
-        configuration.todo << url
+        configuration.to_do url
     end
@@ -290,6 +285,24 @@ CSS
     end
+    def execute_ls
+        rv = 0
+        @configuration.todo.each do |url|
+            opts = @configuration.urls[url]
+            name = @configuration.get(url, :title, url)
+            $logger.debug "Source: #{name}"
+            aggrbase  = @configuration.encoded_filename('aggregate', url, true, 'md5')
+            aggrfiles = Dir["#{aggrbase}_*"]
+            aggrn     = aggrfiles.size
+            if aggrn > 0
+                puts "%3d - %s" % [aggrn, name]
+                rv = 1
+            end
+        end
+        rv
+    end
     # Show data collected by #execute_aggregate
     def execute_show
         @configuration.todo.each do |url|
@@ -320,6 +333,10 @@ CSS
     # and command-line options. The differences are stored in @difftext (a Hash).
     # show_output:: If true, show the output with the defined viewer.
     def execute_downdiff(show_output=true, rebuild=false, &accumulator)
+        if @configuration.todo.empty?
+            $logger.error 'Nothing to do'
+            return 5
+        end
         @configuration.todo.each do |url|
             opts = @configuration.urls[url]
             $logger.debug "Source: #{@configuration.get(url, :title, url)}"
@@ -464,15 +481,19 @@ CSS
             # $logger.debug text #DBG#
         end
-        if older
-            if File.exist?(latest)
-                move(latest, older)
-            elsif !File.exist?(older)
-                $logger.warn "Initial copy: #{latest.inspect}"
+        if text and !text.empty?
+            if older
+                if File.exist?(latest)
+                    move(latest, older)
+                elsif !File.exist?(older)
+                    $logger.warn "Initial copy: #{latest.inspect}"
+                end
             end
+            @configuration.write_file(latest) {|io| io.puts(text)}
+            return true
+        else
+            return false
         end
-        @configuration.write_file(latest) {|io| io.puts(text)}
-        return true
     end
@@ -566,7 +587,7 @@ CSS
         if parent_eligible == parent_now
             return true
         else
-            case now
+            case eligible
             when Array, Range
                 return !eligible.include?(now)
             when Integer

data/lib/websitary/configuration.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # configuration.rb
-# @Last Change: 2007-09-16.
+# @Last Change: 2007-10-21.
 # Author::      Thomas Link (micathom AT gmail com)
 # License::     GPL (see http://www.gnu.org/licenses/gpl.txt)
 # Created::     2007-09-08.
@@ -12,7 +12,7 @@ class Websitary::Configuration
     # Hash (key = URL, value = Hash of options)
     attr_accessor :urls
     # Array of urls to be downloaded.
-    attr_accessor :todo
+    attr_reader :todo
     # Array of downloaded urls.
     attr_accessor :done
     # The user configuration directory
@@ -60,6 +60,7 @@ class Websitary::Configuration
         @profiles          = []
         @robots            = {}
         @todo              = []
+        @exclude           = []
         @urlencmap         = {}
         @urls              = {}
@@ -127,10 +128,9 @@ class Websitary::Configuration
                 global(:timer => value)
             end
-            # opts.on('--review', 'View last diff') do |value|
-            #   view_output
-            #   exit 0
-            # end
+            opts.on('-x', '--exclude=N', Regexp, 'Exclude URLs matching this pattern') do |value|
+                exclude(value)
+            end
             opts.separator ''
             opts.separator "Available commands (default: #@execute):"
@@ -304,6 +304,8 @@ class Websitary::Configuration
                 $logger.debug "Profile: #{fn}"
                 contents = File.read(fn)
                 return eval_profile(contents, fn)
+            else
+                $logger.error "Unknown profile: #{profile_name}"
             end
         end
         return false
@@ -334,6 +336,13 @@ class Websitary::Configuration
     end
+    def to_do(url)
+        unless @exclude.any? {|p| url =~ p}
+            @todo << url
+        end
+    end
     # Set the output format.
     def output_format(*format)
         unless format.all? {|e| ['text', 'html', 'rss'].include?(e)}
@@ -396,7 +405,7 @@ class Websitary::Configuration
     def source(urls, opts={})
         urls.split("\n").flatten.compact.each do |url|
             @urls[url] = @default_options.dup.update(opts)
-            @todo << url
+            to_do url
         end
     end
@@ -424,6 +433,13 @@ class Websitary::Configuration
     end
+    # Configuration command:
+    # Add URL-exclusion patterns (REGEXPs).
+    def exclude(*urls)
+        @exclude += urls
+    end
     # Configuration command:
     # Set the viewer.
     def view(view)
@@ -786,6 +802,7 @@ HTML
             # pn0   = Pathname.new(guess_dir(File.expand_path(uri0.path)))
             pn0   = Pathname.new(guess_dir(uri0.path))
             (hpricot / 'a').each do |a|
+                next if a['rel'] == 'nofollow'
                 href = a['href']
                 next if href.nil? or href == url or href =~ /^\s*javascript:/
                     uri  = URI.parse(href)
@@ -793,18 +810,18 @@ HTML
                 href = rewrite_href(href, url, uri0, pn0, true)
                 curl = canonic_url(href)
                 next if !href or href.nil? or @done.include?(curl) or @todo.include?(curl)
-                next unless robots_allowed?(curl, uri)
                 # pn   = Pathname.new(guess_dir(File.expand_path(uri.path)))
                 uri  = URI.parse(href)
                 pn   = Pathname.new(guess_dir(uri.path))
-                if condition.call(uri0, pn0, uri, pn)
-                    opts = @urls[url].dup
-                    # opts[:title] = File.basename(curl)
-                    opts[:title] = [opts[:title], File.basename(curl)].join(' - ')
-                    opts[:depth] = depth - 1 if depth and depth >= 0
-                    @urls[curl] = opts
-                    @todo << curl
-                end
+                next unless condition.call(uri0, pn0, uri, pn)
+                next unless robots_allowed?(curl, uri)
+                opts = @urls[url].dup
+                # opts[:title] = File.basename(curl)
+                opts[:title] = [opts[:title], File.basename(curl)].join(' - ')
+                opts[:depth] = depth - 1 if depth and depth >= 0
+                # opts[:sleep] = delay if delay
+                @urls[curl] = opts
+                to_do curl
             end
         rescue Exception => e
             # $logger.error e  #DBG#
@@ -900,7 +917,7 @@ HTML
     # group:: A number (default: 0)
     # tag:: The HTML tag to use (default: "span")
     def highlighter(rx, color=nil, group=nil, tag='span')
-        lambda {|text| text.gsub(rx, %{<#{tag} class="highlight-#{color || 'yellow'}">\\#{group || 0}</#{tag}>})}
+        lambda {|text| text.gsub(rx, %{<#{tag} class="highlight-#{color || 'red'}">\\#{group || 0}</#{tag}>})}
     end
@@ -952,14 +969,14 @@ HTML
     def initialize_options
         @options = {
             :global => {
-                :downloadhtml => :openuri,
+                :download_html => :openuri,
             },
         }
         @options[:diff] = {
             :default => :diff,
-            :diff   => lambda {|old, new, *args|
+            :diff => lambda {|old, new, *args|
                 opts, _  = args
                 opts   ||= '-d -w'
                 difftext = call_cmd('diff %s -u2 "%s" "%s"', [opts, old, new])
@@ -978,7 +995,22 @@ HTML
             :raw => :new,
+            :htmldiff => lambda {|old, new|
+                oldhtml  = File.read(old)
+                newhtml  = File.read(new)
+                difftext = Websitary::Htmldiff.new(:oldtext => oldhtml, :newtext => newhtml).diff
+                difftext
+            },
             :webdiff => lambda {|old, new|
+                oldhtml  = File.read(old)
+                newhtml  = File.read(new)
+                difftext = Websitary::Htmldiff.new(:highlight => 'highlight', :oldtext => oldhtml, :newtext => newhtml).diff
+                difftext
+            },
+            :websec_webdiff => lambda {|old, new|
+            # :webdiff => lambda {|old, new|
                 $logger.debug "webdiff: #{File.basename(new)}"
                 $logger.debug %{webdiff --hicolor=yellow -archive "#{old}" -current "#{new}" -out -}
                 difftext = `webdiff --hicolor=yellow -archive "#{old}" -current "#{new}" -out -`
@@ -1027,25 +1059,25 @@ HTML
             # :download => 'w3m -no-cookie -S -F -dump "%s"'
         shortcut :lynx, :delegate => :diff,
-        :download => 'lynx -dump "%s"'
+            :download => 'lynx -dump "%s"'
         shortcut :links, :delegate => :diff,
-        :download => 'links -dump "%s"'
+            :download => 'links -dump "%s"'
         shortcut :curl, :delegate => :webdiff,
-        :download => 'curl --silent "%s"'
+            :download => 'curl --silent "%s"'
         shortcut :wget, :delegate => :webdiff,
-        :download => 'wget -q -O - "%s"'
+            :download => 'wget -q -O - "%s"'
         shortcut :text, :delegate => :diff,
-        :download => lambda {|url| html_to_text(open_url(url).read)}
+            :download => lambda {|url| html_to_text(read_url(url, 'html'))}
         shortcut :body_html, :delegate => :webdiff,
             :strip_tags => :default,
             :download => lambda {|url|
                 begin
-                    doc = Hpricot(open_url(url).read)
+                    doc = Hpricot(read_url(url, 'html'))
                     doc = doc.at('body')
                     if doc
                         doc  = rewrite_urls(url, doc)
@@ -1068,7 +1100,7 @@ HTML
         shortcut :openuri, :delegate => :webdiff,
             :download => lambda {|url|
                 begin
-                    open_url(url).read
+                    read_url_openuri(url)
                 rescue Exception => e
                     # $logger.error e  #DBG#
                     $logger.error e.message
@@ -1085,17 +1117,17 @@ HTML
                 if ro
                     rh = {}
                     ro.items.each do |item|
-                        rh[Digest::MD5.hexdigest(item.to_s)] = item
+                        rh[rss_item_id(item)] = item
                         rh[item.link] = item
                     end
                     rnew = []
                     rn = RSS::Parser.parse(File.read(new), false)
                     if rn
                         rn.items.each do |item|
-                            rid = Digest::MD5.hexdigest(item.to_s)
+                            rid = rss_item_id(item)
                             if !rh[rid]
                                 if (olditem = rh[item.link])
-                                    rss_diff = Websitary::Htmldiff.new(:oldtext => olditem.description, :newtext => item.description).process
+                                    rss_diff = Websitary::Htmldiff.new(:highlight => 'highlight', :oldtext => olditem.description, :newtext => item.description).process
                                     rnew << format_rss_item(item, rss_diff)
                                 else
                                     if item.enclosure and (curl = item.enclosure.url)
@@ -1111,7 +1143,7 @@ HTML
                                             $logger.debug "Enclosure URL: #{curl}"
                                             fname = File.join(dir, encode(File.basename(curl) || item.title || item.pubDate.to_s || Time.now.to_s))
                                             $logger.debug "Enclosure save to: #{fname}"
-                                            enc   = open_url(curl).read
+                                            enc   = read_url(curl, 'rss_enclosure')
                                             write_file(fname, 'wb') {|io| io.puts enc}
                                             furl = file_url(fname)
                                             enclosure = %{<p class="enclosure"><a href="%s" class="enclosure" />Enclosure (local copy)</a></p>} % furl
@@ -1146,7 +1178,7 @@ HTML
                             opts[:download] = :rss
                             opts[:title] = elt['title'] || elt['text'] || elt['htmlurl'] || curl
                             @urls[curl] = opts
-                            @todo << curl
+                            to_do curl
                         else
                             $logger.warn "Unsupported type in OPML: #{elt.to_s}"
                         end
@@ -1162,10 +1194,10 @@ HTML
             :download => lambda {|url| get_website_below(:body_html, url)}
         shortcut :website_txt, :delegate => :default,
-            :download => lambda {|url| html_to_text(get_website(get(url, :downloadhtml, :openuri), url))}
+            :download => lambda {|url| html_to_text(get_website(get(url, :download_html, :openuri), url))}
         shortcut :website_txt_below, :delegate => :default,
-            :download => lambda {|url| html_to_text(get_website_below(get(url, :downloadhtml, :openuri), url))}
+            :download => lambda {|url| html_to_text(get_website_below(get(url, :download_html, :openuri), url))}
         shortcut :ftp, :delegate => :default,
             :download => lambda {|url| get_ftp(url).join("\n")}
@@ -1184,7 +1216,7 @@ HTML
                         opts[:title] = [opts[:title], File.basename(curl)].join(' - ')
                         opts[:depth] = depth - 1 if depth and depth >= 0
                         @urls[curl] = opts
-                        @todo << curl
+                        to_do curl
                     end
                 end
                 list.join("\n")
@@ -1284,7 +1316,8 @@ OUT
             if doc
                 return if robots?(doc, 'noindex')
                 push_hrefs(url, doc) do |uri0, pn0, uri, pn|
-                    uri.host && uri0.host &&
+                    (uri.host || uri.is_a?(URI::Generic)) &&
+                        (uri0.host || uri0.is_a?(URI::Generic)) &&
                         eligible_path?(url, uri0.path, uri.path) &&
                         uri.host == uri0.host &&
                         (pn.to_s == '.' || pn.relative_path_from(pn0).to_s == '.')
@@ -1337,7 +1370,17 @@ OUT
     end
-    def open_url(url)
+    def read_url(url, type='html')
+        downloader = get(url, "download_#{type}".intern)
+        if downloader
+            call_cmd(downloader, [url])
+        else
+            read_url_openuri(url)
+        end
+    end
+    def read_url_openuri(url)
         if url.nil? or url.empty?
             $logger.fatal "Internal error: url is nil"
             puts caller.join("\n")
@@ -1346,11 +1389,11 @@ OUT
         $logger.debug "Open URL: #{url}"
         uri = URI.parse(url)
         if uri.instance_of?(URI::Generic) or uri.scheme == 'file'
-            open(url)
+            open(url).read
         else
             header = {"User-Agent" => @user_agent}
             header.merge!(get(url, :header, {}))
-            open(url, header)
+            open(url, header).read
         end
     end
@@ -1369,6 +1412,14 @@ OUT
     end
+    def rss_item_id(item)
+        return Digest::MD5.hexdigest(item.to_s)
+        # i = [item.author, item.title, item.link, item.description, item.enclosure].inspect
+        # # p "DBG", i.inspect, Digest::MD5.hexdigest(i.inspect)
+        # return Digest::MD5.hexdigest(i)
+    end
     def format_rss_item(item, body, enclosure='')
         hd = [item.title]
         hd << " (#{item.author})" if item.author
@@ -1395,12 +1446,17 @@ EOT
     # Retrieve any robots meta directives from the hpricot document.
     def robots?(hpricot, *what)
-        (hpricot / '//meta[@name="robots"]').any? do |e|
+        meta(hpricot, 'robots').any? do |e|
             what.any? {|w| e['content'].split(/,\s*/).include?(w)}
         end
     end
+    def meta(hpricot, name)
+        hpricot / %{//meta[@name="#{name}"]}
+    end
     # Check whether robots are allowed to retrieve an url.
     def robots_allowed?(url, uri)
         if @allow.has_key?(url)
@@ -1414,7 +1470,7 @@ EOT
                 rurl = robots_uri(uri).to_s
                 return true if rurl.nil? or rurl.empty?
                 begin
-                    robots_txt = open_url(rurl).read
+                    robots_txt = read_url(rurl, 'robots')
                     rules      = RobotRules.new(@user_agent)
                     rules.parse(rurl, robots_txt)
                     @robots[host] = rules

data/lib/websitary/htmldiff.rb CHANGED Viewed

@@ -1,29 +1,72 @@
 #!/usr/bin/env ruby
 # htmldiff.rb
-# @Last Change: 2007-09-09.
+# @Last Change: 2007-10-08.
 # Author::      Thomas Link (micathom at gmail com)
 # License::     GPL (see http://www.gnu.org/licenses/gpl.txt)
 # Created::     2007-08-17.
-#
+#
 # == Basic Use
-#   htmldiff OLD NEW > DIFF
+#   htmldiff OLD NEW [HIGHLIGHT-COLOR] > DIFF
 require 'hpricot'
-# TODO:
-# * Option: Don't extract but highlight.
 module Websitary
+    # A simple class to generate diffs for html files using hpricot.
+    # It's quite likely that it will miss certain details and yields
+    # wrong results (especially wrong-negative) in certain occasions.
     class Htmldiff
         VERSION  = '0.1'
-        REVISION = '39'
+        REVISION = '164'
+        # args:: A hash
+        # Fields:
+        # :oldtext:: The old version
+        # :newtext:: The new version
+        # :highlight:: Don't strip old content but highlight new one with this color
+        # :args::    Command-line arguments
         def initialize(args)
             @args = args
+            @high = args[:highlight] || args[:highlightcolor]
             @old  = explode(args[:olddoc] || Hpricot(args[:oldtext] || File.read(args[:oldfile])))
             @new  =         args[:newdoc] || Hpricot(args[:newtext] || File.read(args[:newfile]))
+            @changed = false
         end
+        # Do the diff. Return an empty string if nothing has changed.
+        def diff
+            rv = process.to_s
+            @changed ? rv : ''
+        end
+        # It goes like this: if a node isn't in the list of old nodes either
+        # the node or its content has changed. If the content is a single
+        # node, the whole node has changed. If only some sub-nodes have
+        # changed, collect those.
+        def process(node=@new)
+            acc    = []
+            node.each_child do |child|
+                ch = child.to_html.strip
+                next if ch.nil? or ch.empty?
+                if @old.include?(ch)
+                    if @high
+                        acc << child
+                    end
+                else
+                    if child.respond_to?(:each_child)
+                        acc << process(child)
+                    else
+                        acc << highlight(child).to_s
+                        acc << '<br />' unless @high
+                    end
+                end
+            end
+            replace_inner(node, acc.join("\n"))
+        end
         # Collect all nodes and subnodes in a hpricot document.
         def explode(node)
             if node.respond_to?(:each_child)
@@ -37,40 +80,44 @@ module Websitary
             end
         end
-        # It goes like this: if a node isn't in the list of old nodes either
-        # the node or its content has changed. If the content is a single
-        # node, the whole node has changed. If only some sub-nodes have
-        # changed, collect those.
-        def process(node=@new)
-            acc    = []
-            single = false
-            node.each_child do |child|
-                ch = child.to_html.strip
-                next if ch.nil? or ch.empty?
-                # p "DBG ch=#{ch}"
-                unless @old.include?(ch)
-                    if child.respond_to?(:each_child)
-                        # p "DBG each_child"
-                        ap = process(child)
-                        # if ap.empty? or Hpricot(ap.join.strip).to_html ==
-                        # Hpricot(child.inner_html.strip).to_html
-                        if ap.empty?
-                            # p "DBG add child"
-                            acc << child
-                        else
-                            # p "DBG add inner"
-                            acc += ap
-                        end
+        def highlight(child)
+            @changed = true
+            if @high
+                if child.respond_to?(:each_child)
+                    acc = []
+                    child.each_child do |ch|
+                        acc << replace_inner(ch, highlight(ch).to_s)
+                    end
+                    replace_inner(child, acc.join("\n"))
+                else
+                    case @args[:highlight]
+                    when String
+                        opts = %{class="#{@args[:highlight]}"}
+                    when true, Numeric
+                        opts = %{class="highlight"}
                     else
-                        # p "DBG add single child"
-                        acc << [child, '<br />']
-                        single = true
+                        opts = %{style="background-color: #{@args[:highlightcolor]};"}
                     end
+                    ihtml = %{<span #{opts}>#{child.to_s}</span>}
+                    replace_inner(child, ihtml)
                 end
+            else
+                child
+            end
+        end
+        def replace_inner(child, ihtml)
+            case child
+            when Hpricot::Comment
+                child
+            when Hpricot::Text
+                Hpricot(ihtml)
+            else
+                child.inner_html = ihtml
+                child
             end
-            # p "DBG n=#{acc.size}"
-            acc.size == 1 && single ? [node] : acc
-            # puts acc.map {|c| c.to_html}.join("\n")
         end
     end
@@ -78,12 +125,14 @@ end
 if __FILE__ == $0
-    old, new, args = ARGV
+    old, new, aargs = ARGV
     if old and new
-        acc = Websitary::Htmldiff.new(:args => args, :oldfile => old, :newfile => new).process.join('\n')
+        args = {:args => aargs, :oldfile => old, :newfile => new}
+        args[:highlightcolor], _ = aargs
+        acc = Websitary::Htmldiff.new(args).diff
         puts acc
     else
-        puts "#{File.basename($0)} OLD NEW > DIFF"
+        puts "#{File.basename($0)} OLD NEW [HIGHLIGHT-COLOR] > DIFF"
     end
 end

metadata CHANGED Viewed

@@ -3,15 +3,15 @@ rubygems_version: 0.9.4
 specification_version: 1
 name: websitary
 version: !ruby/object:Gem::Version
-  version: 0.2.1
-date: 2007-09-16 00:00:00 +02:00
+  version: "0.3"
+date: 2007-10-26 00:00:00 +02:00
 summary: A unified website news, rss feed, podcast monitor
 require_paths:
 - lib
 email: micathom at gmail com
 homepage: http://rubyforge.org/projects/websitiary/
 rubyforge_project: websitiary
-description: "== DESCRIPTION: websitary (formerly known as websitiary with an extra \"i\") monitors  webpages, rss feeds, podcasts etc. It reuses other programs (w3m, diff,  webdiff etc.) to do most of the actual work. By default, it works on an  ASCII basis, i.e. with the output of text-based webbrowsers like w3m (or  lynx, links etc.) as the output can easily be post-processed. With the  help of some friends (see the section below on requirements), it can  also work with HTML. E.g., if you have websec installed, you can also  use its webdiff program to show colored diffs. This script was  originally planned as a ruby-based websec replacement. For HTML diffs,  it stills relies on the webdiff perl script that comes with websec.  By default, this script will use w3m to dump HTML pages and then run  diff over the current page and the previous backup. Some pages are  better viewed with lynx or links. Downloaded documents (HTML or ASCII)  can be post-processed (e.g., filtered through some ruby block that  extracts elements via hpricot and the like). Please see the  configuration options below to find out how to change this globally or  for a single source.  == FEATURES/PROBLEMS: * Handle webpages, rss feeds (optionally save attachments in podcasts  etc.) * Compare webpages with previous backups * Display differences between the current version and the backup * Provide hooks to post-process the downloaded documents and the diff * Display a one-page report summarizing all news * Automatically open the report in your favourite web-browser * Experimental: Download webpages on defined intervalls and generate  incremental diffs.  ISSUES, TODO: * With HTML output, changes are presented on one single page, which  means that pages with different encodings cause problems. * Improved support for robots.txt (test it) * The use of :website_below and :website is hardly tested (please  report errors). * download => :body_html tries to rewrite references (a, img) which may  fail on certain kind of urls (please report errors). * When using :body_html for download, it may happen that some  JavaScript code is stripped, which breaks some JavaScript-generated  links. * The --log command line will create a new instance of the logger and  thus reset any previous options related to the logging level."
+description: "== DESCRIPTION: websitary (formerly known as websitiary with an extra \"i\") monitors  webpages, rss feeds, podcasts etc. It reuses other programs (w3m, diff  etc.) to do most of the actual work. By default, it works on an ASCII  basis, i.e. with the output of text-based webbrowsers like w3m (or lynx,  links etc.) as the output can easily be post-processed. It can also work  with HTML and highlight new items. This script was originally planned as  a ruby-based websec replacement.  By default, this script will use w3m to dump HTML pages and then run  diff over the current page and the previous backup. Some pages are  better viewed with lynx or links. Downloaded documents (HTML or ASCII)  can be post-processed (e.g., filtered through some ruby block that  extracts elements via hpricot and the like). Please see the  configuration options below to find out how to change this globally or  for a single source.  This user manual is also available as PDF[http://websitiary.rubyforge.org/websitary.pdf].  == FEATURES/PROBLEMS: * Handle webpages, rss feeds (optionally save attachments in podcasts  etc.) * Compare webpages with previous backups * Display differences between the current version and the backup * Provide hooks to post-process the downloaded documents and the diff * Display a one-page report summarizing all news * Automatically open the report in your favourite web-browser * Experimental: Download webpages on defined intervalls and generate  incremental diffs."
 autorequire:
 default_executable:
 bindir: bin
@@ -72,5 +72,5 @@ dependencies:
     requirements:
     - - ">="
       - !ruby/object:Gem::Version
-        version: 1.2.2
+        version: 1.3.0
     version: