websitary 0.2.1 → 0.3

Sign up to get free protection for your applications and to get access to all the features.
data/History.txt CHANGED
@@ -1,3 +1,19 @@
1
+ = 0.3
2
+
3
+ * Renamed the global option :downloadhtml to :download_html.
4
+ * The downloader for robots and rss enclosures should now be properly
5
+ configurable via the global options :download_robots and
6
+ :download_rss_enclosure (default: :openuri).
7
+ * Respect rel="nofollow" on hyperreferences.
8
+ * :wdays, :mdays didn't work.
9
+ * --exclude command line options, exclude configuration command
10
+ * Check for robots.txt-compliance after testing if the URL is
11
+ appropriate.
12
+ * htmldiff.rb can now also highlight differences � la websec's webdiff.
13
+ * configuration.rb: Ignore pubDate and certain other non-essential fields (tags
14
+ etc.) when constructing rss item IDs.
15
+
16
+
1
17
  = 0.2.1
2
18
 
3
19
  * Use URI.merge for constructing robots.txt uri.
data/README.txt CHANGED
@@ -4,21 +4,18 @@ http://rubyforge.org/projects/websitiary/
4
4
  This script monitors webpages, rss feeds, podcasts etc. and reports
5
5
  what's new. For many tasks, it reuses other programs to do the actual
6
6
  work. By default, it works on an ASCII basis, i.e. with the output of
7
- text-based webbrowsers. With the help of some friends, it can also work
7
+ text-based webbrowsers. With the help of some friends, it works also
8
8
  with HTML.
9
9
 
10
10
 
11
11
  == DESCRIPTION:
12
12
  websitary (formerly known as websitiary with an extra "i") monitors
13
- webpages, rss feeds, podcasts etc. It reuses other programs (w3m, diff,
14
- webdiff etc.) to do most of the actual work. By default, it works on an
15
- ASCII basis, i.e. with the output of text-based webbrowsers like w3m (or
16
- lynx, links etc.) as the output can easily be post-processed. With the
17
- help of some friends (see the section below on requirements), it can
18
- also work with HTML. E.g., if you have websec installed, you can also
19
- use its webdiff program to show colored diffs. This script was
20
- originally planned as a ruby-based websec replacement. For HTML diffs,
21
- it stills relies on the webdiff perl script that comes with websec.
13
+ webpages, rss feeds, podcasts etc. It reuses other programs (w3m, diff
14
+ etc.) to do most of the actual work. By default, it works on an ASCII
15
+ basis, i.e. with the output of text-based webbrowsers like w3m (or lynx,
16
+ links etc.) as the output can easily be post-processed. It can also work
17
+ with HTML and highlight new items. This script was originally planned as
18
+ a ruby-based websec replacement.
22
19
 
23
20
  By default, this script will use w3m to dump HTML pages and then run
24
21
  diff over the current page and the previous backup. Some pages are
@@ -28,6 +25,9 @@ extracts elements via hpricot and the like). Please see the
28
25
  configuration options below to find out how to change this globally or
29
26
  for a single source.
30
27
 
28
+ This user manual is also available as
29
+ PDF[http://websitiary.rubyforge.org/websitary.pdf].
30
+
31
31
 
32
32
  == FEATURES/PROBLEMS:
33
33
  * Handle webpages, rss feeds (optionally save attachments in podcasts
@@ -58,7 +58,7 @@ NOTE: The script was previously called websitiary but was renamed (from
58
58
  0.2 on) to websitary (without the superfluous i).
59
59
 
60
60
 
61
- === CAVEAT:
61
+ === Caveat
62
62
  The script also includes experimental support for monitoring whole
63
63
  websites. Basically, this script supports robots.txt directives (see
64
64
  requirements) but this is hardly tested and may not work in some cases.
@@ -70,8 +70,6 @@ downloader or offline reader in their user agreements.
70
70
 
71
71
 
72
72
  == SYNOPSIS:
73
- This manual is also available as
74
- PDF[http://websitiary.rubyforge.org/websitary.pdf].
75
73
 
76
74
  === Usage
77
75
  Example:
@@ -245,8 +243,13 @@ Options
245
243
  <tt>:diff => "CMD", :diff => SHORTCUT</tt>::
246
244
  Use this command to make the diff for this page. Possible values for
247
245
  SHORTCUT are: :webdiff (useful in conjunction with :download => :curl,
248
- :wget, or :body_html). :body_html, :website_below, :website and
249
- :openuri are synonyms for :webdiff.
246
+ :wget, or :body_html), :websec_webdiff (use websec's webdiff tool),
247
+ :body_html, :website_below, :website and :openuri are synonyms for
248
+ :webdiff.
249
+ NOTE: Since version 0.3, :webdiff is mapped to websitary's own
250
+ htmldiff class (which can also be used as stand-alone script). Before
251
+ 0.3, websitary used websec's webdiff script, which is now mapped to
252
+ :websec_webdiff.
250
253
 
251
254
  <tt>:diffprocess => lambda {|text| ...}</tt>::
252
255
  Use this ruby snippet to post-process this diff
@@ -479,13 +482,13 @@ references so that the links point to the webpage.
479
482
  source 'http://www.example.com/daily_image/', :title => 'Daily Image',
480
483
  :use => :img,
481
484
  :download => lambda {|url|
485
+ rv = nil
482
486
  # Read the HTML.
483
487
  html = open(url) {|io| io.read}
484
488
  # This check is probably unnecessary as the failure to read
485
489
  # the HTML document would most likely result in an
486
490
  # exception.
487
491
  if html
488
- rv = nil
489
492
  # Parse the HTML document.
490
493
  doc = Hpricot(html)
491
494
  # The following could actually be simplified using xpath
@@ -541,6 +544,9 @@ latest::
541
544
  Show the latest copies of the sources from the profiles given
542
545
  on the command line.
543
546
 
547
+ ls::
548
+ List number of aggregated diffs.
549
+
544
550
  rebuild::
545
551
  Rebuild the latest report.
546
552
 
@@ -611,16 +617,14 @@ and one of:
611
617
  * w3m[http://w3m.sourceforge.net/] (default)
612
618
  * lynx[http://lynx.isc.org/]
613
619
  * links[http://links.twibright.com/]
614
- * websec[http://baruch.ev-en.org/proj/websec/]
615
- (or at Savannah[http://savannah.nongnu.org/projects/websec/])
616
620
 
617
- The use of :webdiff as :diff application requires
618
- websec[http://download.savannah.gnu.org/releases/websec/] to be
619
- installed. In conjunction with :body_html, :openuri, or :curl, this
620
- will give you colored HTML diffs.
621
- Why not use +websec+ if I have to install it, you might ask. Well,
622
- +websec+ is written in perl and I didn't quite manage to make it work
623
- the way I want it to. websitary is made to be better to configure.
621
+ The use of :websec_webdiff as :diff application requires
622
+ websec[http://baruch.ev-en.org/proj/websec/] (or at
623
+ Savannah[http://savannah.nongnu.org/projects/websec/]) to be installed.
624
+ By default, websitary uses it's own htmldiff class/script, which is less
625
+ well tested and may return inferior results in comparison with websec's
626
+ webdiff. In conjunction with :body_html, :openuri, or :curl, this will
627
+ give you colored HTML diffs.
624
628
 
625
629
  For downloading HTML, you need one of these:
626
630
 
@@ -641,7 +645,6 @@ and :website related shortcuts:
641
645
  I personally would suggest to choose the following setup:
642
646
 
643
647
  * w3m[http://w3m.sourceforge.net/]
644
- * websec[http://baruch.ev-en.org/proj/websec/]
645
648
  * hpricot[http://code.whytheluckystiff.net/hpricot]
646
649
  * robot_rules.rb[http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589]
647
650
 
@@ -674,7 +677,7 @@ These could be installed by:
674
677
  gem install hpricot
675
678
 
676
679
  # Install robot_rules.rb
677
- curl http://www.rubyquiz.com/quiz64_sols.zip
680
+ wget http://www.rubyquiz.com/quiz64_sols.zip
678
681
  # Check the correct path to site_ruby first!
679
682
  unzip -p quiz64_sols.zip "solutions/James Edward Gray II/robot_rules.rb" > /lib/ruby/site_ruby/1.8/robot_rules.rb
680
683
  rm quiz64_sols.zip
@@ -708,6 +711,8 @@ of the following directories exist, which will then be used instead:
708
711
  If neither directory exists and no $HOME variable is defined, the
709
712
  current directory will be used.
710
713
 
714
+ Now check out the configuration commands in the Synopsis section.
715
+
711
716
 
712
717
  == LICENSE:
713
718
  websitary Webpage Monitor
data/Rakefile CHANGED
@@ -21,7 +21,7 @@ require 'rtagstask'
21
21
  RTagsTask.new
22
22
 
23
23
  task :ctags do
24
- `ctags --extra=+q -R bin lib`
24
+ `ctags --extra=+q --fields=+i -R bin lib`
25
25
  end
26
26
 
27
27
  # vim: syntax=Ruby
data/lib/websitary.rb CHANGED
@@ -1,13 +1,8 @@
1
1
  # websitary.rb
2
- # @Last Change: 2007-09-16.
2
+ # @Last Change: 2007-10-26.
3
3
  # Author:: Thomas Link (micathom AT gmail com)
4
4
  # License:: GPL (see http://www.gnu.org/licenses/gpl.txt)
5
5
  # Created:: 2007-09-08.
6
- #
7
- # = TODO
8
- # * Built-in support for robots.txt
9
- # * Option to append to output files (e.g. rss)
10
- # * Option to trim output files (when appending items)
11
6
 
12
7
 
13
8
  require 'cgi'
@@ -37,8 +32,8 @@ end
37
32
 
38
33
  module Websitary
39
34
  APPNAME = 'websitary'
40
- VERSION = '0.2.1'
41
- REVISION = '2405'
35
+ VERSION = '0.3'
36
+ REVISION = '2437'
42
37
  end
43
38
 
44
39
  require 'websitary/applog'
@@ -48,7 +43,7 @@ require 'websitary/htmldiff'
48
43
 
49
44
 
50
45
  # Basic usage:
51
- # Websitary.new(ARGV).process
46
+ # Websitary::App.new(ARGV).process
52
47
  class Websitary::App
53
48
  MINUTE_SECS = 60
54
49
  HOUR_SECS = MINUTE_SECS * 60
@@ -207,7 +202,7 @@ CSS
207
202
 
208
203
 
209
204
  def cmdline_arg_add(configuration, url)
210
- configuration.todo << url
205
+ configuration.to_do url
211
206
  end
212
207
 
213
208
 
@@ -290,6 +285,24 @@ CSS
290
285
  end
291
286
 
292
287
 
288
+ def execute_ls
289
+ rv = 0
290
+ @configuration.todo.each do |url|
291
+ opts = @configuration.urls[url]
292
+ name = @configuration.get(url, :title, url)
293
+ $logger.debug "Source: #{name}"
294
+ aggrbase = @configuration.encoded_filename('aggregate', url, true, 'md5')
295
+ aggrfiles = Dir["#{aggrbase}_*"]
296
+ aggrn = aggrfiles.size
297
+ if aggrn > 0
298
+ puts "%3d - %s" % [aggrn, name]
299
+ rv = 1
300
+ end
301
+ end
302
+ rv
303
+ end
304
+
305
+
293
306
  # Show data collected by #execute_aggregate
294
307
  def execute_show
295
308
  @configuration.todo.each do |url|
@@ -320,6 +333,10 @@ CSS
320
333
  # and command-line options. The differences are stored in @difftext (a Hash).
321
334
  # show_output:: If true, show the output with the defined viewer.
322
335
  def execute_downdiff(show_output=true, rebuild=false, &accumulator)
336
+ if @configuration.todo.empty?
337
+ $logger.error 'Nothing to do'
338
+ return 5
339
+ end
323
340
  @configuration.todo.each do |url|
324
341
  opts = @configuration.urls[url]
325
342
  $logger.debug "Source: #{@configuration.get(url, :title, url)}"
@@ -464,15 +481,19 @@ CSS
464
481
  # $logger.debug text #DBG#
465
482
  end
466
483
 
467
- if older
468
- if File.exist?(latest)
469
- move(latest, older)
470
- elsif !File.exist?(older)
471
- $logger.warn "Initial copy: #{latest.inspect}"
484
+ if text and !text.empty?
485
+ if older
486
+ if File.exist?(latest)
487
+ move(latest, older)
488
+ elsif !File.exist?(older)
489
+ $logger.warn "Initial copy: #{latest.inspect}"
490
+ end
472
491
  end
492
+ @configuration.write_file(latest) {|io| io.puts(text)}
493
+ return true
494
+ else
495
+ return false
473
496
  end
474
- @configuration.write_file(latest) {|io| io.puts(text)}
475
- return true
476
497
  end
477
498
 
478
499
 
@@ -566,7 +587,7 @@ CSS
566
587
  if parent_eligible == parent_now
567
588
  return true
568
589
  else
569
- case now
590
+ case eligible
570
591
  when Array, Range
571
592
  return !eligible.include?(now)
572
593
  when Integer
@@ -1,5 +1,5 @@
1
1
  # configuration.rb
2
- # @Last Change: 2007-09-16.
2
+ # @Last Change: 2007-10-21.
3
3
  # Author:: Thomas Link (micathom AT gmail com)
4
4
  # License:: GPL (see http://www.gnu.org/licenses/gpl.txt)
5
5
  # Created:: 2007-09-08.
@@ -12,7 +12,7 @@ class Websitary::Configuration
12
12
  # Hash (key = URL, value = Hash of options)
13
13
  attr_accessor :urls
14
14
  # Array of urls to be downloaded.
15
- attr_accessor :todo
15
+ attr_reader :todo
16
16
  # Array of downloaded urls.
17
17
  attr_accessor :done
18
18
  # The user configuration directory
@@ -60,6 +60,7 @@ class Websitary::Configuration
60
60
  @profiles = []
61
61
  @robots = {}
62
62
  @todo = []
63
+ @exclude = []
63
64
  @urlencmap = {}
64
65
  @urls = {}
65
66
 
@@ -127,10 +128,9 @@ class Websitary::Configuration
127
128
  global(:timer => value)
128
129
  end
129
130
 
130
- # opts.on('--review', 'View last diff') do |value|
131
- # view_output
132
- # exit 0
133
- # end
131
+ opts.on('-x', '--exclude=N', Regexp, 'Exclude URLs matching this pattern') do |value|
132
+ exclude(value)
133
+ end
134
134
 
135
135
  opts.separator ''
136
136
  opts.separator "Available commands (default: #@execute):"
@@ -304,6 +304,8 @@ class Websitary::Configuration
304
304
  $logger.debug "Profile: #{fn}"
305
305
  contents = File.read(fn)
306
306
  return eval_profile(contents, fn)
307
+ else
308
+ $logger.error "Unknown profile: #{profile_name}"
307
309
  end
308
310
  end
309
311
  return false
@@ -334,6 +336,13 @@ class Websitary::Configuration
334
336
  end
335
337
 
336
338
 
339
+ def to_do(url)
340
+ unless @exclude.any? {|p| url =~ p}
341
+ @todo << url
342
+ end
343
+ end
344
+
345
+
337
346
  # Set the output format.
338
347
  def output_format(*format)
339
348
  unless format.all? {|e| ['text', 'html', 'rss'].include?(e)}
@@ -396,7 +405,7 @@ class Websitary::Configuration
396
405
  def source(urls, opts={})
397
406
  urls.split("\n").flatten.compact.each do |url|
398
407
  @urls[url] = @default_options.dup.update(opts)
399
- @todo << url
408
+ to_do url
400
409
  end
401
410
  end
402
411
 
@@ -424,6 +433,13 @@ class Websitary::Configuration
424
433
  end
425
434
 
426
435
 
436
+ # Configuration command:
437
+ # Add URL-exclusion patterns (REGEXPs).
438
+ def exclude(*urls)
439
+ @exclude += urls
440
+ end
441
+
442
+
427
443
  # Configuration command:
428
444
  # Set the viewer.
429
445
  def view(view)
@@ -786,6 +802,7 @@ HTML
786
802
  # pn0 = Pathname.new(guess_dir(File.expand_path(uri0.path)))
787
803
  pn0 = Pathname.new(guess_dir(uri0.path))
788
804
  (hpricot / 'a').each do |a|
805
+ next if a['rel'] == 'nofollow'
789
806
  href = a['href']
790
807
  next if href.nil? or href == url or href =~ /^\s*javascript:/
791
808
  uri = URI.parse(href)
@@ -793,18 +810,18 @@ HTML
793
810
  href = rewrite_href(href, url, uri0, pn0, true)
794
811
  curl = canonic_url(href)
795
812
  next if !href or href.nil? or @done.include?(curl) or @todo.include?(curl)
796
- next unless robots_allowed?(curl, uri)
797
813
  # pn = Pathname.new(guess_dir(File.expand_path(uri.path)))
798
814
  uri = URI.parse(href)
799
815
  pn = Pathname.new(guess_dir(uri.path))
800
- if condition.call(uri0, pn0, uri, pn)
801
- opts = @urls[url].dup
802
- # opts[:title] = File.basename(curl)
803
- opts[:title] = [opts[:title], File.basename(curl)].join(' - ')
804
- opts[:depth] = depth - 1 if depth and depth >= 0
805
- @urls[curl] = opts
806
- @todo << curl
807
- end
816
+ next unless condition.call(uri0, pn0, uri, pn)
817
+ next unless robots_allowed?(curl, uri)
818
+ opts = @urls[url].dup
819
+ # opts[:title] = File.basename(curl)
820
+ opts[:title] = [opts[:title], File.basename(curl)].join(' - ')
821
+ opts[:depth] = depth - 1 if depth and depth >= 0
822
+ # opts[:sleep] = delay if delay
823
+ @urls[curl] = opts
824
+ to_do curl
808
825
  end
809
826
  rescue Exception => e
810
827
  # $logger.error e #DBG#
@@ -900,7 +917,7 @@ HTML
900
917
  # group:: A number (default: 0)
901
918
  # tag:: The HTML tag to use (default: "span")
902
919
  def highlighter(rx, color=nil, group=nil, tag='span')
903
- lambda {|text| text.gsub(rx, %{<#{tag} class="highlight-#{color || 'yellow'}">\\#{group || 0}</#{tag}>})}
920
+ lambda {|text| text.gsub(rx, %{<#{tag} class="highlight-#{color || 'red'}">\\#{group || 0}</#{tag}>})}
904
921
  end
905
922
 
906
923
 
@@ -952,14 +969,14 @@ HTML
952
969
  def initialize_options
953
970
  @options = {
954
971
  :global => {
955
- :downloadhtml => :openuri,
972
+ :download_html => :openuri,
956
973
  },
957
974
  }
958
975
 
959
976
  @options[:diff] = {
960
977
  :default => :diff,
961
978
 
962
- :diff => lambda {|old, new, *args|
979
+ :diff => lambda {|old, new, *args|
963
980
  opts, _ = args
964
981
  opts ||= '-d -w'
965
982
  difftext = call_cmd('diff %s -u2 "%s" "%s"', [opts, old, new])
@@ -978,7 +995,22 @@ HTML
978
995
 
979
996
  :raw => :new,
980
997
 
998
+ :htmldiff => lambda {|old, new|
999
+ oldhtml = File.read(old)
1000
+ newhtml = File.read(new)
1001
+ difftext = Websitary::Htmldiff.new(:oldtext => oldhtml, :newtext => newhtml).diff
1002
+ difftext
1003
+ },
1004
+
981
1005
  :webdiff => lambda {|old, new|
1006
+ oldhtml = File.read(old)
1007
+ newhtml = File.read(new)
1008
+ difftext = Websitary::Htmldiff.new(:highlight => 'highlight', :oldtext => oldhtml, :newtext => newhtml).diff
1009
+ difftext
1010
+ },
1011
+
1012
+ :websec_webdiff => lambda {|old, new|
1013
+ # :webdiff => lambda {|old, new|
982
1014
  $logger.debug "webdiff: #{File.basename(new)}"
983
1015
  $logger.debug %{webdiff --hicolor=yellow -archive "#{old}" -current "#{new}" -out -}
984
1016
  difftext = `webdiff --hicolor=yellow -archive "#{old}" -current "#{new}" -out -`
@@ -1027,25 +1059,25 @@ HTML
1027
1059
  # :download => 'w3m -no-cookie -S -F -dump "%s"'
1028
1060
 
1029
1061
  shortcut :lynx, :delegate => :diff,
1030
- :download => 'lynx -dump "%s"'
1062
+ :download => 'lynx -dump "%s"'
1031
1063
 
1032
1064
  shortcut :links, :delegate => :diff,
1033
- :download => 'links -dump "%s"'
1065
+ :download => 'links -dump "%s"'
1034
1066
 
1035
1067
  shortcut :curl, :delegate => :webdiff,
1036
- :download => 'curl --silent "%s"'
1068
+ :download => 'curl --silent "%s"'
1037
1069
 
1038
1070
  shortcut :wget, :delegate => :webdiff,
1039
- :download => 'wget -q -O - "%s"'
1071
+ :download => 'wget -q -O - "%s"'
1040
1072
 
1041
1073
  shortcut :text, :delegate => :diff,
1042
- :download => lambda {|url| html_to_text(open_url(url).read)}
1074
+ :download => lambda {|url| html_to_text(read_url(url, 'html'))}
1043
1075
 
1044
1076
  shortcut :body_html, :delegate => :webdiff,
1045
1077
  :strip_tags => :default,
1046
1078
  :download => lambda {|url|
1047
1079
  begin
1048
- doc = Hpricot(open_url(url).read)
1080
+ doc = Hpricot(read_url(url, 'html'))
1049
1081
  doc = doc.at('body')
1050
1082
  if doc
1051
1083
  doc = rewrite_urls(url, doc)
@@ -1068,7 +1100,7 @@ HTML
1068
1100
  shortcut :openuri, :delegate => :webdiff,
1069
1101
  :download => lambda {|url|
1070
1102
  begin
1071
- open_url(url).read
1103
+ read_url_openuri(url)
1072
1104
  rescue Exception => e
1073
1105
  # $logger.error e #DBG#
1074
1106
  $logger.error e.message
@@ -1085,17 +1117,17 @@ HTML
1085
1117
  if ro
1086
1118
  rh = {}
1087
1119
  ro.items.each do |item|
1088
- rh[Digest::MD5.hexdigest(item.to_s)] = item
1120
+ rh[rss_item_id(item)] = item
1089
1121
  rh[item.link] = item
1090
1122
  end
1091
1123
  rnew = []
1092
1124
  rn = RSS::Parser.parse(File.read(new), false)
1093
1125
  if rn
1094
1126
  rn.items.each do |item|
1095
- rid = Digest::MD5.hexdigest(item.to_s)
1127
+ rid = rss_item_id(item)
1096
1128
  if !rh[rid]
1097
1129
  if (olditem = rh[item.link])
1098
- rss_diff = Websitary::Htmldiff.new(:oldtext => olditem.description, :newtext => item.description).process
1130
+ rss_diff = Websitary::Htmldiff.new(:highlight => 'highlight', :oldtext => olditem.description, :newtext => item.description).process
1099
1131
  rnew << format_rss_item(item, rss_diff)
1100
1132
  else
1101
1133
  if item.enclosure and (curl = item.enclosure.url)
@@ -1111,7 +1143,7 @@ HTML
1111
1143
  $logger.debug "Enclosure URL: #{curl}"
1112
1144
  fname = File.join(dir, encode(File.basename(curl) || item.title || item.pubDate.to_s || Time.now.to_s))
1113
1145
  $logger.debug "Enclosure save to: #{fname}"
1114
- enc = open_url(curl).read
1146
+ enc = read_url(curl, 'rss_enclosure')
1115
1147
  write_file(fname, 'wb') {|io| io.puts enc}
1116
1148
  furl = file_url(fname)
1117
1149
  enclosure = %{<p class="enclosure"><a href="%s" class="enclosure" />Enclosure (local copy)</a></p>} % furl
@@ -1146,7 +1178,7 @@ HTML
1146
1178
  opts[:download] = :rss
1147
1179
  opts[:title] = elt['title'] || elt['text'] || elt['htmlurl'] || curl
1148
1180
  @urls[curl] = opts
1149
- @todo << curl
1181
+ to_do curl
1150
1182
  else
1151
1183
  $logger.warn "Unsupported type in OPML: #{elt.to_s}"
1152
1184
  end
@@ -1162,10 +1194,10 @@ HTML
1162
1194
  :download => lambda {|url| get_website_below(:body_html, url)}
1163
1195
 
1164
1196
  shortcut :website_txt, :delegate => :default,
1165
- :download => lambda {|url| html_to_text(get_website(get(url, :downloadhtml, :openuri), url))}
1197
+ :download => lambda {|url| html_to_text(get_website(get(url, :download_html, :openuri), url))}
1166
1198
 
1167
1199
  shortcut :website_txt_below, :delegate => :default,
1168
- :download => lambda {|url| html_to_text(get_website_below(get(url, :downloadhtml, :openuri), url))}
1200
+ :download => lambda {|url| html_to_text(get_website_below(get(url, :download_html, :openuri), url))}
1169
1201
 
1170
1202
  shortcut :ftp, :delegate => :default,
1171
1203
  :download => lambda {|url| get_ftp(url).join("\n")}
@@ -1184,7 +1216,7 @@ HTML
1184
1216
  opts[:title] = [opts[:title], File.basename(curl)].join(' - ')
1185
1217
  opts[:depth] = depth - 1 if depth and depth >= 0
1186
1218
  @urls[curl] = opts
1187
- @todo << curl
1219
+ to_do curl
1188
1220
  end
1189
1221
  end
1190
1222
  list.join("\n")
@@ -1284,7 +1316,8 @@ OUT
1284
1316
  if doc
1285
1317
  return if robots?(doc, 'noindex')
1286
1318
  push_hrefs(url, doc) do |uri0, pn0, uri, pn|
1287
- uri.host && uri0.host &&
1319
+ (uri.host || uri.is_a?(URI::Generic)) &&
1320
+ (uri0.host || uri0.is_a?(URI::Generic)) &&
1288
1321
  eligible_path?(url, uri0.path, uri.path) &&
1289
1322
  uri.host == uri0.host &&
1290
1323
  (pn.to_s == '.' || pn.relative_path_from(pn0).to_s == '.')
@@ -1337,7 +1370,17 @@ OUT
1337
1370
  end
1338
1371
 
1339
1372
 
1340
- def open_url(url)
1373
+ def read_url(url, type='html')
1374
+ downloader = get(url, "download_#{type}".intern)
1375
+ if downloader
1376
+ call_cmd(downloader, [url])
1377
+ else
1378
+ read_url_openuri(url)
1379
+ end
1380
+ end
1381
+
1382
+
1383
+ def read_url_openuri(url)
1341
1384
  if url.nil? or url.empty?
1342
1385
  $logger.fatal "Internal error: url is nil"
1343
1386
  puts caller.join("\n")
@@ -1346,11 +1389,11 @@ OUT
1346
1389
  $logger.debug "Open URL: #{url}"
1347
1390
  uri = URI.parse(url)
1348
1391
  if uri.instance_of?(URI::Generic) or uri.scheme == 'file'
1349
- open(url)
1392
+ open(url).read
1350
1393
  else
1351
1394
  header = {"User-Agent" => @user_agent}
1352
1395
  header.merge!(get(url, :header, {}))
1353
- open(url, header)
1396
+ open(url, header).read
1354
1397
  end
1355
1398
  end
1356
1399
 
@@ -1369,6 +1412,14 @@ OUT
1369
1412
  end
1370
1413
 
1371
1414
 
1415
+ def rss_item_id(item)
1416
+ return Digest::MD5.hexdigest(item.to_s)
1417
+ # i = [item.author, item.title, item.link, item.description, item.enclosure].inspect
1418
+ # # p "DBG", i.inspect, Digest::MD5.hexdigest(i.inspect)
1419
+ # return Digest::MD5.hexdigest(i)
1420
+ end
1421
+
1422
+
1372
1423
  def format_rss_item(item, body, enclosure='')
1373
1424
  hd = [item.title]
1374
1425
  hd << " (#{item.author})" if item.author
@@ -1395,12 +1446,17 @@ EOT
1395
1446
 
1396
1447
  # Retrieve any robots meta directives from the hpricot document.
1397
1448
  def robots?(hpricot, *what)
1398
- (hpricot / '//meta[@name="robots"]').any? do |e|
1449
+ meta(hpricot, 'robots').any? do |e|
1399
1450
  what.any? {|w| e['content'].split(/,\s*/).include?(w)}
1400
1451
  end
1401
1452
  end
1402
1453
 
1403
1454
 
1455
+ def meta(hpricot, name)
1456
+ hpricot / %{//meta[@name="#{name}"]}
1457
+ end
1458
+
1459
+
1404
1460
  # Check whether robots are allowed to retrieve an url.
1405
1461
  def robots_allowed?(url, uri)
1406
1462
  if @allow.has_key?(url)
@@ -1414,7 +1470,7 @@ EOT
1414
1470
  rurl = robots_uri(uri).to_s
1415
1471
  return true if rurl.nil? or rurl.empty?
1416
1472
  begin
1417
- robots_txt = open_url(rurl).read
1473
+ robots_txt = read_url(rurl, 'robots')
1418
1474
  rules = RobotRules.new(@user_agent)
1419
1475
  rules.parse(rurl, robots_txt)
1420
1476
  @robots[host] = rules
@@ -1,29 +1,72 @@
1
1
  #!/usr/bin/env ruby
2
2
  # htmldiff.rb
3
- # @Last Change: 2007-09-09.
3
+ # @Last Change: 2007-10-08.
4
4
  # Author:: Thomas Link (micathom at gmail com)
5
5
  # License:: GPL (see http://www.gnu.org/licenses/gpl.txt)
6
6
  # Created:: 2007-08-17.
7
- #
7
+ #
8
8
  # == Basic Use
9
- # htmldiff OLD NEW > DIFF
9
+ # htmldiff OLD NEW [HIGHLIGHT-COLOR] > DIFF
10
10
 
11
11
  require 'hpricot'
12
12
 
13
13
 
14
- # TODO:
15
- # * Option: Don't extract but highlight.
16
14
  module Websitary
15
+ # A simple class to generate diffs for html files using hpricot.
16
+ # It's quite likely that it will miss certain details and yields
17
+ # wrong results (especially wrong-negative) in certain occasions.
17
18
  class Htmldiff
18
19
  VERSION = '0.1'
19
- REVISION = '39'
20
-
20
+ REVISION = '164'
21
+
22
+ # args:: A hash
23
+ # Fields:
24
+ # :oldtext:: The old version
25
+ # :newtext:: The new version
26
+ # :highlight:: Don't strip old content but highlight new one with this color
27
+ # :args:: Command-line arguments
21
28
  def initialize(args)
22
29
  @args = args
30
+ @high = args[:highlight] || args[:highlightcolor]
23
31
  @old = explode(args[:olddoc] || Hpricot(args[:oldtext] || File.read(args[:oldfile])))
24
32
  @new = args[:newdoc] || Hpricot(args[:newtext] || File.read(args[:newfile]))
33
+ @changed = false
25
34
  end
26
35
 
36
+
37
+ # Do the diff. Return an empty string if nothing has changed.
38
+ def diff
39
+ rv = process.to_s
40
+ @changed ? rv : ''
41
+ end
42
+
43
+
44
+ # It goes like this: if a node isn't in the list of old nodes either
45
+ # the node or its content has changed. If the content is a single
46
+ # node, the whole node has changed. If only some sub-nodes have
47
+ # changed, collect those.
48
+ def process(node=@new)
49
+ acc = []
50
+ node.each_child do |child|
51
+ ch = child.to_html.strip
52
+ next if ch.nil? or ch.empty?
53
+ if @old.include?(ch)
54
+ if @high
55
+ acc << child
56
+ end
57
+ else
58
+ if child.respond_to?(:each_child)
59
+ acc << process(child)
60
+ else
61
+ acc << highlight(child).to_s
62
+ acc << '<br />' unless @high
63
+ end
64
+ end
65
+ end
66
+ replace_inner(node, acc.join("\n"))
67
+ end
68
+
69
+
27
70
  # Collect all nodes and subnodes in a hpricot document.
28
71
  def explode(node)
29
72
  if node.respond_to?(:each_child)
@@ -37,40 +80,44 @@ module Websitary
37
80
  end
38
81
  end
39
82
 
40
- # It goes like this: if a node isn't in the list of old nodes either
41
- # the node or its content has changed. If the content is a single
42
- # node, the whole node has changed. If only some sub-nodes have
43
- # changed, collect those.
44
- def process(node=@new)
45
- acc = []
46
- single = false
47
- node.each_child do |child|
48
- ch = child.to_html.strip
49
- next if ch.nil? or ch.empty?
50
- # p "DBG ch=#{ch}"
51
- unless @old.include?(ch)
52
- if child.respond_to?(:each_child)
53
- # p "DBG each_child"
54
- ap = process(child)
55
- # if ap.empty? or Hpricot(ap.join.strip).to_html ==
56
- # Hpricot(child.inner_html.strip).to_html
57
- if ap.empty?
58
- # p "DBG add child"
59
- acc << child
60
- else
61
- # p "DBG add inner"
62
- acc += ap
63
- end
83
+
84
+ def highlight(child)
85
+ @changed = true
86
+ if @high
87
+ if child.respond_to?(:each_child)
88
+ acc = []
89
+ child.each_child do |ch|
90
+ acc << replace_inner(ch, highlight(ch).to_s)
91
+ end
92
+ replace_inner(child, acc.join("\n"))
93
+ else
94
+ case @args[:highlight]
95
+ when String
96
+ opts = %{class="#{@args[:highlight]}"}
97
+ when true, Numeric
98
+ opts = %{class="highlight"}
64
99
  else
65
- # p "DBG add single child"
66
- acc << [child, '<br />']
67
- single = true
100
+ opts = %{style="background-color: #{@args[:highlightcolor]};"}
68
101
  end
102
+ ihtml = %{<span #{opts}>#{child.to_s}</span>}
103
+ replace_inner(child, ihtml)
69
104
  end
105
+ else
106
+ child
107
+ end
108
+ end
109
+
110
+
111
+ def replace_inner(child, ihtml)
112
+ case child
113
+ when Hpricot::Comment
114
+ child
115
+ when Hpricot::Text
116
+ Hpricot(ihtml)
117
+ else
118
+ child.inner_html = ihtml
119
+ child
70
120
  end
71
- # p "DBG n=#{acc.size}"
72
- acc.size == 1 && single ? [node] : acc
73
- # puts acc.map {|c| c.to_html}.join("\n")
74
121
  end
75
122
 
76
123
  end
@@ -78,12 +125,14 @@ end
78
125
 
79
126
 
80
127
  if __FILE__ == $0
81
- old, new, args = ARGV
128
+ old, new, aargs = ARGV
82
129
  if old and new
83
- acc = Websitary::Htmldiff.new(:args => args, :oldfile => old, :newfile => new).process.join('\n')
130
+ args = {:args => aargs, :oldfile => old, :newfile => new}
131
+ args[:highlightcolor], _ = aargs
132
+ acc = Websitary::Htmldiff.new(args).diff
84
133
  puts acc
85
134
  else
86
- puts "#{File.basename($0)} OLD NEW > DIFF"
135
+ puts "#{File.basename($0)} OLD NEW [HIGHLIGHT-COLOR] > DIFF"
87
136
  end
88
137
  end
89
138
 
metadata CHANGED
@@ -3,15 +3,15 @@ rubygems_version: 0.9.4
3
3
  specification_version: 1
4
4
  name: websitary
5
5
  version: !ruby/object:Gem::Version
6
- version: 0.2.1
7
- date: 2007-09-16 00:00:00 +02:00
6
+ version: "0.3"
7
+ date: 2007-10-26 00:00:00 +02:00
8
8
  summary: A unified website news, rss feed, podcast monitor
9
9
  require_paths:
10
10
  - lib
11
11
  email: micathom at gmail com
12
12
  homepage: http://rubyforge.org/projects/websitiary/
13
13
  rubyforge_project: websitiary
14
- description: "== DESCRIPTION: websitary (formerly known as websitiary with an extra \"i\") monitors webpages, rss feeds, podcasts etc. It reuses other programs (w3m, diff, webdiff etc.) to do most of the actual work. By default, it works on an ASCII basis, i.e. with the output of text-based webbrowsers like w3m (or lynx, links etc.) as the output can easily be post-processed. With the help of some friends (see the section below on requirements), it can also work with HTML. E.g., if you have websec installed, you can also use its webdiff program to show colored diffs. This script was originally planned as a ruby-based websec replacement. For HTML diffs, it stills relies on the webdiff perl script that comes with websec. By default, this script will use w3m to dump HTML pages and then run diff over the current page and the previous backup. Some pages are better viewed with lynx or links. Downloaded documents (HTML or ASCII) can be post-processed (e.g., filtered through some ruby block that extracts elements via hpricot and the like). Please see the configuration options below to find out how to change this globally or for a single source. == FEATURES/PROBLEMS: * Handle webpages, rss feeds (optionally save attachments in podcasts etc.) * Compare webpages with previous backups * Display differences between the current version and the backup * Provide hooks to post-process the downloaded documents and the diff * Display a one-page report summarizing all news * Automatically open the report in your favourite web-browser * Experimental: Download webpages on defined intervalls and generate incremental diffs. ISSUES, TODO: * With HTML output, changes are presented on one single page, which means that pages with different encodings cause problems. * Improved support for robots.txt (test it) * The use of :website_below and :website is hardly tested (please report errors). * download => :body_html tries to rewrite references (a, img) which may fail on certain kind of urls (please report errors). * When using :body_html for download, it may happen that some JavaScript code is stripped, which breaks some JavaScript-generated links. * The --log command line will create a new instance of the logger and thus reset any previous options related to the logging level."
14
+ description: "== DESCRIPTION: websitary (formerly known as websitiary with an extra \"i\") monitors webpages, rss feeds, podcasts etc. It reuses other programs (w3m, diff etc.) to do most of the actual work. By default, it works on an ASCII basis, i.e. with the output of text-based webbrowsers like w3m (or lynx, links etc.) as the output can easily be post-processed. It can also work with HTML and highlight new items. This script was originally planned as a ruby-based websec replacement. By default, this script will use w3m to dump HTML pages and then run diff over the current page and the previous backup. Some pages are better viewed with lynx or links. Downloaded documents (HTML or ASCII) can be post-processed (e.g., filtered through some ruby block that extracts elements via hpricot and the like). Please see the configuration options below to find out how to change this globally or for a single source. This user manual is also available as PDF[http://websitiary.rubyforge.org/websitary.pdf]. == FEATURES/PROBLEMS: * Handle webpages, rss feeds (optionally save attachments in podcasts etc.) * Compare webpages with previous backups * Display differences between the current version and the backup * Provide hooks to post-process the downloaded documents and the diff * Display a one-page report summarizing all news * Automatically open the report in your favourite web-browser * Experimental: Download webpages on defined intervalls and generate incremental diffs."
15
15
  autorequire:
16
16
  default_executable:
17
17
  bindir: bin
@@ -72,5 +72,5 @@ dependencies:
72
72
  requirements:
73
73
  - - ">="
74
74
  - !ruby/object:Gem::Version
75
- version: 1.2.2
75
+ version: 1.3.0
76
76
  version: