websitary 0.2.1 → 0.3
Sign up to get free protection for your applications and to get access to all the features.
- data/History.txt +16 -0
- data/README.txt +32 -27
- data/Rakefile +1 -1
- data/lib/websitary.rb +39 -18
- data/lib/websitary/configuration.rb +96 -40
- data/lib/websitary/htmldiff.rb +89 -40
- metadata +4 -4
data/History.txt
CHANGED
@@ -1,3 +1,19 @@
|
|
1
|
+
= 0.3
|
2
|
+
|
3
|
+
* Renamed the global option :downloadhtml to :download_html.
|
4
|
+
* The downloader for robots and rss enclosures should now be properly
|
5
|
+
configurable via the global options :download_robots and
|
6
|
+
:download_rss_enclosure (default: :openuri).
|
7
|
+
* Respect rel="nofollow" on hyperreferences.
|
8
|
+
* :wdays, :mdays didn't work.
|
9
|
+
* --exclude command line options, exclude configuration command
|
10
|
+
* Check for robots.txt-compliance after testing if the URL is
|
11
|
+
appropriate.
|
12
|
+
* htmldiff.rb can now also highlight differences � la websec's webdiff.
|
13
|
+
* configuration.rb: Ignore pubDate and certain other non-essential fields (tags
|
14
|
+
etc.) when constructing rss item IDs.
|
15
|
+
|
16
|
+
|
1
17
|
= 0.2.1
|
2
18
|
|
3
19
|
* Use URI.merge for constructing robots.txt uri.
|
data/README.txt
CHANGED
@@ -4,21 +4,18 @@ http://rubyforge.org/projects/websitiary/
|
|
4
4
|
This script monitors webpages, rss feeds, podcasts etc. and reports
|
5
5
|
what's new. For many tasks, it reuses other programs to do the actual
|
6
6
|
work. By default, it works on an ASCII basis, i.e. with the output of
|
7
|
-
text-based webbrowsers. With the help of some friends, it
|
7
|
+
text-based webbrowsers. With the help of some friends, it works also
|
8
8
|
with HTML.
|
9
9
|
|
10
10
|
|
11
11
|
== DESCRIPTION:
|
12
12
|
websitary (formerly known as websitiary with an extra "i") monitors
|
13
|
-
webpages, rss feeds, podcasts etc. It reuses other programs (w3m, diff
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
use its webdiff program to show colored diffs. This script was
|
20
|
-
originally planned as a ruby-based websec replacement. For HTML diffs,
|
21
|
-
it stills relies on the webdiff perl script that comes with websec.
|
13
|
+
webpages, rss feeds, podcasts etc. It reuses other programs (w3m, diff
|
14
|
+
etc.) to do most of the actual work. By default, it works on an ASCII
|
15
|
+
basis, i.e. with the output of text-based webbrowsers like w3m (or lynx,
|
16
|
+
links etc.) as the output can easily be post-processed. It can also work
|
17
|
+
with HTML and highlight new items. This script was originally planned as
|
18
|
+
a ruby-based websec replacement.
|
22
19
|
|
23
20
|
By default, this script will use w3m to dump HTML pages and then run
|
24
21
|
diff over the current page and the previous backup. Some pages are
|
@@ -28,6 +25,9 @@ extracts elements via hpricot and the like). Please see the
|
|
28
25
|
configuration options below to find out how to change this globally or
|
29
26
|
for a single source.
|
30
27
|
|
28
|
+
This user manual is also available as
|
29
|
+
PDF[http://websitiary.rubyforge.org/websitary.pdf].
|
30
|
+
|
31
31
|
|
32
32
|
== FEATURES/PROBLEMS:
|
33
33
|
* Handle webpages, rss feeds (optionally save attachments in podcasts
|
@@ -58,7 +58,7 @@ NOTE: The script was previously called websitiary but was renamed (from
|
|
58
58
|
0.2 on) to websitary (without the superfluous i).
|
59
59
|
|
60
60
|
|
61
|
-
===
|
61
|
+
=== Caveat
|
62
62
|
The script also includes experimental support for monitoring whole
|
63
63
|
websites. Basically, this script supports robots.txt directives (see
|
64
64
|
requirements) but this is hardly tested and may not work in some cases.
|
@@ -70,8 +70,6 @@ downloader or offline reader in their user agreements.
|
|
70
70
|
|
71
71
|
|
72
72
|
== SYNOPSIS:
|
73
|
-
This manual is also available as
|
74
|
-
PDF[http://websitiary.rubyforge.org/websitary.pdf].
|
75
73
|
|
76
74
|
=== Usage
|
77
75
|
Example:
|
@@ -245,8 +243,13 @@ Options
|
|
245
243
|
<tt>:diff => "CMD", :diff => SHORTCUT</tt>::
|
246
244
|
Use this command to make the diff for this page. Possible values for
|
247
245
|
SHORTCUT are: :webdiff (useful in conjunction with :download => :curl,
|
248
|
-
:wget, or :body_html)
|
249
|
-
:openuri are synonyms for
|
246
|
+
:wget, or :body_html), :websec_webdiff (use websec's webdiff tool),
|
247
|
+
:body_html, :website_below, :website and :openuri are synonyms for
|
248
|
+
:webdiff.
|
249
|
+
NOTE: Since version 0.3, :webdiff is mapped to websitary's own
|
250
|
+
htmldiff class (which can also be used as stand-alone script). Before
|
251
|
+
0.3, websitary used websec's webdiff script, which is now mapped to
|
252
|
+
:websec_webdiff.
|
250
253
|
|
251
254
|
<tt>:diffprocess => lambda {|text| ...}</tt>::
|
252
255
|
Use this ruby snippet to post-process this diff
|
@@ -479,13 +482,13 @@ references so that the links point to the webpage.
|
|
479
482
|
source 'http://www.example.com/daily_image/', :title => 'Daily Image',
|
480
483
|
:use => :img,
|
481
484
|
:download => lambda {|url|
|
485
|
+
rv = nil
|
482
486
|
# Read the HTML.
|
483
487
|
html = open(url) {|io| io.read}
|
484
488
|
# This check is probably unnecessary as the failure to read
|
485
489
|
# the HTML document would most likely result in an
|
486
490
|
# exception.
|
487
491
|
if html
|
488
|
-
rv = nil
|
489
492
|
# Parse the HTML document.
|
490
493
|
doc = Hpricot(html)
|
491
494
|
# The following could actually be simplified using xpath
|
@@ -541,6 +544,9 @@ latest::
|
|
541
544
|
Show the latest copies of the sources from the profiles given
|
542
545
|
on the command line.
|
543
546
|
|
547
|
+
ls::
|
548
|
+
List number of aggregated diffs.
|
549
|
+
|
544
550
|
rebuild::
|
545
551
|
Rebuild the latest report.
|
546
552
|
|
@@ -611,16 +617,14 @@ and one of:
|
|
611
617
|
* w3m[http://w3m.sourceforge.net/] (default)
|
612
618
|
* lynx[http://lynx.isc.org/]
|
613
619
|
* links[http://links.twibright.com/]
|
614
|
-
* websec[http://baruch.ev-en.org/proj/websec/]
|
615
|
-
(or at Savannah[http://savannah.nongnu.org/projects/websec/])
|
616
620
|
|
617
|
-
The use of :
|
618
|
-
websec[http://
|
619
|
-
|
620
|
-
|
621
|
-
|
622
|
-
|
623
|
-
|
621
|
+
The use of :websec_webdiff as :diff application requires
|
622
|
+
websec[http://baruch.ev-en.org/proj/websec/] (or at
|
623
|
+
Savannah[http://savannah.nongnu.org/projects/websec/]) to be installed.
|
624
|
+
By default, websitary uses it's own htmldiff class/script, which is less
|
625
|
+
well tested and may return inferior results in comparison with websec's
|
626
|
+
webdiff. In conjunction with :body_html, :openuri, or :curl, this will
|
627
|
+
give you colored HTML diffs.
|
624
628
|
|
625
629
|
For downloading HTML, you need one of these:
|
626
630
|
|
@@ -641,7 +645,6 @@ and :website related shortcuts:
|
|
641
645
|
I personally would suggest to choose the following setup:
|
642
646
|
|
643
647
|
* w3m[http://w3m.sourceforge.net/]
|
644
|
-
* websec[http://baruch.ev-en.org/proj/websec/]
|
645
648
|
* hpricot[http://code.whytheluckystiff.net/hpricot]
|
646
649
|
* robot_rules.rb[http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/177589]
|
647
650
|
|
@@ -674,7 +677,7 @@ These could be installed by:
|
|
674
677
|
gem install hpricot
|
675
678
|
|
676
679
|
# Install robot_rules.rb
|
677
|
-
|
680
|
+
wget http://www.rubyquiz.com/quiz64_sols.zip
|
678
681
|
# Check the correct path to site_ruby first!
|
679
682
|
unzip -p quiz64_sols.zip "solutions/James Edward Gray II/robot_rules.rb" > /lib/ruby/site_ruby/1.8/robot_rules.rb
|
680
683
|
rm quiz64_sols.zip
|
@@ -708,6 +711,8 @@ of the following directories exist, which will then be used instead:
|
|
708
711
|
If neither directory exists and no $HOME variable is defined, the
|
709
712
|
current directory will be used.
|
710
713
|
|
714
|
+
Now check out the configuration commands in the Synopsis section.
|
715
|
+
|
711
716
|
|
712
717
|
== LICENSE:
|
713
718
|
websitary Webpage Monitor
|
data/Rakefile
CHANGED
data/lib/websitary.rb
CHANGED
@@ -1,13 +1,8 @@
|
|
1
1
|
# websitary.rb
|
2
|
-
# @Last Change: 2007-
|
2
|
+
# @Last Change: 2007-10-26.
|
3
3
|
# Author:: Thomas Link (micathom AT gmail com)
|
4
4
|
# License:: GPL (see http://www.gnu.org/licenses/gpl.txt)
|
5
5
|
# Created:: 2007-09-08.
|
6
|
-
#
|
7
|
-
# = TODO
|
8
|
-
# * Built-in support for robots.txt
|
9
|
-
# * Option to append to output files (e.g. rss)
|
10
|
-
# * Option to trim output files (when appending items)
|
11
6
|
|
12
7
|
|
13
8
|
require 'cgi'
|
@@ -37,8 +32,8 @@ end
|
|
37
32
|
|
38
33
|
module Websitary
|
39
34
|
APPNAME = 'websitary'
|
40
|
-
VERSION = '0.
|
41
|
-
REVISION = '
|
35
|
+
VERSION = '0.3'
|
36
|
+
REVISION = '2437'
|
42
37
|
end
|
43
38
|
|
44
39
|
require 'websitary/applog'
|
@@ -48,7 +43,7 @@ require 'websitary/htmldiff'
|
|
48
43
|
|
49
44
|
|
50
45
|
# Basic usage:
|
51
|
-
# Websitary.new(ARGV).process
|
46
|
+
# Websitary::App.new(ARGV).process
|
52
47
|
class Websitary::App
|
53
48
|
MINUTE_SECS = 60
|
54
49
|
HOUR_SECS = MINUTE_SECS * 60
|
@@ -207,7 +202,7 @@ CSS
|
|
207
202
|
|
208
203
|
|
209
204
|
def cmdline_arg_add(configuration, url)
|
210
|
-
configuration.
|
205
|
+
configuration.to_do url
|
211
206
|
end
|
212
207
|
|
213
208
|
|
@@ -290,6 +285,24 @@ CSS
|
|
290
285
|
end
|
291
286
|
|
292
287
|
|
288
|
+
def execute_ls
|
289
|
+
rv = 0
|
290
|
+
@configuration.todo.each do |url|
|
291
|
+
opts = @configuration.urls[url]
|
292
|
+
name = @configuration.get(url, :title, url)
|
293
|
+
$logger.debug "Source: #{name}"
|
294
|
+
aggrbase = @configuration.encoded_filename('aggregate', url, true, 'md5')
|
295
|
+
aggrfiles = Dir["#{aggrbase}_*"]
|
296
|
+
aggrn = aggrfiles.size
|
297
|
+
if aggrn > 0
|
298
|
+
puts "%3d - %s" % [aggrn, name]
|
299
|
+
rv = 1
|
300
|
+
end
|
301
|
+
end
|
302
|
+
rv
|
303
|
+
end
|
304
|
+
|
305
|
+
|
293
306
|
# Show data collected by #execute_aggregate
|
294
307
|
def execute_show
|
295
308
|
@configuration.todo.each do |url|
|
@@ -320,6 +333,10 @@ CSS
|
|
320
333
|
# and command-line options. The differences are stored in @difftext (a Hash).
|
321
334
|
# show_output:: If true, show the output with the defined viewer.
|
322
335
|
def execute_downdiff(show_output=true, rebuild=false, &accumulator)
|
336
|
+
if @configuration.todo.empty?
|
337
|
+
$logger.error 'Nothing to do'
|
338
|
+
return 5
|
339
|
+
end
|
323
340
|
@configuration.todo.each do |url|
|
324
341
|
opts = @configuration.urls[url]
|
325
342
|
$logger.debug "Source: #{@configuration.get(url, :title, url)}"
|
@@ -464,15 +481,19 @@ CSS
|
|
464
481
|
# $logger.debug text #DBG#
|
465
482
|
end
|
466
483
|
|
467
|
-
if
|
468
|
-
if
|
469
|
-
|
470
|
-
|
471
|
-
|
484
|
+
if text and !text.empty?
|
485
|
+
if older
|
486
|
+
if File.exist?(latest)
|
487
|
+
move(latest, older)
|
488
|
+
elsif !File.exist?(older)
|
489
|
+
$logger.warn "Initial copy: #{latest.inspect}"
|
490
|
+
end
|
472
491
|
end
|
492
|
+
@configuration.write_file(latest) {|io| io.puts(text)}
|
493
|
+
return true
|
494
|
+
else
|
495
|
+
return false
|
473
496
|
end
|
474
|
-
@configuration.write_file(latest) {|io| io.puts(text)}
|
475
|
-
return true
|
476
497
|
end
|
477
498
|
|
478
499
|
|
@@ -566,7 +587,7 @@ CSS
|
|
566
587
|
if parent_eligible == parent_now
|
567
588
|
return true
|
568
589
|
else
|
569
|
-
case
|
590
|
+
case eligible
|
570
591
|
when Array, Range
|
571
592
|
return !eligible.include?(now)
|
572
593
|
when Integer
|
@@ -1,5 +1,5 @@
|
|
1
1
|
# configuration.rb
|
2
|
-
# @Last Change: 2007-
|
2
|
+
# @Last Change: 2007-10-21.
|
3
3
|
# Author:: Thomas Link (micathom AT gmail com)
|
4
4
|
# License:: GPL (see http://www.gnu.org/licenses/gpl.txt)
|
5
5
|
# Created:: 2007-09-08.
|
@@ -12,7 +12,7 @@ class Websitary::Configuration
|
|
12
12
|
# Hash (key = URL, value = Hash of options)
|
13
13
|
attr_accessor :urls
|
14
14
|
# Array of urls to be downloaded.
|
15
|
-
|
15
|
+
attr_reader :todo
|
16
16
|
# Array of downloaded urls.
|
17
17
|
attr_accessor :done
|
18
18
|
# The user configuration directory
|
@@ -60,6 +60,7 @@ class Websitary::Configuration
|
|
60
60
|
@profiles = []
|
61
61
|
@robots = {}
|
62
62
|
@todo = []
|
63
|
+
@exclude = []
|
63
64
|
@urlencmap = {}
|
64
65
|
@urls = {}
|
65
66
|
|
@@ -127,10 +128,9 @@ class Websitary::Configuration
|
|
127
128
|
global(:timer => value)
|
128
129
|
end
|
129
130
|
|
130
|
-
|
131
|
-
|
132
|
-
|
133
|
-
# end
|
131
|
+
opts.on('-x', '--exclude=N', Regexp, 'Exclude URLs matching this pattern') do |value|
|
132
|
+
exclude(value)
|
133
|
+
end
|
134
134
|
|
135
135
|
opts.separator ''
|
136
136
|
opts.separator "Available commands (default: #@execute):"
|
@@ -304,6 +304,8 @@ class Websitary::Configuration
|
|
304
304
|
$logger.debug "Profile: #{fn}"
|
305
305
|
contents = File.read(fn)
|
306
306
|
return eval_profile(contents, fn)
|
307
|
+
else
|
308
|
+
$logger.error "Unknown profile: #{profile_name}"
|
307
309
|
end
|
308
310
|
end
|
309
311
|
return false
|
@@ -334,6 +336,13 @@ class Websitary::Configuration
|
|
334
336
|
end
|
335
337
|
|
336
338
|
|
339
|
+
def to_do(url)
|
340
|
+
unless @exclude.any? {|p| url =~ p}
|
341
|
+
@todo << url
|
342
|
+
end
|
343
|
+
end
|
344
|
+
|
345
|
+
|
337
346
|
# Set the output format.
|
338
347
|
def output_format(*format)
|
339
348
|
unless format.all? {|e| ['text', 'html', 'rss'].include?(e)}
|
@@ -396,7 +405,7 @@ class Websitary::Configuration
|
|
396
405
|
def source(urls, opts={})
|
397
406
|
urls.split("\n").flatten.compact.each do |url|
|
398
407
|
@urls[url] = @default_options.dup.update(opts)
|
399
|
-
|
408
|
+
to_do url
|
400
409
|
end
|
401
410
|
end
|
402
411
|
|
@@ -424,6 +433,13 @@ class Websitary::Configuration
|
|
424
433
|
end
|
425
434
|
|
426
435
|
|
436
|
+
# Configuration command:
|
437
|
+
# Add URL-exclusion patterns (REGEXPs).
|
438
|
+
def exclude(*urls)
|
439
|
+
@exclude += urls
|
440
|
+
end
|
441
|
+
|
442
|
+
|
427
443
|
# Configuration command:
|
428
444
|
# Set the viewer.
|
429
445
|
def view(view)
|
@@ -786,6 +802,7 @@ HTML
|
|
786
802
|
# pn0 = Pathname.new(guess_dir(File.expand_path(uri0.path)))
|
787
803
|
pn0 = Pathname.new(guess_dir(uri0.path))
|
788
804
|
(hpricot / 'a').each do |a|
|
805
|
+
next if a['rel'] == 'nofollow'
|
789
806
|
href = a['href']
|
790
807
|
next if href.nil? or href == url or href =~ /^\s*javascript:/
|
791
808
|
uri = URI.parse(href)
|
@@ -793,18 +810,18 @@ HTML
|
|
793
810
|
href = rewrite_href(href, url, uri0, pn0, true)
|
794
811
|
curl = canonic_url(href)
|
795
812
|
next if !href or href.nil? or @done.include?(curl) or @todo.include?(curl)
|
796
|
-
next unless robots_allowed?(curl, uri)
|
797
813
|
# pn = Pathname.new(guess_dir(File.expand_path(uri.path)))
|
798
814
|
uri = URI.parse(href)
|
799
815
|
pn = Pathname.new(guess_dir(uri.path))
|
800
|
-
|
801
|
-
|
802
|
-
|
803
|
-
|
804
|
-
|
805
|
-
|
806
|
-
|
807
|
-
|
816
|
+
next unless condition.call(uri0, pn0, uri, pn)
|
817
|
+
next unless robots_allowed?(curl, uri)
|
818
|
+
opts = @urls[url].dup
|
819
|
+
# opts[:title] = File.basename(curl)
|
820
|
+
opts[:title] = [opts[:title], File.basename(curl)].join(' - ')
|
821
|
+
opts[:depth] = depth - 1 if depth and depth >= 0
|
822
|
+
# opts[:sleep] = delay if delay
|
823
|
+
@urls[curl] = opts
|
824
|
+
to_do curl
|
808
825
|
end
|
809
826
|
rescue Exception => e
|
810
827
|
# $logger.error e #DBG#
|
@@ -900,7 +917,7 @@ HTML
|
|
900
917
|
# group:: A number (default: 0)
|
901
918
|
# tag:: The HTML tag to use (default: "span")
|
902
919
|
def highlighter(rx, color=nil, group=nil, tag='span')
|
903
|
-
lambda {|text| text.gsub(rx, %{<#{tag} class="highlight-#{color || '
|
920
|
+
lambda {|text| text.gsub(rx, %{<#{tag} class="highlight-#{color || 'red'}">\\#{group || 0}</#{tag}>})}
|
904
921
|
end
|
905
922
|
|
906
923
|
|
@@ -952,14 +969,14 @@ HTML
|
|
952
969
|
def initialize_options
|
953
970
|
@options = {
|
954
971
|
:global => {
|
955
|
-
:
|
972
|
+
:download_html => :openuri,
|
956
973
|
},
|
957
974
|
}
|
958
975
|
|
959
976
|
@options[:diff] = {
|
960
977
|
:default => :diff,
|
961
978
|
|
962
|
-
:diff
|
979
|
+
:diff => lambda {|old, new, *args|
|
963
980
|
opts, _ = args
|
964
981
|
opts ||= '-d -w'
|
965
982
|
difftext = call_cmd('diff %s -u2 "%s" "%s"', [opts, old, new])
|
@@ -978,7 +995,22 @@ HTML
|
|
978
995
|
|
979
996
|
:raw => :new,
|
980
997
|
|
998
|
+
:htmldiff => lambda {|old, new|
|
999
|
+
oldhtml = File.read(old)
|
1000
|
+
newhtml = File.read(new)
|
1001
|
+
difftext = Websitary::Htmldiff.new(:oldtext => oldhtml, :newtext => newhtml).diff
|
1002
|
+
difftext
|
1003
|
+
},
|
1004
|
+
|
981
1005
|
:webdiff => lambda {|old, new|
|
1006
|
+
oldhtml = File.read(old)
|
1007
|
+
newhtml = File.read(new)
|
1008
|
+
difftext = Websitary::Htmldiff.new(:highlight => 'highlight', :oldtext => oldhtml, :newtext => newhtml).diff
|
1009
|
+
difftext
|
1010
|
+
},
|
1011
|
+
|
1012
|
+
:websec_webdiff => lambda {|old, new|
|
1013
|
+
# :webdiff => lambda {|old, new|
|
982
1014
|
$logger.debug "webdiff: #{File.basename(new)}"
|
983
1015
|
$logger.debug %{webdiff --hicolor=yellow -archive "#{old}" -current "#{new}" -out -}
|
984
1016
|
difftext = `webdiff --hicolor=yellow -archive "#{old}" -current "#{new}" -out -`
|
@@ -1027,25 +1059,25 @@ HTML
|
|
1027
1059
|
# :download => 'w3m -no-cookie -S -F -dump "%s"'
|
1028
1060
|
|
1029
1061
|
shortcut :lynx, :delegate => :diff,
|
1030
|
-
|
1062
|
+
:download => 'lynx -dump "%s"'
|
1031
1063
|
|
1032
1064
|
shortcut :links, :delegate => :diff,
|
1033
|
-
|
1065
|
+
:download => 'links -dump "%s"'
|
1034
1066
|
|
1035
1067
|
shortcut :curl, :delegate => :webdiff,
|
1036
|
-
|
1068
|
+
:download => 'curl --silent "%s"'
|
1037
1069
|
|
1038
1070
|
shortcut :wget, :delegate => :webdiff,
|
1039
|
-
|
1071
|
+
:download => 'wget -q -O - "%s"'
|
1040
1072
|
|
1041
1073
|
shortcut :text, :delegate => :diff,
|
1042
|
-
|
1074
|
+
:download => lambda {|url| html_to_text(read_url(url, 'html'))}
|
1043
1075
|
|
1044
1076
|
shortcut :body_html, :delegate => :webdiff,
|
1045
1077
|
:strip_tags => :default,
|
1046
1078
|
:download => lambda {|url|
|
1047
1079
|
begin
|
1048
|
-
doc = Hpricot(
|
1080
|
+
doc = Hpricot(read_url(url, 'html'))
|
1049
1081
|
doc = doc.at('body')
|
1050
1082
|
if doc
|
1051
1083
|
doc = rewrite_urls(url, doc)
|
@@ -1068,7 +1100,7 @@ HTML
|
|
1068
1100
|
shortcut :openuri, :delegate => :webdiff,
|
1069
1101
|
:download => lambda {|url|
|
1070
1102
|
begin
|
1071
|
-
|
1103
|
+
read_url_openuri(url)
|
1072
1104
|
rescue Exception => e
|
1073
1105
|
# $logger.error e #DBG#
|
1074
1106
|
$logger.error e.message
|
@@ -1085,17 +1117,17 @@ HTML
|
|
1085
1117
|
if ro
|
1086
1118
|
rh = {}
|
1087
1119
|
ro.items.each do |item|
|
1088
|
-
rh[
|
1120
|
+
rh[rss_item_id(item)] = item
|
1089
1121
|
rh[item.link] = item
|
1090
1122
|
end
|
1091
1123
|
rnew = []
|
1092
1124
|
rn = RSS::Parser.parse(File.read(new), false)
|
1093
1125
|
if rn
|
1094
1126
|
rn.items.each do |item|
|
1095
|
-
rid =
|
1127
|
+
rid = rss_item_id(item)
|
1096
1128
|
if !rh[rid]
|
1097
1129
|
if (olditem = rh[item.link])
|
1098
|
-
rss_diff = Websitary::Htmldiff.new(:oldtext => olditem.description, :newtext => item.description).process
|
1130
|
+
rss_diff = Websitary::Htmldiff.new(:highlight => 'highlight', :oldtext => olditem.description, :newtext => item.description).process
|
1099
1131
|
rnew << format_rss_item(item, rss_diff)
|
1100
1132
|
else
|
1101
1133
|
if item.enclosure and (curl = item.enclosure.url)
|
@@ -1111,7 +1143,7 @@ HTML
|
|
1111
1143
|
$logger.debug "Enclosure URL: #{curl}"
|
1112
1144
|
fname = File.join(dir, encode(File.basename(curl) || item.title || item.pubDate.to_s || Time.now.to_s))
|
1113
1145
|
$logger.debug "Enclosure save to: #{fname}"
|
1114
|
-
enc =
|
1146
|
+
enc = read_url(curl, 'rss_enclosure')
|
1115
1147
|
write_file(fname, 'wb') {|io| io.puts enc}
|
1116
1148
|
furl = file_url(fname)
|
1117
1149
|
enclosure = %{<p class="enclosure"><a href="%s" class="enclosure" />Enclosure (local copy)</a></p>} % furl
|
@@ -1146,7 +1178,7 @@ HTML
|
|
1146
1178
|
opts[:download] = :rss
|
1147
1179
|
opts[:title] = elt['title'] || elt['text'] || elt['htmlurl'] || curl
|
1148
1180
|
@urls[curl] = opts
|
1149
|
-
|
1181
|
+
to_do curl
|
1150
1182
|
else
|
1151
1183
|
$logger.warn "Unsupported type in OPML: #{elt.to_s}"
|
1152
1184
|
end
|
@@ -1162,10 +1194,10 @@ HTML
|
|
1162
1194
|
:download => lambda {|url| get_website_below(:body_html, url)}
|
1163
1195
|
|
1164
1196
|
shortcut :website_txt, :delegate => :default,
|
1165
|
-
:download => lambda {|url| html_to_text(get_website(get(url, :
|
1197
|
+
:download => lambda {|url| html_to_text(get_website(get(url, :download_html, :openuri), url))}
|
1166
1198
|
|
1167
1199
|
shortcut :website_txt_below, :delegate => :default,
|
1168
|
-
:download => lambda {|url| html_to_text(get_website_below(get(url, :
|
1200
|
+
:download => lambda {|url| html_to_text(get_website_below(get(url, :download_html, :openuri), url))}
|
1169
1201
|
|
1170
1202
|
shortcut :ftp, :delegate => :default,
|
1171
1203
|
:download => lambda {|url| get_ftp(url).join("\n")}
|
@@ -1184,7 +1216,7 @@ HTML
|
|
1184
1216
|
opts[:title] = [opts[:title], File.basename(curl)].join(' - ')
|
1185
1217
|
opts[:depth] = depth - 1 if depth and depth >= 0
|
1186
1218
|
@urls[curl] = opts
|
1187
|
-
|
1219
|
+
to_do curl
|
1188
1220
|
end
|
1189
1221
|
end
|
1190
1222
|
list.join("\n")
|
@@ -1284,7 +1316,8 @@ OUT
|
|
1284
1316
|
if doc
|
1285
1317
|
return if robots?(doc, 'noindex')
|
1286
1318
|
push_hrefs(url, doc) do |uri0, pn0, uri, pn|
|
1287
|
-
uri.host
|
1319
|
+
(uri.host || uri.is_a?(URI::Generic)) &&
|
1320
|
+
(uri0.host || uri0.is_a?(URI::Generic)) &&
|
1288
1321
|
eligible_path?(url, uri0.path, uri.path) &&
|
1289
1322
|
uri.host == uri0.host &&
|
1290
1323
|
(pn.to_s == '.' || pn.relative_path_from(pn0).to_s == '.')
|
@@ -1337,7 +1370,17 @@ OUT
|
|
1337
1370
|
end
|
1338
1371
|
|
1339
1372
|
|
1340
|
-
def
|
1373
|
+
def read_url(url, type='html')
|
1374
|
+
downloader = get(url, "download_#{type}".intern)
|
1375
|
+
if downloader
|
1376
|
+
call_cmd(downloader, [url])
|
1377
|
+
else
|
1378
|
+
read_url_openuri(url)
|
1379
|
+
end
|
1380
|
+
end
|
1381
|
+
|
1382
|
+
|
1383
|
+
def read_url_openuri(url)
|
1341
1384
|
if url.nil? or url.empty?
|
1342
1385
|
$logger.fatal "Internal error: url is nil"
|
1343
1386
|
puts caller.join("\n")
|
@@ -1346,11 +1389,11 @@ OUT
|
|
1346
1389
|
$logger.debug "Open URL: #{url}"
|
1347
1390
|
uri = URI.parse(url)
|
1348
1391
|
if uri.instance_of?(URI::Generic) or uri.scheme == 'file'
|
1349
|
-
open(url)
|
1392
|
+
open(url).read
|
1350
1393
|
else
|
1351
1394
|
header = {"User-Agent" => @user_agent}
|
1352
1395
|
header.merge!(get(url, :header, {}))
|
1353
|
-
open(url, header)
|
1396
|
+
open(url, header).read
|
1354
1397
|
end
|
1355
1398
|
end
|
1356
1399
|
|
@@ -1369,6 +1412,14 @@ OUT
|
|
1369
1412
|
end
|
1370
1413
|
|
1371
1414
|
|
1415
|
+
def rss_item_id(item)
|
1416
|
+
return Digest::MD5.hexdigest(item.to_s)
|
1417
|
+
# i = [item.author, item.title, item.link, item.description, item.enclosure].inspect
|
1418
|
+
# # p "DBG", i.inspect, Digest::MD5.hexdigest(i.inspect)
|
1419
|
+
# return Digest::MD5.hexdigest(i)
|
1420
|
+
end
|
1421
|
+
|
1422
|
+
|
1372
1423
|
def format_rss_item(item, body, enclosure='')
|
1373
1424
|
hd = [item.title]
|
1374
1425
|
hd << " (#{item.author})" if item.author
|
@@ -1395,12 +1446,17 @@ EOT
|
|
1395
1446
|
|
1396
1447
|
# Retrieve any robots meta directives from the hpricot document.
|
1397
1448
|
def robots?(hpricot, *what)
|
1398
|
-
(hpricot
|
1449
|
+
meta(hpricot, 'robots').any? do |e|
|
1399
1450
|
what.any? {|w| e['content'].split(/,\s*/).include?(w)}
|
1400
1451
|
end
|
1401
1452
|
end
|
1402
1453
|
|
1403
1454
|
|
1455
|
+
def meta(hpricot, name)
|
1456
|
+
hpricot / %{//meta[@name="#{name}"]}
|
1457
|
+
end
|
1458
|
+
|
1459
|
+
|
1404
1460
|
# Check whether robots are allowed to retrieve an url.
|
1405
1461
|
def robots_allowed?(url, uri)
|
1406
1462
|
if @allow.has_key?(url)
|
@@ -1414,7 +1470,7 @@ EOT
|
|
1414
1470
|
rurl = robots_uri(uri).to_s
|
1415
1471
|
return true if rurl.nil? or rurl.empty?
|
1416
1472
|
begin
|
1417
|
-
robots_txt =
|
1473
|
+
robots_txt = read_url(rurl, 'robots')
|
1418
1474
|
rules = RobotRules.new(@user_agent)
|
1419
1475
|
rules.parse(rurl, robots_txt)
|
1420
1476
|
@robots[host] = rules
|
data/lib/websitary/htmldiff.rb
CHANGED
@@ -1,29 +1,72 @@
|
|
1
1
|
#!/usr/bin/env ruby
|
2
2
|
# htmldiff.rb
|
3
|
-
# @Last Change: 2007-
|
3
|
+
# @Last Change: 2007-10-08.
|
4
4
|
# Author:: Thomas Link (micathom at gmail com)
|
5
5
|
# License:: GPL (see http://www.gnu.org/licenses/gpl.txt)
|
6
6
|
# Created:: 2007-08-17.
|
7
|
-
#
|
7
|
+
#
|
8
8
|
# == Basic Use
|
9
|
-
# htmldiff OLD NEW > DIFF
|
9
|
+
# htmldiff OLD NEW [HIGHLIGHT-COLOR] > DIFF
|
10
10
|
|
11
11
|
require 'hpricot'
|
12
12
|
|
13
13
|
|
14
|
-
# TODO:
|
15
|
-
# * Option: Don't extract but highlight.
|
16
14
|
module Websitary
|
15
|
+
# A simple class to generate diffs for html files using hpricot.
|
16
|
+
# It's quite likely that it will miss certain details and yields
|
17
|
+
# wrong results (especially wrong-negative) in certain occasions.
|
17
18
|
class Htmldiff
|
18
19
|
VERSION = '0.1'
|
19
|
-
REVISION = '
|
20
|
-
|
20
|
+
REVISION = '164'
|
21
|
+
|
22
|
+
# args:: A hash
|
23
|
+
# Fields:
|
24
|
+
# :oldtext:: The old version
|
25
|
+
# :newtext:: The new version
|
26
|
+
# :highlight:: Don't strip old content but highlight new one with this color
|
27
|
+
# :args:: Command-line arguments
|
21
28
|
def initialize(args)
|
22
29
|
@args = args
|
30
|
+
@high = args[:highlight] || args[:highlightcolor]
|
23
31
|
@old = explode(args[:olddoc] || Hpricot(args[:oldtext] || File.read(args[:oldfile])))
|
24
32
|
@new = args[:newdoc] || Hpricot(args[:newtext] || File.read(args[:newfile]))
|
33
|
+
@changed = false
|
25
34
|
end
|
26
35
|
|
36
|
+
|
37
|
+
# Do the diff. Return an empty string if nothing has changed.
|
38
|
+
def diff
|
39
|
+
rv = process.to_s
|
40
|
+
@changed ? rv : ''
|
41
|
+
end
|
42
|
+
|
43
|
+
|
44
|
+
# It goes like this: if a node isn't in the list of old nodes either
|
45
|
+
# the node or its content has changed. If the content is a single
|
46
|
+
# node, the whole node has changed. If only some sub-nodes have
|
47
|
+
# changed, collect those.
|
48
|
+
def process(node=@new)
|
49
|
+
acc = []
|
50
|
+
node.each_child do |child|
|
51
|
+
ch = child.to_html.strip
|
52
|
+
next if ch.nil? or ch.empty?
|
53
|
+
if @old.include?(ch)
|
54
|
+
if @high
|
55
|
+
acc << child
|
56
|
+
end
|
57
|
+
else
|
58
|
+
if child.respond_to?(:each_child)
|
59
|
+
acc << process(child)
|
60
|
+
else
|
61
|
+
acc << highlight(child).to_s
|
62
|
+
acc << '<br />' unless @high
|
63
|
+
end
|
64
|
+
end
|
65
|
+
end
|
66
|
+
replace_inner(node, acc.join("\n"))
|
67
|
+
end
|
68
|
+
|
69
|
+
|
27
70
|
# Collect all nodes and subnodes in a hpricot document.
|
28
71
|
def explode(node)
|
29
72
|
if node.respond_to?(:each_child)
|
@@ -37,40 +80,44 @@ module Websitary
|
|
37
80
|
end
|
38
81
|
end
|
39
82
|
|
40
|
-
|
41
|
-
|
42
|
-
|
43
|
-
|
44
|
-
|
45
|
-
|
46
|
-
|
47
|
-
|
48
|
-
|
49
|
-
|
50
|
-
|
51
|
-
|
52
|
-
|
53
|
-
|
54
|
-
|
55
|
-
|
56
|
-
# Hpricot(child.inner_html.strip).to_html
|
57
|
-
if ap.empty?
|
58
|
-
# p "DBG add child"
|
59
|
-
acc << child
|
60
|
-
else
|
61
|
-
# p "DBG add inner"
|
62
|
-
acc += ap
|
63
|
-
end
|
83
|
+
|
84
|
+
def highlight(child)
|
85
|
+
@changed = true
|
86
|
+
if @high
|
87
|
+
if child.respond_to?(:each_child)
|
88
|
+
acc = []
|
89
|
+
child.each_child do |ch|
|
90
|
+
acc << replace_inner(ch, highlight(ch).to_s)
|
91
|
+
end
|
92
|
+
replace_inner(child, acc.join("\n"))
|
93
|
+
else
|
94
|
+
case @args[:highlight]
|
95
|
+
when String
|
96
|
+
opts = %{class="#{@args[:highlight]}"}
|
97
|
+
when true, Numeric
|
98
|
+
opts = %{class="highlight"}
|
64
99
|
else
|
65
|
-
|
66
|
-
acc << [child, '<br />']
|
67
|
-
single = true
|
100
|
+
opts = %{style="background-color: #{@args[:highlightcolor]};"}
|
68
101
|
end
|
102
|
+
ihtml = %{<span #{opts}>#{child.to_s}</span>}
|
103
|
+
replace_inner(child, ihtml)
|
69
104
|
end
|
105
|
+
else
|
106
|
+
child
|
107
|
+
end
|
108
|
+
end
|
109
|
+
|
110
|
+
|
111
|
+
def replace_inner(child, ihtml)
|
112
|
+
case child
|
113
|
+
when Hpricot::Comment
|
114
|
+
child
|
115
|
+
when Hpricot::Text
|
116
|
+
Hpricot(ihtml)
|
117
|
+
else
|
118
|
+
child.inner_html = ihtml
|
119
|
+
child
|
70
120
|
end
|
71
|
-
# p "DBG n=#{acc.size}"
|
72
|
-
acc.size == 1 && single ? [node] : acc
|
73
|
-
# puts acc.map {|c| c.to_html}.join("\n")
|
74
121
|
end
|
75
122
|
|
76
123
|
end
|
@@ -78,12 +125,14 @@ end
|
|
78
125
|
|
79
126
|
|
80
127
|
if __FILE__ == $0
|
81
|
-
old, new,
|
128
|
+
old, new, aargs = ARGV
|
82
129
|
if old and new
|
83
|
-
|
130
|
+
args = {:args => aargs, :oldfile => old, :newfile => new}
|
131
|
+
args[:highlightcolor], _ = aargs
|
132
|
+
acc = Websitary::Htmldiff.new(args).diff
|
84
133
|
puts acc
|
85
134
|
else
|
86
|
-
puts "#{File.basename($0)} OLD NEW > DIFF"
|
135
|
+
puts "#{File.basename($0)} OLD NEW [HIGHLIGHT-COLOR] > DIFF"
|
87
136
|
end
|
88
137
|
end
|
89
138
|
|
metadata
CHANGED
@@ -3,15 +3,15 @@ rubygems_version: 0.9.4
|
|
3
3
|
specification_version: 1
|
4
4
|
name: websitary
|
5
5
|
version: !ruby/object:Gem::Version
|
6
|
-
version: 0.
|
7
|
-
date: 2007-
|
6
|
+
version: "0.3"
|
7
|
+
date: 2007-10-26 00:00:00 +02:00
|
8
8
|
summary: A unified website news, rss feed, podcast monitor
|
9
9
|
require_paths:
|
10
10
|
- lib
|
11
11
|
email: micathom at gmail com
|
12
12
|
homepage: http://rubyforge.org/projects/websitiary/
|
13
13
|
rubyforge_project: websitiary
|
14
|
-
description: "== DESCRIPTION: websitary (formerly known as websitiary with an extra \"i\") monitors webpages, rss feeds, podcasts etc. It reuses other programs (w3m, diff
|
14
|
+
description: "== DESCRIPTION: websitary (formerly known as websitiary with an extra \"i\") monitors webpages, rss feeds, podcasts etc. It reuses other programs (w3m, diff etc.) to do most of the actual work. By default, it works on an ASCII basis, i.e. with the output of text-based webbrowsers like w3m (or lynx, links etc.) as the output can easily be post-processed. It can also work with HTML and highlight new items. This script was originally planned as a ruby-based websec replacement. By default, this script will use w3m to dump HTML pages and then run diff over the current page and the previous backup. Some pages are better viewed with lynx or links. Downloaded documents (HTML or ASCII) can be post-processed (e.g., filtered through some ruby block that extracts elements via hpricot and the like). Please see the configuration options below to find out how to change this globally or for a single source. This user manual is also available as PDF[http://websitiary.rubyforge.org/websitary.pdf]. == FEATURES/PROBLEMS: * Handle webpages, rss feeds (optionally save attachments in podcasts etc.) * Compare webpages with previous backups * Display differences between the current version and the backup * Provide hooks to post-process the downloaded documents and the diff * Display a one-page report summarizing all news * Automatically open the report in your favourite web-browser * Experimental: Download webpages on defined intervalls and generate incremental diffs."
|
15
15
|
autorequire:
|
16
16
|
default_executable:
|
17
17
|
bindir: bin
|
@@ -72,5 +72,5 @@ dependencies:
|
|
72
72
|
requirements:
|
73
73
|
- - ">="
|
74
74
|
- !ruby/object:Gem::Version
|
75
|
-
version: 1.
|
75
|
+
version: 1.3.0
|
76
76
|
version:
|