websitary 0.4 → 0.5
Sign up to get free protection for your applications and to get access to all the features.
- data/History.txt +15 -0
- data/README.txt +33 -2
- data/Rakefile +2 -2
- data/bin/websitary +2 -2
- data/lib/websitary.rb +47 -39
- data/lib/websitary/applog.rb +0 -0
- data/lib/websitary/configuration.rb +230 -96
- data/lib/websitary/filemtimes.rb +0 -0
- data/lib/websitary/htmldiff.rb +0 -0
- metadata +4 -4
data/History.txt
CHANGED
@@ -1,3 +1,18 @@
|
|
1
|
+
= 0.5
|
2
|
+
|
3
|
+
* mailto: and javascript: hrefs are now handled via the exclude option
|
4
|
+
* rewrite absolute URLs sans host correctly
|
5
|
+
* strip href and image src tags in order to prevent parser errors
|
6
|
+
* some scaffolding for mechanize
|
7
|
+
* global proxy option (currently only used for mechanize)
|
8
|
+
* use -nolist for lynx
|
9
|
+
* catch errors in Websitary::App#execute_downdiff
|
10
|
+
* :rss_find_enclosure => LAMBDA: Extract the enclosure URL from the item
|
11
|
+
description
|
12
|
+
* :rss_format_local_copy => STRING|BLOCK/2: Format the display of the
|
13
|
+
local copy.
|
14
|
+
|
15
|
+
|
1
16
|
= 0.4
|
2
17
|
|
3
18
|
* Sources may have a :timeout option.
|
data/README.txt
CHANGED
@@ -212,6 +212,12 @@ Known global options:
|
|
212
212
|
<tt>:toggle_body => BOOLEAN</tt>::
|
213
213
|
If true, make a news body collabsable on mouse-clicks (sort of).
|
214
214
|
|
215
|
+
<tt>:proxy => STRING</tt>, <tt>:proxy => ARRAY</tt>::
|
216
|
+
The proxy. (currently only supported by mechanize)
|
217
|
+
|
218
|
+
<tt>:user_agent => STRING</tt>::
|
219
|
+
Set the user agent (only for certain queries).
|
220
|
+
|
215
221
|
|
216
222
|
==== output_format FORMAT, output_format [FORMAT1, FORMAT2, ...]
|
217
223
|
Set the output format.
|
@@ -318,11 +324,29 @@ Options
|
|
318
324
|
<tt>:rss_enclosure => true|"DIRECTORY"</tt>::
|
319
325
|
If true, save rss feed enclosures in
|
320
326
|
"~/.websitary/attachments/RSS_FEED_NAME/". If a string, use this as
|
321
|
-
destination directory.
|
327
|
+
destination directory. Only enclosures of new items will be saved --
|
328
|
+
i.e. when downloading a feed for the first time, no enclosures will be
|
329
|
+
saved.
|
330
|
+
|
331
|
+
<tt>:rss_find_enclosure => BLOCK</tt>::
|
332
|
+
Certain RSS-feeds embed enclosures in the description. Use this option
|
333
|
+
to scan the description (a Hpricot document) for an URL that is then saved
|
334
|
+
as enclosure if the :rss_enclosure option is set.
|
335
|
+
Example:
|
336
|
+
source 'http://www.example.com/rss',
|
337
|
+
:title => 'Example',
|
338
|
+
:use => :rss, :rss_enclosure => true,
|
339
|
+
:rss_find_enclosure => lambda {|item, doc| (doc / 'img').map {|e| e['src']}[0]}
|
322
340
|
|
323
341
|
<tt>:rss_format (default: "plain_text")</tt>::
|
324
342
|
When output format is :rss, create rss item descriptios as plain text.
|
325
343
|
|
344
|
+
<tt>:rss_format_local_copy => FORMAT_STRING | BLOCK</tt>::
|
345
|
+
By default a hypertext reference to the local copy of an RSS
|
346
|
+
enclosure is added to entry. Sometimes you may want to display
|
347
|
+
something inline (e.g. an image). You can then use this option to
|
348
|
+
define a format string (one field = the local copy's file url).
|
349
|
+
|
326
350
|
<tt>:show_initial => true</tt>::
|
327
351
|
Include initial copies in the report (may not always work properly).
|
328
352
|
This can also be set as a global option.
|
@@ -388,6 +412,13 @@ Example:
|
|
388
412
|
Use open-uri for downloading the source. Use webdiff for generating
|
389
413
|
diffs. This doesn't handle cookies and the like.
|
390
414
|
|
415
|
+
<tt>:mechanize</tt>::
|
416
|
+
Use mechanize (must be installed) for downloading the source. Use
|
417
|
+
webdiff for generating diffs. This calls the URL's :mechanize property
|
418
|
+
(a lambda that takes 3 arguments: URL, agent, page => HTML as string)
|
419
|
+
to post-process the page (or if not available, use the page body's
|
420
|
+
HTML).
|
421
|
+
|
391
422
|
<tt>:text</tt>::
|
392
423
|
This requires hpricot to be installed. Use open-uri for downloading
|
393
424
|
and hpricot for converting HTML to plain text. This still requires
|
@@ -730,7 +761,7 @@ Now check out the configuration commands in the Synopsis section.
|
|
730
761
|
|
731
762
|
== LICENSE:
|
732
763
|
websitary Webpage Monitor
|
733
|
-
Copyright (C) 2007 Thomas Link
|
764
|
+
Copyright (C) 2007-2008 Thomas Link
|
734
765
|
|
735
766
|
This program is free software; you can redistribute it and/or modify
|
736
767
|
it under the terms of the GNU General Public License as published by
|
data/Rakefile
CHANGED
@@ -21,11 +21,11 @@ require 'rtagstask'
|
|
21
21
|
RTagsTask.new
|
22
22
|
|
23
23
|
task :ctags do
|
24
|
-
`ctags --extra=+q --fields=+i+S -R bin lib`
|
24
|
+
puts `ctags --extra=+q --fields=+i+S -R bin lib`
|
25
25
|
end
|
26
26
|
|
27
27
|
task :files do
|
28
|
-
`find bin lib -name "*.rb" > files.lst`
|
28
|
+
puts `find bin lib -name "*.rb" > files.lst`
|
29
29
|
end
|
30
30
|
|
31
31
|
# vim: syntax=Ruby
|
data/bin/websitary
CHANGED
@@ -1,6 +1,6 @@
|
|
1
1
|
#! /usr/bin/env ruby
|
2
2
|
# websitary.rb -- The website news, rss feed, podcast catching monitor
|
3
|
-
# @Last Change:
|
3
|
+
# @Last Change: 2008-02-12.
|
4
4
|
# Author:: Thomas Link (micathom at gmail com)
|
5
5
|
# License:: GPL (see http://www.gnu.org/licenses/gpl.txt)
|
6
6
|
# Created:: 2007-06-09.
|
@@ -11,7 +11,7 @@ require 'websitary'
|
|
11
11
|
|
12
12
|
if __FILE__ == $0
|
13
13
|
w = Websitary::App.new(ARGV)
|
14
|
-
t = w.configuration.
|
14
|
+
t = w.configuration.optval_get(:global, :timer)
|
15
15
|
if t
|
16
16
|
exit_code = 0
|
17
17
|
while exit_code <= 1
|
data/lib/websitary.rb
CHANGED
@@ -1,5 +1,5 @@
|
|
1
1
|
# websitary.rb
|
2
|
-
# @Last Change: 2008-
|
2
|
+
# @Last Change: 2008-03-11.
|
3
3
|
# Author:: Thomas Link (micathom AT gmail com)
|
4
4
|
# License:: GPL (see http://www.gnu.org/licenses/gpl.txt)
|
5
5
|
# Created:: 2007-09-08.
|
@@ -7,7 +7,8 @@
|
|
7
7
|
|
8
8
|
require 'cgi'
|
9
9
|
require 'digest/md5'
|
10
|
-
require 'ftools'
|
10
|
+
# require 'ftools'
|
11
|
+
require 'fileutils'
|
11
12
|
require 'net/ftp'
|
12
13
|
require 'optparse'
|
13
14
|
require 'pathname'
|
@@ -33,8 +34,8 @@ end
|
|
33
34
|
|
34
35
|
module Websitary
|
35
36
|
APPNAME = 'websitary'
|
36
|
-
VERSION = '0.
|
37
|
-
REVISION = '
|
37
|
+
VERSION = '0.5'
|
38
|
+
REVISION = '2476'
|
38
39
|
end
|
39
40
|
|
40
41
|
require 'websitary/applog'
|
@@ -72,7 +73,7 @@ class Websitary::App
|
|
72
73
|
unless File.exists?(css)
|
73
74
|
$logger.info "Copying default css file: #{css}"
|
74
75
|
@configuration.write_file(css, 'w') do |io|
|
75
|
-
io.puts @configuration.
|
76
|
+
io.puts @configuration.opt_get(:page, :css)
|
76
77
|
end
|
77
78
|
end
|
78
79
|
end
|
@@ -99,7 +100,7 @@ class Websitary::App
|
|
99
100
|
def execute_configuration
|
100
101
|
keys = @configuration.options.keys
|
101
102
|
urls = @configuration.todo
|
102
|
-
# urls = @configuration.todo..sort {|a,b| @configuration.
|
103
|
+
# urls = @configuration.todo..sort {|a,b| @configuration.url_get(a, :title, a) <=> @configuration.url_get(b, :title, b)}
|
103
104
|
urls.each_with_index do |url, i|
|
104
105
|
data = @configuration.urls[url]
|
105
106
|
text = [
|
@@ -107,7 +108,7 @@ class Websitary::App
|
|
107
108
|
"<b>current</b><br/>#{CGI.escapeHTML(@configuration.latestname(url, true))}<br/>",
|
108
109
|
"<b>backup</b><br/>#{CGI.escapeHTML(@configuration.oldname(url, true))}<br/>",
|
109
110
|
*((data.keys | keys).map do |k|
|
110
|
-
v = @configuration.
|
111
|
+
v = @configuration.url_get(url, k).inspect
|
111
112
|
"<b>:#{k}</b><br/>#{CGI.escapeHTML(v)}<br/>"
|
112
113
|
end)
|
113
114
|
]
|
@@ -205,7 +206,7 @@ class Websitary::App
|
|
205
206
|
rv = 0
|
206
207
|
@configuration.todo.each do |url|
|
207
208
|
opts = @configuration.urls[url]
|
208
|
-
name = @configuration.
|
209
|
+
name = @configuration.url_get(url, :title, url)
|
209
210
|
$logger.debug "Source: #{name}"
|
210
211
|
aggrbase = @configuration.encoded_filename('aggregate', url, true, 'md5')
|
211
212
|
aggrfiles = Dir["#{aggrbase}_*"]
|
@@ -223,7 +224,7 @@ class Websitary::App
|
|
223
224
|
def execute_show
|
224
225
|
@configuration.todo.each do |url|
|
225
226
|
opts = @configuration.urls[url]
|
226
|
-
$logger.debug "Source: #{@configuration.
|
227
|
+
$logger.debug "Source: #{@configuration.url_get(url, :title, url)}"
|
227
228
|
aggrbase = @configuration.encoded_filename('aggregate', url, true, 'md5')
|
228
229
|
difftext = []
|
229
230
|
aggrfiles = Dir["#{aggrbase}_*"]
|
@@ -233,7 +234,7 @@ class Websitary::App
|
|
233
234
|
difftext.compact!
|
234
235
|
difftext.delete('')
|
235
236
|
unless difftext.empty?
|
236
|
-
joindiffs = @configuration.
|
237
|
+
joindiffs = @configuration.url_get(url, :joindiffs, lambda {|t| t.join("\n")})
|
237
238
|
difftext = @configuration.call_cmd(joindiffs, [difftext], :url => url) if joindiffs
|
238
239
|
accumulate(url, difftext, opts)
|
239
240
|
end
|
@@ -255,13 +256,13 @@ class Websitary::App
|
|
255
256
|
end
|
256
257
|
@configuration.todo.each do |url|
|
257
258
|
opts = @configuration.urls[url]
|
258
|
-
$logger.debug "Source: #{@configuration.
|
259
|
+
$logger.debug "Source: #{@configuration.url_get(url, :title, url)}"
|
259
260
|
|
260
261
|
diffed = @configuration.diffname(url, true)
|
261
262
|
$logger.debug "diffname: #{diffed}"
|
262
263
|
|
263
264
|
if File.exists?(diffed)
|
264
|
-
$logger.warn "Reuse old diff: #{@configuration.
|
265
|
+
$logger.warn "Reuse old diff: #{@configuration.url_get(url, :title, url)} => #{diffed}"
|
265
266
|
difftext = File.read(diffed)
|
266
267
|
accumulate(url, difftext, opts)
|
267
268
|
else
|
@@ -272,17 +273,22 @@ class Websitary::App
|
|
272
273
|
older = @configuration.oldname(url, true)
|
273
274
|
$logger.debug "older: #{older}"
|
274
275
|
|
275
|
-
|
276
|
-
|
277
|
-
|
278
|
-
|
279
|
-
|
280
|
-
|
281
|
-
accumulator
|
282
|
-
|
283
|
-
|
276
|
+
begin
|
277
|
+
if rebuild or download(url, opts, latest, older)
|
278
|
+
difftext = diff(url, opts, latest, older)
|
279
|
+
if difftext
|
280
|
+
@configuration.write_file(diffed, 'wb') {|io| io.puts difftext}
|
281
|
+
# $logger.debug "difftext: #{difftext}" #DBG#
|
282
|
+
if accumulator
|
283
|
+
accumulator.call(url, difftext, opts)
|
284
|
+
else
|
285
|
+
accumulate(url, difftext, opts)
|
286
|
+
end
|
284
287
|
end
|
285
288
|
end
|
289
|
+
rescue Exception => e
|
290
|
+
$logger.error e.to_s
|
291
|
+
$logger.info e.backtrace.join("\n")
|
286
292
|
end
|
287
293
|
end
|
288
294
|
end
|
@@ -291,20 +297,22 @@ class Websitary::App
|
|
291
297
|
|
292
298
|
|
293
299
|
def move(from, to)
|
294
|
-
copy_move(:rename, from, to)
|
300
|
+
# copy_move(:rename, from, to) # ftools
|
301
|
+
copy_move(:mv, from, to) # FileUtils
|
295
302
|
end
|
296
303
|
|
297
304
|
|
298
305
|
def copy(from, to)
|
299
|
-
copy_move(:copy, from, to)
|
306
|
+
# copy_move(:copy, from, to)
|
307
|
+
copy_move(:cp, from, to)
|
300
308
|
end
|
301
309
|
|
302
310
|
|
303
311
|
def copy_move(method, from, to)
|
304
312
|
if File.exists?(from)
|
305
|
-
$logger.debug "
|
313
|
+
$logger.debug "Overwrite: #{from} -> #{to}" if File.exists?(to)
|
306
314
|
lst = File.lstat(from)
|
307
|
-
|
315
|
+
FileUtils.send(method, from, to)
|
308
316
|
File.utime(lst.atime, lst.mtime, to)
|
309
317
|
@configuration.mtimes.set(from, lst.mtime)
|
310
318
|
@configuration.mtimes.set(to, lst.mtime)
|
@@ -347,16 +355,16 @@ class Websitary::App
|
|
347
355
|
|
348
356
|
def download(url, opts, latest, older=nil)
|
349
357
|
if @configuration.done.include?(url)
|
350
|
-
$logger.info "Already downloaded: #{@configuration.
|
358
|
+
$logger.info "Already downloaded: #{@configuration.url_get(url, :title, url).inspect}"
|
351
359
|
return false
|
352
360
|
end
|
353
361
|
|
354
|
-
$logger.warn "Download: #{@configuration.
|
362
|
+
$logger.warn "Download: #{@configuration.url_get(url, :title, url).inspect}"
|
355
363
|
@configuration.done << url
|
356
|
-
text = @configuration.call_cmd(@configuration.
|
364
|
+
text = @configuration.call_cmd(@configuration.url_get(url, :download), [url], :url => url)
|
357
365
|
# $logger.debug text #DBG#
|
358
366
|
unless text
|
359
|
-
$logger.warn "no contents: #{@configuration.
|
367
|
+
$logger.warn "no contents: #{@configuration.url_get(url, :title, url)}"
|
360
368
|
return false
|
361
369
|
end
|
362
370
|
|
@@ -390,7 +398,7 @@ class Websitary::App
|
|
390
398
|
text = text.join("\n")
|
391
399
|
end
|
392
400
|
|
393
|
-
pprc = @configuration.
|
401
|
+
pprc = @configuration.url_get(url, :downloadprocess)
|
394
402
|
if pprc
|
395
403
|
$logger.debug "download process: #{pprc}"
|
396
404
|
text = @configuration.call_cmd(pprc, [text], :url => url)
|
@@ -416,25 +424,25 @@ class Websitary::App
|
|
416
424
|
def diff(url, opts, new, old)
|
417
425
|
if File.exists?(old)
|
418
426
|
$logger.debug "diff: #{old} <-> #{new}"
|
419
|
-
difftext = @configuration.call_cmd(@configuration.
|
427
|
+
difftext = @configuration.call_cmd(@configuration.url_get(url, :diff), [old, new], :url => url)
|
420
428
|
# $logger.debug "diff: #{difftext}" #DBG#
|
421
429
|
|
422
430
|
if difftext =~ /\S/
|
423
|
-
if (pprc = @configuration.
|
431
|
+
if (pprc = @configuration.url_get(url, :diffprocess))
|
424
432
|
$logger.debug "diff process: #{pprc}"
|
425
433
|
difftext = @configuration.call_cmd(pprc, [difftext], :url => url)
|
426
434
|
end
|
427
435
|
# $logger.debug "difftext: #{difftext}" #DBG#
|
428
436
|
if difftext =~ /\S/
|
429
|
-
$logger.warn "Changed: #{@configuration.
|
437
|
+
$logger.warn "Changed: #{@configuration.url_get(url, :title, url).inspect}"
|
430
438
|
return difftext
|
431
439
|
end
|
432
440
|
end
|
433
441
|
|
434
|
-
$logger.debug "Unchanged: #{@configuration.
|
442
|
+
$logger.debug "Unchanged: #{@configuration.url_get(url, :title, url).inspect}"
|
435
443
|
|
436
444
|
elsif File.exist?(new) and
|
437
|
-
(@configuration.
|
445
|
+
(@configuration.url_get(url, :show_initial) or @configuration.optval_get(:global, :show_initial))
|
438
446
|
|
439
447
|
return File.read(new)
|
440
448
|
|
@@ -451,16 +459,16 @@ class Websitary::App
|
|
451
459
|
tdiff = tdiff_with(opts, tn, tl)
|
452
460
|
case tdiff
|
453
461
|
when nil, false
|
454
|
-
$logger.debug "Age requirement fulfilled: #{@configuration.
|
462
|
+
$logger.debug "Age requirement fulfilled: #{@configuration.url_get(url, :title, url).inspect}: #{format_tdiff(td)} old"
|
455
463
|
return false
|
456
464
|
when :skip, true
|
457
|
-
$logger.info "Skip #{@configuration.
|
465
|
+
$logger.info "Skip #{@configuration.url_get(url, :title, url).inspect}: Only #{format_tdiff(td)} old"
|
458
466
|
return true
|
459
467
|
when Numeric
|
460
468
|
if td < tdiff
|
461
469
|
tdd = tdiff - td
|
462
470
|
@tdiff_min = tdd if @tdiff_min.nil? or tdd < @tdiff_min
|
463
|
-
$logger.info "Skip #{@configuration.
|
471
|
+
$logger.info "Skip #{@configuration.url_get(url, :title, url).inspect}: Only #{format_tdiff(td)} old (#{format_tdiff(tdiff)})"
|
464
472
|
return true
|
465
473
|
end
|
466
474
|
else
|
@@ -509,7 +517,7 @@ class Websitary::App
|
|
509
517
|
when Integer
|
510
518
|
return eligible != now
|
511
519
|
else
|
512
|
-
$logger.error "#{@configuration.
|
520
|
+
$logger.error "#{@configuration.url_get(url, :title, url)}: Wrong type for :days_of_week=#{dweek.inspect}"
|
513
521
|
return :skip
|
514
522
|
end
|
515
523
|
end
|
data/lib/websitary/applog.rb
CHANGED
File without changes
|
@@ -1,5 +1,5 @@
|
|
1
1
|
# configuration.rb
|
2
|
-
# @Last Change: 2008-
|
2
|
+
# @Last Change: 2008-05-23.
|
3
3
|
# Author:: Thomas Link (micathom AT gmail com)
|
4
4
|
# License:: GPL (see http://www.gnu.org/licenses/gpl.txt)
|
5
5
|
# Created:: 2007-09-08.
|
@@ -47,7 +47,6 @@ class Websitary::Configuration
|
|
47
47
|
@cmd_edit = 'vi "%s"'
|
48
48
|
@execute = 'downdiff'
|
49
49
|
@quicklist_profile = 'quicklist'
|
50
|
-
@user_agent = "websitary/#{Websitary::VERSION}"
|
51
50
|
@view = 'w3m "%s"'
|
52
51
|
|
53
52
|
@allow = {}
|
@@ -60,7 +59,7 @@ class Websitary::Configuration
|
|
60
59
|
@profiles = []
|
61
60
|
@robots = {}
|
62
61
|
@todo = []
|
63
|
-
@exclude = []
|
62
|
+
@exclude = [/^\s*(javascript|mailto):/]
|
64
63
|
@urlencmap = {}
|
65
64
|
@urls = {}
|
66
65
|
|
@@ -190,10 +189,16 @@ class Websitary::Configuration
|
|
190
189
|
end
|
191
190
|
|
192
191
|
|
192
|
+
def url_set(url, items)
|
193
|
+
opts = @urls[url] ||= {}
|
194
|
+
opts.merge!(items)
|
195
|
+
end
|
196
|
+
|
197
|
+
|
193
198
|
# Retrieve an option for an url
|
194
199
|
# url:: String
|
195
200
|
# opt:: Symbol
|
196
|
-
def
|
201
|
+
def url_get(url, opt, default=nil)
|
197
202
|
opts = @urls[url]
|
198
203
|
unless opts
|
199
204
|
$logger.debug "Non-registered URL: #{url}"
|
@@ -221,7 +226,7 @@ class Websitary::Configuration
|
|
221
226
|
when nil
|
222
227
|
when Symbol
|
223
228
|
$logger.debug "get: val=#{val}"
|
224
|
-
success, rv =
|
229
|
+
success, rv = opt_get(opt, val)
|
225
230
|
$logger.debug "get: #{success}, #{rv}"
|
226
231
|
if success
|
227
232
|
return rv
|
@@ -231,7 +236,7 @@ class Websitary::Configuration
|
|
231
236
|
return val
|
232
237
|
end
|
233
238
|
unless default
|
234
|
-
success, default1 =
|
239
|
+
success, default1 = opt_get(opt, :default)
|
235
240
|
default = default1 if success
|
236
241
|
end
|
237
242
|
|
@@ -240,10 +245,10 @@ class Websitary::Configuration
|
|
240
245
|
end
|
241
246
|
|
242
247
|
|
243
|
-
def
|
248
|
+
def optval_get(opt, val, default=nil)
|
244
249
|
case val
|
245
250
|
when Symbol
|
246
|
-
ok, val =
|
251
|
+
ok, val = opt_get(opt, val)
|
247
252
|
if ok
|
248
253
|
val
|
249
254
|
else
|
@@ -255,22 +260,22 @@ class Websitary::Configuration
|
|
255
260
|
end
|
256
261
|
|
257
262
|
|
258
|
-
def
|
263
|
+
def opt_get(opt, val)
|
259
264
|
vals = @options[opt]
|
260
265
|
$logger.debug "val=#{val} vals=#{vals.inspect}"
|
261
266
|
if vals and vals.has_key?(val)
|
262
267
|
rv = vals[val]
|
263
|
-
$logger.debug "
|
268
|
+
$logger.debug "opt_get ok: #{opt} => #{rv.inspect}"
|
264
269
|
case rv
|
265
270
|
when Symbol
|
266
|
-
$logger.debug "
|
267
|
-
return
|
271
|
+
$logger.debug "opt_get re: #{rv}"
|
272
|
+
return opt_get(opt, rv)
|
268
273
|
else
|
269
|
-
$logger.debug "
|
274
|
+
$logger.debug "opt_get true, #{rv}"
|
270
275
|
return [true, rv]
|
271
276
|
end
|
272
277
|
else
|
273
|
-
$logger.debug "
|
278
|
+
$logger.debug "opt_get no: #{opt} => #{val.inspect}"
|
274
279
|
return [false, val]
|
275
280
|
end
|
276
281
|
end
|
@@ -409,7 +414,7 @@ class Websitary::Configuration
|
|
409
414
|
# urls:: String
|
410
415
|
def source(urls, opts={})
|
411
416
|
urls.split("\n").flatten.compact.each do |url|
|
412
|
-
|
417
|
+
url_set(url, @default_options.dup.update(opts))
|
413
418
|
to_do url
|
414
419
|
end
|
415
420
|
end
|
@@ -477,9 +482,9 @@ class Websitary::Configuration
|
|
477
482
|
|
478
483
|
|
479
484
|
def format_text(url, text)
|
480
|
-
enc =
|
485
|
+
enc = url_get(url, :iconv)
|
481
486
|
if enc
|
482
|
-
denc =
|
487
|
+
denc = optval_get(:global, :encoding)
|
483
488
|
begin
|
484
489
|
require 'iconv'
|
485
490
|
text = Iconv.conv(denc, enc, text)
|
@@ -493,7 +498,7 @@ class Websitary::Configuration
|
|
493
498
|
|
494
499
|
# Format a diff according to URL's source options.
|
495
500
|
def format(url, difftext)
|
496
|
-
fmt =
|
501
|
+
fmt = url_get(url, :format)
|
497
502
|
text = format_text(url, difftext)
|
498
503
|
eval_arg(fmt, [text], text)
|
499
504
|
end
|
@@ -527,7 +532,7 @@ class Websitary::Configuration
|
|
527
532
|
def call_cmd(cmd, cmdargs, args={})
|
528
533
|
default = args[:default]
|
529
534
|
url = args[:url]
|
530
|
-
timeout = url ?
|
535
|
+
timeout = url ? url_get(url, :timeout) : nil
|
531
536
|
if timeout
|
532
537
|
begin
|
533
538
|
Timeout::timeout(timeout) do |timeout_length|
|
@@ -583,7 +588,7 @@ class Websitary::Configuration
|
|
583
588
|
if difftext
|
584
589
|
difftext = html_to_text(difftext) if is_html?(difftext)
|
585
590
|
!difftext.empty? && [
|
586
|
-
eval_arg(
|
591
|
+
eval_arg(url_get(url, :rewrite_link, '%s'), [url]),
|
587
592
|
difftext_annotation(url),
|
588
593
|
nil,
|
589
594
|
difftext
|
@@ -594,32 +599,32 @@ class Websitary::Configuration
|
|
594
599
|
|
595
600
|
|
596
601
|
def get_output_rss(difftext)
|
597
|
-
success, rss_url =
|
602
|
+
success, rss_url = opt_get(:rss, :url)
|
598
603
|
if success
|
599
|
-
success, rss_version =
|
604
|
+
success, rss_version = opt_get(:rss, :version)
|
600
605
|
# require "rss/#{rss_version}"
|
601
606
|
|
602
607
|
rss = RSS::Rss.new(rss_version)
|
603
608
|
chan = RSS::Rss::Channel.new
|
604
609
|
chan.title = @output_title
|
605
610
|
[:description, :copyright, :category, :language, :image, :webMaster, :pubDate].each do |field|
|
606
|
-
ok, val =
|
611
|
+
ok, val = opt_get(:rss, field)
|
607
612
|
item.send(format_symbol(field, '%s='), val) if ok
|
608
613
|
end
|
609
614
|
chan.link = rss_url
|
610
615
|
rss.channel = chan
|
611
616
|
|
612
617
|
cnt = difftext.map do |url, text|
|
613
|
-
rss_format =
|
618
|
+
rss_format = url_get(url, :rss_format, 'plain_text')
|
614
619
|
text = strip_tags(text, :format => rss_format)
|
615
620
|
next if text.empty?
|
616
621
|
|
617
622
|
item = RSS::Rss::Channel::Item.new
|
618
623
|
item.date = Time.now
|
619
|
-
item.title =
|
620
|
-
item.link = eval_arg(
|
624
|
+
item.title = url_get(url, :title, File.basename(url))
|
625
|
+
item.link = eval_arg(url_get(url, :rewrite_link, '%s'), [url])
|
621
626
|
[:author, :date, :enclosure, :category, :pubDate].each do |field|
|
622
|
-
val =
|
627
|
+
val = url_get(url, format_symbol(field, 'rss_%s'))
|
623
628
|
item.send(format_symbol(field, '%s='), val) if val
|
624
629
|
end
|
625
630
|
|
@@ -647,7 +652,7 @@ class Websitary::Configuration
|
|
647
652
|
|
648
653
|
def get_output_html(difftext)
|
649
654
|
difftext = difftext.map do |url, text|
|
650
|
-
tags =
|
655
|
+
tags = url_get(url, :strip_tags)
|
651
656
|
text = strip_tags(text, :tags => tags) if tags
|
652
657
|
text.empty? ? nil : [url, text]
|
653
658
|
end
|
@@ -655,7 +660,7 @@ class Websitary::Configuration
|
|
655
660
|
sort_difftext!(difftext)
|
656
661
|
|
657
662
|
toc = difftext.map do |url, text|
|
658
|
-
ti =
|
663
|
+
ti = url_get(url, :title, File.basename(url))
|
659
664
|
tid = html_toc_id(url)
|
660
665
|
bid = html_body_id(url)
|
661
666
|
%{<li id="#{tid}" class="toc"><a class="toc" href="\##{bid}">#{ti}</a></li>}
|
@@ -664,9 +669,9 @@ class Websitary::Configuration
|
|
664
669
|
idx = 0
|
665
670
|
cnt = difftext.map do |url, text|
|
666
671
|
idx += 1
|
667
|
-
ti =
|
672
|
+
ti = url_get(url, :title, File.basename(url))
|
668
673
|
bid = html_body_id(url)
|
669
|
-
if (rewrite =
|
674
|
+
if (rewrite = url_get(url, :rewrite_link))
|
670
675
|
urlr = eval_arg(rewrite, [url])
|
671
676
|
ext = ''
|
672
677
|
else
|
@@ -676,7 +681,7 @@ class Websitary::Configuration
|
|
676
681
|
urlr = url
|
677
682
|
end
|
678
683
|
note = difftext_annotation(url)
|
679
|
-
onclick =
|
684
|
+
onclick = optval_get(:global, :toggle_body) ? 'onclick="ToggleBody(this)"' : ''
|
680
685
|
<<HTML
|
681
686
|
<div id="#{bid}" class="webpage" #{onclick}>
|
682
687
|
<div class="count">
|
@@ -697,9 +702,9 @@ class Websitary::Configuration
|
|
697
702
|
HTML
|
698
703
|
end.join(('<hr class="separator"/>') + "\n")
|
699
704
|
|
700
|
-
success, template =
|
705
|
+
success, template = opt_get(:page, :format)
|
701
706
|
unless success
|
702
|
-
success, template =
|
707
|
+
success, template = opt_get(:page, :simple)
|
703
708
|
end
|
704
709
|
return eval_arg(template, [@output_title, toc, cnt])
|
705
710
|
end
|
@@ -735,12 +740,12 @@ HTML
|
|
735
740
|
|
736
741
|
|
737
742
|
def encoded_filename(dir, url, ensure_dir=false, type=nil)
|
738
|
-
type ||=
|
743
|
+
type ||= url_get(url, :cachetype, 'tree')
|
739
744
|
$logger.debug "encoded_filename: type=#{type} url=#{url}"
|
740
745
|
rv = File.join(@cfgdir, dir, encoded_basename(url, type))
|
741
746
|
rd = File.dirname(rv)
|
742
747
|
$logger.debug "encoded_filename: rv0=#{rv}"
|
743
|
-
fm =
|
748
|
+
fm = optval_get(:global, :filename_size, 255)
|
744
749
|
rdok = !ensure_dir || @app.ensure_dir(rd, false)
|
745
750
|
if !rdok or rv.size > fm or File.directory?(rv)
|
746
751
|
# $logger.debug "Filename too long (:global=>:filename_size = #{fm}), try md5 encoded filename instead: #{url}"
|
@@ -796,6 +801,24 @@ HTML
|
|
796
801
|
end
|
797
802
|
|
798
803
|
|
804
|
+
def save_dir(url, dir, title=nil)
|
805
|
+
case dir
|
806
|
+
when true
|
807
|
+
title ||= url_get(url, :title)
|
808
|
+
dir = File.join(@cfgdir, 'attachments', encode(title))
|
809
|
+
when Proc
|
810
|
+
dir = dir.call(url)
|
811
|
+
end
|
812
|
+
@app.ensure_dir(dir) if dir
|
813
|
+
return dir
|
814
|
+
end
|
815
|
+
|
816
|
+
|
817
|
+
def clean_url(url)
|
818
|
+
url && url.strip
|
819
|
+
end
|
820
|
+
|
821
|
+
|
799
822
|
# Strip the url's last part (after #).
|
800
823
|
def canonic_url(url)
|
801
824
|
url.sub(/#.*$/, '')
|
@@ -803,7 +826,7 @@ HTML
|
|
803
826
|
|
804
827
|
|
805
828
|
def strip_tags_default
|
806
|
-
success, tags =
|
829
|
+
success, tags = opt_get(:strip_tags, :default)
|
807
830
|
tags.dup if success
|
808
831
|
end
|
809
832
|
|
@@ -830,7 +853,7 @@ HTML
|
|
830
853
|
# This checks either for a :match option for url or the extensions
|
831
854
|
# of path0 and path.
|
832
855
|
def eligible_path?(url, path0, path)
|
833
|
-
rx =
|
856
|
+
rx = url_get(url, :match)
|
834
857
|
if rx
|
835
858
|
return path =~ rx
|
836
859
|
else
|
@@ -845,15 +868,15 @@ HTML
|
|
845
868
|
begin
|
846
869
|
$logger.debug "push_refs: #{url}"
|
847
870
|
return if robots?(hpricot, 'nofollow') or is_excluded?(url)
|
848
|
-
depth =
|
871
|
+
depth = url_get(url, :depth)
|
849
872
|
return if depth and depth <= 0
|
850
873
|
uri0 = URI.parse(url)
|
851
874
|
# pn0 = Pathname.new(guess_dir(File.expand_path(uri0.path)))
|
852
875
|
pn0 = Pathname.new(guess_dir(uri0.path))
|
853
876
|
(hpricot / 'a').each do |a|
|
854
877
|
next if a['rel'] == 'nofollow'
|
855
|
-
href = a['href']
|
856
|
-
next if href.nil? or href == url or
|
878
|
+
href = clean_url(a['href'])
|
879
|
+
next if href.nil? or href == url or is_excluded?(href)
|
857
880
|
uri = URI.parse(href)
|
858
881
|
pn = guess_dir(uri.path)
|
859
882
|
href = rewrite_href(href, url, uri0, pn0, true)
|
@@ -869,7 +892,7 @@ HTML
|
|
869
892
|
opts[:title] = [opts[:title], File.basename(curl)].join(' - ')
|
870
893
|
opts[:depth] = depth - 1 if depth and depth >= 0
|
871
894
|
# opts[:sleep] = delay if delay
|
872
|
-
|
895
|
+
url_set(curl, opts)
|
873
896
|
to_do curl
|
874
897
|
end
|
875
898
|
rescue Exception => e
|
@@ -887,7 +910,7 @@ HTML
|
|
887
910
|
uri = URI.parse(url)
|
888
911
|
urd = guess_dir(uri.path)
|
889
912
|
(doc / 'a').each do |a|
|
890
|
-
href = a['href']
|
913
|
+
href = clean_url(a['href'])
|
891
914
|
if is_excluded?(href)
|
892
915
|
comment_element(doc, a)
|
893
916
|
else
|
@@ -896,7 +919,7 @@ HTML
|
|
896
919
|
end
|
897
920
|
end
|
898
921
|
(doc / 'img').each do |a|
|
899
|
-
href = a['src']
|
922
|
+
href = clean_url(a['src'])
|
900
923
|
if is_excluded?(href)
|
901
924
|
comment_element(doc, a)
|
902
925
|
else
|
@@ -917,12 +940,14 @@ HTML
|
|
917
940
|
# Try to make href an absolute url.
|
918
941
|
def rewrite_href(href, url, uri=nil, urd=nil, local=false)
|
919
942
|
begin
|
920
|
-
return if !href or href
|
921
|
-
urh = URI.parse(href)
|
943
|
+
return nil if !href or is_excluded?(href)
|
922
944
|
uri ||= URI.parse(url)
|
945
|
+
if href =~ /^\s*\//
|
946
|
+
return uri.merge(href).to_s
|
947
|
+
end
|
948
|
+
urh = URI.parse(href)
|
923
949
|
urd ||= guess_dir(uri.path)
|
924
950
|
rv = nil
|
925
|
-
href = href.strip
|
926
951
|
|
927
952
|
# $logger.debug "DBG", uri, urh, #DBG#
|
928
953
|
if href =~ /\w+:/
|
@@ -1026,7 +1051,7 @@ HTML
|
|
1026
1051
|
|
1027
1052
|
|
1028
1053
|
def canonic_filename(filename)
|
1029
|
-
call_cmd(
|
1054
|
+
call_cmd(optval_get(:global, :canonic_filename), [filename], :default => filename)
|
1030
1055
|
end
|
1031
1056
|
|
1032
1057
|
|
@@ -1037,6 +1062,7 @@ HTML
|
|
1037
1062
|
:download_html => :openuri,
|
1038
1063
|
:encoding => 'ISO-8859-1',
|
1039
1064
|
:toggle_body => false,
|
1065
|
+
:user_agent => "websitary/#{Websitary::VERSION}",
|
1040
1066
|
},
|
1041
1067
|
}
|
1042
1068
|
|
@@ -1052,11 +1078,11 @@ HTML
|
|
1052
1078
|
},
|
1053
1079
|
|
1054
1080
|
:binary => lambda {|old, new|
|
1055
|
-
call_cmd(
|
1081
|
+
call_cmd(optval_get(:diff, :diff), [old, new, '--binary -d -w'])
|
1056
1082
|
},
|
1057
1083
|
|
1058
1084
|
:new => lambda {|old, new|
|
1059
|
-
difftext = call_cmd(
|
1085
|
+
difftext = call_cmd(optval_get(:diff, :binary), [old, new])
|
1060
1086
|
difftext.empty? ? '' : new
|
1061
1087
|
},
|
1062
1088
|
|
@@ -1067,7 +1093,7 @@ HTML
|
|
1067
1093
|
args = {
|
1068
1094
|
:oldhtml => File.read(old),
|
1069
1095
|
:newhtml => File.read(new),
|
1070
|
-
:ignore =>
|
1096
|
+
:ignore => url_get(url, :ignore),
|
1071
1097
|
}
|
1072
1098
|
difftext = Websitary::Htmldiff.new(args).diff
|
1073
1099
|
difftext
|
@@ -1130,7 +1156,7 @@ HTML
|
|
1130
1156
|
# :download => 'w3m -no-cookie -S -F -dump "%s"'
|
1131
1157
|
|
1132
1158
|
shortcut :lynx, :delegate => :diff,
|
1133
|
-
:download => 'lynx -dump "%s"'
|
1159
|
+
:download => 'lynx -dump -nolist "%s"'
|
1134
1160
|
|
1135
1161
|
shortcut :links, :delegate => :diff,
|
1136
1162
|
:download => 'links -dump "%s"'
|
@@ -1142,24 +1168,14 @@ HTML
|
|
1142
1168
|
:download => 'wget -q -O - "%s"'
|
1143
1169
|
|
1144
1170
|
shortcut :text, :delegate => :diff,
|
1145
|
-
:download => lambda {|url|
|
1171
|
+
:download => lambda {|url| doc_to_text(read_document(url))}
|
1146
1172
|
|
1147
1173
|
shortcut :body_html, :delegate => :webdiff,
|
1148
1174
|
:strip_tags => :default,
|
1149
1175
|
:download => lambda {|url|
|
1150
1176
|
begin
|
1151
|
-
doc =
|
1152
|
-
|
1153
|
-
if doc
|
1154
|
-
doc = rewrite_urls(url, doc)
|
1155
|
-
doc = doc.inner_html
|
1156
|
-
if (tags = get(url, :strip_tags))
|
1157
|
-
doc = strip_tags(doc, :format => :hpricot, :tags => tags)
|
1158
|
-
end
|
1159
|
-
else
|
1160
|
-
$logger.warn 'inner html: No body'
|
1161
|
-
end
|
1162
|
-
doc.to_s
|
1177
|
+
doc = read_document(url)
|
1178
|
+
body_html(url, doc).to_s
|
1163
1179
|
rescue Exception => e
|
1164
1180
|
# $logger.error e #DBG#
|
1165
1181
|
$logger.error e.message
|
@@ -1180,10 +1196,37 @@ HTML
|
|
1180
1196
|
end
|
1181
1197
|
}
|
1182
1198
|
|
1199
|
+
shortcut :mechanize, :delegate => :webdiff,
|
1200
|
+
:download => lambda {|url|
|
1201
|
+
require 'mechanize'
|
1202
|
+
agent = WWW::Mechanize.new
|
1203
|
+
proxy = get_proxy
|
1204
|
+
if proxy
|
1205
|
+
agent.set_proxy(*proxy)
|
1206
|
+
end
|
1207
|
+
page = agent.get(url)
|
1208
|
+
process = url_get(url, :mechanize)
|
1209
|
+
if process
|
1210
|
+
uri = URI.parse(url)
|
1211
|
+
urd = guess_dir(uri.path)
|
1212
|
+
page.links.each {|link|
|
1213
|
+
href = link.node['href']
|
1214
|
+
if href
|
1215
|
+
href = rewrite_href(href, url, uri, urd, true)
|
1216
|
+
link.node['href'] = href if href
|
1217
|
+
end
|
1218
|
+
}
|
1219
|
+
process.call(url, agent, page)
|
1220
|
+
else
|
1221
|
+
doc = url_document(url, page.content)
|
1222
|
+
body_html(url, doc).to_s
|
1223
|
+
end
|
1224
|
+
}
|
1225
|
+
|
1183
1226
|
shortcut :rss,
|
1184
1227
|
:delegate => :openuri,
|
1185
1228
|
:diff => lambda {|old, new|
|
1186
|
-
success, rss_version =
|
1229
|
+
success, rss_version = opt_get(:rss, :version)
|
1187
1230
|
ro = RSS::Parser.parse(File.read(old), false)
|
1188
1231
|
if ro
|
1189
1232
|
rh = {}
|
@@ -1202,24 +1245,35 @@ HTML
|
|
1202
1245
|
rnew << format_rss_item(item, rss_diff)
|
1203
1246
|
else
|
1204
1247
|
enc = item.respond_to?(:enclosure) && item.enclosure
|
1205
|
-
|
1206
|
-
|
1207
|
-
|
1248
|
+
url = url_from_filename(new)
|
1249
|
+
if !enc and item.description
|
1250
|
+
scanner = url_get(url, :rss_find_enclosure)
|
1251
|
+
if scanner
|
1252
|
+
ddoc = Hpricot(item.description)
|
1253
|
+
enc = scanner.call(item, ddoc)
|
1254
|
+
if enc
|
1255
|
+
def enc.url
|
1256
|
+
self
|
1257
|
+
end
|
1258
|
+
else
|
1259
|
+
$logger.warn "No embedded enclosure URL found: #{item.description}"
|
1260
|
+
end
|
1261
|
+
end
|
1262
|
+
end
|
1263
|
+
if enc and (curl = clean_url(enc.url))
|
1264
|
+
dir = url_get(url, :rss_enclosure)
|
1208
1265
|
curl = rewrite_href(curl, url, nil, nil, true)
|
1209
1266
|
next unless curl
|
1210
1267
|
if dir
|
1211
|
-
|
1212
|
-
|
1213
|
-
end
|
1214
|
-
@app.ensure_dir(dir)
|
1215
|
-
$logger.debug "Enclosure URL: #{curl}"
|
1268
|
+
dir = save_dir(url, dir, encode(rn.channel.title))
|
1269
|
+
$logger.info "Enclosure: #{curl}"
|
1216
1270
|
fname = File.join(dir, encode(File.basename(curl) || item.title || item.pubDate.to_s || Time.now.to_s))
|
1217
|
-
$logger.debug "
|
1271
|
+
$logger.debug "Save enclosure: #{fname}"
|
1218
1272
|
enc = read_url(curl, 'rss_enclosure')
|
1219
1273
|
write_file(fname, 'wb') {|io| io.puts enc}
|
1220
1274
|
furl = file_url(fname)
|
1221
|
-
enclosure =
|
1222
|
-
if
|
1275
|
+
enclosure = rss_enclosure_local_copy(url, furl)
|
1276
|
+
if url_get(url, :rss_rewrite_enclosed_urls)
|
1223
1277
|
item.description.gsub!(Regexp.new(Regexp.escape(curl))) {|t| furl}
|
1224
1278
|
end
|
1225
1279
|
else
|
@@ -1249,7 +1303,7 @@ HTML
|
|
1249
1303
|
opts = @urls[url].dup
|
1250
1304
|
opts[:download] = :rss
|
1251
1305
|
opts[:title] = elt['title'] || elt['text'] || elt['htmlurl'] || curl
|
1252
|
-
|
1306
|
+
url_set(curl, opts)
|
1253
1307
|
to_do curl
|
1254
1308
|
else
|
1255
1309
|
$logger.warn "Unsupported type in OPML: #{elt.to_s}"
|
@@ -1266,10 +1320,10 @@ HTML
|
|
1266
1320
|
:download => lambda {|url| get_website_below(:body_html, url)}
|
1267
1321
|
|
1268
1322
|
shortcut :website_txt, :delegate => :default,
|
1269
|
-
:download => lambda {|url| html_to_text(get_website(
|
1323
|
+
:download => lambda {|url| html_to_text(get_website(url_get(url, :download_html, :openuri), url))}
|
1270
1324
|
|
1271
1325
|
shortcut :website_txt_below, :delegate => :default,
|
1272
|
-
:download => lambda {|url| html_to_text(get_website_below(
|
1326
|
+
:download => lambda {|url| html_to_text(get_website_below(url_get(url, :download_html, :openuri), url))}
|
1273
1327
|
|
1274
1328
|
shortcut :ftp, :delegate => :default,
|
1275
1329
|
:download => lambda {|url| get_ftp(url).join("\n")}
|
@@ -1277,7 +1331,7 @@ HTML
|
|
1277
1331
|
shortcut :ftp_recursive, :delegate => :default,
|
1278
1332
|
:download => lambda {|url|
|
1279
1333
|
list = get_ftp(url)
|
1280
|
-
depth =
|
1334
|
+
depth = url_get(url, :depth)
|
1281
1335
|
if !depth or depth >= 0
|
1282
1336
|
dirs = list.find_all {|e| e =~ /^d/}
|
1283
1337
|
dirs.each do |l|
|
@@ -1287,7 +1341,7 @@ HTML
|
|
1287
1341
|
opts = @urls[url].dup
|
1288
1342
|
opts[:title] = [opts[:title], File.basename(curl)].join(' - ')
|
1289
1343
|
opts[:depth] = depth - 1 if depth and depth >= 0
|
1290
|
-
|
1344
|
+
url_set(curl, opts)
|
1291
1345
|
to_do curl
|
1292
1346
|
end
|
1293
1347
|
end
|
@@ -1307,7 +1361,7 @@ HTML
|
|
1307
1361
|
<html>
|
1308
1362
|
<head>
|
1309
1363
|
<title>%s</title>
|
1310
|
-
<meta http-equiv="Content-Type" content="text/html; charset=#{
|
1364
|
+
<meta http-equiv="Content-Type" content="text/html; charset=#{optval_get(:global, :encoding)}">
|
1311
1365
|
<link rel="stylesheet" href="websitary.css" type="text/css">
|
1312
1366
|
<link rel="alternate" href="websitary.rss" type="application/rss+xml" title="%s">
|
1313
1367
|
</head>
|
@@ -1463,6 +1517,9 @@ CSS
|
|
1463
1517
|
begin
|
1464
1518
|
self.instance_eval(contents)
|
1465
1519
|
return true
|
1520
|
+
rescue Exception => e
|
1521
|
+
$logger.fatal "Error when reading profile: #{profile_file}\n#{e}"
|
1522
|
+
exit 5
|
1466
1523
|
ensure
|
1467
1524
|
@current_profile = nil
|
1468
1525
|
end
|
@@ -1470,9 +1527,9 @@ CSS
|
|
1470
1527
|
|
1471
1528
|
|
1472
1529
|
def get_website(download, url)
|
1473
|
-
html = call_cmd(
|
1530
|
+
html = call_cmd(optval_get(:download, download), [url], :url => url)
|
1474
1531
|
if html
|
1475
|
-
doc =
|
1532
|
+
doc = url_document(url, html)
|
1476
1533
|
if doc
|
1477
1534
|
return if robots?(doc, 'noindex')
|
1478
1535
|
push_hrefs(url, doc) do |uri0, pn0, uri, pn|
|
@@ -1486,10 +1543,10 @@ CSS
|
|
1486
1543
|
|
1487
1544
|
|
1488
1545
|
def get_website_below(download, url)
|
1489
|
-
dwnl =
|
1546
|
+
dwnl = optval_get(:download, download)
|
1490
1547
|
html = call_cmd(dwnl, [url], :url => url)
|
1491
1548
|
if html
|
1492
|
-
doc =
|
1549
|
+
doc = url_document(url, html)
|
1493
1550
|
if doc
|
1494
1551
|
return if robots?(doc, 'noindex')
|
1495
1552
|
push_hrefs(url, doc) do |uri0, pn0, uri, pn|
|
@@ -1547,8 +1604,26 @@ CSS
|
|
1547
1604
|
end
|
1548
1605
|
|
1549
1606
|
|
1607
|
+
def url_document(url, html)
|
1608
|
+
doc = html && Hpricot(html)
|
1609
|
+
if doc
|
1610
|
+
unless url_get(url, :title)
|
1611
|
+
ti = (doc / 'head > title').inner_html
|
1612
|
+
url_set(url, :title => ti) unless ti.empty?
|
1613
|
+
end
|
1614
|
+
end
|
1615
|
+
doc
|
1616
|
+
end
|
1617
|
+
|
1618
|
+
|
1619
|
+
def read_document(url)
|
1620
|
+
html = read_url(url, 'html')
|
1621
|
+
html && url_document(url, html)
|
1622
|
+
end
|
1623
|
+
|
1624
|
+
|
1550
1625
|
def read_url(url, type='html')
|
1551
|
-
downloader =
|
1626
|
+
downloader = url_get(url, "download_#{type}".intern)
|
1552
1627
|
if downloader
|
1553
1628
|
call_cmd(downloader, [url], :url => url)
|
1554
1629
|
else
|
@@ -1568,9 +1643,13 @@ CSS
|
|
1568
1643
|
if uri.instance_of?(URI::Generic) or uri.scheme == 'file'
|
1569
1644
|
open(url).read
|
1570
1645
|
else
|
1571
|
-
|
1572
|
-
|
1573
|
-
|
1646
|
+
args = {"User-Agent" => optval_get(:global, :user_agent)}
|
1647
|
+
args.merge!(url_get(url, :header, {}))
|
1648
|
+
# proxy = get_proxy
|
1649
|
+
# if proxy
|
1650
|
+
# args[:proxy] = proxy[0,2].join(':')
|
1651
|
+
# end
|
1652
|
+
open(url, args).read
|
1574
1653
|
end
|
1575
1654
|
end
|
1576
1655
|
|
@@ -1579,7 +1658,7 @@ CSS
|
|
1579
1658
|
bak = oldname(url)
|
1580
1659
|
lst = latestname(url)
|
1581
1660
|
if File.exist?(bak) and File.exist?(lst)
|
1582
|
-
eval_arg(
|
1661
|
+
eval_arg(url_get(url, :format_annotation, '%s >>> %s'), [@mtimes.mtime(bak), @mtimes.mtime(lst)])
|
1583
1662
|
end
|
1584
1663
|
end
|
1585
1664
|
|
@@ -1597,6 +1676,21 @@ CSS
|
|
1597
1676
|
end
|
1598
1677
|
|
1599
1678
|
|
1679
|
+
def rss_enclosure_local_copy(url, furl)
|
1680
|
+
t = url_get(url, :rss_format_local_copy) ||
|
1681
|
+
%{<p class="enclosure"><a href="%s" class="enclosure" />Enclosure (local copy)</a></p>}
|
1682
|
+
case t
|
1683
|
+
when Proc
|
1684
|
+
t.call(url, furl)
|
1685
|
+
when String
|
1686
|
+
t % furl
|
1687
|
+
else
|
1688
|
+
$logger.fatal 'Argument for :rss_format_local_copy must be String or Proc: %s' % t.inspect
|
1689
|
+
exit 5
|
1690
|
+
end
|
1691
|
+
end
|
1692
|
+
|
1693
|
+
|
1600
1694
|
def format_rss_item(item, body, enclosure='')
|
1601
1695
|
ti = rss_field(item, :title)
|
1602
1696
|
au = rss_field(item, :author)
|
@@ -1627,6 +1721,46 @@ EOT
|
|
1627
1721
|
end
|
1628
1722
|
|
1629
1723
|
|
1724
|
+
def get_proxy
|
1725
|
+
proxy = optval_get(:global, :proxy)
|
1726
|
+
if proxy
|
1727
|
+
case proxy
|
1728
|
+
when String
|
1729
|
+
proxy = proxy.split(':', 2)
|
1730
|
+
if proxy.size == 1
|
1731
|
+
proxy << 8080
|
1732
|
+
else
|
1733
|
+
proxy[1] = proxy[1].to_i
|
1734
|
+
end
|
1735
|
+
when Array
|
1736
|
+
else
|
1737
|
+
raise ArgumentError, 'proxy must be String or Array'
|
1738
|
+
end
|
1739
|
+
end
|
1740
|
+
proxy
|
1741
|
+
end
|
1742
|
+
|
1743
|
+
|
1744
|
+
def body_html(url, doc)
|
1745
|
+
doc &&= doc.at('body')
|
1746
|
+
if doc
|
1747
|
+
doc = rewrite_urls(url, doc)
|
1748
|
+
doc = doc.inner_html
|
1749
|
+
if (tags = url_get(url, :strip_tags))
|
1750
|
+
doc = strip_tags(doc, :format => :hpricot, :tags => tags)
|
1751
|
+
end
|
1752
|
+
else
|
1753
|
+
$logger.warn 'inner html: No body'
|
1754
|
+
end
|
1755
|
+
doc
|
1756
|
+
end
|
1757
|
+
|
1758
|
+
|
1759
|
+
def doc_to_text(doc)
|
1760
|
+
doc && doc.to_plain_text
|
1761
|
+
end
|
1762
|
+
|
1763
|
+
|
1630
1764
|
# Convert html to plain text using hpricot.
|
1631
1765
|
def html_to_text(text)
|
1632
1766
|
text && Hpricot(text).to_plain_text
|
@@ -1660,10 +1794,10 @@ EOT
|
|
1660
1794
|
return true if rurl.nil? or rurl.empty?
|
1661
1795
|
begin
|
1662
1796
|
robots_txt = read_url(rurl, 'robots')
|
1663
|
-
rules = RobotRules.new(
|
1797
|
+
rules = RobotRules.new(optval_get(:global, :user_agent))
|
1664
1798
|
rules.parse(rurl, robots_txt)
|
1665
1799
|
@robots[host] = rules
|
1666
|
-
$logger.info "Loaded #{rurl} for #{
|
1800
|
+
$logger.info "Loaded #{rurl} for #{optval_get(:global, :user_agent)}"
|
1667
1801
|
$logger.debug robots_txt
|
1668
1802
|
rescue Exception => e
|
1669
1803
|
$logger.info "#{rurl}: #{e}"
|
@@ -1705,7 +1839,7 @@ EOT
|
|
1705
1839
|
difftext.sort! do |a, b|
|
1706
1840
|
aa = a[0]
|
1707
1841
|
bb = b[0]
|
1708
|
-
|
1842
|
+
url_get(aa, :title, aa).downcase <=> url_get(bb, :title, bb).downcase
|
1709
1843
|
end
|
1710
1844
|
end
|
1711
1845
|
|
@@ -1713,7 +1847,7 @@ EOT
|
|
1713
1847
|
def file_url(filename)
|
1714
1848
|
# filename = File.join(File.basename(File.dirname(filename)), File.basename(filename))
|
1715
1849
|
# "file://#{encode(filename, ':/')}"
|
1716
|
-
filename = call_cmd(
|
1850
|
+
filename = call_cmd(optval_get(:global, :file_url), [filename], :default => filename)
|
1717
1851
|
encode(filename, ':/')
|
1718
1852
|
end
|
1719
1853
|
|
data/lib/websitary/filemtimes.rb
CHANGED
File without changes
|
data/lib/websitary/htmldiff.rb
CHANGED
File without changes
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: websitary
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: "0.
|
4
|
+
version: "0.5"
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Thomas Link
|
@@ -9,7 +9,7 @@ autorequire:
|
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
11
|
|
12
|
-
date: 2008-
|
12
|
+
date: 2008-05-24 00:00:00 +02:00
|
13
13
|
default_executable:
|
14
14
|
dependencies:
|
15
15
|
- !ruby/object:Gem::Dependency
|
@@ -28,7 +28,7 @@ dependencies:
|
|
28
28
|
requirements:
|
29
29
|
- - ">="
|
30
30
|
- !ruby/object:Gem::Version
|
31
|
-
version: 1.
|
31
|
+
version: 1.5.1
|
32
32
|
version:
|
33
33
|
description: "== DESCRIPTION: websitary (formerly known as websitiary with an extra \"i\") monitors webpages, rss feeds, podcasts etc. It reuses other programs (w3m, diff etc.) to do most of the actual work. By default, it works on an ASCII basis, i.e. with the output of text-based webbrowsers like w3m (or lynx, links etc.) as the output can easily be post-processed. It can also work with HTML and highlight new items. This script was originally planned as a ruby-based websec replacement. By default, this script will use w3m to dump HTML pages and then run diff over the current page and the previous backup. Some pages are better viewed with lynx or links. Downloaded documents (HTML or ASCII) can be post-processed (e.g., filtered through some ruby block that extracts elements via hpricot and the like). Please see the configuration options below to find out how to change this globally or for a single source. This user manual is also available as PDF[http://websitiary.rubyforge.org/websitary.pdf]. == FEATURES/PROBLEMS: * Handle webpages, rss feeds (optionally save attachments in podcasts etc.) * Compare webpages with previous backups * Display differences between the current version and the backup * Provide hooks to post-process the downloaded documents and the diff * Display a one-page report summarizing all news * Automatically open the report in your favourite web-browser * Experimental: Download webpages on defined intervalls and generate incremental diffs."
|
34
34
|
email: micathom at gmail com
|
@@ -75,7 +75,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
|
|
75
75
|
requirements: []
|
76
76
|
|
77
77
|
rubyforge_project: websitiary
|
78
|
-
rubygems_version: 1.
|
78
|
+
rubygems_version: 1.1.1
|
79
79
|
signing_key:
|
80
80
|
specification_version: 2
|
81
81
|
summary: A unified website news, rss feed, podcast monitor
|