get_them_all 1.0.0

Sign up to get free protection for your applications and to get access to all the features.
data/LICENSE ADDED
@@ -0,0 +1,19 @@
1
+ Copyright (c) 2009-2011 Julien Ammous
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining a copy
4
+ of this software and associated documentation files (the "Software"), to deal
5
+ in the Software without restriction, including without limitation the rights
6
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
7
+ copies of the Software, and to permit persons to whom the Software is
8
+ furnished to do so, subject to the following conditions:
9
+
10
+ The above copyright notice and this permission notice shall be included in
11
+ all copies or substantial portions of the Software.
12
+
13
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
15
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
16
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
17
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
18
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
19
+ THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,77 @@
1
+ = What is it ?
2
+
3
+ Get Them All is my personal try at building a versatile and powerful web downloader, its goal is pretty simple:
4
+ download all the targets and keep up to date with new content by remembering what was downloaded.
5
+
6
+ It should be able to download ay file type and try as much as possible to not make any assumptions on how the
7
+ targeted website is built.
8
+
9
+ EventMachine is used to power the core, hpricot is used to parse the html.
10
+
11
+ # Why ?
12
+
13
+ I simply never found any tool fulfilling my needs so I made mine ;)
14
+
15
+
16
+ # What can it do for you
17
+
18
+ First let's start by what is currently supported:
19
+
20
+ - authentication (partially by hand)
21
+ - the referer is passed from one page to another so any leecher detection
22
+ by referer will fail
23
+ - cookies are passed too
24
+ - parallel download, you decide how many parallel tasks are executed
25
+ you can go as high as you want but don't be stupid !
26
+ - multiple storage backend, currently the files can be saved in:
27
+ - local disk
28
+ - dropbox
29
+ - javascript parsing with therubyracer, yes you read that well,
30
+ if you are crawling a javascript powered site and need to read javascript
31
+ you can use this to extract the informations you need.
32
+
33
+ Any website is considered as a reversed pyramid, let's take a gallery website as an example:
34
+
35
+ - the first level would be the page containing all the thumbnails
36
+ - the second level would be a page showing the picture (each link collected in level 0
37
+ will lead to a different page on level 2)
38
+ - the third level would be the link to the picture itself
39
+
40
+ I decided on this model after some testing and until now I never found a
41
+ website where this cannot be applied (a website with fiels to download)
42
+
43
+
44
+ # Current state
45
+
46
+ The application is already ready for my needs and may be for someone else.
47
+ Currently all the connections errors may not be correctly handled especially if
48
+ the web server really has trouble keeping connections alive to serve the clients
49
+ (like for the example above).
50
+
51
+
52
+ # Usage
53
+
54
+ Look at the examples folder, there is two way of using this gem:
55
+
56
+ As an application, try running:
57
+
58
+ ```bash
59
+ ./bin/gta exec examples/wallpaper -s data
60
+ ```
61
+
62
+ Or as a library, try this:
63
+
64
+ ```bash
65
+ ruby examples/standalone.rb
66
+ ```
67
+
68
+
69
+
70
+ # Disclaimer
71
+
72
+ As with most open source projects you are responsible for your actions, if you start
73
+ a crawler with a lot of parallel tasks and manage to get banned for your favorite
74
+ wallpaper site I have nothing to do with this ok ?
75
+ Don't be stupid and everything will be fine, for my needs I rarely need more than
76
+ 2 examiners and 1/2 downloaders.
77
+
data/bin/gta ADDED
@@ -0,0 +1,54 @@
1
+ #!/usr/bin/env ruby
2
+ require "rubygems"
3
+
4
+ $LOAD_PATH.unshift( File.expand_path('../../lib', __FILE__) )
5
+ require "get_them_all"
6
+ require "thor"
7
+
8
+
9
+ class GtaRunner < Thor
10
+
11
+ desc "exec [-s <path>] <script_path>", "run a user script"
12
+ method_option :storage_path, :aliases => '-s', :desc => "path where the data will be saved"
13
+ def exec(script_path)
14
+
15
+ storage_path = options[:storage_path]
16
+ raise("storage_path required") unless storage_path
17
+
18
+ if storage_path[0,1] != '/'
19
+ # relative path
20
+ storage_path = File.join(Dir.pwd, storage_path)
21
+ end
22
+
23
+ if script_path[0,1] != '/'
24
+ script_path = File.join(Dir.pwd, script_path)
25
+ end
26
+
27
+ # the file exist, load it
28
+ require script_path
29
+
30
+ # check that the class exist
31
+ class_name = File.basename(script_path, ".rb").camelize + "Downloader"
32
+ fail("file #{script_path} should define class #{class_name} !") unless Object.const_defined?( class_name.to_sym )
33
+
34
+ info("Started with config file #{File.basename(script_path)}")
35
+
36
+ # CTRL+C
37
+ trap("INT") do
38
+ EM::stop_event_loop()
39
+ end
40
+
41
+ # create the instance (and start download)
42
+ class_name.constantize.new(
43
+ :storage => {
44
+ :type => 'file',
45
+ :params => {
46
+ :root => storage_path
47
+ }
48
+ },
49
+ :extensions => [GetThemAll::ActionLogger.new]
50
+ ).start()
51
+ end
52
+ end
53
+
54
+ GtaRunner.start
@@ -0,0 +1,28 @@
1
+ $:.push File.expand_path("../lib", __FILE__)
2
+ require "get_them_all/version"
3
+
4
+ Gem::Specification.new do |s|
5
+ s.name = "get_them_all"
6
+ s.version = GetThemAll::VERSION
7
+ s.authors = ["Julien Ammous"]
8
+ s.email = []
9
+ s.homepage = ""
10
+ s.summary = %q{Mass downloader}
11
+ s.description = %q{Mass downloader useable as standalone or as a library}
12
+
13
+ s.rubyforge_project = "get_them_all"
14
+
15
+ s.files = `git ls-files lib/* *.gemspec README.* LICENSE`.split("\n")
16
+ s.executables = `git ls-files -- bin/*`.split("\n").map{ |f| File.basename(f) }
17
+ s.require_paths = ["lib"]
18
+
19
+ s.add_runtime_dependency 'thor'
20
+ s.add_runtime_dependency 'em-http-request', '~> 1.0.0'
21
+ s.add_runtime_dependency 'em-priority-queue', '~> 0.0.2'
22
+ s.add_runtime_dependency 'hpricot', '~> 0.8.1'
23
+ s.add_runtime_dependency 'i18n'
24
+ s.add_runtime_dependency 'activesupport', '~> 3.1.0'
25
+ s.add_runtime_dependency 'therubyracer', '~> 0.9.8'
26
+ s.add_runtime_dependency 'dropbox'
27
+ s.add_runtime_dependency 'girl_friday'
28
+ end
@@ -0,0 +1,47 @@
1
+ require File.expand_path('../get_them_all/version', __FILE__)
2
+
3
+ # system libraries
4
+ require 'logger'
5
+ require 'fileutils'
6
+ require 'zlib'
7
+
8
+ # gems
9
+ require 'eventmachine'
10
+ require 'active_support/core_ext/object/duplicable'
11
+ require 'active_support/core_ext/class'
12
+ require 'active_support/core_ext/string'
13
+ require 'active_support/core_ext/array'
14
+
15
+ # local files
16
+ Dir.chdir( File.join(File.dirname(__FILE__), "get_them_all") ) do
17
+ require './logger'
18
+
19
+ # libraries
20
+ require './notifier'
21
+ require './javascript_loader'
22
+ require './history'
23
+
24
+ # Storage
25
+ require './storage'
26
+ require './storage/file_storage'
27
+ require './storage/dropbox_storage'
28
+
29
+ # extensions
30
+ require './extension'
31
+ require './extensions/graph_builder'
32
+ require './extensions/action_logger'
33
+ require './extensions/gauge_display'
34
+
35
+ # main files
36
+ require './site_downloader'
37
+ require './worker'
38
+ require './action'
39
+ require './actions/examine_action'
40
+ require './actions/download_action'
41
+ end
42
+
43
+
44
+
45
+ module GetThemAll
46
+
47
+ end
@@ -0,0 +1,55 @@
1
+ module GetThemAll
2
+ class Action
3
+ include Notifier
4
+
5
+ attr_accessor :url, :level, :destination_folder, :params, :referer
6
+ attr_accessor :parent_url
7
+
8
+ include EM::Deferrable
9
+
10
+ def initialize(downloader, h, params = {})
11
+ @downloader = downloader
12
+
13
+ @storage = @downloader.storage
14
+
15
+ @level= 0
16
+ @params= h.delete(:params)
17
+ @destination_folder= nil
18
+
19
+ h.each do |key, val|
20
+ raise ("unknown properties #{key} !") unless respond_to?("#{key}=")
21
+ send("#{key}=", val) unless val.nil?
22
+ end
23
+ end
24
+
25
+ def inspect
26
+ "{#{self.class}[#{level}] #{url} }"
27
+ end
28
+
29
+ def uri
30
+ URI.parse(@url)
31
+ end
32
+
33
+
34
+ def already_visited?(url)
35
+ @downloader.history.include?(url)
36
+ end
37
+
38
+
39
+ # internals
40
+ def queue_action(action)
41
+ action.parent_url = @url
42
+ action.destination_folder ||= @destination_folder
43
+
44
+ queue = action.is_a?(ExamineAction) ? "@examine_queue" : "@download_queue"
45
+ @downloader.instance_variable_get(queue).push(action, action.priority)
46
+ end
47
+
48
+ # return a number between 0.1 and 1
49
+ def retry_time
50
+ 0.1 * (rand(1000)+1)/100
51
+ end
52
+ protected :retry_time
53
+
54
+ end
55
+ end
@@ -0,0 +1,80 @@
1
+ module GetThemAll
2
+ class DownloadAction < Action
3
+ def priority
4
+ 10
5
+ end
6
+
7
+ def do_action(worker = nil)
8
+ notify('action.download.started', worker, self)
9
+
10
+ if already_visited?(@url)
11
+ notify('action.download.skipped', worker, self)
12
+ set_deferred_status(:succeeded)
13
+ else
14
+
15
+ req = @downloader.open_url(@url, "GET", nil, @referer)
16
+ req.callback do |req|
17
+
18
+ destpath = compute_filename(worker)
19
+ download = @storage.write(destpath, req.response)
20
+
21
+ download.callback do
22
+ add_to_history()
23
+ set_deferred_status(:succeeded)
24
+
25
+ notify('action.download.success', worker, self, destpath)
26
+ end
27
+
28
+ download.errback do
29
+ notify('action.download.failure', worker, self)
30
+ end
31
+ end
32
+
33
+ req.timeout(5)
34
+
35
+ req.errback do |*args|
36
+ status = (args.size == 1) ? args.first : 0
37
+
38
+ # remove file if created
39
+ path = compute_filename(worker)
40
+ File.delete(path) if File.exist?(path)
41
+
42
+ notify('action.download.failure', worker, self)
43
+
44
+ set_deferred_status(:failed)
45
+ end
46
+ end
47
+
48
+ end
49
+
50
+ private
51
+ def random_string(len=5)
52
+ ret= ""
53
+ chars= ("a".."z").to_a
54
+ 1.upto(len) { |i| ret<< chars[rand(chars.size-1)] }
55
+ ret
56
+ end
57
+
58
+ def add_to_history()
59
+ if @downloader.class.history_tracking == :default
60
+ @downloader.history.add(@parent_url)
61
+ else
62
+ @downloader.history.add(@url)
63
+ end
64
+ end
65
+
66
+ def compute_filename(worker)
67
+ destpath= @downloader.get_file_destpath_from_action(self)
68
+
69
+ # find an unused filename
70
+ while @storage.exist?(destpath)
71
+ path, filename= File.dirname(destpath), File.basename(destpath).split(".")
72
+ filename= "#{filename[0]}_#{random_string(2)}.#{filename[1]}"
73
+ destpath= File.join(path, filename)
74
+ notify('action.download.renamed', worker, self, destpath)
75
+ end
76
+
77
+ destpath
78
+ end
79
+ end
80
+ end
@@ -0,0 +1,47 @@
1
+
2
+ require 'hpricot'
3
+
4
+ module GetThemAll
5
+ class ExamineAction < Action
6
+
7
+ def priority
8
+ @level
9
+ end
10
+
11
+ def do_action(worker = nil)
12
+ notify('action.examine.started', worker, self)
13
+
14
+ if already_visited?(@url)
15
+ notify('action.examine.skipped', worker, self)
16
+ set_deferred_status(:succeeded)
17
+
18
+ else
19
+ req = @downloader.open_url(@url, "GET", nil, @referer)
20
+ req.callback do |req|
21
+ doc = Hpricot( req.response )
22
+
23
+ actions = @downloader.examine_page(doc, @level, self)
24
+ actions.each do |action|
25
+ action.level = @level + 1
26
+ # action.params = @params.merge(action.params)
27
+ queue_action(action)
28
+ end
29
+
30
+ notify('action.examine.success', worker, self, actions)
31
+ set_deferred_status(:succeeded)
32
+ end
33
+
34
+ req.timeout(5)
35
+
36
+ req.errback do |*args|
37
+ status = (args.size == 1) ? args.first : 0
38
+ notify('action.examine.failure', worker, self, status)
39
+ set_deferred_status(:failed)
40
+
41
+ end
42
+
43
+ end
44
+ end
45
+
46
+ end
47
+ end
@@ -0,0 +1,17 @@
1
+ require 'active_support/notifications'
2
+
3
+ module GetThemAll
4
+ class Extension
5
+
6
+ ##
7
+ # Register a handler to call when this notification
8
+ # is sent
9
+ #
10
+ # @param [String] name notification identifier
11
+ #
12
+ def register_handler(name, &block)
13
+ ActiveSupport::Notifications.subscribe(name, &block)
14
+ end
15
+
16
+ end
17
+ end
@@ -0,0 +1,58 @@
1
+
2
+ module GetThemAll
3
+ ##
4
+ # This extension can be considered as a verbose mode, it
5
+ # logs nearly every everything that happens.
6
+ #
7
+ class ActionLogger < Extension
8
+ def initialize
9
+ register_handler('downloader.started') do |name, downloader|
10
+ @skipped_files = 0
11
+ @download_files = 0
12
+ end
13
+
14
+ register_handler('action.examine.started') do |name, worker, action|
15
+ log("Examining[#{action.level}] #{action.url}")
16
+ end
17
+
18
+ register_handler('action.examine.skipped') do |name, worker, action|
19
+ @skipped_files += 1
20
+ log("Skipping #{action.url}")
21
+ end
22
+
23
+ register_handler('action.examine.success') do |name, worker, action|
24
+ # do nothing
25
+ end
26
+
27
+
28
+ register_handler('action.download.started') do |name, worker, action|
29
+ log("Downloading #{action.url}")
30
+ end
31
+
32
+ register_handler('action.download.renamed') do |name, worker, action, new_path|
33
+ log("Renamed as #{File.basename(new_path)}")
34
+ end
35
+
36
+ register_handler('action.download.skipped') do |name, worker, action|
37
+ log("url Skipped: #{action.url}")
38
+ end
39
+
40
+ register_handler('action.download.success') do |name, worker, action, destpath|
41
+ @download_files += 1
42
+ log("File downloaded: #{destpath}")
43
+ end
44
+
45
+ register_handler('downloader.completed') do |name, worker, downloader|
46
+ log ""
47
+ log "Downloaded #{@download_files} files"
48
+ log "Skipped: #{@skipped_files}"
49
+ end
50
+
51
+ end
52
+
53
+ def log(str)
54
+ puts "[log] #{str}"
55
+ end
56
+
57
+ end
58
+ end