cobweb 0.0.67 → 0.0.68
Sign up to get free protection for your applications and to get access to all the features.
- data/README.textile +59 -12
- data/lib/cobweb_version.rb +1 -1
- data/lib/crawl.rb +51 -0
- data/lib/crawl_job.rb +10 -9
- data/lib/stats.rb +8 -4
- data/spec/cobweb/cobweb_job_spec.rb +110 -78
- data/spec/cobweb/crawl_spec.rb +74 -0
- metadata +24 -22
data/README.textile
CHANGED
@@ -1,5 +1,5 @@
|
|
1
1
|
|
2
|
-
h1. Cobweb v0.0.
|
2
|
+
h1. Cobweb v0.0.68
|
3
3
|
|
4
4
|
!https://secure.travis-ci.org/stewartmckee/cobweb.png?branch=master!
|
5
5
|
|
@@ -17,13 +17,9 @@ h3. Standalone
|
|
17
17
|
|
18
18
|
CobwebCrawler takes the same options as cobweb itself, so you can use any of the options available for that. An example is listed below.
|
19
19
|
|
20
|
-
bq. crawler = CobwebCrawler.new(:cache => 600);
|
21
|
-
|
22
|
-
bq. stats = crawler.crawl("http://www.pepsico.com")
|
23
|
-
|
24
20
|
While the crawler is running, you can view statistics on http://localhost:4567
|
25
21
|
|
26
|
-
h3. Data Returned
|
22
|
+
h3. Data Returned For Each Page
|
27
23
|
The data available in the returned hash are:
|
28
24
|
|
29
25
|
* :url - url of the resource requested
|
@@ -44,14 +40,40 @@ h3. Data Returned
|
|
44
40
|
|
45
41
|
The source for the links can be overridden, contact me for the syntax (don't have time to put it into this documentation, will as soon as i have time!)
|
46
42
|
|
43
|
+
h3. Statistics
|
44
|
+
|
45
|
+
Statistics are available during the crawl, you can create a Stats object passing in a hash with redis_options and crawl_id. Stats has a get_statistics method that returns a hash of the statistics available to you. It is also returned by default from the CobwebCrawler.crawl standalone crawling method.
|
46
|
+
|
47
|
+
The data available within statistics is as follows:
|
48
|
+
|
49
|
+
* :average_length - average size of each objet
|
50
|
+
* :minimum_length - minimum length returned
|
51
|
+
* :queued_at - date and time that the crawl was started at (eg: "2012-09-10T23:10:08+01:00")
|
52
|
+
* :maximum_length - maximum length of object received
|
53
|
+
* :status_counts - hash with the status returned as the key and value as number of pages (eg: {"404" => 1, "200" => 1})
|
54
|
+
* :mime_counts - hash containing the mime type as key and count or pages as value (eg: {"text/html" => 8, "image/jpeg" => 25)})
|
55
|
+
* :queue_counter - size of queue waiting to be processed for crawl
|
56
|
+
* :page_count - number of html pages retrieved
|
57
|
+
* :total_length - total size of data received
|
58
|
+
* :current_status - Current status of crawl
|
59
|
+
* :asset_count - count of non-html objects received
|
60
|
+
* :page_size - total size of pages received
|
61
|
+
* :average_response_time - average response time of all objects
|
62
|
+
* :crawl_counter - number of objects that have been crawled
|
63
|
+
* :minimum_response_time - quickest response time of crawl
|
64
|
+
* :maximum_response_time - longest response time of crawl
|
65
|
+
* :asset_size - total size of all non-assets received
|
66
|
+
|
47
67
|
h2. Installation
|
48
68
|
|
49
69
|
Install crawler as a gem
|
50
70
|
|
51
|
-
|
71
|
+
bc. gem install cobweb
|
52
72
|
|
53
73
|
h2. Usage
|
54
74
|
|
75
|
+
h3. Cobweb
|
76
|
+
|
55
77
|
h4. new(options)
|
56
78
|
|
57
79
|
Creates a new crawler object based on a base_url
|
@@ -76,7 +98,7 @@ Creates a new crawler object based on a base_url
|
|
76
98
|
** :crawl_limit_by_page - sets the crawl counter to only use html page types when counting objects crawled
|
77
99
|
** :valid_mime_types - an array of mime types that takes wildcards (eg 'text/*') defaults to ['*/*']
|
78
100
|
|
79
|
-
|
101
|
+
bc. crawler = Cobweb.new(:follow_redirects => false)
|
80
102
|
|
81
103
|
h4. start(base_url)
|
82
104
|
|
@@ -86,7 +108,7 @@ Starts a crawl through resque. Requires the :processing_queue to be set to a va
|
|
86
108
|
|
87
109
|
Once the crawler starts, if the first page is redirected (eg from http://www.test.com to http://test.com) then the endpoint scheme, host and domain is added to the internal_urls automatically.
|
88
110
|
|
89
|
-
|
111
|
+
bc. crawler.start("http://www.google.com/")
|
90
112
|
|
91
113
|
h4. get(url)
|
92
114
|
|
@@ -94,7 +116,7 @@ Simple get that obey's the options supplied in new.
|
|
94
116
|
|
95
117
|
* url - url requested
|
96
118
|
|
97
|
-
|
119
|
+
bc. crawler.get("http://www.google.com/")
|
98
120
|
|
99
121
|
h4. head(url)
|
100
122
|
|
@@ -102,10 +124,35 @@ Simple get that obey's the options supplied in new.
|
|
102
124
|
|
103
125
|
* url - url requested
|
104
126
|
|
105
|
-
|
127
|
+
bc. crawler.head("http://www.google.com/")
|
128
|
+
|
129
|
+
h3. CobwebCrawler
|
130
|
+
|
131
|
+
CobwebCrawler is the standalone crawling class. If you don't want to use redis and just want to crawl the site within your ruby process, you can use this class.
|
132
|
+
|
133
|
+
bc. crawler = CobwebCrawler.new(:cache => 600)
|
134
|
+
statistics = crawler.crawl("http://www.pepsico.com")
|
135
|
+
|
136
|
+
You can also run within a block and get access to each page as it is being crawled.
|
137
|
+
|
138
|
+
bc. statistics = CobwebCrawler.new(:cache => 600).crawl("http://www.pepsico.com") do |page|
|
139
|
+
puts "Just crawled #{page[:url]} and got a status of #{page[:status_code]}."
|
140
|
+
end
|
141
|
+
puts "Finished Crawl in "
|
142
|
+
|
143
|
+
|
144
|
+
|
145
|
+
h3. Crawl
|
146
|
+
|
147
|
+
The crawl class is a helper class to assist in getting information about a crawl and to perform functions against the crawl
|
148
|
+
|
149
|
+
bc. crawl = Crawl.new(options)
|
150
|
+
|
151
|
+
* options - the hash of options passed into Cobweb.new (must include a :crawl_id)
|
152
|
+
|
106
153
|
|
107
154
|
|
108
|
-
|
155
|
+
h2. Contributing/Testing
|
109
156
|
|
110
157
|
Feel free to contribute small or large bits of code, just please make sure that there are rspec test for the features your submitting. We also test on travis at http://travis-ci.org/#!/stewartmckee/cobweb if you want to see the state of the project.
|
111
158
|
|
data/lib/cobweb_version.rb
CHANGED
data/lib/crawl.rb
ADDED
@@ -0,0 +1,51 @@
|
|
1
|
+
# The crawl class gives easy access to information about the crawl, and gives the ability to stop a crawl
|
2
|
+
class Crawl
|
3
|
+
|
4
|
+
attr_accessor :id
|
5
|
+
|
6
|
+
BATCH_SIZE = 200
|
7
|
+
FINISHED = "Finished"
|
8
|
+
STARTING = "Starting"
|
9
|
+
CANCELLED = "Cancelled"
|
10
|
+
|
11
|
+
def initialize(data)
|
12
|
+
@data = data
|
13
|
+
@stats = Stats.new(data)
|
14
|
+
end
|
15
|
+
|
16
|
+
def destroy
|
17
|
+
queue_name = "cobweb_crawl_job"
|
18
|
+
# set status as cancelled now so that we don't enqueue any further pages
|
19
|
+
self.statistics.end_crawl(@data, true)
|
20
|
+
|
21
|
+
job_items = Resque.peek(queue_name, 0, BATCH_SIZE)
|
22
|
+
batch_count = 0
|
23
|
+
until job_items.empty?
|
24
|
+
|
25
|
+
job_items.each do |item|
|
26
|
+
if item["args"][0]["crawl_id"] == id
|
27
|
+
# remote this job from the queue
|
28
|
+
Resque.dequeue(CrawlJob, item["args"][0])
|
29
|
+
end
|
30
|
+
end
|
31
|
+
|
32
|
+
position = batch_count*BATCH_SIZE
|
33
|
+
batch_count += 1
|
34
|
+
job_items = Resque.peek(queue_name, position, BATCH_SIZE)
|
35
|
+
end
|
36
|
+
|
37
|
+
end
|
38
|
+
|
39
|
+
def statistics
|
40
|
+
@stats
|
41
|
+
end
|
42
|
+
|
43
|
+
def status
|
44
|
+
statistics.get_status
|
45
|
+
end
|
46
|
+
|
47
|
+
def id
|
48
|
+
@data[:crawl_id]
|
49
|
+
end
|
50
|
+
|
51
|
+
end
|
data/lib/crawl_job.rb
CHANGED
@@ -14,6 +14,7 @@ class CrawlJob
|
|
14
14
|
# change all hash keys to symbols
|
15
15
|
content_request = HashUtil.deep_symbolize_keys(content_request)
|
16
16
|
@content_request = content_request
|
17
|
+
@crawl = Crawl.new(content_request)
|
17
18
|
|
18
19
|
content_request[:redis_options] = {} unless content_request.has_key? :redis_options
|
19
20
|
content_request[:crawl_limit_by_page] = false unless content_request.has_key? :crawl_limit_by_page
|
@@ -27,8 +28,7 @@ class CrawlJob
|
|
27
28
|
# check we haven't crawled this url before
|
28
29
|
unless @redis.sismember "crawled", content_request[:url]
|
29
30
|
# if there is no limit or we're still under it lets get the url
|
30
|
-
if within_crawl_limits?(content_request[:crawl_limit])
|
31
|
-
puts "cbpl: #{content_request[:url]}" if content_request[:crawl_limit_by_page]
|
31
|
+
if within_crawl_limits?(content_request[:crawl_limit]) and @crawl.status != Crawl::CANCELLED
|
32
32
|
content = Cobweb.new(content_request).get(content_request[:url], content_request)
|
33
33
|
if content_request[:url] == @redis.get("original_base_url")
|
34
34
|
@redis.set("crawled_base_url", content[:base_url])
|
@@ -55,7 +55,7 @@ class CrawlJob
|
|
55
55
|
|
56
56
|
# set the base url if this is the first page
|
57
57
|
set_base_url @redis, content, content_request
|
58
|
-
|
58
|
+
|
59
59
|
@cobweb_links = CobwebLinks.new(content_request)
|
60
60
|
if within_queue_limits?(content_request[:crawl_limit])
|
61
61
|
internal_links = ContentLinkParser.new(content_request[:url], content[:body], content_request).all_links(:valid_schemes => [:http, :https])
|
@@ -69,7 +69,11 @@ class CrawlJob
|
|
69
69
|
internal_links.reject! { |link| @redis.sismember("queued", link) }
|
70
70
|
|
71
71
|
internal_links.each do |link|
|
72
|
-
|
72
|
+
puts link
|
73
|
+
puts "Not enqueuing due to cancelled crawl" if @crawl.status == Crawl::CANCELLED
|
74
|
+
if within_queue_limits?(content_request[:crawl_limit]) and @crawl.status != Crawl::CANCELLED
|
75
|
+
enqueue_content(content_request, link)
|
76
|
+
end
|
73
77
|
end
|
74
78
|
end
|
75
79
|
|
@@ -92,7 +96,6 @@ class CrawlJob
|
|
92
96
|
if content_request[:crawl_limit_by_page]
|
93
97
|
if content[:mime_type].match("text/html")
|
94
98
|
increment_crawl_counter
|
95
|
-
ap "clbp: #{crawl_counter}"
|
96
99
|
end
|
97
100
|
else
|
98
101
|
increment_crawl_counter
|
@@ -112,8 +115,6 @@ class CrawlJob
|
|
112
115
|
end
|
113
116
|
|
114
117
|
decrement_queue_counter
|
115
|
-
puts content_request[:crawl_limit]
|
116
|
-
print_counters
|
117
118
|
# if there's nothing left queued or the crawled limit has been reached
|
118
119
|
if content_request[:crawl_limit].nil? || content_request[:crawl_limit] == 0
|
119
120
|
if queue_counter + crawl_started_counter - crawl_counter == 0
|
@@ -125,10 +126,10 @@ class CrawlJob
|
|
125
126
|
|
126
127
|
end
|
127
128
|
|
128
|
-
# Sets the crawl status to
|
129
|
+
# Sets the crawl status to Crawl::FINISHED and enqueues the crawl finished job
|
129
130
|
def self.finished(content_request)
|
130
131
|
# finished
|
131
|
-
if @
|
132
|
+
if @crawl.status != Crawl::FINISHED and @crawl.status != Crawl::CANCELLED
|
132
133
|
ap "CRAWL FINISHED #{content_request[:url]}, #{counters}, #{@redis.get("original_base_url")}, #{@redis.get("crawled_base_url")}" if content_request[:debug]
|
133
134
|
@stats.end_crawl(content_request)
|
134
135
|
|
data/lib/stats.rb
CHANGED
@@ -16,13 +16,17 @@ class Stats
|
|
16
16
|
@redis.hset "crawl_details", key, options[key].to_s
|
17
17
|
end
|
18
18
|
end
|
19
|
-
@redis.hset "statistics", "current_status",
|
19
|
+
@redis.hset "statistics", "current_status", Crawl::STARTING
|
20
20
|
end
|
21
21
|
|
22
22
|
# Removes the crawl from the running crawls and updates status
|
23
|
-
def end_crawl(options)
|
23
|
+
def end_crawl(options, cancelled=false)
|
24
24
|
@full_redis.srem "cobweb_crawls", options[:crawl_id]
|
25
|
-
|
25
|
+
if cancelled
|
26
|
+
@redis.hset "statistics", "current_status", Crawl::CANCELLED
|
27
|
+
else
|
28
|
+
@redis.hset "statistics", "current_status", Crawl::FINISHED
|
29
|
+
end
|
26
30
|
@redis.del "crawl_details"
|
27
31
|
end
|
28
32
|
|
@@ -154,7 +158,7 @@ class Stats
|
|
154
158
|
|
155
159
|
# Sets the current status of the crawl
|
156
160
|
def update_status(status)
|
157
|
-
|
161
|
+
#@redis.hset("statistics", "current_status", status) unless status == Crawl::CANCELLED
|
158
162
|
end
|
159
163
|
|
160
164
|
# Returns the current status of the crawl
|
@@ -5,13 +5,13 @@ describe Cobweb, :local_only => true do
|
|
5
5
|
before(:all) do
|
6
6
|
#store all existing resque process ids so we don't kill them afterwards
|
7
7
|
@existing_processes = `ps aux | grep resque | grep -v grep | grep -v resque-web | awk '{print $2}'`.split("\n")
|
8
|
-
|
8
|
+
|
9
9
|
# START WORKERS ONLY FOR CRAWL QUEUE SO WE CAN COUNT ENQUEUED PROCESS AND FINISH QUEUES
|
10
10
|
puts "Starting Workers... Please Wait..."
|
11
11
|
`mkdir log`
|
12
12
|
io = IO.popen("nohup rake resque:workers PIDFILE=./tmp/pids/resque.pid COUNT=1 QUEUE=cobweb_crawl_job > log/output.log &")
|
13
13
|
puts "Workers Started."
|
14
|
-
|
14
|
+
|
15
15
|
end
|
16
16
|
|
17
17
|
before(:each) do
|
@@ -19,56 +19,90 @@ describe Cobweb, :local_only => true do
|
|
19
19
|
@base_page_count = 77
|
20
20
|
clear_queues
|
21
21
|
end
|
22
|
-
|
22
|
+
|
23
|
+
describe "when crawl is cancelled" do
|
24
|
+
before(:each) do
|
25
|
+
@request = {
|
26
|
+
:crawl_id => Digest::SHA1.hexdigest("#{Time.now.to_i}.#{Time.now.usec}"),
|
27
|
+
:crawl_limit => nil,
|
28
|
+
:quiet => false,
|
29
|
+
:debug => false,
|
30
|
+
:cache => nil
|
31
|
+
}
|
32
|
+
@cobweb = Cobweb.new @request
|
33
|
+
end
|
34
|
+
it "should not crawl anything if nothing has started" do
|
35
|
+
crawl = @cobweb.start(@base_url)
|
36
|
+
crawl_obj = Crawl.new(crawl)
|
37
|
+
crawl_obj.destroy
|
38
|
+
@stat = Stats.new({:crawl_id => crawl[:crawl_id]})
|
39
|
+
wait_for_crawl_finished crawl[:crawl_id]
|
40
|
+
Resque.size("cobweb_process_job").should == 0
|
41
|
+
end
|
42
|
+
|
43
|
+
it "should not complete the crawl when cancelled" do
|
44
|
+
crawl = @cobweb.start(@base_url)
|
45
|
+
crawl_obj = Crawl.new(crawl)
|
46
|
+
sleep 6
|
47
|
+
crawl_obj.destroy
|
48
|
+
@stat = Stats.new({:crawl_id => crawl[:crawl_id]})
|
49
|
+
wait_for_crawl_finished crawl[:crawl_id]
|
50
|
+
Resque.size("cobweb_process_job").should > 0
|
51
|
+
Resque.size("cobweb_process_job").should_not == @base_page_count
|
52
|
+
end
|
53
|
+
|
54
|
+
end
|
23
55
|
describe "with no crawl limit" do
|
24
56
|
before(:each) do
|
25
|
-
|
26
|
-
|
27
|
-
|
28
|
-
|
29
|
-
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
end
|
34
|
-
|
57
|
+
@request = {
|
58
|
+
:crawl_id => Digest::SHA1.hexdigest("#{Time.now.to_i}.#{Time.now.usec}"),
|
59
|
+
:crawl_limit => nil,
|
60
|
+
:quiet => false,
|
61
|
+
:debug => false,
|
62
|
+
:cache => nil
|
63
|
+
}
|
64
|
+
@cobweb = Cobweb.new @request
|
65
|
+
end
|
66
|
+
|
35
67
|
it "should crawl entire site" do
|
36
|
-
|
37
|
-
|
38
|
-
|
39
|
-
|
68
|
+
ap Resque.size("cobweb_process_job")
|
69
|
+
crawl = @cobweb.start(@base_url)
|
70
|
+
@stat = Stats.new({:crawl_id => crawl[:crawl_id]})
|
71
|
+
wait_for_crawl_finished crawl[:crawl_id]
|
72
|
+
ap @stat.get_statistics
|
73
|
+
Resque.size("cobweb_process_job").should == @base_page_count
|
40
74
|
end
|
41
75
|
it "detect crawl finished once" do
|
42
|
-
|
43
|
-
|
44
|
-
|
45
|
-
|
76
|
+
crawl = @cobweb.start(@base_url)
|
77
|
+
@stat = Stats.new({:crawl_id => crawl[:crawl_id]})
|
78
|
+
wait_for_crawl_finished crawl[:crawl_id]
|
79
|
+
Resque.size("cobweb_finished_job").should == 1
|
80
|
+
end
|
81
|
+
end
|
82
|
+
describe "with limited mime_types" do
|
83
|
+
before(:each) do
|
84
|
+
@request = {
|
85
|
+
:crawl_id => Digest::SHA1.hexdigest("#{Time.now.to_i}.#{Time.now.usec}"),
|
86
|
+
:quiet => true,
|
87
|
+
:cache => nil,
|
88
|
+
:valid_mime_types => ["text/html"]
|
89
|
+
}
|
90
|
+
@cobweb = Cobweb.new @request
|
91
|
+
end
|
92
|
+
|
93
|
+
it "should only crawl html pages" do
|
94
|
+
crawl = @cobweb.start(@base_url)
|
95
|
+
@stat = Stats.new({:crawl_id => crawl[:crawl_id]})
|
96
|
+
wait_for_crawl_finished crawl[:crawl_id]
|
97
|
+
Resque.size("cobweb_process_job").should == 8
|
98
|
+
|
99
|
+
mime_types = Resque.peek("cobweb_process_job", 0, 100).map{|job| job["args"][0]["mime_type"]}
|
100
|
+
mime_types.count.should == 8
|
101
|
+
mime_types.map{|m| m.should == "text/html"}
|
102
|
+
mime_types.select{|m| m=="text/html"}.count.should == 8
|
46
103
|
end
|
104
|
+
|
47
105
|
end
|
48
|
-
describe "with limited mime_types" do
|
49
|
-
before(:each) do
|
50
|
-
@request = {
|
51
|
-
:crawl_id => Digest::SHA1.hexdigest("#{Time.now.to_i}.#{Time.now.usec}"),
|
52
|
-
:quiet => true,
|
53
|
-
:cache => nil,
|
54
|
-
:valid_mime_types => ["text/html"]
|
55
|
-
}
|
56
|
-
@cobweb = Cobweb.new @request
|
57
|
-
end
|
58
|
-
|
59
|
-
it "should only crawl html pages" do
|
60
|
-
crawl = @cobweb.start(@base_url)
|
61
|
-
@stat = Stats.new({:crawl_id => crawl[:crawl_id]})
|
62
|
-
wait_for_crawl_finished crawl[:crawl_id]
|
63
|
-
Resque.size("cobweb_process_job").should == 8
|
64
|
-
|
65
|
-
mime_types = Resque.peek("cobweb_process_job", 0, 100).map{|job| job["args"][0]["mime_type"]}
|
66
|
-
mime_types.count.should == 8
|
67
|
-
mime_types.map{|m| m.should == "text/html"}
|
68
|
-
mime_types.select{|m| m=="text/html"}.count.should == 8
|
69
|
-
end
|
70
|
-
|
71
|
-
end
|
72
106
|
describe "with a crawl limit" do
|
73
107
|
before(:each) do
|
74
108
|
@request = {
|
@@ -77,31 +111,31 @@ describe Cobweb, :local_only => true do
|
|
77
111
|
:cache => nil
|
78
112
|
}
|
79
113
|
end
|
80
|
-
|
114
|
+
|
81
115
|
describe "limit to 1" do
|
82
116
|
before(:each) do
|
83
117
|
@request[:crawl_limit] = 1
|
84
118
|
@cobweb = Cobweb.new @request
|
85
119
|
end
|
86
|
-
|
120
|
+
|
87
121
|
it "should not crawl the entire site" do
|
88
122
|
crawl = @cobweb.start(@base_url)
|
89
123
|
@stat = Stats.new({:crawl_id => crawl[:crawl_id]})
|
90
124
|
wait_for_crawl_finished crawl[:crawl_id]
|
91
125
|
Resque.size("cobweb_process_job").should_not == @base_page_count
|
92
|
-
end
|
126
|
+
end
|
93
127
|
it "should only crawl 1 page" do
|
94
128
|
crawl = @cobweb.start(@base_url)
|
95
129
|
@stat = Stats.new({:crawl_id => crawl[:crawl_id]})
|
96
130
|
wait_for_crawl_finished crawl[:crawl_id]
|
97
131
|
Resque.size("cobweb_process_job").should == 1
|
98
|
-
end
|
132
|
+
end
|
99
133
|
it "should notify of crawl finished once" do
|
100
134
|
crawl = @cobweb.start(@base_url)
|
101
135
|
@stat = Stats.new({:crawl_id => crawl[:crawl_id]})
|
102
136
|
wait_for_crawl_finished crawl[:crawl_id]
|
103
137
|
Resque.size("cobweb_finished_job").should == 1
|
104
|
-
end
|
138
|
+
end
|
105
139
|
end
|
106
140
|
|
107
141
|
describe "for pages only" do
|
@@ -110,7 +144,7 @@ describe Cobweb, :local_only => true do
|
|
110
144
|
@request[:crawl_limit] = 5
|
111
145
|
@cobweb = Cobweb.new @request
|
112
146
|
end
|
113
|
-
|
147
|
+
|
114
148
|
it "should only use html pages towards the crawl limit" do
|
115
149
|
crawl = @cobweb.start(@base_url)
|
116
150
|
@stat = Stats.new({:crawl_id => crawl[:crawl_id]})
|
@@ -126,19 +160,19 @@ describe Cobweb, :local_only => true do
|
|
126
160
|
@request[:crawl_limit] = 10
|
127
161
|
@cobweb = Cobweb.new @request
|
128
162
|
end
|
129
|
-
|
163
|
+
|
130
164
|
it "should not crawl the entire site" do
|
131
165
|
crawl = @cobweb.start(@base_url)
|
132
166
|
@stat = Stats.new({:crawl_id => crawl[:crawl_id]})
|
133
167
|
wait_for_crawl_finished crawl[:crawl_id]
|
134
168
|
Resque.size("cobweb_process_job").should_not == @base_page_count
|
135
|
-
end
|
169
|
+
end
|
136
170
|
it "should notify of crawl finished once" do
|
137
171
|
crawl = @cobweb.start(@base_url)
|
138
172
|
@stat = Stats.new({:crawl_id => crawl[:crawl_id]})
|
139
173
|
wait_for_crawl_finished crawl[:crawl_id]
|
140
174
|
Resque.size("cobweb_finished_job").should == 1
|
141
|
-
end
|
175
|
+
end
|
142
176
|
it "should only crawl 10 objects" do
|
143
177
|
crawl = @cobweb.start(@base_url)
|
144
178
|
@stat = Stats.new({:crawl_id => crawl[:crawl_id]})
|
@@ -146,40 +180,40 @@ describe Cobweb, :local_only => true do
|
|
146
180
|
Resque.size("cobweb_process_job").should == 10
|
147
181
|
end
|
148
182
|
end
|
149
|
-
|
183
|
+
|
150
184
|
describe "limit to 100" do
|
151
185
|
before(:each) do
|
152
186
|
@request[:crawl_limit] = 100
|
153
187
|
@cobweb = Cobweb.new @request
|
154
188
|
end
|
155
|
-
|
189
|
+
|
156
190
|
it "should crawl the entire sample site" do
|
157
191
|
crawl = @cobweb.start(@base_url)
|
158
192
|
@stat = Stats.new({:crawl_id => crawl[:crawl_id]})
|
159
193
|
wait_for_crawl_finished crawl[:crawl_id]
|
160
194
|
Resque.size("cobweb_process_job").should == @base_page_count
|
161
|
-
end
|
195
|
+
end
|
162
196
|
it "should notify of crawl finished once" do
|
163
197
|
crawl = @cobweb.start(@base_url)
|
164
198
|
@stat = Stats.new({:crawl_id => crawl[:crawl_id]})
|
165
199
|
wait_for_crawl_finished crawl[:crawl_id]
|
166
200
|
Resque.size("cobweb_finished_job").should == 1
|
167
|
-
end
|
201
|
+
end
|
168
202
|
it "should not crawl 100 pages" do
|
169
203
|
crawl = @cobweb.start(@base_url)
|
170
204
|
@stat = Stats.new({:crawl_id => crawl[:crawl_id]})
|
171
205
|
wait_for_crawl_finished crawl[:crawl_id]
|
172
206
|
Resque.size("cobweb_process_job").should_not == 100
|
173
|
-
end
|
207
|
+
end
|
174
208
|
end
|
175
209
|
end
|
176
210
|
|
177
211
|
after(:all) do
|
178
|
-
|
212
|
+
|
179
213
|
@all_processes = `ps aux | grep resque | grep -v grep | grep -v resque-web | awk '{print $2}'`.split("\n")
|
180
214
|
command = "kill -9 #{(@all_processes - @existing_processes).join(" ")}"
|
181
215
|
IO.popen(command)
|
182
|
-
|
216
|
+
|
183
217
|
clear_queues
|
184
218
|
end
|
185
219
|
|
@@ -189,25 +223,23 @@ def wait_for_crawl_finished(crawl_id, timeout=20)
|
|
189
223
|
counter = 0
|
190
224
|
start_time = Time.now
|
191
225
|
while(running?(crawl_id) && Time.now < start_time + timeout) do
|
192
|
-
|
193
|
-
|
194
|
-
|
195
|
-
|
226
|
+
sleep 0.5
|
227
|
+
end
|
228
|
+
if Time.now > start_time + timeout
|
229
|
+
raise "End of crawl not detected"
|
230
|
+
end
|
196
231
|
end
|
197
|
-
end
|
198
|
-
|
199
|
-
def running?(crawl_id)
|
200
|
-
@stat.get_status != "Crawl Finished"
|
201
|
-
end
|
202
232
|
|
203
|
-
def
|
204
|
-
|
205
|
-
Resque.remove_queue(queue)
|
233
|
+
def running?(crawl_id)
|
234
|
+
@stat.get_status != Crawl::FINISHED and @stat.get_status != Crawl::CANCELLED
|
206
235
|
end
|
207
|
-
|
208
|
-
Resque.size("cobweb_process_job").should == 0
|
209
|
-
Resque.size("cobweb_finished_job").should == 0
|
210
|
-
Resque.peek("cobweb_process_job", 0, 200).should be_empty
|
211
|
-
end
|
212
236
|
|
237
|
+
def clear_queues
|
238
|
+
Resque.queues.each do |queue|
|
239
|
+
Resque.remove_queue(queue)
|
240
|
+
end
|
213
241
|
|
242
|
+
Resque.size("cobweb_process_job").should == 0
|
243
|
+
Resque.size("cobweb_finished_job").should == 0
|
244
|
+
Resque.peek("cobweb_process_job", 0, 200).should be_empty
|
245
|
+
end
|
@@ -0,0 +1,74 @@
|
|
1
|
+
require File.expand_path(File.dirname(__FILE__) + '/../spec_helper')
|
2
|
+
|
3
|
+
describe Crawl do
|
4
|
+
|
5
|
+
# this spec tests the crawl object
|
6
|
+
|
7
|
+
describe "initialize" do
|
8
|
+
describe "without data" do
|
9
|
+
it "should raise an exception" do
|
10
|
+
lambda {Crawl.new}.should raise_exception
|
11
|
+
end
|
12
|
+
end
|
13
|
+
|
14
|
+
describe "with data" do
|
15
|
+
before(:each) do
|
16
|
+
data = {:crawl_id => "asdf"}
|
17
|
+
@crawl = Crawl.new(data)
|
18
|
+
end
|
19
|
+
it "should create a crawl object" do
|
20
|
+
@crawl.should be_an_instance_of Crawl
|
21
|
+
end
|
22
|
+
it "should return an id" do
|
23
|
+
@crawl.should respond_to "id"
|
24
|
+
end
|
25
|
+
it "should return a status" do
|
26
|
+
@crawl.should respond_to "status"
|
27
|
+
end
|
28
|
+
|
29
|
+
describe "the destroy method" do
|
30
|
+
before(:each) do
|
31
|
+
if Resque.size("cobweb_crawl_job") > 0
|
32
|
+
raise "cobweb_crawl_job is not empty, do not run specs until it is!"
|
33
|
+
end
|
34
|
+
105.times do |item_count|
|
35
|
+
2.times do |crawl_count|
|
36
|
+
item_data = {:crawl_id => "crawl_#{crawl_count}_id", :url => "http://crawl#{crawl_count}.com/page#{item_count}.html"}
|
37
|
+
Resque.enqueue(CrawlJob, item_data)
|
38
|
+
end
|
39
|
+
end
|
40
|
+
end
|
41
|
+
after(:each) do
|
42
|
+
Resque.remove_queue("cobweb_crawl_job")
|
43
|
+
end
|
44
|
+
it "should have a queue length of 210" do
|
45
|
+
Resque.size("cobweb_crawl_job").should == 210
|
46
|
+
end
|
47
|
+
describe "after called" do
|
48
|
+
before(:each) do
|
49
|
+
@crawl = Crawl.new({:crawl_id => "crawl_0_id"})
|
50
|
+
@crawl.destroy
|
51
|
+
end
|
52
|
+
it "should delete only the crawl specified" do
|
53
|
+
Resque.size("cobweb_crawl_job").should == 105
|
54
|
+
end
|
55
|
+
it "should not contain any crawl_0_id" do
|
56
|
+
Resque.peek("cobweb_crawl_job", 0, 200).map{|i| i["args"][0]}.each do |item|
|
57
|
+
item["crawl_id"].should_not == "crawl_0_id"
|
58
|
+
end
|
59
|
+
end
|
60
|
+
it "should only contain crawl_1_id" do
|
61
|
+
Resque.peek("cobweb_crawl_job", 0, 200).map{|i| i["args"][0]}.each do |item|
|
62
|
+
item["crawl_id"].should == "crawl_1_id"
|
63
|
+
end
|
64
|
+
end
|
65
|
+
it "should set status to 'Cancelled'" do
|
66
|
+
@crawl.status.should == "Cancelled"
|
67
|
+
end
|
68
|
+
end
|
69
|
+
end
|
70
|
+
end
|
71
|
+
end
|
72
|
+
|
73
|
+
|
74
|
+
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: cobweb
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.0.
|
4
|
+
version: 0.0.68
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,11 +9,11 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2012-09-
|
12
|
+
date: 2012-09-10 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: resque
|
16
|
-
requirement: &
|
16
|
+
requirement: &70324863540700 !ruby/object:Gem::Requirement
|
17
17
|
none: false
|
18
18
|
requirements:
|
19
19
|
- - ! '>='
|
@@ -21,10 +21,10 @@ dependencies:
|
|
21
21
|
version: '0'
|
22
22
|
type: :runtime
|
23
23
|
prerelease: false
|
24
|
-
version_requirements: *
|
24
|
+
version_requirements: *70324863540700
|
25
25
|
- !ruby/object:Gem::Dependency
|
26
26
|
name: redis
|
27
|
-
requirement: &
|
27
|
+
requirement: &70324863539560 !ruby/object:Gem::Requirement
|
28
28
|
none: false
|
29
29
|
requirements:
|
30
30
|
- - ! '>='
|
@@ -32,10 +32,10 @@ dependencies:
|
|
32
32
|
version: '0'
|
33
33
|
type: :runtime
|
34
34
|
prerelease: false
|
35
|
-
version_requirements: *
|
35
|
+
version_requirements: *70324863539560
|
36
36
|
- !ruby/object:Gem::Dependency
|
37
37
|
name: nokogiri
|
38
|
-
requirement: &
|
38
|
+
requirement: &70324863538960 !ruby/object:Gem::Requirement
|
39
39
|
none: false
|
40
40
|
requirements:
|
41
41
|
- - ! '>='
|
@@ -43,10 +43,10 @@ dependencies:
|
|
43
43
|
version: '0'
|
44
44
|
type: :runtime
|
45
45
|
prerelease: false
|
46
|
-
version_requirements: *
|
46
|
+
version_requirements: *70324863538960
|
47
47
|
- !ruby/object:Gem::Dependency
|
48
48
|
name: addressable
|
49
|
-
requirement: &
|
49
|
+
requirement: &70324863537700 !ruby/object:Gem::Requirement
|
50
50
|
none: false
|
51
51
|
requirements:
|
52
52
|
- - ! '>='
|
@@ -54,10 +54,10 @@ dependencies:
|
|
54
54
|
version: '0'
|
55
55
|
type: :runtime
|
56
56
|
prerelease: false
|
57
|
-
version_requirements: *
|
57
|
+
version_requirements: *70324863537700
|
58
58
|
- !ruby/object:Gem::Dependency
|
59
59
|
name: rspec
|
60
|
-
requirement: &
|
60
|
+
requirement: &70324863537120 !ruby/object:Gem::Requirement
|
61
61
|
none: false
|
62
62
|
requirements:
|
63
63
|
- - ! '>='
|
@@ -65,10 +65,10 @@ dependencies:
|
|
65
65
|
version: '0'
|
66
66
|
type: :runtime
|
67
67
|
prerelease: false
|
68
|
-
version_requirements: *
|
68
|
+
version_requirements: *70324863537120
|
69
69
|
- !ruby/object:Gem::Dependency
|
70
70
|
name: awesome_print
|
71
|
-
requirement: &
|
71
|
+
requirement: &70324863536500 !ruby/object:Gem::Requirement
|
72
72
|
none: false
|
73
73
|
requirements:
|
74
74
|
- - ! '>='
|
@@ -76,10 +76,10 @@ dependencies:
|
|
76
76
|
version: '0'
|
77
77
|
type: :runtime
|
78
78
|
prerelease: false
|
79
|
-
version_requirements: *
|
79
|
+
version_requirements: *70324863536500
|
80
80
|
- !ruby/object:Gem::Dependency
|
81
81
|
name: sinatra
|
82
|
-
requirement: &
|
82
|
+
requirement: &70324863535620 !ruby/object:Gem::Requirement
|
83
83
|
none: false
|
84
84
|
requirements:
|
85
85
|
- - ! '>='
|
@@ -87,10 +87,10 @@ dependencies:
|
|
87
87
|
version: '0'
|
88
88
|
type: :runtime
|
89
89
|
prerelease: false
|
90
|
-
version_requirements: *
|
90
|
+
version_requirements: *70324863535620
|
91
91
|
- !ruby/object:Gem::Dependency
|
92
92
|
name: thin
|
93
|
-
requirement: &
|
93
|
+
requirement: &70324863534860 !ruby/object:Gem::Requirement
|
94
94
|
none: false
|
95
95
|
requirements:
|
96
96
|
- - ! '>='
|
@@ -98,10 +98,10 @@ dependencies:
|
|
98
98
|
version: '0'
|
99
99
|
type: :runtime
|
100
100
|
prerelease: false
|
101
|
-
version_requirements: *
|
101
|
+
version_requirements: *70324863534860
|
102
102
|
- !ruby/object:Gem::Dependency
|
103
103
|
name: haml
|
104
|
-
requirement: &
|
104
|
+
requirement: &70324863534000 !ruby/object:Gem::Requirement
|
105
105
|
none: false
|
106
106
|
requirements:
|
107
107
|
- - ! '>='
|
@@ -109,10 +109,10 @@ dependencies:
|
|
109
109
|
version: '0'
|
110
110
|
type: :runtime
|
111
111
|
prerelease: false
|
112
|
-
version_requirements: *
|
112
|
+
version_requirements: *70324863534000
|
113
113
|
- !ruby/object:Gem::Dependency
|
114
114
|
name: namespaced_redis
|
115
|
-
requirement: &
|
115
|
+
requirement: &70324863533220 !ruby/object:Gem::Requirement
|
116
116
|
none: false
|
117
117
|
requirements:
|
118
118
|
- - ! '>='
|
@@ -120,7 +120,7 @@ dependencies:
|
|
120
120
|
version: 1.0.2
|
121
121
|
type: :runtime
|
122
122
|
prerelease: false
|
123
|
-
version_requirements: *
|
123
|
+
version_requirements: *70324863533220
|
124
124
|
description: Cobweb is a web crawler that can use resque to cluster crawls to quickly
|
125
125
|
crawl extremely large sites which is much more perofmant than multi-threaded crawlers. It
|
126
126
|
is also a standalone crawler that has a sophisticated statistics monitoring interface
|
@@ -136,6 +136,7 @@ files:
|
|
136
136
|
- spec/cobweb/cobweb_links_spec.rb
|
137
137
|
- spec/cobweb/cobweb_spec.rb
|
138
138
|
- spec/cobweb/content_link_parser_spec.rb
|
139
|
+
- spec/cobweb/crawl_spec.rb
|
139
140
|
- spec/cobweb/robots_spec.rb
|
140
141
|
- spec/samples/robots.txt
|
141
142
|
- spec/samples/sample_html_links.html
|
@@ -315,6 +316,7 @@ files:
|
|
315
316
|
- lib/cobweb_process_job.rb
|
316
317
|
- lib/cobweb_version.rb
|
317
318
|
- lib/content_link_parser.rb
|
319
|
+
- lib/crawl.rb
|
318
320
|
- lib/crawl_job.rb
|
319
321
|
- lib/encoding_safe_process_job.rb
|
320
322
|
- lib/hash_util.rb
|