cobweb 0.0.67 → 0.0.68

Sign up to get free protection for your applications and to get access to all the features.
data/README.textile CHANGED
@@ -1,5 +1,5 @@
1
1
 
2
- h1. Cobweb v0.0.67
2
+ h1. Cobweb v0.0.68
3
3
 
4
4
  !https://secure.travis-ci.org/stewartmckee/cobweb.png?branch=master!
5
5
 
@@ -17,13 +17,9 @@ h3. Standalone
17
17
 
18
18
  CobwebCrawler takes the same options as cobweb itself, so you can use any of the options available for that. An example is listed below.
19
19
 
20
- bq. crawler = CobwebCrawler.new(:cache => 600);
21
-
22
- bq. stats = crawler.crawl("http://www.pepsico.com")
23
-
24
20
  While the crawler is running, you can view statistics on http://localhost:4567
25
21
 
26
- h3. Data Returned
22
+ h3. Data Returned For Each Page
27
23
  The data available in the returned hash are:
28
24
 
29
25
  * :url - url of the resource requested
@@ -44,14 +40,40 @@ h3. Data Returned
44
40
 
45
41
  The source for the links can be overridden, contact me for the syntax (don't have time to put it into this documentation, will as soon as i have time!)
46
42
 
43
+ h3. Statistics
44
+
45
+ Statistics are available during the crawl, you can create a Stats object passing in a hash with redis_options and crawl_id. Stats has a get_statistics method that returns a hash of the statistics available to you. It is also returned by default from the CobwebCrawler.crawl standalone crawling method.
46
+
47
+ The data available within statistics is as follows:
48
+
49
+ * :average_length - average size of each objet
50
+ * :minimum_length - minimum length returned
51
+ * :queued_at - date and time that the crawl was started at (eg: "2012-09-10T23:10:08+01:00")
52
+ * :maximum_length - maximum length of object received
53
+ * :status_counts - hash with the status returned as the key and value as number of pages (eg: {"404" => 1, "200" => 1})
54
+ * :mime_counts - hash containing the mime type as key and count or pages as value (eg: {"text/html" => 8, "image/jpeg" => 25)})
55
+ * :queue_counter - size of queue waiting to be processed for crawl
56
+ * :page_count - number of html pages retrieved
57
+ * :total_length - total size of data received
58
+ * :current_status - Current status of crawl
59
+ * :asset_count - count of non-html objects received
60
+ * :page_size - total size of pages received
61
+ * :average_response_time - average response time of all objects
62
+ * :crawl_counter - number of objects that have been crawled
63
+ * :minimum_response_time - quickest response time of crawl
64
+ * :maximum_response_time - longest response time of crawl
65
+ * :asset_size - total size of all non-assets received
66
+
47
67
  h2. Installation
48
68
 
49
69
  Install crawler as a gem
50
70
 
51
- bq. gem install cobweb
71
+ bc. gem install cobweb
52
72
 
53
73
  h2. Usage
54
74
 
75
+ h3. Cobweb
76
+
55
77
  h4. new(options)
56
78
 
57
79
  Creates a new crawler object based on a base_url
@@ -76,7 +98,7 @@ Creates a new crawler object based on a base_url
76
98
  ** :crawl_limit_by_page - sets the crawl counter to only use html page types when counting objects crawled
77
99
  ** :valid_mime_types - an array of mime types that takes wildcards (eg 'text/*') defaults to ['*/*']
78
100
 
79
- bq. crawler = CobWeb.new(:follow_redirects => false)
101
+ bc. crawler = Cobweb.new(:follow_redirects => false)
80
102
 
81
103
  h4. start(base_url)
82
104
 
@@ -86,7 +108,7 @@ Starts a crawl through resque. Requires the :processing_queue to be set to a va
86
108
 
87
109
  Once the crawler starts, if the first page is redirected (eg from http://www.test.com to http://test.com) then the endpoint scheme, host and domain is added to the internal_urls automatically.
88
110
 
89
- bq. crawler.start("http://www.google.com/")
111
+ bc. crawler.start("http://www.google.com/")
90
112
 
91
113
  h4. get(url)
92
114
 
@@ -94,7 +116,7 @@ Simple get that obey's the options supplied in new.
94
116
 
95
117
  * url - url requested
96
118
 
97
- bq. crawler.get("http://www.google.com/")
119
+ bc. crawler.get("http://www.google.com/")
98
120
 
99
121
  h4. head(url)
100
122
 
@@ -102,10 +124,35 @@ Simple get that obey's the options supplied in new.
102
124
 
103
125
  * url - url requested
104
126
 
105
- bq. crawler.head("http://www.google.com/")
127
+ bc. crawler.head("http://www.google.com/")
128
+
129
+ h3. CobwebCrawler
130
+
131
+ CobwebCrawler is the standalone crawling class. If you don't want to use redis and just want to crawl the site within your ruby process, you can use this class.
132
+
133
+ bc. crawler = CobwebCrawler.new(:cache => 600)
134
+ statistics = crawler.crawl("http://www.pepsico.com")
135
+
136
+ You can also run within a block and get access to each page as it is being crawled.
137
+
138
+ bc. statistics = CobwebCrawler.new(:cache => 600).crawl("http://www.pepsico.com") do |page|
139
+ puts "Just crawled #{page[:url]} and got a status of #{page[:status_code]}."
140
+ end
141
+ puts "Finished Crawl in "
142
+
143
+
144
+
145
+ h3. Crawl
146
+
147
+ The crawl class is a helper class to assist in getting information about a crawl and to perform functions against the crawl
148
+
149
+ bc. crawl = Crawl.new(options)
150
+
151
+ * options - the hash of options passed into Cobweb.new (must include a :crawl_id)
152
+
106
153
 
107
154
 
108
- h3. Contributing/Testing
155
+ h2. Contributing/Testing
109
156
 
110
157
  Feel free to contribute small or large bits of code, just please make sure that there are rspec test for the features your submitting. We also test on travis at http://travis-ci.org/#!/stewartmckee/cobweb if you want to see the state of the project.
111
158
 
@@ -3,7 +3,7 @@ class CobwebVersion
3
3
 
4
4
  # Returns a string of the current version
5
5
  def self.version
6
- "0.0.67"
6
+ "0.0.68"
7
7
  end
8
8
 
9
9
  end
data/lib/crawl.rb ADDED
@@ -0,0 +1,51 @@
1
+ # The crawl class gives easy access to information about the crawl, and gives the ability to stop a crawl
2
+ class Crawl
3
+
4
+ attr_accessor :id
5
+
6
+ BATCH_SIZE = 200
7
+ FINISHED = "Finished"
8
+ STARTING = "Starting"
9
+ CANCELLED = "Cancelled"
10
+
11
+ def initialize(data)
12
+ @data = data
13
+ @stats = Stats.new(data)
14
+ end
15
+
16
+ def destroy
17
+ queue_name = "cobweb_crawl_job"
18
+ # set status as cancelled now so that we don't enqueue any further pages
19
+ self.statistics.end_crawl(@data, true)
20
+
21
+ job_items = Resque.peek(queue_name, 0, BATCH_SIZE)
22
+ batch_count = 0
23
+ until job_items.empty?
24
+
25
+ job_items.each do |item|
26
+ if item["args"][0]["crawl_id"] == id
27
+ # remote this job from the queue
28
+ Resque.dequeue(CrawlJob, item["args"][0])
29
+ end
30
+ end
31
+
32
+ position = batch_count*BATCH_SIZE
33
+ batch_count += 1
34
+ job_items = Resque.peek(queue_name, position, BATCH_SIZE)
35
+ end
36
+
37
+ end
38
+
39
+ def statistics
40
+ @stats
41
+ end
42
+
43
+ def status
44
+ statistics.get_status
45
+ end
46
+
47
+ def id
48
+ @data[:crawl_id]
49
+ end
50
+
51
+ end
data/lib/crawl_job.rb CHANGED
@@ -14,6 +14,7 @@ class CrawlJob
14
14
  # change all hash keys to symbols
15
15
  content_request = HashUtil.deep_symbolize_keys(content_request)
16
16
  @content_request = content_request
17
+ @crawl = Crawl.new(content_request)
17
18
 
18
19
  content_request[:redis_options] = {} unless content_request.has_key? :redis_options
19
20
  content_request[:crawl_limit_by_page] = false unless content_request.has_key? :crawl_limit_by_page
@@ -27,8 +28,7 @@ class CrawlJob
27
28
  # check we haven't crawled this url before
28
29
  unless @redis.sismember "crawled", content_request[:url]
29
30
  # if there is no limit or we're still under it lets get the url
30
- if within_crawl_limits?(content_request[:crawl_limit])
31
- puts "cbpl: #{content_request[:url]}" if content_request[:crawl_limit_by_page]
31
+ if within_crawl_limits?(content_request[:crawl_limit]) and @crawl.status != Crawl::CANCELLED
32
32
  content = Cobweb.new(content_request).get(content_request[:url], content_request)
33
33
  if content_request[:url] == @redis.get("original_base_url")
34
34
  @redis.set("crawled_base_url", content[:base_url])
@@ -55,7 +55,7 @@ class CrawlJob
55
55
 
56
56
  # set the base url if this is the first page
57
57
  set_base_url @redis, content, content_request
58
-
58
+
59
59
  @cobweb_links = CobwebLinks.new(content_request)
60
60
  if within_queue_limits?(content_request[:crawl_limit])
61
61
  internal_links = ContentLinkParser.new(content_request[:url], content[:body], content_request).all_links(:valid_schemes => [:http, :https])
@@ -69,7 +69,11 @@ class CrawlJob
69
69
  internal_links.reject! { |link| @redis.sismember("queued", link) }
70
70
 
71
71
  internal_links.each do |link|
72
- enqueue_content(content_request, link) if within_queue_limits?(content_request[:crawl_limit])
72
+ puts link
73
+ puts "Not enqueuing due to cancelled crawl" if @crawl.status == Crawl::CANCELLED
74
+ if within_queue_limits?(content_request[:crawl_limit]) and @crawl.status != Crawl::CANCELLED
75
+ enqueue_content(content_request, link)
76
+ end
73
77
  end
74
78
  end
75
79
 
@@ -92,7 +96,6 @@ class CrawlJob
92
96
  if content_request[:crawl_limit_by_page]
93
97
  if content[:mime_type].match("text/html")
94
98
  increment_crawl_counter
95
- ap "clbp: #{crawl_counter}"
96
99
  end
97
100
  else
98
101
  increment_crawl_counter
@@ -112,8 +115,6 @@ class CrawlJob
112
115
  end
113
116
 
114
117
  decrement_queue_counter
115
- puts content_request[:crawl_limit]
116
- print_counters
117
118
  # if there's nothing left queued or the crawled limit has been reached
118
119
  if content_request[:crawl_limit].nil? || content_request[:crawl_limit] == 0
119
120
  if queue_counter + crawl_started_counter - crawl_counter == 0
@@ -125,10 +126,10 @@ class CrawlJob
125
126
 
126
127
  end
127
128
 
128
- # Sets the crawl status to 'Crawl Finished' and enqueues the crawl finished job
129
+ # Sets the crawl status to Crawl::FINISHED and enqueues the crawl finished job
129
130
  def self.finished(content_request)
130
131
  # finished
131
- if @redis.hget("statistics", "current_status")!= "Crawl Finished"
132
+ if @crawl.status != Crawl::FINISHED and @crawl.status != Crawl::CANCELLED
132
133
  ap "CRAWL FINISHED #{content_request[:url]}, #{counters}, #{@redis.get("original_base_url")}, #{@redis.get("crawled_base_url")}" if content_request[:debug]
133
134
  @stats.end_crawl(content_request)
134
135
 
data/lib/stats.rb CHANGED
@@ -16,13 +16,17 @@ class Stats
16
16
  @redis.hset "crawl_details", key, options[key].to_s
17
17
  end
18
18
  end
19
- @redis.hset "statistics", "current_status", "Crawl Starting..."
19
+ @redis.hset "statistics", "current_status", Crawl::STARTING
20
20
  end
21
21
 
22
22
  # Removes the crawl from the running crawls and updates status
23
- def end_crawl(options)
23
+ def end_crawl(options, cancelled=false)
24
24
  @full_redis.srem "cobweb_crawls", options[:crawl_id]
25
- @redis.hset "statistics", "current_status", "Crawl Finished"
25
+ if cancelled
26
+ @redis.hset "statistics", "current_status", Crawl::CANCELLED
27
+ else
28
+ @redis.hset "statistics", "current_status", Crawl::FINISHED
29
+ end
26
30
  @redis.del "crawl_details"
27
31
  end
28
32
 
@@ -154,7 +158,7 @@ class Stats
154
158
 
155
159
  # Sets the current status of the crawl
156
160
  def update_status(status)
157
- @redis.hset "statistics", "current_status", status
161
+ #@redis.hset("statistics", "current_status", status) unless status == Crawl::CANCELLED
158
162
  end
159
163
 
160
164
  # Returns the current status of the crawl
@@ -5,13 +5,13 @@ describe Cobweb, :local_only => true do
5
5
  before(:all) do
6
6
  #store all existing resque process ids so we don't kill them afterwards
7
7
  @existing_processes = `ps aux | grep resque | grep -v grep | grep -v resque-web | awk '{print $2}'`.split("\n")
8
-
8
+
9
9
  # START WORKERS ONLY FOR CRAWL QUEUE SO WE CAN COUNT ENQUEUED PROCESS AND FINISH QUEUES
10
10
  puts "Starting Workers... Please Wait..."
11
11
  `mkdir log`
12
12
  io = IO.popen("nohup rake resque:workers PIDFILE=./tmp/pids/resque.pid COUNT=1 QUEUE=cobweb_crawl_job > log/output.log &")
13
13
  puts "Workers Started."
14
-
14
+
15
15
  end
16
16
 
17
17
  before(:each) do
@@ -19,56 +19,90 @@ describe Cobweb, :local_only => true do
19
19
  @base_page_count = 77
20
20
  clear_queues
21
21
  end
22
-
22
+
23
+ describe "when crawl is cancelled" do
24
+ before(:each) do
25
+ @request = {
26
+ :crawl_id => Digest::SHA1.hexdigest("#{Time.now.to_i}.#{Time.now.usec}"),
27
+ :crawl_limit => nil,
28
+ :quiet => false,
29
+ :debug => false,
30
+ :cache => nil
31
+ }
32
+ @cobweb = Cobweb.new @request
33
+ end
34
+ it "should not crawl anything if nothing has started" do
35
+ crawl = @cobweb.start(@base_url)
36
+ crawl_obj = Crawl.new(crawl)
37
+ crawl_obj.destroy
38
+ @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
39
+ wait_for_crawl_finished crawl[:crawl_id]
40
+ Resque.size("cobweb_process_job").should == 0
41
+ end
42
+
43
+ it "should not complete the crawl when cancelled" do
44
+ crawl = @cobweb.start(@base_url)
45
+ crawl_obj = Crawl.new(crawl)
46
+ sleep 6
47
+ crawl_obj.destroy
48
+ @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
49
+ wait_for_crawl_finished crawl[:crawl_id]
50
+ Resque.size("cobweb_process_job").should > 0
51
+ Resque.size("cobweb_process_job").should_not == @base_page_count
52
+ end
53
+
54
+ end
23
55
  describe "with no crawl limit" do
24
56
  before(:each) do
25
- @request = {
26
- :crawl_id => Digest::SHA1.hexdigest("#{Time.now.to_i}.#{Time.now.usec}"),
27
- :crawl_limit => nil,
28
- :quiet => false,
29
- :debug => false,
30
- :cache => nil
31
- }
32
- @cobweb = Cobweb.new @request
33
- end
34
-
57
+ @request = {
58
+ :crawl_id => Digest::SHA1.hexdigest("#{Time.now.to_i}.#{Time.now.usec}"),
59
+ :crawl_limit => nil,
60
+ :quiet => false,
61
+ :debug => false,
62
+ :cache => nil
63
+ }
64
+ @cobweb = Cobweb.new @request
65
+ end
66
+
35
67
  it "should crawl entire site" do
36
- crawl = @cobweb.start(@base_url)
37
- @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
38
- wait_for_crawl_finished crawl[:crawl_id]
39
- Resque.size("cobweb_process_job").should == @base_page_count
68
+ ap Resque.size("cobweb_process_job")
69
+ crawl = @cobweb.start(@base_url)
70
+ @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
71
+ wait_for_crawl_finished crawl[:crawl_id]
72
+ ap @stat.get_statistics
73
+ Resque.size("cobweb_process_job").should == @base_page_count
40
74
  end
41
75
  it "detect crawl finished once" do
42
- crawl = @cobweb.start(@base_url)
43
- @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
44
- wait_for_crawl_finished crawl[:crawl_id]
45
- Resque.size("cobweb_finished_job").should == 1
76
+ crawl = @cobweb.start(@base_url)
77
+ @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
78
+ wait_for_crawl_finished crawl[:crawl_id]
79
+ Resque.size("cobweb_finished_job").should == 1
80
+ end
81
+ end
82
+ describe "with limited mime_types" do
83
+ before(:each) do
84
+ @request = {
85
+ :crawl_id => Digest::SHA1.hexdigest("#{Time.now.to_i}.#{Time.now.usec}"),
86
+ :quiet => true,
87
+ :cache => nil,
88
+ :valid_mime_types => ["text/html"]
89
+ }
90
+ @cobweb = Cobweb.new @request
91
+ end
92
+
93
+ it "should only crawl html pages" do
94
+ crawl = @cobweb.start(@base_url)
95
+ @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
96
+ wait_for_crawl_finished crawl[:crawl_id]
97
+ Resque.size("cobweb_process_job").should == 8
98
+
99
+ mime_types = Resque.peek("cobweb_process_job", 0, 100).map{|job| job["args"][0]["mime_type"]}
100
+ mime_types.count.should == 8
101
+ mime_types.map{|m| m.should == "text/html"}
102
+ mime_types.select{|m| m=="text/html"}.count.should == 8
46
103
  end
104
+
47
105
  end
48
- describe "with limited mime_types" do
49
- before(:each) do
50
- @request = {
51
- :crawl_id => Digest::SHA1.hexdigest("#{Time.now.to_i}.#{Time.now.usec}"),
52
- :quiet => true,
53
- :cache => nil,
54
- :valid_mime_types => ["text/html"]
55
- }
56
- @cobweb = Cobweb.new @request
57
- end
58
-
59
- it "should only crawl html pages" do
60
- crawl = @cobweb.start(@base_url)
61
- @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
62
- wait_for_crawl_finished crawl[:crawl_id]
63
- Resque.size("cobweb_process_job").should == 8
64
-
65
- mime_types = Resque.peek("cobweb_process_job", 0, 100).map{|job| job["args"][0]["mime_type"]}
66
- mime_types.count.should == 8
67
- mime_types.map{|m| m.should == "text/html"}
68
- mime_types.select{|m| m=="text/html"}.count.should == 8
69
- end
70
-
71
- end
72
106
  describe "with a crawl limit" do
73
107
  before(:each) do
74
108
  @request = {
@@ -77,31 +111,31 @@ describe Cobweb, :local_only => true do
77
111
  :cache => nil
78
112
  }
79
113
  end
80
-
114
+
81
115
  describe "limit to 1" do
82
116
  before(:each) do
83
117
  @request[:crawl_limit] = 1
84
118
  @cobweb = Cobweb.new @request
85
119
  end
86
-
120
+
87
121
  it "should not crawl the entire site" do
88
122
  crawl = @cobweb.start(@base_url)
89
123
  @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
90
124
  wait_for_crawl_finished crawl[:crawl_id]
91
125
  Resque.size("cobweb_process_job").should_not == @base_page_count
92
- end
126
+ end
93
127
  it "should only crawl 1 page" do
94
128
  crawl = @cobweb.start(@base_url)
95
129
  @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
96
130
  wait_for_crawl_finished crawl[:crawl_id]
97
131
  Resque.size("cobweb_process_job").should == 1
98
- end
132
+ end
99
133
  it "should notify of crawl finished once" do
100
134
  crawl = @cobweb.start(@base_url)
101
135
  @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
102
136
  wait_for_crawl_finished crawl[:crawl_id]
103
137
  Resque.size("cobweb_finished_job").should == 1
104
- end
138
+ end
105
139
  end
106
140
 
107
141
  describe "for pages only" do
@@ -110,7 +144,7 @@ describe Cobweb, :local_only => true do
110
144
  @request[:crawl_limit] = 5
111
145
  @cobweb = Cobweb.new @request
112
146
  end
113
-
147
+
114
148
  it "should only use html pages towards the crawl limit" do
115
149
  crawl = @cobweb.start(@base_url)
116
150
  @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
@@ -126,19 +160,19 @@ describe Cobweb, :local_only => true do
126
160
  @request[:crawl_limit] = 10
127
161
  @cobweb = Cobweb.new @request
128
162
  end
129
-
163
+
130
164
  it "should not crawl the entire site" do
131
165
  crawl = @cobweb.start(@base_url)
132
166
  @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
133
167
  wait_for_crawl_finished crawl[:crawl_id]
134
168
  Resque.size("cobweb_process_job").should_not == @base_page_count
135
- end
169
+ end
136
170
  it "should notify of crawl finished once" do
137
171
  crawl = @cobweb.start(@base_url)
138
172
  @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
139
173
  wait_for_crawl_finished crawl[:crawl_id]
140
174
  Resque.size("cobweb_finished_job").should == 1
141
- end
175
+ end
142
176
  it "should only crawl 10 objects" do
143
177
  crawl = @cobweb.start(@base_url)
144
178
  @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
@@ -146,40 +180,40 @@ describe Cobweb, :local_only => true do
146
180
  Resque.size("cobweb_process_job").should == 10
147
181
  end
148
182
  end
149
-
183
+
150
184
  describe "limit to 100" do
151
185
  before(:each) do
152
186
  @request[:crawl_limit] = 100
153
187
  @cobweb = Cobweb.new @request
154
188
  end
155
-
189
+
156
190
  it "should crawl the entire sample site" do
157
191
  crawl = @cobweb.start(@base_url)
158
192
  @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
159
193
  wait_for_crawl_finished crawl[:crawl_id]
160
194
  Resque.size("cobweb_process_job").should == @base_page_count
161
- end
195
+ end
162
196
  it "should notify of crawl finished once" do
163
197
  crawl = @cobweb.start(@base_url)
164
198
  @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
165
199
  wait_for_crawl_finished crawl[:crawl_id]
166
200
  Resque.size("cobweb_finished_job").should == 1
167
- end
201
+ end
168
202
  it "should not crawl 100 pages" do
169
203
  crawl = @cobweb.start(@base_url)
170
204
  @stat = Stats.new({:crawl_id => crawl[:crawl_id]})
171
205
  wait_for_crawl_finished crawl[:crawl_id]
172
206
  Resque.size("cobweb_process_job").should_not == 100
173
- end
207
+ end
174
208
  end
175
209
  end
176
210
 
177
211
  after(:all) do
178
-
212
+
179
213
  @all_processes = `ps aux | grep resque | grep -v grep | grep -v resque-web | awk '{print $2}'`.split("\n")
180
214
  command = "kill -9 #{(@all_processes - @existing_processes).join(" ")}"
181
215
  IO.popen(command)
182
-
216
+
183
217
  clear_queues
184
218
  end
185
219
 
@@ -189,25 +223,23 @@ def wait_for_crawl_finished(crawl_id, timeout=20)
189
223
  counter = 0
190
224
  start_time = Time.now
191
225
  while(running?(crawl_id) && Time.now < start_time + timeout) do
192
- sleep 0.5
193
- end
194
- if Time.now > start_time + timeout
195
- raise "End of crawl not detected"
226
+ sleep 0.5
227
+ end
228
+ if Time.now > start_time + timeout
229
+ raise "End of crawl not detected"
230
+ end
196
231
  end
197
- end
198
-
199
- def running?(crawl_id)
200
- @stat.get_status != "Crawl Finished"
201
- end
202
232
 
203
- def clear_queues
204
- Resque.queues.each do |queue|
205
- Resque.remove_queue(queue)
233
+ def running?(crawl_id)
234
+ @stat.get_status != Crawl::FINISHED and @stat.get_status != Crawl::CANCELLED
206
235
  end
207
-
208
- Resque.size("cobweb_process_job").should == 0
209
- Resque.size("cobweb_finished_job").should == 0
210
- Resque.peek("cobweb_process_job", 0, 200).should be_empty
211
- end
212
236
 
237
+ def clear_queues
238
+ Resque.queues.each do |queue|
239
+ Resque.remove_queue(queue)
240
+ end
213
241
 
242
+ Resque.size("cobweb_process_job").should == 0
243
+ Resque.size("cobweb_finished_job").should == 0
244
+ Resque.peek("cobweb_process_job", 0, 200).should be_empty
245
+ end
@@ -0,0 +1,74 @@
1
+ require File.expand_path(File.dirname(__FILE__) + '/../spec_helper')
2
+
3
+ describe Crawl do
4
+
5
+ # this spec tests the crawl object
6
+
7
+ describe "initialize" do
8
+ describe "without data" do
9
+ it "should raise an exception" do
10
+ lambda {Crawl.new}.should raise_exception
11
+ end
12
+ end
13
+
14
+ describe "with data" do
15
+ before(:each) do
16
+ data = {:crawl_id => "asdf"}
17
+ @crawl = Crawl.new(data)
18
+ end
19
+ it "should create a crawl object" do
20
+ @crawl.should be_an_instance_of Crawl
21
+ end
22
+ it "should return an id" do
23
+ @crawl.should respond_to "id"
24
+ end
25
+ it "should return a status" do
26
+ @crawl.should respond_to "status"
27
+ end
28
+
29
+ describe "the destroy method" do
30
+ before(:each) do
31
+ if Resque.size("cobweb_crawl_job") > 0
32
+ raise "cobweb_crawl_job is not empty, do not run specs until it is!"
33
+ end
34
+ 105.times do |item_count|
35
+ 2.times do |crawl_count|
36
+ item_data = {:crawl_id => "crawl_#{crawl_count}_id", :url => "http://crawl#{crawl_count}.com/page#{item_count}.html"}
37
+ Resque.enqueue(CrawlJob, item_data)
38
+ end
39
+ end
40
+ end
41
+ after(:each) do
42
+ Resque.remove_queue("cobweb_crawl_job")
43
+ end
44
+ it "should have a queue length of 210" do
45
+ Resque.size("cobweb_crawl_job").should == 210
46
+ end
47
+ describe "after called" do
48
+ before(:each) do
49
+ @crawl = Crawl.new({:crawl_id => "crawl_0_id"})
50
+ @crawl.destroy
51
+ end
52
+ it "should delete only the crawl specified" do
53
+ Resque.size("cobweb_crawl_job").should == 105
54
+ end
55
+ it "should not contain any crawl_0_id" do
56
+ Resque.peek("cobweb_crawl_job", 0, 200).map{|i| i["args"][0]}.each do |item|
57
+ item["crawl_id"].should_not == "crawl_0_id"
58
+ end
59
+ end
60
+ it "should only contain crawl_1_id" do
61
+ Resque.peek("cobweb_crawl_job", 0, 200).map{|i| i["args"][0]}.each do |item|
62
+ item["crawl_id"].should == "crawl_1_id"
63
+ end
64
+ end
65
+ it "should set status to 'Cancelled'" do
66
+ @crawl.status.should == "Cancelled"
67
+ end
68
+ end
69
+ end
70
+ end
71
+ end
72
+
73
+
74
+ end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: cobweb
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.0.67
4
+ version: 0.0.68
5
5
  prerelease:
6
6
  platform: ruby
7
7
  authors:
@@ -9,11 +9,11 @@ authors:
9
9
  autorequire:
10
10
  bindir: bin
11
11
  cert_chain: []
12
- date: 2012-09-07 00:00:00.000000000 Z
12
+ date: 2012-09-10 00:00:00.000000000 Z
13
13
  dependencies:
14
14
  - !ruby/object:Gem::Dependency
15
15
  name: resque
16
- requirement: &70248783211420 !ruby/object:Gem::Requirement
16
+ requirement: &70324863540700 !ruby/object:Gem::Requirement
17
17
  none: false
18
18
  requirements:
19
19
  - - ! '>='
@@ -21,10 +21,10 @@ dependencies:
21
21
  version: '0'
22
22
  type: :runtime
23
23
  prerelease: false
24
- version_requirements: *70248783211420
24
+ version_requirements: *70324863540700
25
25
  - !ruby/object:Gem::Dependency
26
26
  name: redis
27
- requirement: &70248783210160 !ruby/object:Gem::Requirement
27
+ requirement: &70324863539560 !ruby/object:Gem::Requirement
28
28
  none: false
29
29
  requirements:
30
30
  - - ! '>='
@@ -32,10 +32,10 @@ dependencies:
32
32
  version: '0'
33
33
  type: :runtime
34
34
  prerelease: false
35
- version_requirements: *70248783210160
35
+ version_requirements: *70324863539560
36
36
  - !ruby/object:Gem::Dependency
37
37
  name: nokogiri
38
- requirement: &70248783209580 !ruby/object:Gem::Requirement
38
+ requirement: &70324863538960 !ruby/object:Gem::Requirement
39
39
  none: false
40
40
  requirements:
41
41
  - - ! '>='
@@ -43,10 +43,10 @@ dependencies:
43
43
  version: '0'
44
44
  type: :runtime
45
45
  prerelease: false
46
- version_requirements: *70248783209580
46
+ version_requirements: *70324863538960
47
47
  - !ruby/object:Gem::Dependency
48
48
  name: addressable
49
- requirement: &70248783208340 !ruby/object:Gem::Requirement
49
+ requirement: &70324863537700 !ruby/object:Gem::Requirement
50
50
  none: false
51
51
  requirements:
52
52
  - - ! '>='
@@ -54,10 +54,10 @@ dependencies:
54
54
  version: '0'
55
55
  type: :runtime
56
56
  prerelease: false
57
- version_requirements: *70248783208340
57
+ version_requirements: *70324863537700
58
58
  - !ruby/object:Gem::Dependency
59
59
  name: rspec
60
- requirement: &70248783207700 !ruby/object:Gem::Requirement
60
+ requirement: &70324863537120 !ruby/object:Gem::Requirement
61
61
  none: false
62
62
  requirements:
63
63
  - - ! '>='
@@ -65,10 +65,10 @@ dependencies:
65
65
  version: '0'
66
66
  type: :runtime
67
67
  prerelease: false
68
- version_requirements: *70248783207700
68
+ version_requirements: *70324863537120
69
69
  - !ruby/object:Gem::Dependency
70
70
  name: awesome_print
71
- requirement: &70248783207100 !ruby/object:Gem::Requirement
71
+ requirement: &70324863536500 !ruby/object:Gem::Requirement
72
72
  none: false
73
73
  requirements:
74
74
  - - ! '>='
@@ -76,10 +76,10 @@ dependencies:
76
76
  version: '0'
77
77
  type: :runtime
78
78
  prerelease: false
79
- version_requirements: *70248783207100
79
+ version_requirements: *70324863536500
80
80
  - !ruby/object:Gem::Dependency
81
81
  name: sinatra
82
- requirement: &70248783206200 !ruby/object:Gem::Requirement
82
+ requirement: &70324863535620 !ruby/object:Gem::Requirement
83
83
  none: false
84
84
  requirements:
85
85
  - - ! '>='
@@ -87,10 +87,10 @@ dependencies:
87
87
  version: '0'
88
88
  type: :runtime
89
89
  prerelease: false
90
- version_requirements: *70248783206200
90
+ version_requirements: *70324863535620
91
91
  - !ruby/object:Gem::Dependency
92
92
  name: thin
93
- requirement: &70248783205520 !ruby/object:Gem::Requirement
93
+ requirement: &70324863534860 !ruby/object:Gem::Requirement
94
94
  none: false
95
95
  requirements:
96
96
  - - ! '>='
@@ -98,10 +98,10 @@ dependencies:
98
98
  version: '0'
99
99
  type: :runtime
100
100
  prerelease: false
101
- version_requirements: *70248783205520
101
+ version_requirements: *70324863534860
102
102
  - !ruby/object:Gem::Dependency
103
103
  name: haml
104
- requirement: &70248783204580 !ruby/object:Gem::Requirement
104
+ requirement: &70324863534000 !ruby/object:Gem::Requirement
105
105
  none: false
106
106
  requirements:
107
107
  - - ! '>='
@@ -109,10 +109,10 @@ dependencies:
109
109
  version: '0'
110
110
  type: :runtime
111
111
  prerelease: false
112
- version_requirements: *70248783204580
112
+ version_requirements: *70324863534000
113
113
  - !ruby/object:Gem::Dependency
114
114
  name: namespaced_redis
115
- requirement: &70248783203800 !ruby/object:Gem::Requirement
115
+ requirement: &70324863533220 !ruby/object:Gem::Requirement
116
116
  none: false
117
117
  requirements:
118
118
  - - ! '>='
@@ -120,7 +120,7 @@ dependencies:
120
120
  version: 1.0.2
121
121
  type: :runtime
122
122
  prerelease: false
123
- version_requirements: *70248783203800
123
+ version_requirements: *70324863533220
124
124
  description: Cobweb is a web crawler that can use resque to cluster crawls to quickly
125
125
  crawl extremely large sites which is much more perofmant than multi-threaded crawlers. It
126
126
  is also a standalone crawler that has a sophisticated statistics monitoring interface
@@ -136,6 +136,7 @@ files:
136
136
  - spec/cobweb/cobweb_links_spec.rb
137
137
  - spec/cobweb/cobweb_spec.rb
138
138
  - spec/cobweb/content_link_parser_spec.rb
139
+ - spec/cobweb/crawl_spec.rb
139
140
  - spec/cobweb/robots_spec.rb
140
141
  - spec/samples/robots.txt
141
142
  - spec/samples/sample_html_links.html
@@ -315,6 +316,7 @@ files:
315
316
  - lib/cobweb_process_job.rb
316
317
  - lib/cobweb_version.rb
317
318
  - lib/content_link_parser.rb
319
+ - lib/crawl.rb
318
320
  - lib/crawl_job.rb
319
321
  - lib/encoding_safe_process_job.rb
320
322
  - lib/hash_util.rb