spiderkit 0.1.1 → 0.1.2
Sign up to get free protection for your applications and to get access to all the features.
- data/README.md +34 -11
- data/lib/queue.rb +6 -2
- data/lib/version.rb +1 -1
- data/spec/visit_queue_spec.rb +26 -1
- metadata +2 -2
data/README.md
CHANGED
@@ -37,12 +37,13 @@ Using Spiderkit, you implement your spiders and bots around the idea of a visit
|
|
37
37
|
* Once an object is iterated over, it's removed from the queue
|
38
38
|
* Once an object is iterated over, it's string value is added to an already-visited list, at which point you can't add it again.
|
39
39
|
* The queue will stop once it's empty, and optionally execute a final Proc that you pass to it
|
40
|
-
* The queue will not fetch web pages or anything else on it's own - that's part of what *you* implement.
|
40
|
+
* The queue will not fetch web pages or anything else on it's own - that's part of what *you* implement.
|
41
|
+
|
42
|
+
Since you need to implement page fetching on your own (using any of a number of high quality gems or libararies), you'll also need to implement the associated error checking, network timeout handling, and sanity checking that's involved. If you handle redirects by pushing them onto the queue, however, then you'll at least get a little help where redirect loops are concerned.
|
41
43
|
|
42
44
|
A basic example:
|
43
45
|
|
44
46
|
```ruby
|
45
|
-
|
46
47
|
mybot = Spider::VisitQueue.new
|
47
48
|
mybot.push_front('http://someurl.com')
|
48
49
|
|
@@ -58,17 +59,15 @@ A slightly fancier example:
|
|
58
59
|
|
59
60
|
```ruby
|
60
61
|
#download robots.txt as variable txt
|
61
|
-
#user agent is "my-bot
|
62
|
+
#user agent for robots.txt is "my-bot"
|
62
63
|
|
63
64
|
finalizer = Proc.new { puts "done"}
|
64
|
-
|
65
|
-
mybot = Spider::VisitQueue.new(txt, "my-bot/1.0", finalizer)
|
65
|
+
mybot = Spider::VisitQueue.new(txt, "my-bot", finalizer)
|
66
66
|
```
|
67
67
|
|
68
68
|
As urls are fetched and added to the queue, any links already visited will be dropped transparently. You have the option to push objects to either the front or rear of the queue at any time. If you do push to the front of the queue while iterating over it, the things you push will be the next items visited, and vice versa if you push to the back:
|
69
69
|
|
70
70
|
```ruby
|
71
|
-
|
72
71
|
mybot.visit_each do |url|
|
73
72
|
#these will be visited next
|
74
73
|
mybot.push_front(nexturls)
|
@@ -76,19 +75,26 @@ mybot.visit_each do |url|
|
|
76
75
|
#these will be visited last
|
77
76
|
mybot.push_back(lasturls)
|
78
77
|
end
|
78
|
+
```
|
79
|
+
|
80
|
+
The already visited list is implemented as a Bloom filter, so you should be able to spider even fairly large domains (and there are quite a few out there) without re-visiting pages. You can get a count of pages you've already visited at any time with the visit_count method.
|
79
81
|
|
82
|
+
If you need to clear the visited list at any point, use the clear_visited method:
|
83
|
+
|
84
|
+
```ruby
|
85
|
+
mybot.visit_each do |url|
|
86
|
+
mybot.clear_visited
|
87
|
+
end
|
80
88
|
```
|
81
89
|
|
82
|
-
|
90
|
+
After which you can push urls onto the queue regardless of if you visited them before clearing. However, the queue will still refuse to visit them once you've done so again. Note also that the count of visited pages will *not* reset.
|
83
91
|
|
84
92
|
Finally, you can forcefully stop spidering at any point:
|
85
93
|
|
86
94
|
```ruby
|
87
|
-
|
88
95
|
mybot.visit_each do |url|
|
89
96
|
mybot.stop
|
90
97
|
end
|
91
|
-
|
92
98
|
```
|
93
99
|
|
94
100
|
The finalizer, if any, will still be executed after stopping iteration.
|
@@ -120,6 +126,25 @@ mybot.visit_each |url|
|
|
120
126
|
end
|
121
127
|
```
|
122
128
|
|
129
|
+
If you don't pass an agent string, then the parser will take it's configuration from the default agent specified in the robots.txt. If you want your bot to respond to directives for a given user agent, just pass the agent to either the queue when you create it, or the parser:
|
130
|
+
|
131
|
+
```ruby
|
132
|
+
#visit queue that will respond to any robots.txt
|
133
|
+
#with User-agent: mybot in them
|
134
|
+
mybot = Spider::VisitQueue(txt, 'mybot')
|
135
|
+
|
136
|
+
#same thing as a standalone parser
|
137
|
+
myparser = Spider::ExclusionParser.new(txt, 'mybot')
|
138
|
+
```
|
139
|
+
|
140
|
+
Note that user agent string passed in to your exclusion parser and the user agent string sent along with HTTP requests are not necessarily one and the same, although the user agent contained in robots.txt will usually be a subset of the HTTP user agent.
|
141
|
+
|
142
|
+
For example:
|
143
|
+
|
144
|
+
Googlebot/2.1 (+http://www.google.com/bot.html)
|
145
|
+
|
146
|
+
should respond to "googlebot" in robots.txt. By convention, bots and spiders usually have the name 'bot' somewhere in their user agent strings.
|
147
|
+
|
123
148
|
## Wait Time
|
124
149
|
|
125
150
|
Ideally a bot should wait for some period of time in between requests to avoid crashing websites (less likely) or being blacklisted (more likely). A WaitTime class is provided that encapsulates this waiting logic, and logic to respond to rate limit codes and the "crawl-delay" directives found in some robots.txt files. Times are in seconds.
|
@@ -127,7 +152,6 @@ Ideally a bot should wait for some period of time in between requests to avoid c
|
|
127
152
|
You can create it standalone, or get it from an exclusion parser:
|
128
153
|
|
129
154
|
```ruby
|
130
|
-
|
131
155
|
#download a robots.txt with a crawl-delay 40
|
132
156
|
|
133
157
|
robots_txt = Spider::ExclusionParser.new(txt)
|
@@ -147,7 +171,6 @@ delay.reduce_wait
|
|
147
171
|
|
148
172
|
#after one call to back_off and one call to reduce_wait
|
149
173
|
delay.value => 160
|
150
|
-
|
151
174
|
```
|
152
175
|
|
153
176
|
By default a WaitTime will specify an initial value of 2 seconds. You can pass a value to new to specify the wait seconds, although values larger than the max allowable value will be set to the max allowable value (3 minutes / 180 seconds).
|
data/lib/queue.rb
CHANGED
@@ -15,10 +15,10 @@ module Spider
|
|
15
15
|
attr_accessor :robot_txt
|
16
16
|
|
17
17
|
def initialize(robots=nil, agent=nil, finish=nil)
|
18
|
-
@visited = BloomFilter.new(size: 10_000, error_rate: 0.001)
|
19
18
|
@robot_txt = ExclusionParser.new(robots, agent) if robots
|
20
19
|
@finalize = finish
|
21
20
|
@visit_count = 0
|
21
|
+
clear_visited
|
22
22
|
@pending = []
|
23
23
|
end
|
24
24
|
|
@@ -57,6 +57,10 @@ module Spider
|
|
57
57
|
def stop
|
58
58
|
raise IterationExit
|
59
59
|
end
|
60
|
+
|
61
|
+
def clear_visited
|
62
|
+
@visited = BloomFilter.new(size: 10_000, error_rate: 0.001)
|
63
|
+
end
|
60
64
|
|
61
65
|
private
|
62
66
|
|
@@ -71,7 +75,7 @@ module Spider
|
|
71
75
|
urls.compact!
|
72
76
|
|
73
77
|
urls.each do |url|
|
74
|
-
unless @visited.include?(url)
|
78
|
+
unless @visited.include?(url) || @pending.include?(url)
|
75
79
|
yield url
|
76
80
|
end
|
77
81
|
end
|
data/lib/version.rb
CHANGED
data/spec/visit_queue_spec.rb
CHANGED
@@ -31,7 +31,6 @@ module Spider
|
|
31
31
|
expect(order).to eq %w(one two divider)
|
32
32
|
end
|
33
33
|
|
34
|
-
|
35
34
|
it 'should allow appending to the back of the queue' do
|
36
35
|
@queue.push_back('one')
|
37
36
|
@queue.push_back('two')
|
@@ -86,6 +85,13 @@ module Spider
|
|
86
85
|
expect(@queue.empty?).to be true
|
87
86
|
end
|
88
87
|
|
88
|
+
it 'should not insert urls already in the pending queue' do
|
89
|
+
@queue.push_back(%w(one two three))
|
90
|
+
expect(@queue.size).to eq 4
|
91
|
+
@queue.push_back(%w(one two three))
|
92
|
+
expect(@queue.size).to eq 4
|
93
|
+
end
|
94
|
+
|
89
95
|
it 'should track number of urls visited' do
|
90
96
|
expect(@queue.visit_count).to eq 0
|
91
97
|
@queue.push_back(%w(one two three four))
|
@@ -124,6 +130,25 @@ REND
|
|
124
130
|
expect(queue.visit_count).to eq 1
|
125
131
|
expect(flag).to be true
|
126
132
|
end
|
133
|
+
|
134
|
+
it 'should allow you to clear the visited list' do
|
135
|
+
@queue.push_front(%w(one two three))
|
136
|
+
order = get_visit_order(@queue)
|
137
|
+
expect(order).to eq %w(three two one divider)
|
138
|
+
expect(@queue.visit_count).to eq 4
|
139
|
+
|
140
|
+
@queue.push_front(%w(one two three))
|
141
|
+
order = get_visit_order(@queue)
|
142
|
+
expect(order).to be_empty
|
143
|
+
expect(@queue.visit_count).to eq 4
|
144
|
+
|
145
|
+
@queue.clear_visited
|
146
|
+
@queue.push_front(%w(one two three))
|
147
|
+
order = get_visit_order(@queue)
|
148
|
+
expect(order).to eq %w(three two one)
|
149
|
+
expect(@queue.visit_count).to eq 7
|
150
|
+
end
|
151
|
+
|
127
152
|
end
|
128
153
|
|
129
154
|
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: spiderkit
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.1.
|
4
|
+
version: 0.1.2
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2016-07-
|
12
|
+
date: 2016-07-15 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: bundler
|