spiderkit 0.1.2 → 0.2.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +7 -0
- data/README.md +74 -18
- data/lib/exclusion.rb +68 -57
- data/lib/queue.rb +27 -26
- data/lib/recorder.rb +111 -0
- data/lib/spiderkit.rb +3 -1
- data/lib/version.rb +1 -1
- data/lib/wait_time.rb +12 -13
- data/spec/exclusion_parser_spec.rb +49 -5
- data/spec/recorder_spec.rb +128 -0
- data/spec/visit_queue_spec.rb +14 -2
- data/spiderkit.gemspec +1 -0
- metadata +19 -26
checksums.yaml
ADDED
@@ -0,0 +1,7 @@
|
|
1
|
+
---
|
2
|
+
SHA1:
|
3
|
+
metadata.gz: 5d3b757d2369c3d6c9520e8f4b78f60f79fdfd95
|
4
|
+
data.tar.gz: 562120125d785f938f37b102a53e90dc78dc56cc
|
5
|
+
SHA512:
|
6
|
+
metadata.gz: 391d8705750fe6e738152384439da6b512dc29471796c37bd48f1d90bd2b3909fc1a1c6257c1a95baff42533d662140f7cf6a999a3d9cedca17cdd8cb1f43c06
|
7
|
+
data.tar.gz: a243700c54bf0eecaf16a879d76e201c7debcb93094bed3ba32bea47f9615c321ba5f297451955cead398e3fda3f7ecd2892673eea87102339e0ca75c3df34de
|
data/README.md
CHANGED
@@ -58,8 +58,8 @@ end
|
|
58
58
|
A slightly fancier example:
|
59
59
|
|
60
60
|
```ruby
|
61
|
-
#download robots.txt as variable txt
|
62
|
-
#user agent for robots.txt is "my-bot"
|
61
|
+
# download robots.txt as variable txt
|
62
|
+
# user agent for robots.txt is "my-bot"
|
63
63
|
|
64
64
|
finalizer = Proc.new { puts "done"}
|
65
65
|
mybot = Spider::VisitQueue.new(txt, "my-bot", finalizer)
|
@@ -69,10 +69,10 @@ As urls are fetched and added to the queue, any links already visited will be dr
|
|
69
69
|
|
70
70
|
```ruby
|
71
71
|
mybot.visit_each do |url|
|
72
|
-
#these will be visited next
|
72
|
+
# these will be visited next
|
73
73
|
mybot.push_front(nexturls)
|
74
74
|
|
75
|
-
#these will be visited last
|
75
|
+
# these will be visited last
|
76
76
|
mybot.push_back(lasturls)
|
77
77
|
end
|
78
78
|
```
|
@@ -104,24 +104,26 @@ The finalizer, if any, will still be executed after stopping iteration.
|
|
104
104
|
Spiderkit also includes a robots.txt parser that can either work standalone, or be passed as an argument to the visit queue. If passed as an argument, urls that are excluded by the robots.txt will be dropped transparently.
|
105
105
|
|
106
106
|
```
|
107
|
-
#fetch robots.txt as variable txt
|
107
|
+
# fetch robots.txt as variable txt
|
108
108
|
|
109
|
-
#create a stand alone parser
|
109
|
+
# create a stand alone parser
|
110
110
|
robots_txt = Spider::ExclusionParser.new(txt)
|
111
111
|
|
112
112
|
robots_txt.excluded?("/") => true
|
113
113
|
robots_txt.excluded?("/admin") => false
|
114
114
|
robots_txt.allowed?("/blog") => true
|
115
115
|
|
116
|
-
#pass text directly to visit queue
|
116
|
+
# pass text directly to visit queue
|
117
117
|
mybot = Spider::VisitQueue(txt)
|
118
118
|
```
|
119
119
|
|
120
120
|
Note that you pass the robots.txt directly to the visit queue - no need to new up the parser yourself. The VisitQueue also has a robots_txt accessor that you can use to access and set the exclusion parser while iterating through the queue:
|
121
121
|
|
122
122
|
```ruby
|
123
|
+
require 'open-uri'
|
124
|
+
|
123
125
|
mybot.visit_each |url|
|
124
|
-
|
126
|
+
txt = open('http://wikipedia.org/robots.txt').read
|
125
127
|
mybot.robot_txt = Spider::ExclusionParser.new(txt)
|
126
128
|
end
|
127
129
|
```
|
@@ -129,21 +131,31 @@ end
|
|
129
131
|
If you don't pass an agent string, then the parser will take it's configuration from the default agent specified in the robots.txt. If you want your bot to respond to directives for a given user agent, just pass the agent to either the queue when you create it, or the parser:
|
130
132
|
|
131
133
|
```ruby
|
132
|
-
#visit queue that will respond to any robots.txt
|
133
|
-
#with User-agent: mybot in them
|
134
|
+
# visit queue that will respond to any robots.txt
|
135
|
+
# with User-agent: mybot in them
|
134
136
|
mybot = Spider::VisitQueue(txt, 'mybot')
|
135
137
|
|
136
138
|
#same thing as a standalone parser
|
137
139
|
myparser = Spider::ExclusionParser.new(txt, 'mybot')
|
138
140
|
```
|
139
141
|
|
140
|
-
Note that user agent string passed in to your exclusion parser and the user agent string sent along with HTTP requests are not necessarily one and the same, although the user agent contained in robots.txt will usually be a subset of the HTTP user agent.
|
142
|
+
You can also pass nil or a blank string as the agent to use the default agent. Note that user agent string passed in to your exclusion parser and the user agent string sent along with HTTP requests are not necessarily one and the same, although the user agent contained in robots.txt will usually be a subset of the HTTP user agent.
|
141
143
|
|
142
144
|
For example:
|
143
145
|
|
144
146
|
Googlebot/2.1 (+http://www.google.com/bot.html)
|
145
147
|
|
146
|
-
should respond to "googlebot" in robots.txt. By convention, bots and spiders usually have the name 'bot' somewhere in their user agent strings.
|
148
|
+
should respond to "googlebot" in robots.txt. By convention, bots and spiders usually have the name 'bot' somewhere in their user agent strings. You can also pass the response code of the request that fetched the robots.txt file if you like, and let the exclusion parser decide what to do with it:
|
149
|
+
|
150
|
+
```ruby
|
151
|
+
require 'open-uri'
|
152
|
+
|
153
|
+
status = 0
|
154
|
+
data = open('http://wikipedia.org/robots.txt') { |f| status = f.status }
|
155
|
+
mybot.robot_txt = Spider::ExclusionParser.new(data.read, 'mybot', status)
|
156
|
+
```
|
157
|
+
|
158
|
+
Finally, as a sanity check / to avoid DoS honeypots with malicious robots.txt files, the exclusion parser will process a maximum of one thousand non-whitespace lines before stopping.
|
147
159
|
|
148
160
|
## Wait Time
|
149
161
|
|
@@ -152,29 +164,73 @@ Ideally a bot should wait for some period of time in between requests to avoid c
|
|
152
164
|
You can create it standalone, or get it from an exclusion parser:
|
153
165
|
|
154
166
|
```ruby
|
155
|
-
#download a robots.txt with a crawl-delay 40
|
167
|
+
# download a robots.txt with a crawl-delay 40
|
156
168
|
|
157
169
|
robots_txt = Spider::ExclusionParser.new(txt)
|
158
170
|
delay = robots_txt.wait_time
|
159
171
|
delay.value => 40
|
160
172
|
|
161
|
-
#receive a rate limit code, double wait time
|
173
|
+
# receive a rate limit code, double wait time
|
162
174
|
delay.back_off
|
163
175
|
|
164
|
-
#actually do the waiting part
|
176
|
+
# actually do the waiting part
|
165
177
|
delay.wait
|
166
178
|
|
167
|
-
#in response to some rate limit codes you'll want
|
168
|
-
#to sleep for a while, then back off
|
179
|
+
# in response to some rate limit codes you'll want
|
180
|
+
# to sleep for a while, then back off
|
169
181
|
delay.reduce_wait
|
170
182
|
|
171
183
|
|
172
|
-
#after one call to back_off and one call to reduce_wait
|
184
|
+
# after one call to back_off and one call to reduce_wait
|
173
185
|
delay.value => 160
|
174
186
|
```
|
175
187
|
|
176
188
|
By default a WaitTime will specify an initial value of 2 seconds. You can pass a value to new to specify the wait seconds, although values larger than the max allowable value will be set to the max allowable value (3 minutes / 180 seconds).
|
177
189
|
|
190
|
+
## Recording Requests
|
191
|
+
|
192
|
+
For convenience, an HTTP request recorder is provided, and is highly useful for helping write regression and integration tests. It accepts a block of code that returns a string containing the response data. The String class is monkey-patched to add http_status and http_headers accessors for ease of transporting other request data (yes, I know, monkey patching is evil). Information assigned to these accessors will be saved as well by the recorder, but their use is not required. The recorder class will manage the marshaling and unmarshaling of the request data behind the scenes, saving requests identified by their URL as a uniquely hashed file name with YAML-ized and Base64 encoded data in it. This is similar to VCR, and you can certainly use that instead. However, I personally ran into some troubles integrating it into some spiders I was writing, so I came up with this as a simple, lightweight alternative that works well with the rest of the Spiderkit.
|
193
|
+
|
194
|
+
The recorder will not play back request data unless enabled, and it will not save request data unless recording is turned on. This is done with the **activate!** and **record!** methods, respectively. You can stop recording with the **pause!** method and stop playback with the **deactivate!** method.
|
195
|
+
|
196
|
+
A simple spider for iterating pages and recording them might look like this:
|
197
|
+
|
198
|
+
```ruby
|
199
|
+
require 'spiderkit'
|
200
|
+
require 'open-uri'
|
201
|
+
|
202
|
+
mybot = Spider::VisitQueue.new
|
203
|
+
mybot.push_front('http://someurl.com')
|
204
|
+
|
205
|
+
Spider::VisitRecorder.config('/save/path')
|
206
|
+
Spider::VisitRecorder.activate!
|
207
|
+
Spider::VisitRecorder.record!
|
208
|
+
|
209
|
+
mybot.visit_each do |url|
|
210
|
+
|
211
|
+
data = Spider::VisitRecorder.recall(url) do
|
212
|
+
text = ''
|
213
|
+
puts "fetch #{url}"
|
214
|
+
open(url) do |f|
|
215
|
+
text = f.read
|
216
|
+
# doing this is only necessary if you want to
|
217
|
+
# save this information in the recording
|
218
|
+
text.http_status = f.status.first.to_i
|
219
|
+
end
|
220
|
+
|
221
|
+
text
|
222
|
+
end
|
223
|
+
|
224
|
+
# extract links from data and push onto the
|
225
|
+
# spider queue
|
226
|
+
end
|
227
|
+
```
|
228
|
+
|
229
|
+
After the first time the pages are spidered and saved, any subsequent run would simply replay the recorded data. You would find the saved request files in the working directory. The path that requests are saved to can be altered using the **config** method:
|
230
|
+
|
231
|
+
```ruby
|
232
|
+
Spider::VisitRecorder.config('/some/test/path')
|
233
|
+
```
|
178
234
|
|
179
235
|
## Contributing
|
180
236
|
|
data/lib/exclusion.rb
CHANGED
@@ -4,7 +4,7 @@
|
|
4
4
|
|
5
5
|
#==============================================
|
6
6
|
# This is the class that parses robots.txt and implements the exclusion
|
7
|
-
# checking logic therein. Works by breaking the file up into a hash of
|
7
|
+
# checking logic therein. Works by breaking the file up into a hash of
|
8
8
|
# arrays of directives for each specified user agent, and then parsing the
|
9
9
|
# directives into internal arrays and iterating through the list to find a
|
10
10
|
# match. Urls are matched case sensitive, everything else is case insensitive.
|
@@ -13,46 +13,45 @@
|
|
13
13
|
require 'cgi'
|
14
14
|
|
15
15
|
module Spider
|
16
|
+
class ExclusionParser
|
16
17
|
|
17
|
-
class ExclusionParser
|
18
|
-
|
19
18
|
attr_accessor :wait_time
|
20
19
|
|
21
|
-
|
22
|
-
|
23
|
-
|
24
|
-
|
20
|
+
NULL_MATCH = '*!*'.freeze
|
21
|
+
DISALLOW = 'disallow'.freeze
|
22
|
+
DELAY = 'crawl-delay'.freeze
|
23
|
+
ALLOW = 'allow'.freeze
|
24
|
+
|
25
25
|
MAX_DIRECTIVES = 1000
|
26
|
-
|
27
|
-
|
28
|
-
def initialize(text, agent=nil)
|
26
|
+
|
27
|
+
def initialize(text, agent = nil, status = 200)
|
29
28
|
@skip_list = []
|
30
29
|
@agent_key = agent
|
31
|
-
|
30
|
+
|
32
31
|
return if text.nil? || text.length.zero?
|
33
|
-
|
34
|
-
if [401, 403].include?
|
32
|
+
|
33
|
+
if [401, 403].include? status
|
35
34
|
@skip_list << [NULL_MATCH, true]
|
36
35
|
return
|
37
36
|
end
|
38
|
-
|
37
|
+
|
39
38
|
begin
|
40
39
|
config = parse_text(text)
|
41
40
|
grab_list(config)
|
42
41
|
rescue
|
43
42
|
end
|
44
43
|
end
|
45
|
-
|
44
|
+
|
46
45
|
# Check to see if the given url is matched by any rule
|
47
46
|
# in the file, and return it's associated status
|
48
|
-
|
47
|
+
|
49
48
|
def excluded?(url)
|
50
49
|
url = safe_unescape(url)
|
51
50
|
@skip_list.each do |entry|
|
52
51
|
return entry.last if url.include? entry.first
|
53
52
|
return entry.last if entry.first == NULL_MATCH
|
54
53
|
end
|
55
|
-
|
54
|
+
|
56
55
|
false
|
57
56
|
end
|
58
57
|
|
@@ -61,89 +60,101 @@ module Spider
|
|
61
60
|
end
|
62
61
|
|
63
62
|
private
|
64
|
-
|
63
|
+
|
65
64
|
# Method to process the list of directives for a given user agent.
|
66
65
|
# Picks the one that applies to us, and then processes it's directives
|
67
66
|
# into the skip list by splitting the strings and taking the appropriate
|
68
67
|
# action. Stops after a set number of directives to avoid malformed files
|
69
68
|
# or denial of service attacks
|
70
|
-
|
69
|
+
|
71
70
|
def grab_list(config)
|
72
|
-
|
73
|
-
config[@agent_key]
|
74
|
-
|
75
|
-
|
71
|
+
if config.include?(@agent_key)
|
72
|
+
section = config[@agent_key]
|
73
|
+
else
|
74
|
+
section = config['*']
|
75
|
+
end
|
76
|
+
|
77
|
+
if section.length > MAX_DIRECTIVES
|
76
78
|
section.slice!(MAX_DIRECTIVES, section.length)
|
77
79
|
end
|
78
|
-
|
80
|
+
|
79
81
|
section.each do |pair|
|
80
82
|
key, value = pair.split(':')
|
81
|
-
|
82
|
-
next if key.nil? || value.nil? ||
|
83
|
-
|
84
|
-
|
83
|
+
|
84
|
+
next if key.nil? || value.nil? ||
|
85
|
+
key.empty? || value.empty?
|
86
|
+
|
85
87
|
key.downcase!
|
86
88
|
key.lstrip!
|
87
89
|
key.rstrip!
|
88
|
-
|
90
|
+
|
89
91
|
value.lstrip!
|
90
92
|
value.rstrip!
|
91
|
-
|
93
|
+
|
92
94
|
disallow(value) if key == DISALLOW
|
93
95
|
delay(value) if key == DELAY
|
94
|
-
allow(value) if key == ALLOW
|
96
|
+
allow(value) if key == ALLOW
|
95
97
|
end
|
96
98
|
end
|
97
|
-
|
99
|
+
|
98
100
|
# Top level file parsing method - makes sure carriage returns work,
|
99
101
|
# strips out any BOM, then loops through each line and opens up a new
|
100
|
-
# array of directives in the hash if a user-agent directive is found
|
101
|
-
|
102
|
+
# array of directives in the hash if a user-agent directive is found.
|
103
|
+
|
102
104
|
def parse_text(text)
|
103
|
-
current_key =
|
105
|
+
current_key = ''
|
104
106
|
config = {}
|
105
|
-
|
107
|
+
|
106
108
|
text.gsub!("\r", "\n")
|
107
|
-
text.
|
108
|
-
|
109
|
+
text = text.force_encoding('UTF-8')
|
110
|
+
text.gsub!("\xEF\xBB\xBF".force_encoding('UTF-8'), '')
|
111
|
+
|
109
112
|
text.each_line do |line|
|
110
113
|
line.lstrip!
|
111
114
|
line.rstrip!
|
112
|
-
line.gsub!
|
113
|
-
|
114
|
-
|
115
|
-
|
116
|
-
|
117
|
-
|
118
|
-
|
119
|
-
|
115
|
+
line.gsub!(/#.*/, '')
|
116
|
+
|
117
|
+
next unless line.length.nonzero? && line =~ /[^\s]/
|
118
|
+
|
119
|
+
if line =~ /User-agent:\s+(.+)/i
|
120
|
+
previous_key = current_key
|
121
|
+
current_key = $1.downcase
|
122
|
+
config[current_key] = [] unless config[current_key]
|
123
|
+
|
124
|
+
# If we've seen a new user-agent directive and the previous one
|
125
|
+
# is empty then we have a cascading user-agent string. Copy the
|
126
|
+
# new user agent array ref so both user agents are identical.
|
127
|
+
|
128
|
+
if config.key?(previous_key) && config[previous_key].size.zero?
|
129
|
+
config[previous_key] = config[current_key]
|
120
130
|
end
|
121
|
-
|
131
|
+
|
132
|
+
else
|
122
133
|
config[current_key] << line
|
123
134
|
end
|
124
135
|
end
|
125
|
-
|
136
|
+
|
126
137
|
config
|
127
|
-
end
|
128
|
-
|
138
|
+
end
|
139
|
+
|
129
140
|
def disallow(value)
|
130
|
-
token = (value ==
|
141
|
+
token = (value == '/' ? NULL_MATCH : value.chomp('*'))
|
131
142
|
@skip_list << [safe_unescape(token), true]
|
132
143
|
end
|
133
|
-
|
144
|
+
|
134
145
|
def allow(value)
|
135
|
-
token = (value ==
|
146
|
+
token = (value == '/' ? NULL_MATCH : value.chomp('*'))
|
136
147
|
@skip_list << [safe_unescape(token), false]
|
137
148
|
end
|
138
|
-
|
149
|
+
|
139
150
|
def delay(value)
|
140
151
|
@wait_time = WaitTime.new(value.to_i)
|
141
152
|
end
|
142
|
-
|
153
|
+
|
143
154
|
def safe_unescape(target)
|
144
|
-
t = target.gsub
|
155
|
+
t = target.gsub(/%2f/, '^^^')
|
145
156
|
t = CGI.unescape(t)
|
146
|
-
t.gsub
|
157
|
+
t.gsub(/\^\^\^/, '%2f')
|
147
158
|
end
|
148
159
|
end
|
149
160
|
end
|
data/lib/queue.rb
CHANGED
@@ -6,45 +6,48 @@ require 'bloom-filter'
|
|
6
6
|
require 'exclusion'
|
7
7
|
|
8
8
|
module Spider
|
9
|
-
|
10
9
|
class VisitQueue
|
11
|
-
|
12
|
-
|
10
|
+
|
11
|
+
IterationExit = Class.new(Exception)
|
13
12
|
|
14
13
|
attr_accessor :visit_count
|
15
14
|
attr_accessor :robot_txt
|
16
15
|
|
17
|
-
def initialize(robots=nil, agent=nil, finish=nil)
|
16
|
+
def initialize(robots = nil, agent = nil, finish = nil)
|
18
17
|
@robot_txt = ExclusionParser.new(robots, agent) if robots
|
19
18
|
@finalize = finish
|
20
19
|
@visit_count = 0
|
21
20
|
clear_visited
|
22
21
|
@pending = []
|
23
|
-
end
|
22
|
+
end
|
24
23
|
|
25
24
|
def visit_each
|
26
25
|
begin
|
27
26
|
until @pending.empty?
|
28
27
|
url = @pending.pop
|
29
|
-
|
30
|
-
|
31
|
-
|
32
|
-
|
33
|
-
|
34
|
-
end
|
28
|
+
next unless url_okay(url)
|
29
|
+
yield url.clone if block_given?
|
30
|
+
@visited.insert(url)
|
31
|
+
@visit_count += 1
|
32
|
+
end
|
35
33
|
rescue IterationExit
|
36
34
|
end
|
37
|
-
|
35
|
+
|
38
36
|
@finalize.call if @finalize
|
39
|
-
end
|
40
|
-
|
37
|
+
end
|
38
|
+
|
41
39
|
def push_front(urls)
|
42
|
-
add_url(urls) {|u| @pending.push(u)}
|
43
|
-
end
|
44
|
-
|
40
|
+
add_url(urls) { |u| @pending.push(u) }
|
41
|
+
end
|
42
|
+
|
45
43
|
def push_back(urls)
|
46
|
-
add_url(urls) {|u| @pending.unshift(u)}
|
47
|
-
end
|
44
|
+
add_url(urls) { |u| @pending.unshift(u) }
|
45
|
+
end
|
46
|
+
|
47
|
+
def mark(urls)
|
48
|
+
urls = [urls] unless urls.is_a? Array
|
49
|
+
urls.each { |u| @visited.insert(u) }
|
50
|
+
end
|
48
51
|
|
49
52
|
def size
|
50
53
|
@pending.size
|
@@ -61,23 +64,21 @@ module Spider
|
|
61
64
|
def clear_visited
|
62
65
|
@visited = BloomFilter.new(size: 10_000, error_rate: 0.001)
|
63
66
|
end
|
64
|
-
|
65
|
-
private
|
66
67
|
|
67
68
|
def url_okay(url)
|
68
69
|
return false if @visited.include?(url)
|
69
70
|
return false if @robot_txt && @robot_txt.excluded?(url)
|
70
71
|
true
|
71
72
|
end
|
72
|
-
|
73
|
+
|
74
|
+
private
|
75
|
+
|
73
76
|
def add_url(urls)
|
74
77
|
urls = [urls] unless urls.is_a? Array
|
75
78
|
urls.compact!
|
76
|
-
|
79
|
+
|
77
80
|
urls.each do |url|
|
78
|
-
unless @visited.include?(url) || @pending.include?(url)
|
79
|
-
yield url
|
80
|
-
end
|
81
|
+
yield url unless @visited.include?(url) || @pending.include?(url)
|
81
82
|
end
|
82
83
|
end
|
83
84
|
end
|
data/lib/recorder.rb
ADDED
@@ -0,0 +1,111 @@
|
|
1
|
+
# Author:: Robert Dormer (mailto:rdormer@gmail.com)
|
2
|
+
# Copyright:: Copyright (c) 2016 Robert Dormer
|
3
|
+
# License:: MIT
|
4
|
+
|
5
|
+
#==============================================
|
6
|
+
# Class for saving visited URIs as YAML-ized files with response codes and
|
7
|
+
# headers. Takes the network request code as a block, hashes the URI to get
|
8
|
+
# a file name, and then creates it and saves it if it's not present, or reads
|
9
|
+
# and returns the contents if it is. Since the data returned from the file
|
10
|
+
# is an exact copy of what the block returned for that URI, it's a constant,
|
11
|
+
# deterministic recording that is highly useful for integration tests and the
|
12
|
+
# like. Yes, you can use VCR if that's your thing, but I found it difficult
|
13
|
+
# to integrate with real-world crawlers. This is a lightweight wrapper to
|
14
|
+
# give you 90% of the same thing.
|
15
|
+
#==============================================
|
16
|
+
require 'digest'
|
17
|
+
require 'base64'
|
18
|
+
require 'yaml'
|
19
|
+
|
20
|
+
module Spider
|
21
|
+
class VisitRecorder
|
22
|
+
@@directory = ''
|
23
|
+
@@active = false
|
24
|
+
@@recording = false
|
25
|
+
|
26
|
+
class << self
|
27
|
+
def activate!
|
28
|
+
@@active = true
|
29
|
+
end
|
30
|
+
|
31
|
+
def record!
|
32
|
+
@@recording = true
|
33
|
+
end
|
34
|
+
|
35
|
+
def deactivate!
|
36
|
+
@@active = false
|
37
|
+
end
|
38
|
+
|
39
|
+
def pause!
|
40
|
+
@@recording = false
|
41
|
+
end
|
42
|
+
|
43
|
+
def config(dir)
|
44
|
+
@@directory = dir
|
45
|
+
end
|
46
|
+
|
47
|
+
def recall(*args)
|
48
|
+
if @@active
|
49
|
+
url = args.first.to_s
|
50
|
+
data = ''
|
51
|
+
|
52
|
+
store = locate_file(url)
|
53
|
+
|
54
|
+
if store.size == 0
|
55
|
+
raise "Unexpected request: #{url}" unless @@recording
|
56
|
+
data = yield(*args) if block_given?
|
57
|
+
|
58
|
+
begin
|
59
|
+
store.write(package(url, data))
|
60
|
+
rescue StandardError => e
|
61
|
+
puts e.message
|
62
|
+
puts "On file #{store.path}"
|
63
|
+
end
|
64
|
+
|
65
|
+
else
|
66
|
+
data = unpackage(store, url)
|
67
|
+
end
|
68
|
+
|
69
|
+
return data
|
70
|
+
|
71
|
+
elsif block_given?
|
72
|
+
yield(*args)
|
73
|
+
end
|
74
|
+
end
|
75
|
+
|
76
|
+
private
|
77
|
+
|
78
|
+
def locate_file(url)
|
79
|
+
key = Digest::MD5.hexdigest(url)
|
80
|
+
path = File.expand_path(key, @@directory)
|
81
|
+
fsize = File.size?(path)
|
82
|
+
(fsize.nil? || fsize.zero? ? File.open(path, 'w') : File.open(path, 'r'))
|
83
|
+
end
|
84
|
+
|
85
|
+
def package(url, data)
|
86
|
+
payload = {}
|
87
|
+
payload[:url] = url.encode('UTF-8')
|
88
|
+
payload[:data] = Base64.encode64(data)
|
89
|
+
|
90
|
+
unless data.http_status.nil?
|
91
|
+
payload[:response] = data.http_status
|
92
|
+
end
|
93
|
+
|
94
|
+
unless data.http_headers.nil?
|
95
|
+
payload[:headers] = Base64.encode64(data.http_headers)
|
96
|
+
end
|
97
|
+
|
98
|
+
payload.to_yaml
|
99
|
+
end
|
100
|
+
|
101
|
+
def unpackage(store, url)
|
102
|
+
raw = YAML.load(store.read)
|
103
|
+
raise 'URL mismatch in recording' unless raw[:url] == url
|
104
|
+
data = Base64.decode64(raw[:data])
|
105
|
+
data.http_headers = Base64.decode64(raw[:headers])
|
106
|
+
data.http_status = raw[:response]
|
107
|
+
data
|
108
|
+
end
|
109
|
+
end
|
110
|
+
end
|
111
|
+
end
|
data/lib/spiderkit.rb
CHANGED
@@ -2,12 +2,14 @@
|
|
2
2
|
# Copyright:: Copyright (c) 2016 Robert Dormer
|
3
3
|
# License:: MIT
|
4
4
|
|
5
|
-
|
5
|
+
$LOAD_PATH << File.dirname(__FILE__)
|
6
6
|
require 'wait_time'
|
7
7
|
require 'exclusion'
|
8
|
+
require 'recorder'
|
8
9
|
require 'version'
|
9
10
|
require 'queue'
|
10
11
|
|
11
12
|
class String
|
12
13
|
attr_accessor :http_status
|
14
|
+
attr_accessor :http_headers
|
13
15
|
end
|
data/lib/version.rb
CHANGED
data/lib/wait_time.rb
CHANGED
@@ -3,33 +3,32 @@
|
|
3
3
|
# License:: MIT
|
4
4
|
|
5
5
|
#==============================================
|
6
|
-
#Class to encapsulate the crawl delay being used.
|
7
|
-
#Clamps the value to a maximum amount and implements
|
8
|
-
#an exponential backoff function for responding to
|
9
|
-
#rate limit requests
|
6
|
+
# Class to encapsulate the crawl delay being used.
|
7
|
+
# Clamps the value to a maximum amount and implements
|
8
|
+
# an exponential backoff function for responding to
|
9
|
+
# rate limit requests
|
10
10
|
#==============================================
|
11
11
|
|
12
12
|
module Spider
|
13
|
-
|
14
13
|
class WaitTime
|
15
|
-
|
14
|
+
|
16
15
|
MAX_WAIT = 180
|
17
16
|
DEFAULT_WAIT = 2
|
18
17
|
REDUCE_WAIT = 300
|
19
18
|
|
20
|
-
def initialize(period=nil)
|
21
|
-
|
22
|
-
@wait = (period > MAX_WAIT ? MAX_WAIT : period)
|
23
|
-
else
|
19
|
+
def initialize(period = nil)
|
20
|
+
if period.nil?
|
24
21
|
@wait = DEFAULT_WAIT
|
22
|
+
else
|
23
|
+
@wait = (period > MAX_WAIT ? MAX_WAIT : period)
|
25
24
|
end
|
26
25
|
end
|
27
26
|
|
28
27
|
def back_off
|
29
28
|
if @wait.zero?
|
30
|
-
@wait = DEFAULT_WAIT
|
29
|
+
@wait = DEFAULT_WAIT
|
31
30
|
else
|
32
|
-
waitval = @wait * 2
|
31
|
+
waitval = @wait * 2
|
33
32
|
@wait = (waitval > MAX_WAIT ? MAX_WAIT : waitval)
|
34
33
|
end
|
35
34
|
end
|
@@ -42,7 +41,7 @@ module Spider
|
|
42
41
|
sleep(REDUCE_WAIT)
|
43
42
|
back_off
|
44
43
|
end
|
45
|
-
|
44
|
+
|
46
45
|
def value
|
47
46
|
@wait
|
48
47
|
end
|
@@ -126,12 +126,10 @@ module Spider
|
|
126
126
|
allow: /
|
127
127
|
eos
|
128
128
|
|
129
|
-
|
130
|
-
@bottxt = described_class.new(txt)
|
129
|
+
@bottxt = described_class.new(txt, nil, 401)
|
131
130
|
expect(@bottxt.excluded?('/')).to be true
|
132
131
|
|
133
|
-
|
134
|
-
@bottxt = described_class.new(txt)
|
132
|
+
@bottxt = described_class.new(txt, nil, 403)
|
135
133
|
expect(@bottxt.excluded?('/')).to be true
|
136
134
|
end
|
137
135
|
end
|
@@ -243,7 +241,53 @@ module Spider
|
|
243
241
|
expect(@bottxt.excluded?('/')).to be true
|
244
242
|
end
|
245
243
|
|
246
|
-
|
244
|
+
it "should use default agent if passed nil agent string" do
|
245
|
+
txt = <<-eos
|
246
|
+
user-agent: testbot
|
247
|
+
disallow: /
|
248
|
+
|
249
|
+
user-agent: *
|
250
|
+
disallow:
|
251
|
+
eos
|
252
|
+
|
253
|
+
@bottxt = described_class.new(txt, nil)
|
254
|
+
expect(@bottxt.excluded?('/')).to be false
|
255
|
+
end
|
256
|
+
|
257
|
+
it "should use default agent if passed blank agent string" do
|
258
|
+
txt = <<-eos
|
259
|
+
user-agent: testbot
|
260
|
+
disallow: /
|
261
|
+
|
262
|
+
user-agent: *
|
263
|
+
disallow:
|
264
|
+
eos
|
265
|
+
|
266
|
+
@bottxt = described_class.new(txt, '')
|
267
|
+
expect(@bottxt.excluded?('/')).to be false
|
268
|
+
end
|
269
|
+
|
270
|
+
it "should allow cascading user-agent strings" do
|
271
|
+
txt = <<-eos
|
272
|
+
user-agent: agentfirst
|
273
|
+
user-agent: agentlast
|
274
|
+
disallow: /test_dir
|
275
|
+
allow: /other_test_dir
|
276
|
+
eos
|
277
|
+
|
278
|
+
bottxt_first = described_class.new(txt, 'agentfirst')
|
279
|
+
bottxt_last = described_class.new(txt, 'agentlast')
|
280
|
+
|
281
|
+
expect(bottxt_first.excluded?('/test_dir')).to be true
|
282
|
+
expect(bottxt_last.excluded?('/test_dir')).to be true
|
283
|
+
expect(bottxt_first.allowed?('/test_dir')).to be false
|
284
|
+
expect(bottxt_last.allowed?('/test_dir')).to be false
|
285
|
+
|
286
|
+
expect(bottxt_first.excluded?('/other_test_dir')).to be false
|
287
|
+
expect(bottxt_last.excluded?('/other_test_dir')).to be false
|
288
|
+
expect(bottxt_first.allowed?('/other_test_dir')).to be true
|
289
|
+
expect(bottxt_last.allowed?('/other_test_dir')).to be true
|
290
|
+
end
|
247
291
|
end
|
248
292
|
|
249
293
|
describe "Disallow directive" do
|
@@ -0,0 +1,128 @@
|
|
1
|
+
# Author:: Robert Dormer (mailto:rdormer@gmail.com)
|
2
|
+
# Copyright:: Copyright (c) 2016 Robert Dormer
|
3
|
+
# License:: MIT
|
4
|
+
|
5
|
+
require File.dirname(__FILE__) + '/../lib/spiderkit'
|
6
|
+
|
7
|
+
module Spider
|
8
|
+
describe VisitRecorder do
|
9
|
+
|
10
|
+
describe 'when active' do
|
11
|
+
before(:each) do
|
12
|
+
described_class.activate!
|
13
|
+
described_class.record!
|
14
|
+
@url = "http://test.domain.123"
|
15
|
+
end
|
16
|
+
|
17
|
+
it 'should add http_status to string' do
|
18
|
+
expect("".respond_to? :http_status).to be true
|
19
|
+
end
|
20
|
+
|
21
|
+
it 'should add http_headers to string' do
|
22
|
+
expect("".respond_to? :http_headers).to be true
|
23
|
+
end
|
24
|
+
|
25
|
+
it 'should execute the block argument if recording data' do
|
26
|
+
run_data = ''
|
27
|
+
buffer = StringIO.new
|
28
|
+
allow(buffer).to receive(:size?).and_return(0)
|
29
|
+
allow(File).to receive(:open).and_return(buffer)
|
30
|
+
described_class.recall(@url) { |u| run_data = 'ran' }
|
31
|
+
expect(run_data).to eq 'ran'
|
32
|
+
end
|
33
|
+
|
34
|
+
describe 'saved information' do
|
35
|
+
before(:each) do
|
36
|
+
@buffer = StringIO.new
|
37
|
+
allow(File).to receive(:open).and_return(@buffer)
|
38
|
+
allow(@buffer).to receive(:size?).and_return(0)
|
39
|
+
|
40
|
+
@data = described_class.recall(@url) do |u|
|
41
|
+
rval = "this is the test body"
|
42
|
+
rval.http_headers = "test headers"
|
43
|
+
rval.http_status = 200
|
44
|
+
rval
|
45
|
+
end
|
46
|
+
|
47
|
+
@buffer.rewind
|
48
|
+
end
|
49
|
+
|
50
|
+
it 'should save and return headers' do
|
51
|
+
expect(@data.http_headers).to eq "test headers"
|
52
|
+
data = YAML.load(@buffer.read)
|
53
|
+
expect(data[:headers]).to eq Base64.encode64("test headers")
|
54
|
+
end
|
55
|
+
|
56
|
+
it 'should save the request url' do
|
57
|
+
data = YAML.load(@buffer.read)
|
58
|
+
expect(data[:url]).to eq @url
|
59
|
+
end
|
60
|
+
|
61
|
+
it 'should save response code' do
|
62
|
+
data = YAML.load(@buffer.read)
|
63
|
+
expect(data[:response]).to eq 200
|
64
|
+
end
|
65
|
+
|
66
|
+
it 'should save and return the response data' do
|
67
|
+
expect(@data).to eq "this is the test body"
|
68
|
+
data = YAML.load(@buffer.read)
|
69
|
+
expect(data[:data]).to eq Base64.encode64("this is the test body")
|
70
|
+
end
|
71
|
+
|
72
|
+
it 'should bypass the block argument if playing back data' do
|
73
|
+
run_flag = false
|
74
|
+
described_class.recall(@url) { |u| run_flag = true }
|
75
|
+
expect(run_flag).to be false
|
76
|
+
end
|
77
|
+
end
|
78
|
+
|
79
|
+
describe 'file operations' do
|
80
|
+
before(:each) do
|
81
|
+
@path = "test_path/"
|
82
|
+
@buffer = StringIO.new
|
83
|
+
@fname = Digest::MD5.hexdigest(@url)
|
84
|
+
allow(File).to receive(:open).and_return(@buffer)
|
85
|
+
expect(File).to receive(:size?).and_return(0)
|
86
|
+
described_class.config('/')
|
87
|
+
end
|
88
|
+
|
89
|
+
it 'name file as md5 hash of url with query' do
|
90
|
+
expect(File).to receive(:open).with('/' + @fname, 'w')
|
91
|
+
described_class.recall(@url) { |u| 'test data' }
|
92
|
+
end
|
93
|
+
|
94
|
+
it 'should not overwrite existing files' do
|
95
|
+
File.size?
|
96
|
+
expect(File).to receive(:size?).and_return(1234)
|
97
|
+
expect(File).to receive(:open).with('/' + @fname, 'r')
|
98
|
+
described_class.recall(@url) { |u| 'test data' }
|
99
|
+
end
|
100
|
+
end
|
101
|
+
end
|
102
|
+
|
103
|
+
describe 'when not active' do
|
104
|
+
before(:all) do
|
105
|
+
described_class.pause!
|
106
|
+
described_class.deactivate!
|
107
|
+
end
|
108
|
+
|
109
|
+
it 'should not record unless activated and recording enabled' do
|
110
|
+
expect(File).to_not receive(:size?)
|
111
|
+
expect(File).to_not receive(:open)
|
112
|
+
end
|
113
|
+
|
114
|
+
it 'should pass the return of the block through' do
|
115
|
+
expect(described_class.recall(@url) { |u| 'test data' }).to eq 'test data'
|
116
|
+
end
|
117
|
+
|
118
|
+
it 'should not playback if not active' do
|
119
|
+
data = {data: Base64.encode64('test data should not appear')}.to_yaml
|
120
|
+
@buffer = StringIO.new(data)
|
121
|
+
allow(File).to receive(:open).and_return(@buffer)
|
122
|
+
result = described_class.recall(@url) { |u| 'test data' }
|
123
|
+
expect(result).to_not eq 'test data should not appear'
|
124
|
+
expect(result).to eq 'test data'
|
125
|
+
end
|
126
|
+
end
|
127
|
+
end
|
128
|
+
end
|
data/spec/visit_queue_spec.rb
CHANGED
@@ -14,7 +14,6 @@ def get_visit_order(q)
|
|
14
14
|
order
|
15
15
|
end
|
16
16
|
|
17
|
-
|
18
17
|
module Spider
|
19
18
|
|
20
19
|
describe VisitQueue do
|
@@ -125,7 +124,7 @@ REND
|
|
125
124
|
flag = false
|
126
125
|
final = Proc.new { flag = true }
|
127
126
|
queue = described_class.new(nil, nil, final)
|
128
|
-
queue.push_back((1
|
127
|
+
queue.push_back(%w(1 2 3 4 5 6 7 8))
|
129
128
|
queue.visit_each { queue.stop if queue.visit_count >= 1 }
|
130
129
|
expect(queue.visit_count).to eq 1
|
131
130
|
expect(flag).to be true
|
@@ -149,6 +148,19 @@ REND
|
|
149
148
|
expect(@queue.visit_count).to eq 7
|
150
149
|
end
|
151
150
|
|
151
|
+
it 'should not be affected by modification of url argument' do
|
152
|
+
@queue.push_front(%w(one three))
|
153
|
+
|
154
|
+
@queue.visit_each do |url|
|
155
|
+
url.gsub! /e/, '@'
|
156
|
+
end
|
157
|
+
|
158
|
+
expect(@queue.url_okay('one')).to be false
|
159
|
+
expect(@queue.url_okay('three')).to be false
|
160
|
+
|
161
|
+
expect(@queue.url_okay('on@')).to be true
|
162
|
+
expect(@queue.url_okay('thr@@')).to be true
|
163
|
+
end
|
152
164
|
end
|
153
165
|
|
154
166
|
end
|
data/spiderkit.gemspec
CHANGED
@@ -16,6 +16,7 @@ Gem::Specification.new do |spec|
|
|
16
16
|
spec.files = `git ls-files`.split($/)
|
17
17
|
spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
|
18
18
|
spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
|
19
|
+
spec.required_ruby_version = '>= 1.9.2.330'
|
19
20
|
spec.require_paths = ["lib"]
|
20
21
|
|
21
22
|
spec.add_development_dependency "bundler", "~> 1.3"
|
metadata
CHANGED
@@ -1,78 +1,69 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: spiderkit
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
5
|
-
prerelease:
|
4
|
+
version: 0.2.0
|
6
5
|
platform: ruby
|
7
6
|
authors:
|
8
7
|
- Robert Dormer
|
9
8
|
autorequire:
|
10
9
|
bindir: bin
|
11
10
|
cert_chain: []
|
12
|
-
date: 2016-
|
11
|
+
date: 2016-08-06 00:00:00.000000000 Z
|
13
12
|
dependencies:
|
14
13
|
- !ruby/object:Gem::Dependency
|
15
14
|
name: bundler
|
16
15
|
requirement: !ruby/object:Gem::Requirement
|
17
|
-
none: false
|
18
16
|
requirements:
|
19
|
-
- - ~>
|
17
|
+
- - "~>"
|
20
18
|
- !ruby/object:Gem::Version
|
21
19
|
version: '1.3'
|
22
20
|
type: :development
|
23
21
|
prerelease: false
|
24
22
|
version_requirements: !ruby/object:Gem::Requirement
|
25
|
-
none: false
|
26
23
|
requirements:
|
27
|
-
- - ~>
|
24
|
+
- - "~>"
|
28
25
|
- !ruby/object:Gem::Version
|
29
26
|
version: '1.3'
|
30
27
|
- !ruby/object:Gem::Dependency
|
31
28
|
name: rspec
|
32
29
|
requirement: !ruby/object:Gem::Requirement
|
33
|
-
none: false
|
34
30
|
requirements:
|
35
|
-
- - ~>
|
31
|
+
- - "~>"
|
36
32
|
- !ruby/object:Gem::Version
|
37
33
|
version: 3.4.0
|
38
34
|
type: :development
|
39
35
|
prerelease: false
|
40
36
|
version_requirements: !ruby/object:Gem::Requirement
|
41
|
-
none: false
|
42
37
|
requirements:
|
43
|
-
- - ~>
|
38
|
+
- - "~>"
|
44
39
|
- !ruby/object:Gem::Version
|
45
40
|
version: 3.4.0
|
46
41
|
- !ruby/object:Gem::Dependency
|
47
42
|
name: rake
|
48
43
|
requirement: !ruby/object:Gem::Requirement
|
49
|
-
none: false
|
50
44
|
requirements:
|
51
|
-
- -
|
45
|
+
- - ">="
|
52
46
|
- !ruby/object:Gem::Version
|
53
47
|
version: '0'
|
54
48
|
type: :development
|
55
49
|
prerelease: false
|
56
50
|
version_requirements: !ruby/object:Gem::Requirement
|
57
|
-
none: false
|
58
51
|
requirements:
|
59
|
-
- -
|
52
|
+
- - ">="
|
60
53
|
- !ruby/object:Gem::Version
|
61
54
|
version: '0'
|
62
55
|
- !ruby/object:Gem::Dependency
|
63
56
|
name: bloom-filter
|
64
57
|
requirement: !ruby/object:Gem::Requirement
|
65
|
-
none: false
|
66
58
|
requirements:
|
67
|
-
- - ~>
|
59
|
+
- - "~>"
|
68
60
|
- !ruby/object:Gem::Version
|
69
61
|
version: 0.2.0
|
70
62
|
type: :runtime
|
71
63
|
prerelease: false
|
72
64
|
version_requirements: !ruby/object:Gem::Requirement
|
73
|
-
none: false
|
74
65
|
requirements:
|
75
|
-
- - ~>
|
66
|
+
- - "~>"
|
76
67
|
- !ruby/object:Gem::Version
|
77
68
|
version: 0.2.0
|
78
69
|
description: Spiderkit library for basic spiders and bots
|
@@ -88,39 +79,41 @@ files:
|
|
88
79
|
- Rakefile
|
89
80
|
- lib/exclusion.rb
|
90
81
|
- lib/queue.rb
|
82
|
+
- lib/recorder.rb
|
91
83
|
- lib/spiderkit.rb
|
92
84
|
- lib/version.rb
|
93
85
|
- lib/wait_time.rb
|
94
86
|
- spec/exclusion_parser_spec.rb
|
87
|
+
- spec/recorder_spec.rb
|
95
88
|
- spec/visit_queue_spec.rb
|
96
89
|
- spec/wait_time_spec.rb
|
97
90
|
- spiderkit.gemspec
|
98
91
|
homepage: http://github.com/rdormer/spiderkit
|
99
92
|
licenses:
|
100
93
|
- MIT
|
94
|
+
metadata: {}
|
101
95
|
post_install_message:
|
102
96
|
rdoc_options: []
|
103
97
|
require_paths:
|
104
98
|
- lib
|
105
99
|
required_ruby_version: !ruby/object:Gem::Requirement
|
106
|
-
none: false
|
107
100
|
requirements:
|
108
|
-
- -
|
101
|
+
- - ">="
|
109
102
|
- !ruby/object:Gem::Version
|
110
|
-
version:
|
103
|
+
version: 1.9.2.330
|
111
104
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
112
|
-
none: false
|
113
105
|
requirements:
|
114
|
-
- -
|
106
|
+
- - ">="
|
115
107
|
- !ruby/object:Gem::Version
|
116
108
|
version: '0'
|
117
109
|
requirements: []
|
118
110
|
rubyforge_project:
|
119
|
-
rubygems_version:
|
111
|
+
rubygems_version: 2.4.8
|
120
112
|
signing_key:
|
121
|
-
specification_version:
|
113
|
+
specification_version: 4
|
122
114
|
summary: Basic toolkit for writing web spiders and bots
|
123
115
|
test_files:
|
124
116
|
- spec/exclusion_parser_spec.rb
|
117
|
+
- spec/recorder_spec.rb
|
125
118
|
- spec/visit_queue_spec.rb
|
126
119
|
- spec/wait_time_spec.rb
|