spiderkit 0.1.0

Sign up to get free protection for your applications and to get access to all the features.
data/Gemfile ADDED
@@ -0,0 +1,4 @@
1
+ source 'https://rubygems.org'
2
+
3
+ # Specify your gem's dependencies in spiderkit.gemspec
4
+ gemspec
data/LICENSE.txt ADDED
@@ -0,0 +1,22 @@
1
+ Copyright (c) 2016 TODO: Write your name
2
+
3
+ MIT License
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining
6
+ a copy of this software and associated documentation files (the
7
+ "Software"), to deal in the Software without restriction, including
8
+ without limitation the rights to use, copy, modify, merge, publish,
9
+ distribute, sublicense, and/or sell copies of the Software, and to
10
+ permit persons to whom the Software is furnished to do so, subject to
11
+ the following conditions:
12
+
13
+ The above copyright notice and this permission notice shall be
14
+ included in all copies or substantial portions of the Software.
15
+
16
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
17
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
18
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
19
+ NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
20
+ LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
21
+ OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
22
+ WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.md ADDED
@@ -0,0 +1,158 @@
1
+ # Spiderkit
2
+
3
+ Spiderkit - Lightweight library for spiders and bots
4
+
5
+ ## Installation
6
+
7
+ Add this line to your application's Gemfile:
8
+
9
+ gem 'spiderkit'
10
+
11
+ And then execute:
12
+
13
+ $ bundle
14
+
15
+ Or install it yourself as:
16
+
17
+ $ gem install spiderkit
18
+
19
+ ##Well Behaved Spiders
20
+
21
+ Which is not to say you can't write ill-behaved spiders with this gem, but you're kind of a jerk if you do, and I'd really rather you didn't! A well behaved spider will do a few simple things:
22
+
23
+ * It will download and obey robots.txt
24
+ * It will avoid repeatedly re-visiting pages
25
+ * It will wait in between requests / avoid agressive spidering
26
+ * It will honor rate-limit return codes
27
+ * It will send a valid User-Agent string
28
+
29
+ This library is written with an eye towards rapidly prototyping spiders that will do all of these things, plus whatever else you can come up with.
30
+
31
+ ## Usage
32
+
33
+ Using Spiderkit, you implement your spiders and bots around the idea of a visit queue. Urls (or any object you like) are added to the queue, and the queue is set to iterating. It obeys a few simple rules:
34
+
35
+ * You can add any kind of object you like
36
+ * You can add more objects to the queue as you iterate through it
37
+ * Once an object is iterated over, it's removed from the queue
38
+ * Once an object is iterated over, it's string value is added to an already-visited list, at which point you can't add it again.
39
+ * The queue will stop once it's empty, and optionally execute a final Proc that you pass to it
40
+ * The queue will not fetch web pages or anything else on it's own - that's part of what *you* implement.
41
+
42
+ A basic example:
43
+
44
+ ```ruby
45
+
46
+ mybot = Spider::VisitQueue.new
47
+ mybot.push_front('http://someurl.com')
48
+
49
+ mybot.visit_each do |url|
50
+ #fetch the url
51
+ #pull out the links as linklist
52
+
53
+ mybot.push_back(linklist)
54
+ end
55
+ ```
56
+
57
+ A slightly fancier example:
58
+
59
+ ```ruby
60
+ #download robots.txt as variable txt
61
+ #user agent is "my-bot/1.0"
62
+
63
+ finalizer = Proc.new { puts "done"}
64
+
65
+ mybot = Spider::VisitQueue.new(txt, "my-bot/1.0", finalizer)
66
+ ```
67
+
68
+ As urls are fetched and added to the queue, any links already visited will be dropped transparently. You have the option to push objects to either the front or rear of the queue at any time. If you do push to the front of the queue while iterating over it, the things you push will be the next items visited, and vice versa if you push to the back:
69
+
70
+ ```ruby
71
+
72
+ mybot.visit_each do |url|
73
+ #these will be visited next
74
+ mybot.push_front(nexturls)
75
+
76
+ #these will be visited last
77
+ mybot.push_back(lasturls)
78
+ end
79
+
80
+ ```
81
+
82
+ The already visited list is implemented as a Bloom filter, so you should be able to spider even fairly large domains (and there are quite a few out there) without re-visiting pages.
83
+
84
+ Finally, you can forcefully stop spidering at any point:
85
+
86
+ ```ruby
87
+
88
+ mybot.visit_each do |url|
89
+ mybot.stop
90
+ end
91
+
92
+ ```
93
+
94
+ The finalizer, if any, will still be executed after stopping iteration.
95
+
96
+ ## Robots.txt
97
+
98
+ Spiderkit also includes a robots.txt parser that can either work standalone, or be passed as an argument to the visit queue. If passed as an argument, urls that are excluded by the robots.txt will be dropped transparently.
99
+
100
+ ```
101
+ #fetch robots.txt as variable txt
102
+
103
+ #create a stand alone parser
104
+ robots_txt = Spider::ExclusionParser.new(txt)
105
+
106
+ robots_txt.excluded?("/") => true
107
+ robots_txt.excluded?("/admin") => false
108
+ robots_txt.allowed?("/blog") => true
109
+
110
+ #pass text directly to visit queue
111
+ mybot = Spider::VisitQueue(txt)
112
+ ```
113
+
114
+ Note that you pass the robots.txt directly to the visit queue - no need to new up the parser yourself. The VisitQueue also has a robots_txt accessor that you can use to access and set the exclusion parser while iterating through the queue:
115
+
116
+ ```ruby
117
+ mybot.visit_each |url|
118
+ #download a new robots.txt from somewhere
119
+ mybot.robot_txt = Spider::ExclusionParser.new(txt)
120
+ end
121
+ ```
122
+
123
+ ## Wait Time
124
+
125
+ Ideally a bot should wait for some period of time in between requests to avoid crashing websites (less likely) or being blacklisted (more likely). A WaitTime class is provided that encapsulates this waiting logic, and logic to respond to rate limit codes and the "crawl-delay" directives found in some robots.txt files. Times are in seconds.
126
+
127
+ You can create it standalone, or get it from an exclusion parser:
128
+
129
+ ```ruby
130
+
131
+ #download a robots.txt with a crawl-delay 40
132
+
133
+ robots_txt = Spider::ExclusionParser.new(txt)
134
+ delay = robots_txt.wait_time
135
+ delay.value => 40
136
+
137
+ #receive a rate limit code, double wait time
138
+ delay.back_off
139
+
140
+ #actually do the waiting part
141
+ delay.wait
142
+
143
+ #in response to some rate limit codes you'll want
144
+ #to sleep for a while, then back off
145
+ delay.reduce_wait
146
+
147
+
148
+ #after one call to back_off and one call to reduce_wait
149
+ delay.value => 160
150
+
151
+ ```
152
+
153
+ By default a WaitTime will specify an initial value of 2 seconds. You can pass a value to new to specify the wait seconds, although values larger than the max allowable value will be set to the max allowable value (3 minutes / 180 seconds).
154
+
155
+
156
+ ## Contributing
157
+
158
+ Bug reports, patches, and pull requests warmly welcomed at http://github.com/rdormer/spiderkit
data/Rakefile ADDED
@@ -0,0 +1,5 @@
1
+ require "bundler/gem_tasks"
2
+ require "rspec/core/rake_task"
3
+
4
+ RSpec::Core::RakeTask.new(:spec)
5
+ task :default => :spec
data/lib/exclusion.rb ADDED
@@ -0,0 +1,149 @@
1
+ # Author:: Robert Dormer (mailto:rdormer@gmail.com)
2
+ # Copyright:: Copyright (c) 2016 Robert Dormer
3
+ # License:: MIT
4
+
5
+ #==============================================
6
+ # This is the class that parses robots.txt and implements the exclusion
7
+ # checking logic therein. Works by breaking the file up into a hash of
8
+ # arrays of directives for each specified user agent, and then parsing the
9
+ # directives into internal arrays and iterating through the list to find a
10
+ # match. Urls are matched case sensitive, everything else is case insensitive.
11
+ # The root url is treated as a special case by using a token for it.
12
+ #==============================================
13
+ require 'cgi'
14
+
15
+ module Spider
16
+
17
+ class ExclusionParser
18
+
19
+ attr_accessor :wait_time
20
+
21
+ DISALLOW = "disallow"
22
+ DELAY = "crawl-delay"
23
+ ALLOW = "allow"
24
+
25
+ MAX_DIRECTIVES = 1000
26
+ NULL_MATCH = "*!*"
27
+
28
+ def initialize(text, agent=nil)
29
+ @skip_list = []
30
+ @agent_key = agent
31
+
32
+ return if text.nil? || text.length.zero?
33
+
34
+ if [401, 403].include? text.http_status
35
+ @skip_list << [NULL_MATCH, true]
36
+ return
37
+ end
38
+
39
+ begin
40
+ config = parse_text(text)
41
+ grab_list(config)
42
+ rescue
43
+ end
44
+ end
45
+
46
+ # Check to see if the given url is matched by any rule
47
+ # in the file, and return it's associated status
48
+
49
+ def excluded?(url)
50
+ url = safe_unescape(url)
51
+ @skip_list.each do |entry|
52
+ return entry.last if url.include? entry.first
53
+ return entry.last if entry.first == NULL_MATCH
54
+ end
55
+
56
+ false
57
+ end
58
+
59
+ def allowed?(url)
60
+ !excluded?(url)
61
+ end
62
+
63
+ private
64
+
65
+ # Method to process the list of directives for a given user agent.
66
+ # Picks the one that applies to us, and then processes it's directives
67
+ # into the skip list by splitting the strings and taking the appropriate
68
+ # action. Stops after a set number of directives to avoid malformed files
69
+ # or denial of service attacks
70
+
71
+ def grab_list(config)
72
+ section = (config.include?(@agent_key) ?
73
+ config[@agent_key] : config['*'])
74
+
75
+ if(section.length > MAX_DIRECTIVES)
76
+ section.slice!(MAX_DIRECTIVES, section.length)
77
+ end
78
+
79
+ section.each do |pair|
80
+ key, value = pair.split(':')
81
+
82
+ next if key.nil? || value.nil? ||
83
+ key.empty? || value.empty?
84
+
85
+ key.downcase!
86
+ key.lstrip!
87
+ key.rstrip!
88
+
89
+ value.lstrip!
90
+ value.rstrip!
91
+
92
+ disallow(value) if key == DISALLOW
93
+ delay(value) if key == DELAY
94
+ allow(value) if key == ALLOW
95
+ end
96
+ end
97
+
98
+ # Top level file parsing method - makes sure carriage returns work,
99
+ # strips out any BOM, then loops through each line and opens up a new
100
+ # array of directives in the hash if a user-agent directive is found
101
+
102
+ def parse_text(text)
103
+ current_key = ""
104
+ config = {}
105
+
106
+ text.gsub!("\r", "\n")
107
+ text.gsub!("\xEF\xBB\xBF".force_encoding("ASCII-8BIT"), '')
108
+
109
+ text.each_line do |line|
110
+ line.lstrip!
111
+ line.rstrip!
112
+ line.gsub! /#.*/, ''
113
+
114
+ if line.length.nonzero? && line =~ /[^\s]/
115
+
116
+ if line =~ /User-agent:\s+(.+)/i
117
+ current_key = $1.downcase
118
+ config[current_key] = [] unless config[current_key]
119
+ next
120
+ end
121
+
122
+ config[current_key] << line
123
+ end
124
+ end
125
+
126
+ config
127
+ end
128
+
129
+ def disallow(value)
130
+ token = (value == "/" ? NULL_MATCH : value.chomp('*'))
131
+ @skip_list << [safe_unescape(token), true]
132
+ end
133
+
134
+ def allow(value)
135
+ token = (value == "/" ? NULL_MATCH : value.chomp('*'))
136
+ @skip_list << [safe_unescape(token), false]
137
+ end
138
+
139
+ def delay(value)
140
+ @wait_time = WaitTime.new(value.to_i)
141
+ end
142
+
143
+ def safe_unescape(target)
144
+ t = target.gsub /%2f/, '^^^'
145
+ t = CGI.unescape(t)
146
+ t.gsub /\^\^\^/, '%2f'
147
+ end
148
+ end
149
+ end
data/lib/queue.rb ADDED
@@ -0,0 +1,80 @@
1
+ # Author:: Robert Dormer (mailto:rdormer@gmail.com)
2
+ # Copyright:: Copyright (c) 2016 Robert Dormer
3
+ # License:: MIT
4
+
5
+ require 'bloom-filter'
6
+ require 'exclusion'
7
+
8
+ module Spider
9
+
10
+ class VisitQueue
11
+
12
+ class IterationExit < Exception; end
13
+
14
+ attr_accessor :visit_count
15
+ attr_accessor :robot_txt
16
+
17
+ def initialize(robots=nil, agent=nil, finish=nil)
18
+ @visited = BloomFilter.new(size: 10_000, error_rate: 0.001)
19
+ @robot_txt = ExclusionParser.new(robots, agent) if robots
20
+ @finalize = finish
21
+ @visit_count = 0
22
+ @pending = []
23
+ end
24
+
25
+ def visit_each
26
+ begin
27
+ until @pending.empty?
28
+ url = @pending.pop
29
+ if url_okay(url)
30
+ yield url if block_given?
31
+ @visited.insert(url)
32
+ @visit_count += 1
33
+ end
34
+ end
35
+ rescue IterationExit
36
+ end
37
+
38
+ @finalize.call if @finalize
39
+ end
40
+
41
+ def push_front(urls)
42
+ add_url(urls) {|u| @pending.push(u)}
43
+ end
44
+
45
+ def push_back(urls)
46
+ add_url(urls) {|u| @pending.unshift(u)}
47
+ end
48
+
49
+ def size
50
+ @pending.size
51
+ end
52
+
53
+ def empty?
54
+ @pending.empty?
55
+ end
56
+
57
+ def stop
58
+ raise IterationExit
59
+ end
60
+
61
+ private
62
+
63
+ def url_okay(url)
64
+ return false if @visited.include?(url)
65
+ return false if @robot_txt && @robot_txt.excluded?(url)
66
+ true
67
+ end
68
+
69
+ def add_url(urls)
70
+ urls = [urls] unless urls.is_a? Array
71
+ urls.compact!
72
+
73
+ urls.each do |url|
74
+ unless @visited.include?(url)
75
+ yield url
76
+ end
77
+ end
78
+ end
79
+ end
80
+ end
data/lib/spiderkit.rb ADDED
@@ -0,0 +1,14 @@
1
+ # Author:: Robert Dormer (mailto:rdormer@gmail.com)
2
+ # Copyright:: Copyright (c) 2016 Robert Dormer
3
+ # License:: MIT
4
+
5
+ $: << File.dirname(__FILE__)
6
+ require 'wait_time'
7
+ require 'exclusion'
8
+ require 'urltree'
9
+ require 'version'
10
+ require 'queue'
11
+
12
+ class String
13
+ attr_accessor :http_status
14
+ end
data/lib/version.rb ADDED
@@ -0,0 +1,7 @@
1
+ # Author:: Robert Dormer (mailto:rdormer@gmail.com)
2
+ # Copyright:: Copyright (c) 2016 Robert Dormer
3
+ # License:: MIT
4
+
5
+ module Spider
6
+ VERSION = "0.1.0"
7
+ end
data/lib/wait_time.rb ADDED
@@ -0,0 +1,50 @@
1
+ # Author:: Robert Dormer (mailto:rdormer@gmail.com)
2
+ # Copyright:: Copyright (c) 2016 Robert Dormer
3
+ # License:: MIT
4
+
5
+ #==============================================
6
+ #Class to encapsulate the crawl delay being used.
7
+ #Clamps the value to a maximum amount and implements
8
+ #an exponential backoff function for responding to
9
+ #rate limit requests
10
+ #==============================================
11
+
12
+ module Spider
13
+
14
+ class WaitTime
15
+
16
+ MAX_WAIT = 180
17
+ DEFAULT_WAIT = 2
18
+ REDUCE_WAIT = 300
19
+
20
+ def initialize(period=nil)
21
+ unless period.nil?
22
+ @wait = (period > MAX_WAIT ? MAX_WAIT : period)
23
+ else
24
+ @wait = DEFAULT_WAIT
25
+ end
26
+ end
27
+
28
+ def back_off
29
+ if @wait.zero?
30
+ @wait = DEFAULT_WAIT
31
+ else
32
+ waitval = @wait * 2
33
+ @wait = (waitval > MAX_WAIT ? MAX_WAIT : waitval)
34
+ end
35
+ end
36
+
37
+ def wait
38
+ sleep(@wait)
39
+ end
40
+
41
+ def reduce_wait
42
+ sleep(REDUCE_WAIT)
43
+ back_off
44
+ end
45
+
46
+ def value
47
+ @wait
48
+ end
49
+ end
50
+ end
@@ -0,0 +1,481 @@
1
+ # Author:: Robert Dormer (mailto:rdormer@gmail.com)
2
+ # Copyright:: Copyright (c) 2016 Robert Dormer
3
+ # License:: MIT
4
+
5
+ #See:
6
+ #http://www.robotstxt.org/orig.html
7
+ #http://www.robotstxt.org/norobots-rfc.txt
8
+
9
+ require File.dirname(__FILE__) + '/../lib/spiderkit'
10
+
11
+ module Spider
12
+
13
+ describe ExclusionParser do
14
+
15
+ describe "General file handling" do
16
+ it "should ignore comments" do
17
+ txt = <<-eos
18
+ user-agent: *
19
+ allow: /
20
+ #disallow: /
21
+ eos
22
+
23
+ @bottxt = described_class.new(txt)
24
+ expect(@bottxt.excluded?('/')).to be false
25
+ end
26
+
27
+ it "should ignore comments starting with whitespace" do
28
+ txt = <<-eos
29
+ user-agent: *
30
+ allow: /
31
+ #disallow: /
32
+ eos
33
+
34
+ @bottxt = described_class.new(txt)
35
+ expect(@bottxt.excluded?('/')).to be false
36
+ end
37
+
38
+ it "should cleanly handle winged comments" do
39
+ txt = <<-eos
40
+ user-agent: *
41
+ allow: / #disallow: /
42
+ eos
43
+
44
+ @bottxt = described_class.new(txt)
45
+ expect(@bottxt.excluded?('/')).to be false
46
+ end
47
+
48
+ it "should ignore unrecognized headers" do
49
+ txt = <<-eos
50
+ user-agent: *
51
+ allow: /
52
+ whargarbl: /
53
+ eos
54
+
55
+ @bottxt = described_class.new(txt)
56
+ expect(@bottxt.excluded?('/')).to be false
57
+ end
58
+
59
+ it "should completely ignore an empty file" do
60
+ @bottxt = described_class.new('')
61
+ expect(@bottxt.excluded?('/')).to be false
62
+ expect(@bottxt.excluded?('/test')).to be false
63
+ end
64
+
65
+ it "should stop processing after 1000 directives" do
66
+ txt = <<-eos
67
+ user-agent: *
68
+ eos
69
+
70
+ (1..1002).each {|x| txt += "disallow: /#{x}--\r\n"}
71
+ @bottxt = described_class.new(txt)
72
+
73
+ #remember, we're doing start-of-string matching here,
74
+ #so we need a delimiter or else 100 matches 1001, 1002...
75
+
76
+ expect(@bottxt.excluded?('/1--')).to be true
77
+ expect(@bottxt.excluded?('/100--')).to be true
78
+ expect(@bottxt.excluded?('/1000--')).to be true
79
+ expect(@bottxt.excluded?('/1001--')).to be false
80
+ expect(@bottxt.excluded?('/1002--')).to be false
81
+ end
82
+
83
+ it "should die cleanly on html" do
84
+ txt = <<-eos
85
+ <html>
86
+ <head></head>
87
+ <body></body>
88
+ </html>
89
+ eos
90
+
91
+ @bottxt = described_class.new(txt)
92
+ expect(@bottxt.excluded?('/')).to be false
93
+ end
94
+
95
+ it "should drop byte order marks" do
96
+ txt = <<-eos
97
+ \xEF\xBB\xBF
98
+ user-agent: *
99
+ disallow: /
100
+ eos
101
+
102
+ @bottxt = described_class.new(txt)
103
+ expect(@bottxt.excluded?('/')).to be true
104
+ end
105
+
106
+ it "should be open if no user agent matches and there is no default" do
107
+ txt = <<-eos
108
+ user-agent: test1
109
+ disallow: /
110
+ user-agent: test2
111
+ disallow: /
112
+ eos
113
+
114
+ @bottxt = described_class.new(txt)
115
+ expect(@bottxt.excluded?('/')).to be false
116
+ end
117
+
118
+ it "should handle nil text" do
119
+ @bottxt = described_class.new(nil)
120
+ expect(@bottxt.excluded?('/')).to be false
121
+ end
122
+
123
+ it "should default to deny-all if unauthorized" do
124
+ txt = <<-eos
125
+ user-agent: *
126
+ allow: /
127
+ eos
128
+
129
+ txt.http_status = 401
130
+ @bottxt = described_class.new(txt)
131
+ expect(@bottxt.excluded?('/')).to be true
132
+
133
+ txt.http_status = 403
134
+ @bottxt = described_class.new(txt)
135
+ expect(@bottxt.excluded?('/')).to be true
136
+ end
137
+ end
138
+
139
+ describe "General directive handling" do
140
+ it "should split on CR" do
141
+ txt = "user-agent: *\rdisallow: /"
142
+ @bottxt = described_class.new(txt)
143
+ expect(@bottxt.excluded?('/')).to be true
144
+ end
145
+
146
+ it "should split on NL" do
147
+ txt = "user-agent: *\ndisallow: /"
148
+ @bottxt = described_class.new(txt)
149
+ expect(@bottxt.excluded?('/')).to be true
150
+ end
151
+
152
+ it "should split on CR/NL" do
153
+ txt = "user-agent: *\r\ndisallow: /"
154
+ @bottxt = described_class.new(txt)
155
+ expect(@bottxt.excluded?('/')).to be true
156
+ end
157
+
158
+ it "should be whitespace insensitive" do
159
+ txt = <<-eos
160
+ user-agent: *
161
+ allow: /tmp
162
+ disallow: /
163
+ eos
164
+
165
+ @bottxt = described_class.new(txt)
166
+ expect(@bottxt.excluded?('/')).to be true
167
+ expect(@bottxt.excluded?('/tmp')).to be false
168
+ end
169
+
170
+ it "should match directives case insensitively" do
171
+ txt = <<-eos
172
+ user-agent: *
173
+ DISALLOW: /test1
174
+ ALLOW: /test2
175
+ CRAWL-DELAY: 60
176
+ eos
177
+
178
+ @bottxt = described_class.new(txt)
179
+ expect(@bottxt.excluded?('/test1')).to be true
180
+ expect(@bottxt.excluded?('/test2')).to be false
181
+ end
182
+ end
183
+
184
+ describe "User agent handling" do
185
+ it "should do a case insensitive agent match" do
186
+ txt1 = <<-eos
187
+ user-agent: testbot
188
+ disallow: /
189
+ eos
190
+
191
+ txt2 = <<-eos
192
+ user-agent: TESTbot
193
+ disallow: /
194
+ eos
195
+
196
+ txt3 = <<-eos
197
+ user-agent: TESTBOT
198
+ disallow: /
199
+ eos
200
+
201
+ @bottxt1 = described_class.new(txt1, 'testbot')
202
+ @bottxt2 = described_class.new(txt2, 'testbot')
203
+ @bottxt3 = described_class.new(txt3, 'testbot')
204
+
205
+ expect(@bottxt1.excluded?('/')).to be true
206
+ expect(@bottxt2.excluded?('/')).to be true
207
+ expect(@bottxt3.excluded?('/')).to be true
208
+ end
209
+
210
+ it "should handle default user agent" do
211
+ txt = <<-eos
212
+ user-agent: *
213
+ disallow: /test
214
+ eos
215
+
216
+ @bottxt = described_class.new(txt)
217
+ expect(@bottxt.excluded?('/test')).to be true
218
+ end
219
+
220
+ it "should use only the first of multiple default user agents" do
221
+ txt = <<-eos
222
+ user-agent: *
223
+ disallow: /
224
+
225
+ user-agent: *
226
+ allow: /
227
+ eos
228
+
229
+ @bottxt = described_class.new(txt)
230
+ expect(@bottxt.excluded?('/')).to be true
231
+ end
232
+
233
+ it "should give precedence to a matching user agent over default" do
234
+ txt = <<-eos
235
+ user-agent: testbot
236
+ disallow: /
237
+
238
+ user-agent: *
239
+ disallow:
240
+ eos
241
+
242
+ @bottxt = described_class.new(txt, 'testbot')
243
+ expect(@bottxt.excluded?('/')).to be true
244
+ end
245
+
246
+ xit "should allow cascading user-agent strings"
247
+ end
248
+
249
+ describe "Disallow directive" do
250
+ it "should allow all urls if disallow is empty" do
251
+ txt = <<-eos
252
+ user-agent: *
253
+ disallow:
254
+ eos
255
+
256
+ @bottxt = described_class.new(txt)
257
+ expect(@bottxt.excluded?('/')).to be false
258
+ expect(@bottxt.excluded?('test')).to be false
259
+ expect(@bottxt.excluded?('/test')).to be false
260
+ end
261
+
262
+ it "should blacklist any url starting with the specified string" do
263
+ txt = <<-eos
264
+ user-agent: *
265
+ disallow: /tmp
266
+ eos
267
+
268
+ @bottxt = described_class.new(txt)
269
+ expect(@bottxt.excluded?('/tmp')).to be true
270
+ expect(@bottxt.excluded?('/tmp1234')).to be true
271
+ expect(@bottxt.excluded?('/tmp/stuff')).to be true
272
+ expect(@bottxt.excluded?('/tmporary')).to be true
273
+
274
+ expect(@bottxt.excluded?('/nottmp')).to be false
275
+ expect(@bottxt.excluded?('tmp')).to be false
276
+ end
277
+
278
+ it "should blacklist all urls if root is specified" do
279
+ txt = <<-eos
280
+ user-agent: *
281
+ disallow: /
282
+ eos
283
+
284
+ @bottxt = described_class.new(txt)
285
+ expect(@bottxt.excluded?('/')).to be true
286
+ expect(@bottxt.excluded?('/nottmp')).to be true
287
+ expect(@bottxt.excluded?('/test')).to be true
288
+ expect(@bottxt.excluded?('nottmp')).to be true
289
+ expect(@bottxt.excluded?('test')).to be true
290
+ end
291
+
292
+ it "should match urls case sensitively" do
293
+ txt = <<-eos
294
+ user-agent: *
295
+ disallow: /tmp
296
+ eos
297
+
298
+ @bottxt = described_class.new(txt)
299
+ expect(@bottxt.excluded?('/tmp')).to be true
300
+ expect(@bottxt.excluded?('/TMP')).to be false
301
+ expect(@bottxt.excluded?('/Tmp')).to be false
302
+ end
303
+
304
+ it "should decode url encoded characters" do
305
+ txt = <<-eos
306
+ user-agent: *
307
+ disallow: /a%3cd.html
308
+ eos
309
+
310
+ @bottxt = described_class.new(txt)
311
+ expect(@bottxt.excluded?('/a%3cd.html')).to be true
312
+ expect(@bottxt.excluded?('/a%3Cd.html')).to be true
313
+
314
+ txt = <<-eos
315
+ user-agent: *
316
+ disallow: /a%3Cd.html
317
+ eos
318
+
319
+ @bottxt = described_class.new(txt)
320
+ expect(@bottxt.excluded?('/a%3cd.html')).to be true
321
+ expect(@bottxt.excluded?('/a%3Cd.html')).to be true
322
+ end
323
+
324
+ it "should not decode %2f" do
325
+ txt = <<-eos
326
+ user-agent: *
327
+ disallow: /a%2fb.html
328
+ eos
329
+
330
+ @bottxt = described_class.new(txt)
331
+ expect(@bottxt.excluded?('/a%2fb.html')).to be true
332
+ expect(@bottxt.excluded?('/a/b.html')).to be false
333
+
334
+ txt = <<-eos
335
+ user-agent: *
336
+ disallow: /a/b.html
337
+ eos
338
+
339
+ @bottxt = described_class.new(txt)
340
+ expect(@bottxt.excluded?('/a%2fb.html')).to be false
341
+ expect(@bottxt.excluded?('/a/b.html')).to be true
342
+ end
343
+
344
+ it "should override allow if it comes first" do
345
+ txt = <<-eos
346
+ user-agent: *
347
+ disallow: /tmp
348
+ allow: /tmp
349
+ eos
350
+
351
+ @bottxt = described_class.new(txt)
352
+ expect(@bottxt.excluded?('/tmp')).to be true
353
+ end
354
+ end
355
+
356
+ describe "Allow directive" do
357
+ it "should override disallow if it comes first" do
358
+ txt = <<-eos
359
+ user-agent: *
360
+ allow: /tmp
361
+ disallow: /tmp
362
+ eos
363
+
364
+ @bottxt = described_class.new(txt)
365
+ expect(@bottxt.excluded?('/tmp')).to be false
366
+ end
367
+
368
+ it "should override disallow root if it comes first" do
369
+ txt = <<-eos
370
+ user-agent: *
371
+ allow: /tmp
372
+ allow: /test
373
+ disallow: /
374
+ eos
375
+
376
+ @bottxt = described_class.new(txt)
377
+ expect(@bottxt.excluded?('/tmp')).to be false
378
+ expect(@bottxt.excluded?('/test')).to be false
379
+ expect(@bottxt.excluded?('/other1')).to be true
380
+ expect(@bottxt.excluded?('/other2')).to be true
381
+ end
382
+
383
+ it "allowing root should blacklist nothing" do
384
+ txt = <<-eos
385
+ user-agent: *
386
+ allow: /
387
+ eos
388
+
389
+ @bottxt = described_class.new(txt)
390
+ expect(@bottxt.excluded?('/tmp')).to be false
391
+ expect(@bottxt.excluded?('/test')).to be false
392
+ expect(@bottxt.excluded?('/zzz')).to be false
393
+ end
394
+ end
395
+
396
+ describe "Crawl-Delay directive" do
397
+ it "should set the crawl delay" do
398
+ txt = <<-eos
399
+ user-agent: *
400
+ crawl-delay: 100
401
+ eos
402
+
403
+ @bottxt = described_class.new(txt)
404
+ expect(@bottxt.wait_time.value).to eq 100
405
+ end
406
+
407
+ it "should limit wait time to 180 seconds" do
408
+ txt = <<-eos
409
+ user-agent: *
410
+ crawl-delay: 1000
411
+ eos
412
+
413
+ @bottxt = described_class.new(txt)
414
+ expect(@bottxt.wait_time.value).to eq 180
415
+ end
416
+ end
417
+
418
+ describe "RFC Examples" do
419
+ it "#1" do
420
+ txt = <<-eos
421
+ User-agent: *
422
+ Disallow: /org/plans.html
423
+ Allow: /org/
424
+ Allow: /serv
425
+ Allow: /~mak
426
+ Disallow: /
427
+ eos
428
+
429
+ @bottxt = described_class.new(txt)
430
+ expect(@bottxt.excluded?('/')).to be true
431
+ expect(@bottxt.excluded?('/index.html')).to be true
432
+ expect(@bottxt.excluded?('/server.html')).to be false
433
+ expect(@bottxt.excluded?('/services/fast.html')).to be false
434
+ expect(@bottxt.excluded?('/services/slow.html')).to be false
435
+ expect(@bottxt.excluded?('/orgo.gif')).to be true
436
+ expect(@bottxt.excluded?('/org/about.html')).to be false
437
+ expect(@bottxt.excluded?('/org/plans.html')).to be true
438
+ expect(@bottxt.excluded?('/%7Ejim/jim.html ')).to be true
439
+ expect(@bottxt.excluded?('/%7Emak/mak.html')).to be false
440
+ end
441
+
442
+ it "#2" do
443
+ txt = <<-eos
444
+ User-agent: *
445
+ Disallow: /
446
+ eos
447
+
448
+ @bottxt = described_class.new(txt)
449
+ expect(@bottxt.excluded?('/')).to be true
450
+ expect(@bottxt.excluded?('/index.html')).to be true
451
+ expect(@bottxt.excluded?('/server.html')).to be true
452
+ expect(@bottxt.excluded?('/services/fast.html')).to be true
453
+ expect(@bottxt.excluded?('/services/slow.html')).to be true
454
+ expect(@bottxt.excluded?('/orgo.gif')).to be true
455
+ expect(@bottxt.excluded?('/org/about.html')).to be true
456
+ expect(@bottxt.excluded?('/org/plans.html')).to be true
457
+ expect(@bottxt.excluded?('/%7Ejim/jim.html ')).to be true
458
+ expect(@bottxt.excluded?('/%7Emak/mak.html')).to be true
459
+ end
460
+
461
+ it "#3" do
462
+ txt = <<-eos
463
+ User-agent: *
464
+ Disallow:
465
+ eos
466
+
467
+ @bottxt = described_class.new(txt)
468
+ expect(@bottxt.excluded?('/')).to be false
469
+ expect(@bottxt.excluded?('/index.html')).to be false
470
+ expect(@bottxt.excluded?('/server.html')).to be false
471
+ expect(@bottxt.excluded?('/services/fast.html')).to be false
472
+ expect(@bottxt.excluded?('/services/slow.html')).to be false
473
+ expect(@bottxt.excluded?('/orgo.gif')).to be false
474
+ expect(@bottxt.excluded?('/org/about.html')).to be false
475
+ expect(@bottxt.excluded?('/org/plans.html')).to be false
476
+ expect(@bottxt.excluded?('/%7Ejim/jim.html ')).to be false
477
+ expect(@bottxt.excluded?('/%7Emak/mak.html')).to be false
478
+ end
479
+ end
480
+ end
481
+ end
@@ -0,0 +1,129 @@
1
+ # Author:: Robert Dormer (mailto:rdormer@gmail.com)
2
+ # Copyright:: Copyright (c) 2016 Robert Dormer
3
+ # License:: MIT
4
+
5
+ require File.dirname(__FILE__) + '/../lib/spiderkit'
6
+
7
+ def get_visit_order(q)
8
+ order = []
9
+
10
+ q.visit_each do |t|
11
+ order << t
12
+ end
13
+
14
+ order
15
+ end
16
+
17
+
18
+ module Spider
19
+
20
+ describe VisitQueue do
21
+ before(:each) do
22
+ @queue = described_class.new
23
+ @queue.push_front('divider')
24
+ end
25
+
26
+ it 'should allow appending to front of the queue' do
27
+ @queue.push_front('two')
28
+ @queue.push_front('one')
29
+
30
+ order = get_visit_order(@queue)
31
+ expect(order).to eq %w(one two divider)
32
+ end
33
+
34
+
35
+ it 'should allow appending to the back of the queue' do
36
+ @queue.push_back('one')
37
+ @queue.push_back('two')
38
+
39
+ order = get_visit_order(@queue)
40
+ expect(order).to eq %w(divider one two)
41
+ end
42
+
43
+ it 'should allow appending array of urls to front of the queue' do
44
+ @queue.push_front(%w(two one))
45
+ order = get_visit_order(@queue)
46
+ expect(order).to eq %w(one two divider)
47
+ end
48
+
49
+ it 'should allow appending array of urls to back of the queue' do
50
+ @queue.push_back(%w(one two))
51
+ order = get_visit_order(@queue)
52
+ expect(order).to eq %w(divider one two)
53
+ end
54
+
55
+ it 'should not allow appending nil to the queue' do
56
+ expect(@queue.size).to eq 1
57
+
58
+ @queue.push_back(nil)
59
+ @queue.push_front(nil)
60
+ @queue.push_back([nil, nil, nil])
61
+ @queue.push_front([nil, nil, nil])
62
+
63
+ expect(@queue.size).to eq 1
64
+ end
65
+
66
+ it 'should visit urls appended during iteration' do
67
+ @queue.push_front(%w(one two))
68
+ extra_urls = %w(three four five)
69
+ order = []
70
+
71
+ @queue.visit_each do |t|
72
+ order << t
73
+ @queue.push_back(extra_urls.pop)
74
+ end
75
+
76
+ expect(order).to eq %w(two one divider five four three)
77
+ end
78
+
79
+ it 'should ignore appending if url has already been visited' do
80
+ @queue.visit_each
81
+ expect(@queue.empty?).to be true
82
+ @queue.push_back('divider')
83
+ @queue.push_front('divider')
84
+ @queue.push_back(%w(divider divider))
85
+ @queue.push_front(%w(divider divider))
86
+ expect(@queue.empty?).to be true
87
+ end
88
+
89
+ it 'should track number of urls visited' do
90
+ expect(@queue.visit_count).to eq 0
91
+ @queue.push_back(%w(one two three four))
92
+ @queue.visit_each
93
+ expect(@queue.visit_count).to eq 5
94
+ end
95
+
96
+ it 'should not visit urls blocked by robots.txt' do
97
+ rtext = <<-REND
98
+ User-agent: *
99
+ disallow: two
100
+ disallow: three
101
+ disallow: four
102
+ REND
103
+
104
+ rtext_queue = described_class.new(rtext)
105
+ rtext_queue.push_front(%w(one two three four five six))
106
+ order = get_visit_order(rtext_queue)
107
+ expect(order).to eq %w(six five one)
108
+ end
109
+
110
+ it 'should execute a finalizer if given' do
111
+ flag = false
112
+ final = Proc.new { flag = true }
113
+ queue = described_class.new(nil, nil, final)
114
+ queue.visit_each
115
+ expect(flag).to be true
116
+ end
117
+
118
+ it 'should execute the finalizer even when breaking the loop' do
119
+ flag = false
120
+ final = Proc.new { flag = true }
121
+ queue = described_class.new(nil, nil, final)
122
+ queue.push_back((1..20).to_a)
123
+ queue.visit_each { queue.stop if queue.visit_count >= 1 }
124
+ expect(queue.visit_count).to eq 1
125
+ expect(flag).to be true
126
+ end
127
+ end
128
+
129
+ end
@@ -0,0 +1,51 @@
1
+ # Author:: Robert Dormer (mailto:rdormer@gmail.com)
2
+ # Copyright:: Copyright (c) 2016 Robert Dormer
3
+ # License:: MIT
4
+
5
+ require File.dirname(__FILE__) + '/../lib/spiderkit'
6
+
7
+ module Spider
8
+ describe WaitTime do
9
+
10
+ it 'should have a getter for the value' do
11
+ wait = described_class.new(100)
12
+ expect(wait.value).to eq(100)
13
+ end
14
+
15
+ it 'should clamp the wait time argument to three minutes' do
16
+ wait = described_class.new(1000)
17
+ expect(wait.value).to eq(180)
18
+ end
19
+
20
+ it 'should have a default wait time' do
21
+ wait = described_class.new
22
+ expect(wait.value).to eq(2)
23
+ end
24
+
25
+ describe '#back_off' do
26
+ it 'if wait is zero, should set default wait time' do
27
+ wait = described_class.new(0)
28
+ wait.back_off
29
+ expect(wait.value).to eq(2)
30
+ end
31
+
32
+ it 'should double the wait time every time called' do
33
+ wait = described_class.new(10)
34
+ wait.back_off
35
+ expect(wait.value).to eq(20)
36
+ wait.back_off
37
+ expect(wait.value).to eq(40)
38
+ wait.back_off
39
+ expect(wait.value).to eq(80)
40
+ end
41
+
42
+ it 'should not double beyond the maximum value' do
43
+ wait = described_class.new(90)
44
+ wait.back_off
45
+ expect(wait.value).to eq(180)
46
+ wait.back_off
47
+ expect(wait.value).to eq(180)
48
+ end
49
+ end
50
+ end
51
+ end
data/spiderkit.gemspec ADDED
@@ -0,0 +1,26 @@
1
+ # coding: utf-8
2
+ lib = File.expand_path('../lib', __FILE__)
3
+ $LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
4
+ require 'version'
5
+
6
+ Gem::Specification.new do |spec|
7
+ spec.name = "spiderkit"
8
+ spec.version = Spider::VERSION
9
+ spec.authors = ["Robert Dormer"]
10
+ spec.email = ["rdormer@gmail.com"]
11
+ spec.description = %q{Spiderkit library for basic spiders and bots}
12
+ spec.summary = %q{Basic toolkit for writing web spiders and bots}
13
+ spec.homepage = "http://github.com/rdormer/spiderkit"
14
+ spec.license = "MIT"
15
+
16
+ spec.files = `git ls-files`.split($/)
17
+ spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
18
+ spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
19
+ spec.require_paths = ["lib"]
20
+
21
+ spec.add_development_dependency "bundler", "~> 1.3"
22
+ spec.add_development_dependency "rspec", "~> 3.4.0"
23
+ spec.add_development_dependency "rake"
24
+
25
+ spec.add_dependency "bloom-filter", "~> 0.2.0"
26
+ end
metadata ADDED
@@ -0,0 +1,126 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: spiderkit
3
+ version: !ruby/object:Gem::Version
4
+ version: 0.1.0
5
+ prerelease:
6
+ platform: ruby
7
+ authors:
8
+ - Robert Dormer
9
+ autorequire:
10
+ bindir: bin
11
+ cert_chain: []
12
+ date: 2016-07-10 00:00:00.000000000 Z
13
+ dependencies:
14
+ - !ruby/object:Gem::Dependency
15
+ name: bundler
16
+ requirement: !ruby/object:Gem::Requirement
17
+ none: false
18
+ requirements:
19
+ - - ~>
20
+ - !ruby/object:Gem::Version
21
+ version: '1.3'
22
+ type: :development
23
+ prerelease: false
24
+ version_requirements: !ruby/object:Gem::Requirement
25
+ none: false
26
+ requirements:
27
+ - - ~>
28
+ - !ruby/object:Gem::Version
29
+ version: '1.3'
30
+ - !ruby/object:Gem::Dependency
31
+ name: rspec
32
+ requirement: !ruby/object:Gem::Requirement
33
+ none: false
34
+ requirements:
35
+ - - ~>
36
+ - !ruby/object:Gem::Version
37
+ version: 3.4.0
38
+ type: :development
39
+ prerelease: false
40
+ version_requirements: !ruby/object:Gem::Requirement
41
+ none: false
42
+ requirements:
43
+ - - ~>
44
+ - !ruby/object:Gem::Version
45
+ version: 3.4.0
46
+ - !ruby/object:Gem::Dependency
47
+ name: rake
48
+ requirement: !ruby/object:Gem::Requirement
49
+ none: false
50
+ requirements:
51
+ - - ! '>='
52
+ - !ruby/object:Gem::Version
53
+ version: '0'
54
+ type: :development
55
+ prerelease: false
56
+ version_requirements: !ruby/object:Gem::Requirement
57
+ none: false
58
+ requirements:
59
+ - - ! '>='
60
+ - !ruby/object:Gem::Version
61
+ version: '0'
62
+ - !ruby/object:Gem::Dependency
63
+ name: bloom-filter
64
+ requirement: !ruby/object:Gem::Requirement
65
+ none: false
66
+ requirements:
67
+ - - ~>
68
+ - !ruby/object:Gem::Version
69
+ version: 0.2.0
70
+ type: :runtime
71
+ prerelease: false
72
+ version_requirements: !ruby/object:Gem::Requirement
73
+ none: false
74
+ requirements:
75
+ - - ~>
76
+ - !ruby/object:Gem::Version
77
+ version: 0.2.0
78
+ description: Spiderkit library for basic spiders and bots
79
+ email:
80
+ - rdormer@gmail.com
81
+ executables: []
82
+ extensions: []
83
+ extra_rdoc_files: []
84
+ files:
85
+ - Gemfile
86
+ - LICENSE.txt
87
+ - README.md
88
+ - Rakefile
89
+ - lib/exclusion.rb
90
+ - lib/queue.rb
91
+ - lib/spiderkit.rb
92
+ - lib/version.rb
93
+ - lib/wait_time.rb
94
+ - spec/exclusion_parser_spec.rb
95
+ - spec/visit_queue_spec.rb
96
+ - spec/wait_time_spec.rb
97
+ - spiderkit.gemspec
98
+ homepage: http://github.com/rdormer/spiderkit
99
+ licenses:
100
+ - MIT
101
+ post_install_message:
102
+ rdoc_options: []
103
+ require_paths:
104
+ - lib
105
+ required_ruby_version: !ruby/object:Gem::Requirement
106
+ none: false
107
+ requirements:
108
+ - - ! '>='
109
+ - !ruby/object:Gem::Version
110
+ version: '0'
111
+ required_rubygems_version: !ruby/object:Gem::Requirement
112
+ none: false
113
+ requirements:
114
+ - - ! '>='
115
+ - !ruby/object:Gem::Version
116
+ version: '0'
117
+ requirements: []
118
+ rubyforge_project:
119
+ rubygems_version: 1.8.25
120
+ signing_key:
121
+ specification_version: 3
122
+ summary: Basic toolkit for writing web spiders and bots
123
+ test_files:
124
+ - spec/exclusion_parser_spec.rb
125
+ - spec/visit_queue_spec.rb
126
+ - spec/wait_time_spec.rb