spiderkit 0.1.2 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml ADDED
@@ -0,0 +1,7 @@
1
+ ---
2
+ SHA1:
3
+ metadata.gz: 5d3b757d2369c3d6c9520e8f4b78f60f79fdfd95
4
+ data.tar.gz: 562120125d785f938f37b102a53e90dc78dc56cc
5
+ SHA512:
6
+ metadata.gz: 391d8705750fe6e738152384439da6b512dc29471796c37bd48f1d90bd2b3909fc1a1c6257c1a95baff42533d662140f7cf6a999a3d9cedca17cdd8cb1f43c06
7
+ data.tar.gz: a243700c54bf0eecaf16a879d76e201c7debcb93094bed3ba32bea47f9615c321ba5f297451955cead398e3fda3f7ecd2892673eea87102339e0ca75c3df34de
data/README.md CHANGED
@@ -58,8 +58,8 @@ end
58
58
  A slightly fancier example:
59
59
 
60
60
  ```ruby
61
- #download robots.txt as variable txt
62
- #user agent for robots.txt is "my-bot"
61
+ # download robots.txt as variable txt
62
+ # user agent for robots.txt is "my-bot"
63
63
 
64
64
  finalizer = Proc.new { puts "done"}
65
65
  mybot = Spider::VisitQueue.new(txt, "my-bot", finalizer)
@@ -69,10 +69,10 @@ As urls are fetched and added to the queue, any links already visited will be dr
69
69
 
70
70
  ```ruby
71
71
  mybot.visit_each do |url|
72
- #these will be visited next
72
+ # these will be visited next
73
73
  mybot.push_front(nexturls)
74
74
 
75
- #these will be visited last
75
+ # these will be visited last
76
76
  mybot.push_back(lasturls)
77
77
  end
78
78
  ```
@@ -104,24 +104,26 @@ The finalizer, if any, will still be executed after stopping iteration.
104
104
  Spiderkit also includes a robots.txt parser that can either work standalone, or be passed as an argument to the visit queue. If passed as an argument, urls that are excluded by the robots.txt will be dropped transparently.
105
105
 
106
106
  ```
107
- #fetch robots.txt as variable txt
107
+ # fetch robots.txt as variable txt
108
108
 
109
- #create a stand alone parser
109
+ # create a stand alone parser
110
110
  robots_txt = Spider::ExclusionParser.new(txt)
111
111
 
112
112
  robots_txt.excluded?("/") => true
113
113
  robots_txt.excluded?("/admin") => false
114
114
  robots_txt.allowed?("/blog") => true
115
115
 
116
- #pass text directly to visit queue
116
+ # pass text directly to visit queue
117
117
  mybot = Spider::VisitQueue(txt)
118
118
  ```
119
119
 
120
120
  Note that you pass the robots.txt directly to the visit queue - no need to new up the parser yourself. The VisitQueue also has a robots_txt accessor that you can use to access and set the exclusion parser while iterating through the queue:
121
121
 
122
122
  ```ruby
123
+ require 'open-uri'
124
+
123
125
  mybot.visit_each |url|
124
- #download a new robots.txt from somewhere
126
+ txt = open('http://wikipedia.org/robots.txt').read
125
127
  mybot.robot_txt = Spider::ExclusionParser.new(txt)
126
128
  end
127
129
  ```
@@ -129,21 +131,31 @@ end
129
131
  If you don't pass an agent string, then the parser will take it's configuration from the default agent specified in the robots.txt. If you want your bot to respond to directives for a given user agent, just pass the agent to either the queue when you create it, or the parser:
130
132
 
131
133
  ```ruby
132
- #visit queue that will respond to any robots.txt
133
- #with User-agent: mybot in them
134
+ # visit queue that will respond to any robots.txt
135
+ # with User-agent: mybot in them
134
136
  mybot = Spider::VisitQueue(txt, 'mybot')
135
137
 
136
138
  #same thing as a standalone parser
137
139
  myparser = Spider::ExclusionParser.new(txt, 'mybot')
138
140
  ```
139
141
 
140
- Note that user agent string passed in to your exclusion parser and the user agent string sent along with HTTP requests are not necessarily one and the same, although the user agent contained in robots.txt will usually be a subset of the HTTP user agent.
142
+ You can also pass nil or a blank string as the agent to use the default agent. Note that user agent string passed in to your exclusion parser and the user agent string sent along with HTTP requests are not necessarily one and the same, although the user agent contained in robots.txt will usually be a subset of the HTTP user agent.
141
143
 
142
144
  For example:
143
145
 
144
146
  Googlebot/2.1 (+http://www.google.com/bot.html)
145
147
 
146
- should respond to "googlebot" in robots.txt. By convention, bots and spiders usually have the name 'bot' somewhere in their user agent strings.
148
+ should respond to "googlebot" in robots.txt. By convention, bots and spiders usually have the name 'bot' somewhere in their user agent strings. You can also pass the response code of the request that fetched the robots.txt file if you like, and let the exclusion parser decide what to do with it:
149
+
150
+ ```ruby
151
+ require 'open-uri'
152
+
153
+ status = 0
154
+ data = open('http://wikipedia.org/robots.txt') { |f| status = f.status }
155
+ mybot.robot_txt = Spider::ExclusionParser.new(data.read, 'mybot', status)
156
+ ```
157
+
158
+ Finally, as a sanity check / to avoid DoS honeypots with malicious robots.txt files, the exclusion parser will process a maximum of one thousand non-whitespace lines before stopping.
147
159
 
148
160
  ## Wait Time
149
161
 
@@ -152,29 +164,73 @@ Ideally a bot should wait for some period of time in between requests to avoid c
152
164
  You can create it standalone, or get it from an exclusion parser:
153
165
 
154
166
  ```ruby
155
- #download a robots.txt with a crawl-delay 40
167
+ # download a robots.txt with a crawl-delay 40
156
168
 
157
169
  robots_txt = Spider::ExclusionParser.new(txt)
158
170
  delay = robots_txt.wait_time
159
171
  delay.value => 40
160
172
 
161
- #receive a rate limit code, double wait time
173
+ # receive a rate limit code, double wait time
162
174
  delay.back_off
163
175
 
164
- #actually do the waiting part
176
+ # actually do the waiting part
165
177
  delay.wait
166
178
 
167
- #in response to some rate limit codes you'll want
168
- #to sleep for a while, then back off
179
+ # in response to some rate limit codes you'll want
180
+ # to sleep for a while, then back off
169
181
  delay.reduce_wait
170
182
 
171
183
 
172
- #after one call to back_off and one call to reduce_wait
184
+ # after one call to back_off and one call to reduce_wait
173
185
  delay.value => 160
174
186
  ```
175
187
 
176
188
  By default a WaitTime will specify an initial value of 2 seconds. You can pass a value to new to specify the wait seconds, although values larger than the max allowable value will be set to the max allowable value (3 minutes / 180 seconds).
177
189
 
190
+ ## Recording Requests
191
+
192
+ For convenience, an HTTP request recorder is provided, and is highly useful for helping write regression and integration tests. It accepts a block of code that returns a string containing the response data. The String class is monkey-patched to add http_status and http_headers accessors for ease of transporting other request data (yes, I know, monkey patching is evil). Information assigned to these accessors will be saved as well by the recorder, but their use is not required. The recorder class will manage the marshaling and unmarshaling of the request data behind the scenes, saving requests identified by their URL as a uniquely hashed file name with YAML-ized and Base64 encoded data in it. This is similar to VCR, and you can certainly use that instead. However, I personally ran into some troubles integrating it into some spiders I was writing, so I came up with this as a simple, lightweight alternative that works well with the rest of the Spiderkit.
193
+
194
+ The recorder will not play back request data unless enabled, and it will not save request data unless recording is turned on. This is done with the **activate!** and **record!** methods, respectively. You can stop recording with the **pause!** method and stop playback with the **deactivate!** method.
195
+
196
+ A simple spider for iterating pages and recording them might look like this:
197
+
198
+ ```ruby
199
+ require 'spiderkit'
200
+ require 'open-uri'
201
+
202
+ mybot = Spider::VisitQueue.new
203
+ mybot.push_front('http://someurl.com')
204
+
205
+ Spider::VisitRecorder.config('/save/path')
206
+ Spider::VisitRecorder.activate!
207
+ Spider::VisitRecorder.record!
208
+
209
+ mybot.visit_each do |url|
210
+
211
+ data = Spider::VisitRecorder.recall(url) do
212
+ text = ''
213
+ puts "fetch #{url}"
214
+ open(url) do |f|
215
+ text = f.read
216
+ # doing this is only necessary if you want to
217
+ # save this information in the recording
218
+ text.http_status = f.status.first.to_i
219
+ end
220
+
221
+ text
222
+ end
223
+
224
+ # extract links from data and push onto the
225
+ # spider queue
226
+ end
227
+ ```
228
+
229
+ After the first time the pages are spidered and saved, any subsequent run would simply replay the recorded data. You would find the saved request files in the working directory. The path that requests are saved to can be altered using the **config** method:
230
+
231
+ ```ruby
232
+ Spider::VisitRecorder.config('/some/test/path')
233
+ ```
178
234
 
179
235
  ## Contributing
180
236
 
data/lib/exclusion.rb CHANGED
@@ -4,7 +4,7 @@
4
4
 
5
5
  #==============================================
6
6
  # This is the class that parses robots.txt and implements the exclusion
7
- # checking logic therein. Works by breaking the file up into a hash of
7
+ # checking logic therein. Works by breaking the file up into a hash of
8
8
  # arrays of directives for each specified user agent, and then parsing the
9
9
  # directives into internal arrays and iterating through the list to find a
10
10
  # match. Urls are matched case sensitive, everything else is case insensitive.
@@ -13,46 +13,45 @@
13
13
  require 'cgi'
14
14
 
15
15
  module Spider
16
+ class ExclusionParser
16
17
 
17
- class ExclusionParser
18
-
19
18
  attr_accessor :wait_time
20
19
 
21
- DISALLOW = "disallow"
22
- DELAY = "crawl-delay"
23
- ALLOW = "allow"
24
-
20
+ NULL_MATCH = '*!*'.freeze
21
+ DISALLOW = 'disallow'.freeze
22
+ DELAY = 'crawl-delay'.freeze
23
+ ALLOW = 'allow'.freeze
24
+
25
25
  MAX_DIRECTIVES = 1000
26
- NULL_MATCH = "*!*"
27
-
28
- def initialize(text, agent=nil)
26
+
27
+ def initialize(text, agent = nil, status = 200)
29
28
  @skip_list = []
30
29
  @agent_key = agent
31
-
30
+
32
31
  return if text.nil? || text.length.zero?
33
-
34
- if [401, 403].include? text.http_status
32
+
33
+ if [401, 403].include? status
35
34
  @skip_list << [NULL_MATCH, true]
36
35
  return
37
36
  end
38
-
37
+
39
38
  begin
40
39
  config = parse_text(text)
41
40
  grab_list(config)
42
41
  rescue
43
42
  end
44
43
  end
45
-
44
+
46
45
  # Check to see if the given url is matched by any rule
47
46
  # in the file, and return it's associated status
48
-
47
+
49
48
  def excluded?(url)
50
49
  url = safe_unescape(url)
51
50
  @skip_list.each do |entry|
52
51
  return entry.last if url.include? entry.first
53
52
  return entry.last if entry.first == NULL_MATCH
54
53
  end
55
-
54
+
56
55
  false
57
56
  end
58
57
 
@@ -61,89 +60,101 @@ module Spider
61
60
  end
62
61
 
63
62
  private
64
-
63
+
65
64
  # Method to process the list of directives for a given user agent.
66
65
  # Picks the one that applies to us, and then processes it's directives
67
66
  # into the skip list by splitting the strings and taking the appropriate
68
67
  # action. Stops after a set number of directives to avoid malformed files
69
68
  # or denial of service attacks
70
-
69
+
71
70
  def grab_list(config)
72
- section = (config.include?(@agent_key) ?
73
- config[@agent_key] : config['*'])
74
-
75
- if(section.length > MAX_DIRECTIVES)
71
+ if config.include?(@agent_key)
72
+ section = config[@agent_key]
73
+ else
74
+ section = config['*']
75
+ end
76
+
77
+ if section.length > MAX_DIRECTIVES
76
78
  section.slice!(MAX_DIRECTIVES, section.length)
77
79
  end
78
-
80
+
79
81
  section.each do |pair|
80
82
  key, value = pair.split(':')
81
-
82
- next if key.nil? || value.nil? ||
83
- key.empty? || value.empty?
84
-
83
+
84
+ next if key.nil? || value.nil? ||
85
+ key.empty? || value.empty?
86
+
85
87
  key.downcase!
86
88
  key.lstrip!
87
89
  key.rstrip!
88
-
90
+
89
91
  value.lstrip!
90
92
  value.rstrip!
91
-
93
+
92
94
  disallow(value) if key == DISALLOW
93
95
  delay(value) if key == DELAY
94
- allow(value) if key == ALLOW
96
+ allow(value) if key == ALLOW
95
97
  end
96
98
  end
97
-
99
+
98
100
  # Top level file parsing method - makes sure carriage returns work,
99
101
  # strips out any BOM, then loops through each line and opens up a new
100
- # array of directives in the hash if a user-agent directive is found
101
-
102
+ # array of directives in the hash if a user-agent directive is found.
103
+
102
104
  def parse_text(text)
103
- current_key = ""
105
+ current_key = ''
104
106
  config = {}
105
-
107
+
106
108
  text.gsub!("\r", "\n")
107
- text.gsub!("\xEF\xBB\xBF".force_encoding("ASCII-8BIT"), '')
108
-
109
+ text = text.force_encoding('UTF-8')
110
+ text.gsub!("\xEF\xBB\xBF".force_encoding('UTF-8'), '')
111
+
109
112
  text.each_line do |line|
110
113
  line.lstrip!
111
114
  line.rstrip!
112
- line.gsub! /#.*/, ''
113
-
114
- if line.length.nonzero? && line =~ /[^\s]/
115
-
116
- if line =~ /User-agent:\s+(.+)/i
117
- current_key = $1.downcase
118
- config[current_key] = [] unless config[current_key]
119
- next
115
+ line.gsub!(/#.*/, '')
116
+
117
+ next unless line.length.nonzero? && line =~ /[^\s]/
118
+
119
+ if line =~ /User-agent:\s+(.+)/i
120
+ previous_key = current_key
121
+ current_key = $1.downcase
122
+ config[current_key] = [] unless config[current_key]
123
+
124
+ # If we've seen a new user-agent directive and the previous one
125
+ # is empty then we have a cascading user-agent string. Copy the
126
+ # new user agent array ref so both user agents are identical.
127
+
128
+ if config.key?(previous_key) && config[previous_key].size.zero?
129
+ config[previous_key] = config[current_key]
120
130
  end
121
-
131
+
132
+ else
122
133
  config[current_key] << line
123
134
  end
124
135
  end
125
-
136
+
126
137
  config
127
- end
128
-
138
+ end
139
+
129
140
  def disallow(value)
130
- token = (value == "/" ? NULL_MATCH : value.chomp('*'))
141
+ token = (value == '/' ? NULL_MATCH : value.chomp('*'))
131
142
  @skip_list << [safe_unescape(token), true]
132
143
  end
133
-
144
+
134
145
  def allow(value)
135
- token = (value == "/" ? NULL_MATCH : value.chomp('*'))
146
+ token = (value == '/' ? NULL_MATCH : value.chomp('*'))
136
147
  @skip_list << [safe_unescape(token), false]
137
148
  end
138
-
149
+
139
150
  def delay(value)
140
151
  @wait_time = WaitTime.new(value.to_i)
141
152
  end
142
-
153
+
143
154
  def safe_unescape(target)
144
- t = target.gsub /%2f/, '^^^'
155
+ t = target.gsub(/%2f/, '^^^')
145
156
  t = CGI.unescape(t)
146
- t.gsub /\^\^\^/, '%2f'
157
+ t.gsub(/\^\^\^/, '%2f')
147
158
  end
148
159
  end
149
160
  end
data/lib/queue.rb CHANGED
@@ -6,45 +6,48 @@ require 'bloom-filter'
6
6
  require 'exclusion'
7
7
 
8
8
  module Spider
9
-
10
9
  class VisitQueue
11
-
12
- class IterationExit < Exception; end
10
+
11
+ IterationExit = Class.new(Exception)
13
12
 
14
13
  attr_accessor :visit_count
15
14
  attr_accessor :robot_txt
16
15
 
17
- def initialize(robots=nil, agent=nil, finish=nil)
16
+ def initialize(robots = nil, agent = nil, finish = nil)
18
17
  @robot_txt = ExclusionParser.new(robots, agent) if robots
19
18
  @finalize = finish
20
19
  @visit_count = 0
21
20
  clear_visited
22
21
  @pending = []
23
- end
22
+ end
24
23
 
25
24
  def visit_each
26
25
  begin
27
26
  until @pending.empty?
28
27
  url = @pending.pop
29
- if url_okay(url)
30
- yield url if block_given?
31
- @visited.insert(url)
32
- @visit_count += 1
33
- end
34
- end
28
+ next unless url_okay(url)
29
+ yield url.clone if block_given?
30
+ @visited.insert(url)
31
+ @visit_count += 1
32
+ end
35
33
  rescue IterationExit
36
34
  end
37
-
35
+
38
36
  @finalize.call if @finalize
39
- end
40
-
37
+ end
38
+
41
39
  def push_front(urls)
42
- add_url(urls) {|u| @pending.push(u)}
43
- end
44
-
40
+ add_url(urls) { |u| @pending.push(u) }
41
+ end
42
+
45
43
  def push_back(urls)
46
- add_url(urls) {|u| @pending.unshift(u)}
47
- end
44
+ add_url(urls) { |u| @pending.unshift(u) }
45
+ end
46
+
47
+ def mark(urls)
48
+ urls = [urls] unless urls.is_a? Array
49
+ urls.each { |u| @visited.insert(u) }
50
+ end
48
51
 
49
52
  def size
50
53
  @pending.size
@@ -61,23 +64,21 @@ module Spider
61
64
  def clear_visited
62
65
  @visited = BloomFilter.new(size: 10_000, error_rate: 0.001)
63
66
  end
64
-
65
- private
66
67
 
67
68
  def url_okay(url)
68
69
  return false if @visited.include?(url)
69
70
  return false if @robot_txt && @robot_txt.excluded?(url)
70
71
  true
71
72
  end
72
-
73
+
74
+ private
75
+
73
76
  def add_url(urls)
74
77
  urls = [urls] unless urls.is_a? Array
75
78
  urls.compact!
76
-
79
+
77
80
  urls.each do |url|
78
- unless @visited.include?(url) || @pending.include?(url)
79
- yield url
80
- end
81
+ yield url unless @visited.include?(url) || @pending.include?(url)
81
82
  end
82
83
  end
83
84
  end
data/lib/recorder.rb ADDED
@@ -0,0 +1,111 @@
1
+ # Author:: Robert Dormer (mailto:rdormer@gmail.com)
2
+ # Copyright:: Copyright (c) 2016 Robert Dormer
3
+ # License:: MIT
4
+
5
+ #==============================================
6
+ # Class for saving visited URIs as YAML-ized files with response codes and
7
+ # headers. Takes the network request code as a block, hashes the URI to get
8
+ # a file name, and then creates it and saves it if it's not present, or reads
9
+ # and returns the contents if it is. Since the data returned from the file
10
+ # is an exact copy of what the block returned for that URI, it's a constant,
11
+ # deterministic recording that is highly useful for integration tests and the
12
+ # like. Yes, you can use VCR if that's your thing, but I found it difficult
13
+ # to integrate with real-world crawlers. This is a lightweight wrapper to
14
+ # give you 90% of the same thing.
15
+ #==============================================
16
+ require 'digest'
17
+ require 'base64'
18
+ require 'yaml'
19
+
20
+ module Spider
21
+ class VisitRecorder
22
+ @@directory = ''
23
+ @@active = false
24
+ @@recording = false
25
+
26
+ class << self
27
+ def activate!
28
+ @@active = true
29
+ end
30
+
31
+ def record!
32
+ @@recording = true
33
+ end
34
+
35
+ def deactivate!
36
+ @@active = false
37
+ end
38
+
39
+ def pause!
40
+ @@recording = false
41
+ end
42
+
43
+ def config(dir)
44
+ @@directory = dir
45
+ end
46
+
47
+ def recall(*args)
48
+ if @@active
49
+ url = args.first.to_s
50
+ data = ''
51
+
52
+ store = locate_file(url)
53
+
54
+ if store.size == 0
55
+ raise "Unexpected request: #{url}" unless @@recording
56
+ data = yield(*args) if block_given?
57
+
58
+ begin
59
+ store.write(package(url, data))
60
+ rescue StandardError => e
61
+ puts e.message
62
+ puts "On file #{store.path}"
63
+ end
64
+
65
+ else
66
+ data = unpackage(store, url)
67
+ end
68
+
69
+ return data
70
+
71
+ elsif block_given?
72
+ yield(*args)
73
+ end
74
+ end
75
+
76
+ private
77
+
78
+ def locate_file(url)
79
+ key = Digest::MD5.hexdigest(url)
80
+ path = File.expand_path(key, @@directory)
81
+ fsize = File.size?(path)
82
+ (fsize.nil? || fsize.zero? ? File.open(path, 'w') : File.open(path, 'r'))
83
+ end
84
+
85
+ def package(url, data)
86
+ payload = {}
87
+ payload[:url] = url.encode('UTF-8')
88
+ payload[:data] = Base64.encode64(data)
89
+
90
+ unless data.http_status.nil?
91
+ payload[:response] = data.http_status
92
+ end
93
+
94
+ unless data.http_headers.nil?
95
+ payload[:headers] = Base64.encode64(data.http_headers)
96
+ end
97
+
98
+ payload.to_yaml
99
+ end
100
+
101
+ def unpackage(store, url)
102
+ raw = YAML.load(store.read)
103
+ raise 'URL mismatch in recording' unless raw[:url] == url
104
+ data = Base64.decode64(raw[:data])
105
+ data.http_headers = Base64.decode64(raw[:headers])
106
+ data.http_status = raw[:response]
107
+ data
108
+ end
109
+ end
110
+ end
111
+ end
data/lib/spiderkit.rb CHANGED
@@ -2,12 +2,14 @@
2
2
  # Copyright:: Copyright (c) 2016 Robert Dormer
3
3
  # License:: MIT
4
4
 
5
- $: << File.dirname(__FILE__)
5
+ $LOAD_PATH << File.dirname(__FILE__)
6
6
  require 'wait_time'
7
7
  require 'exclusion'
8
+ require 'recorder'
8
9
  require 'version'
9
10
  require 'queue'
10
11
 
11
12
  class String
12
13
  attr_accessor :http_status
14
+ attr_accessor :http_headers
13
15
  end
data/lib/version.rb CHANGED
@@ -3,5 +3,5 @@
3
3
  # License:: MIT
4
4
 
5
5
  module Spider
6
- VERSION = "0.1.2"
6
+ VERSION = "0.2.0"
7
7
  end
data/lib/wait_time.rb CHANGED
@@ -3,33 +3,32 @@
3
3
  # License:: MIT
4
4
 
5
5
  #==============================================
6
- #Class to encapsulate the crawl delay being used.
7
- #Clamps the value to a maximum amount and implements
8
- #an exponential backoff function for responding to
9
- #rate limit requests
6
+ # Class to encapsulate the crawl delay being used.
7
+ # Clamps the value to a maximum amount and implements
8
+ # an exponential backoff function for responding to
9
+ # rate limit requests
10
10
  #==============================================
11
11
 
12
12
  module Spider
13
-
14
13
  class WaitTime
15
-
14
+
16
15
  MAX_WAIT = 180
17
16
  DEFAULT_WAIT = 2
18
17
  REDUCE_WAIT = 300
19
18
 
20
- def initialize(period=nil)
21
- unless period.nil?
22
- @wait = (period > MAX_WAIT ? MAX_WAIT : period)
23
- else
19
+ def initialize(period = nil)
20
+ if period.nil?
24
21
  @wait = DEFAULT_WAIT
22
+ else
23
+ @wait = (period > MAX_WAIT ? MAX_WAIT : period)
25
24
  end
26
25
  end
27
26
 
28
27
  def back_off
29
28
  if @wait.zero?
30
- @wait = DEFAULT_WAIT
29
+ @wait = DEFAULT_WAIT
31
30
  else
32
- waitval = @wait * 2
31
+ waitval = @wait * 2
33
32
  @wait = (waitval > MAX_WAIT ? MAX_WAIT : waitval)
34
33
  end
35
34
  end
@@ -42,7 +41,7 @@ module Spider
42
41
  sleep(REDUCE_WAIT)
43
42
  back_off
44
43
  end
45
-
44
+
46
45
  def value
47
46
  @wait
48
47
  end
@@ -126,12 +126,10 @@ module Spider
126
126
  allow: /
127
127
  eos
128
128
 
129
- txt.http_status = 401
130
- @bottxt = described_class.new(txt)
129
+ @bottxt = described_class.new(txt, nil, 401)
131
130
  expect(@bottxt.excluded?('/')).to be true
132
131
 
133
- txt.http_status = 403
134
- @bottxt = described_class.new(txt)
132
+ @bottxt = described_class.new(txt, nil, 403)
135
133
  expect(@bottxt.excluded?('/')).to be true
136
134
  end
137
135
  end
@@ -243,7 +241,53 @@ module Spider
243
241
  expect(@bottxt.excluded?('/')).to be true
244
242
  end
245
243
 
246
- xit "should allow cascading user-agent strings"
244
+ it "should use default agent if passed nil agent string" do
245
+ txt = <<-eos
246
+ user-agent: testbot
247
+ disallow: /
248
+
249
+ user-agent: *
250
+ disallow:
251
+ eos
252
+
253
+ @bottxt = described_class.new(txt, nil)
254
+ expect(@bottxt.excluded?('/')).to be false
255
+ end
256
+
257
+ it "should use default agent if passed blank agent string" do
258
+ txt = <<-eos
259
+ user-agent: testbot
260
+ disallow: /
261
+
262
+ user-agent: *
263
+ disallow:
264
+ eos
265
+
266
+ @bottxt = described_class.new(txt, '')
267
+ expect(@bottxt.excluded?('/')).to be false
268
+ end
269
+
270
+ it "should allow cascading user-agent strings" do
271
+ txt = <<-eos
272
+ user-agent: agentfirst
273
+ user-agent: agentlast
274
+ disallow: /test_dir
275
+ allow: /other_test_dir
276
+ eos
277
+
278
+ bottxt_first = described_class.new(txt, 'agentfirst')
279
+ bottxt_last = described_class.new(txt, 'agentlast')
280
+
281
+ expect(bottxt_first.excluded?('/test_dir')).to be true
282
+ expect(bottxt_last.excluded?('/test_dir')).to be true
283
+ expect(bottxt_first.allowed?('/test_dir')).to be false
284
+ expect(bottxt_last.allowed?('/test_dir')).to be false
285
+
286
+ expect(bottxt_first.excluded?('/other_test_dir')).to be false
287
+ expect(bottxt_last.excluded?('/other_test_dir')).to be false
288
+ expect(bottxt_first.allowed?('/other_test_dir')).to be true
289
+ expect(bottxt_last.allowed?('/other_test_dir')).to be true
290
+ end
247
291
  end
248
292
 
249
293
  describe "Disallow directive" do
@@ -0,0 +1,128 @@
1
+ # Author:: Robert Dormer (mailto:rdormer@gmail.com)
2
+ # Copyright:: Copyright (c) 2016 Robert Dormer
3
+ # License:: MIT
4
+
5
+ require File.dirname(__FILE__) + '/../lib/spiderkit'
6
+
7
+ module Spider
8
+ describe VisitRecorder do
9
+
10
+ describe 'when active' do
11
+ before(:each) do
12
+ described_class.activate!
13
+ described_class.record!
14
+ @url = "http://test.domain.123"
15
+ end
16
+
17
+ it 'should add http_status to string' do
18
+ expect("".respond_to? :http_status).to be true
19
+ end
20
+
21
+ it 'should add http_headers to string' do
22
+ expect("".respond_to? :http_headers).to be true
23
+ end
24
+
25
+ it 'should execute the block argument if recording data' do
26
+ run_data = ''
27
+ buffer = StringIO.new
28
+ allow(buffer).to receive(:size?).and_return(0)
29
+ allow(File).to receive(:open).and_return(buffer)
30
+ described_class.recall(@url) { |u| run_data = 'ran' }
31
+ expect(run_data).to eq 'ran'
32
+ end
33
+
34
+ describe 'saved information' do
35
+ before(:each) do
36
+ @buffer = StringIO.new
37
+ allow(File).to receive(:open).and_return(@buffer)
38
+ allow(@buffer).to receive(:size?).and_return(0)
39
+
40
+ @data = described_class.recall(@url) do |u|
41
+ rval = "this is the test body"
42
+ rval.http_headers = "test headers"
43
+ rval.http_status = 200
44
+ rval
45
+ end
46
+
47
+ @buffer.rewind
48
+ end
49
+
50
+ it 'should save and return headers' do
51
+ expect(@data.http_headers).to eq "test headers"
52
+ data = YAML.load(@buffer.read)
53
+ expect(data[:headers]).to eq Base64.encode64("test headers")
54
+ end
55
+
56
+ it 'should save the request url' do
57
+ data = YAML.load(@buffer.read)
58
+ expect(data[:url]).to eq @url
59
+ end
60
+
61
+ it 'should save response code' do
62
+ data = YAML.load(@buffer.read)
63
+ expect(data[:response]).to eq 200
64
+ end
65
+
66
+ it 'should save and return the response data' do
67
+ expect(@data).to eq "this is the test body"
68
+ data = YAML.load(@buffer.read)
69
+ expect(data[:data]).to eq Base64.encode64("this is the test body")
70
+ end
71
+
72
+ it 'should bypass the block argument if playing back data' do
73
+ run_flag = false
74
+ described_class.recall(@url) { |u| run_flag = true }
75
+ expect(run_flag).to be false
76
+ end
77
+ end
78
+
79
+ describe 'file operations' do
80
+ before(:each) do
81
+ @path = "test_path/"
82
+ @buffer = StringIO.new
83
+ @fname = Digest::MD5.hexdigest(@url)
84
+ allow(File).to receive(:open).and_return(@buffer)
85
+ expect(File).to receive(:size?).and_return(0)
86
+ described_class.config('/')
87
+ end
88
+
89
+ it 'name file as md5 hash of url with query' do
90
+ expect(File).to receive(:open).with('/' + @fname, 'w')
91
+ described_class.recall(@url) { |u| 'test data' }
92
+ end
93
+
94
+ it 'should not overwrite existing files' do
95
+ File.size?
96
+ expect(File).to receive(:size?).and_return(1234)
97
+ expect(File).to receive(:open).with('/' + @fname, 'r')
98
+ described_class.recall(@url) { |u| 'test data' }
99
+ end
100
+ end
101
+ end
102
+
103
+ describe 'when not active' do
104
+ before(:all) do
105
+ described_class.pause!
106
+ described_class.deactivate!
107
+ end
108
+
109
+ it 'should not record unless activated and recording enabled' do
110
+ expect(File).to_not receive(:size?)
111
+ expect(File).to_not receive(:open)
112
+ end
113
+
114
+ it 'should pass the return of the block through' do
115
+ expect(described_class.recall(@url) { |u| 'test data' }).to eq 'test data'
116
+ end
117
+
118
+ it 'should not playback if not active' do
119
+ data = {data: Base64.encode64('test data should not appear')}.to_yaml
120
+ @buffer = StringIO.new(data)
121
+ allow(File).to receive(:open).and_return(@buffer)
122
+ result = described_class.recall(@url) { |u| 'test data' }
123
+ expect(result).to_not eq 'test data should not appear'
124
+ expect(result).to eq 'test data'
125
+ end
126
+ end
127
+ end
128
+ end
@@ -14,7 +14,6 @@ def get_visit_order(q)
14
14
  order
15
15
  end
16
16
 
17
-
18
17
  module Spider
19
18
 
20
19
  describe VisitQueue do
@@ -125,7 +124,7 @@ REND
125
124
  flag = false
126
125
  final = Proc.new { flag = true }
127
126
  queue = described_class.new(nil, nil, final)
128
- queue.push_back((1..20).to_a)
127
+ queue.push_back(%w(1 2 3 4 5 6 7 8))
129
128
  queue.visit_each { queue.stop if queue.visit_count >= 1 }
130
129
  expect(queue.visit_count).to eq 1
131
130
  expect(flag).to be true
@@ -149,6 +148,19 @@ REND
149
148
  expect(@queue.visit_count).to eq 7
150
149
  end
151
150
 
151
+ it 'should not be affected by modification of url argument' do
152
+ @queue.push_front(%w(one three))
153
+
154
+ @queue.visit_each do |url|
155
+ url.gsub! /e/, '@'
156
+ end
157
+
158
+ expect(@queue.url_okay('one')).to be false
159
+ expect(@queue.url_okay('three')).to be false
160
+
161
+ expect(@queue.url_okay('on@')).to be true
162
+ expect(@queue.url_okay('thr@@')).to be true
163
+ end
152
164
  end
153
165
 
154
166
  end
data/spiderkit.gemspec CHANGED
@@ -16,6 +16,7 @@ Gem::Specification.new do |spec|
16
16
  spec.files = `git ls-files`.split($/)
17
17
  spec.executables = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
18
18
  spec.test_files = spec.files.grep(%r{^(test|spec|features)/})
19
+ spec.required_ruby_version = '>= 1.9.2.330'
19
20
  spec.require_paths = ["lib"]
20
21
 
21
22
  spec.add_development_dependency "bundler", "~> 1.3"
metadata CHANGED
@@ -1,78 +1,69 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: spiderkit
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.2
5
- prerelease:
4
+ version: 0.2.0
6
5
  platform: ruby
7
6
  authors:
8
7
  - Robert Dormer
9
8
  autorequire:
10
9
  bindir: bin
11
10
  cert_chain: []
12
- date: 2016-07-15 00:00:00.000000000 Z
11
+ date: 2016-08-06 00:00:00.000000000 Z
13
12
  dependencies:
14
13
  - !ruby/object:Gem::Dependency
15
14
  name: bundler
16
15
  requirement: !ruby/object:Gem::Requirement
17
- none: false
18
16
  requirements:
19
- - - ~>
17
+ - - "~>"
20
18
  - !ruby/object:Gem::Version
21
19
  version: '1.3'
22
20
  type: :development
23
21
  prerelease: false
24
22
  version_requirements: !ruby/object:Gem::Requirement
25
- none: false
26
23
  requirements:
27
- - - ~>
24
+ - - "~>"
28
25
  - !ruby/object:Gem::Version
29
26
  version: '1.3'
30
27
  - !ruby/object:Gem::Dependency
31
28
  name: rspec
32
29
  requirement: !ruby/object:Gem::Requirement
33
- none: false
34
30
  requirements:
35
- - - ~>
31
+ - - "~>"
36
32
  - !ruby/object:Gem::Version
37
33
  version: 3.4.0
38
34
  type: :development
39
35
  prerelease: false
40
36
  version_requirements: !ruby/object:Gem::Requirement
41
- none: false
42
37
  requirements:
43
- - - ~>
38
+ - - "~>"
44
39
  - !ruby/object:Gem::Version
45
40
  version: 3.4.0
46
41
  - !ruby/object:Gem::Dependency
47
42
  name: rake
48
43
  requirement: !ruby/object:Gem::Requirement
49
- none: false
50
44
  requirements:
51
- - - ! '>='
45
+ - - ">="
52
46
  - !ruby/object:Gem::Version
53
47
  version: '0'
54
48
  type: :development
55
49
  prerelease: false
56
50
  version_requirements: !ruby/object:Gem::Requirement
57
- none: false
58
51
  requirements:
59
- - - ! '>='
52
+ - - ">="
60
53
  - !ruby/object:Gem::Version
61
54
  version: '0'
62
55
  - !ruby/object:Gem::Dependency
63
56
  name: bloom-filter
64
57
  requirement: !ruby/object:Gem::Requirement
65
- none: false
66
58
  requirements:
67
- - - ~>
59
+ - - "~>"
68
60
  - !ruby/object:Gem::Version
69
61
  version: 0.2.0
70
62
  type: :runtime
71
63
  prerelease: false
72
64
  version_requirements: !ruby/object:Gem::Requirement
73
- none: false
74
65
  requirements:
75
- - - ~>
66
+ - - "~>"
76
67
  - !ruby/object:Gem::Version
77
68
  version: 0.2.0
78
69
  description: Spiderkit library for basic spiders and bots
@@ -88,39 +79,41 @@ files:
88
79
  - Rakefile
89
80
  - lib/exclusion.rb
90
81
  - lib/queue.rb
82
+ - lib/recorder.rb
91
83
  - lib/spiderkit.rb
92
84
  - lib/version.rb
93
85
  - lib/wait_time.rb
94
86
  - spec/exclusion_parser_spec.rb
87
+ - spec/recorder_spec.rb
95
88
  - spec/visit_queue_spec.rb
96
89
  - spec/wait_time_spec.rb
97
90
  - spiderkit.gemspec
98
91
  homepage: http://github.com/rdormer/spiderkit
99
92
  licenses:
100
93
  - MIT
94
+ metadata: {}
101
95
  post_install_message:
102
96
  rdoc_options: []
103
97
  require_paths:
104
98
  - lib
105
99
  required_ruby_version: !ruby/object:Gem::Requirement
106
- none: false
107
100
  requirements:
108
- - - ! '>='
101
+ - - ">="
109
102
  - !ruby/object:Gem::Version
110
- version: '0'
103
+ version: 1.9.2.330
111
104
  required_rubygems_version: !ruby/object:Gem::Requirement
112
- none: false
113
105
  requirements:
114
- - - ! '>='
106
+ - - ">="
115
107
  - !ruby/object:Gem::Version
116
108
  version: '0'
117
109
  requirements: []
118
110
  rubyforge_project:
119
- rubygems_version: 1.8.25
111
+ rubygems_version: 2.4.8
120
112
  signing_key:
121
- specification_version: 3
113
+ specification_version: 4
122
114
  summary: Basic toolkit for writing web spiders and bots
123
115
  test_files:
124
116
  - spec/exclusion_parser_spec.rb
117
+ - spec/recorder_spec.rb
125
118
  - spec/visit_queue_spec.rb
126
119
  - spec/wait_time_spec.rb