RubyGems - spiderkit - Versions diffs - 0.1.2 → 0.2.0 - Mend

spiderkit 0.1.2 → 0.2.0

Files changed (13) hide show

checksums.yaml +7 -0
data/README.md +74 -18
data/lib/exclusion.rb +68 -57
data/lib/queue.rb +27 -26
data/lib/recorder.rb +111 -0
data/lib/spiderkit.rb +3 -1
data/lib/version.rb +1 -1
data/lib/wait_time.rb +12 -13
data/spec/exclusion_parser_spec.rb +49 -5
data/spec/recorder_spec.rb +128 -0
data/spec/visit_queue_spec.rb +14 -2
data/spiderkit.gemspec +1 -0
metadata +19 -26

checksums.yaml ADDED Viewed

@@ -0,0 +1,7 @@
+---
+SHA1:
+  metadata.gz: 5d3b757d2369c3d6c9520e8f4b78f60f79fdfd95
+  data.tar.gz: 562120125d785f938f37b102a53e90dc78dc56cc
+SHA512:
+  metadata.gz: 391d8705750fe6e738152384439da6b512dc29471796c37bd48f1d90bd2b3909fc1a1c6257c1a95baff42533d662140f7cf6a999a3d9cedca17cdd8cb1f43c06
+  data.tar.gz: a243700c54bf0eecaf16a879d76e201c7debcb93094bed3ba32bea47f9615c321ba5f297451955cead398e3fda3f7ecd2892673eea87102339e0ca75c3df34de

data/README.md CHANGED Viewed

@@ -58,8 +58,8 @@ end
 A slightly fancier example:
 ```ruby
-#download robots.txt as variable txt
-#user agent for robots.txt is "my-bot"
+# download robots.txt as variable txt
+# user agent for robots.txt is "my-bot"
 finalizer = Proc.new { puts "done"}
 mybot = Spider::VisitQueue.new(txt, "my-bot", finalizer)
@@ -69,10 +69,10 @@ As urls are fetched and added to the queue, any links already visited will be dr
 ```ruby
 mybot.visit_each do |url|
-  #these will be visited next
+  # these will be visited next
   mybot.push_front(nexturls)
-  #these will be visited last
+  # these will be visited last
   mybot.push_back(lasturls)
 end
 ```
@@ -104,24 +104,26 @@ The finalizer, if any, will still be executed after stopping iteration.
 Spiderkit also includes a robots.txt parser that can either work standalone, or be passed as an argument to the visit queue.  If passed as an argument, urls that are excluded by the robots.txt will be dropped transparently.
 ```
-#fetch robots.txt as variable txt
+# fetch robots.txt as variable txt
-#create a stand alone parser
+# create a stand alone parser
 robots_txt = Spider::ExclusionParser.new(txt)
 robots_txt.excluded?("/") => true
 robots_txt.excluded?("/admin") => false
 robots_txt.allowed?("/blog") => true
-#pass text directly to visit queue
+# pass text directly to visit queue
 mybot = Spider::VisitQueue(txt)
 ```
 Note that you pass the robots.txt directly to the visit queue - no need to new up the parser yourself.  The VisitQueue also has a robots_txt accessor that you can use to access and set the exclusion parser while iterating through the queue:
 ```ruby
+require 'open-uri'
 mybot.visit_each |url|
-  #download a new robots.txt from somewhere
+  txt = open('http://wikipedia.org/robots.txt').read
   mybot.robot_txt = Spider::ExclusionParser.new(txt)
 end
 ```
@@ -129,21 +131,31 @@ end
 If you don't pass an agent string, then the parser will take it's configuration from the default agent specified in the robots.txt.  If you want your bot to respond to directives for a given user agent, just pass the agent to either the queue when you create it, or the parser:
 ```ruby
-#visit queue that will respond to any robots.txt
-#with User-agent: mybot in them
+# visit queue that will respond to any robots.txt
+# with User-agent: mybot in them
 mybot = Spider::VisitQueue(txt, 'mybot')
 #same thing as a standalone parser
 myparser = Spider::ExclusionParser.new(txt, 'mybot')
 ```
-Note that user agent string passed in to your exclusion parser and the user agent string sent along with HTTP requests are not necessarily one and the same, although the user agent contained in robots.txt will usually be a subset of the HTTP user agent.
+You can also pass nil or a blank string as the agent to use the default agent.  Note that user agent string passed in to your exclusion parser and the user agent string sent along with HTTP requests are not necessarily one and the same, although the user agent contained in robots.txt will usually be a subset of the HTTP user agent.
 For example:
 Googlebot/2.1 (+http://www.google.com/bot.html)
-should respond to "googlebot" in robots.txt.  By convention, bots and spiders usually have the name 'bot' somewhere in their user agent strings.
+should respond to "googlebot" in robots.txt.  By convention, bots and spiders usually have the name 'bot' somewhere in their user agent strings.  You can also pass the response code of the request that fetched the robots.txt file if you like, and let the exclusion parser decide what to do with it:
+```ruby
+require 'open-uri'
+status = 0
+data = open('http://wikipedia.org/robots.txt') { |f| status = f.status }
+mybot.robot_txt = Spider::ExclusionParser.new(data.read, 'mybot', status)
+```
+Finally, as a sanity check / to avoid DoS honeypots with malicious robots.txt files, the exclusion parser will process a maximum of one thousand non-whitespace lines before stopping.
 ## Wait Time
@@ -152,29 +164,73 @@ Ideally a bot should wait for some period of time in between requests to avoid c
 You can create it standalone, or get it from an exclusion parser:
 ```ruby
-#download a robots.txt with a crawl-delay 40
+# download a robots.txt with a crawl-delay 40
 robots_txt = Spider::ExclusionParser.new(txt)
 delay = robots_txt.wait_time
 delay.value => 40
-#receive a rate limit code, double wait time
+# receive a rate limit code, double wait time
 delay.back_off
-#actually do the waiting part
+# actually do the waiting part
 delay.wait
-#in response to some rate limit codes you'll want
-#to sleep for a while, then back off
+# in response to some rate limit codes you'll want
+# to sleep for a while, then back off
 delay.reduce_wait
-#after one call to back_off and one call to reduce_wait
+# after one call to back_off and one call to reduce_wait
 delay.value => 160
 ```
 By default a WaitTime will specify an initial value of 2 seconds.  You can pass a value to new to specify the wait seconds, although values larger than the max allowable value will be set to the max allowable value (3 minutes / 180 seconds).
+## Recording Requests
+For convenience, an HTTP request recorder is provided, and is highly useful for helping write regression and integration tests.  It accepts a block of code that returns a string containing the response data.  The String class is monkey-patched to add http_status and http_headers accessors for ease of transporting other request data (yes, I know, monkey patching is evil).  Information assigned to these accessors will be saved as well by the recorder, but their use is not required.  The recorder class will manage the marshaling and unmarshaling of the request data behind the scenes, saving requests identified by their URL as a uniquely hashed file name with YAML-ized and Base64 encoded data in it.  This is similar to VCR, and you can certainly use that instead.  However, I personally ran into some troubles integrating it into some spiders I was writing, so I came up with this as a simple, lightweight alternative that works well with the rest of the Spiderkit.
+The recorder will not play back request data unless enabled, and it will not save request data unless recording is turned on.  This is done with the **activate!** and **record!** methods, respectively.  You can stop recording with the **pause!** method and stop playback with the **deactivate!** method.
+A simple spider for iterating pages and recording them might look like this:
+```ruby
+require 'spiderkit'
+require 'open-uri'
+mybot = Spider::VisitQueue.new
+mybot.push_front('http://someurl.com')
+Spider::VisitRecorder.config('/save/path')
+Spider::VisitRecorder.activate!
+Spider::VisitRecorder.record!
+mybot.visit_each do |url|
+  data = Spider::VisitRecorder.recall(url) do
+    text = ''
+    puts "fetch #{url}"
+    open(url) do |f|
+      text = f.read
+      # doing this is only necessary if you want to
+      # save this information in the recording
+      text.http_status = f.status.first.to_i
+    end
+    text
+  end
+  # extract links from data and push onto the
+  # spider queue
+end
+```
+After the first time the pages are spidered and saved, any subsequent run would simply replay the recorded data.  You would find the saved request files in the working directory.  The path that requests are saved to can be altered using the **config** method:
+```ruby
+Spider::VisitRecorder.config('/some/test/path')
+```
 ## Contributing

data/lib/exclusion.rb CHANGED Viewed

@@ -4,7 +4,7 @@
 #==============================================
 # This is the class that parses robots.txt and implements the exclusion
-# checking logic therein.  Works by breaking the file up into a hash of
+# checking logic therein.  Works by breaking the file up into a hash of
 # arrays of directives for each specified user agent, and then parsing the
 # directives into internal arrays and iterating through the list to find a
 # match.  Urls are matched case sensitive, everything else is case insensitive.
@@ -13,46 +13,45 @@
 require 'cgi'
 module Spider
+  class ExclusionParser
-  class ExclusionParser
     attr_accessor :wait_time
-    DISALLOW = "disallow"
-    DELAY = "crawl-delay"
-    ALLOW = "allow"
+    NULL_MATCH = '*!*'.freeze
+    DISALLOW = 'disallow'.freeze
+    DELAY = 'crawl-delay'.freeze
+    ALLOW = 'allow'.freeze
     MAX_DIRECTIVES = 1000
-    NULL_MATCH = "*!*"
-    def initialize(text, agent=nil)
+    def initialize(text, agent = nil, status = 200)
       @skip_list = []
       @agent_key = agent
       return if text.nil? || text.length.zero?
-      if [401, 403].include? text.http_status
+      if [401, 403].include? status
         @skip_list << [NULL_MATCH, true]
         return
       end
       begin
         config = parse_text(text)
         grab_list(config)
       rescue
       end
     end
     # Check to see if the given url is matched by any rule
     # in the file, and return it's associated status
     def excluded?(url)
       url = safe_unescape(url)
       @skip_list.each do |entry|
         return entry.last if url.include? entry.first
         return entry.last if entry.first == NULL_MATCH
       end
       false
     end
@@ -61,89 +60,101 @@ module Spider
     end
     private
     # Method to process the list of directives for a given user agent.
     # Picks the one that applies to us, and then processes it's directives
     # into the skip list by splitting the strings and taking the appropriate
     # action. Stops after a set number of directives to avoid malformed files
     # or denial of service attacks
     def grab_list(config)
-      section = (config.include?(@agent_key) ?
-        config[@agent_key] : config['*'])
-      if(section.length > MAX_DIRECTIVES)
+      if config.include?(@agent_key)
+        section = config[@agent_key]
+      else
+        section = config['*']
+      end
+      if section.length > MAX_DIRECTIVES
         section.slice!(MAX_DIRECTIVES, section.length)
       end
       section.each do |pair|
         key, value = pair.split(':')
-        next if key.nil? || value.nil? ||
-          key.empty? || value.empty?
+        next if key.nil? || value.nil? ||
+                key.empty? || value.empty?
         key.downcase!
         key.lstrip!
         key.rstrip!
         value.lstrip!
         value.rstrip!
         disallow(value) if key == DISALLOW
         delay(value) if key == DELAY
-        allow(value) if key == ALLOW
+        allow(value) if key == ALLOW
       end
     end
     # Top level file parsing method - makes sure carriage returns work,
     # strips out any BOM, then loops through each line and opens up a new
-    # array of directives in the hash if a user-agent directive is found
+    # array of directives in the hash if a user-agent directive is found.
     def parse_text(text)
-      current_key = ""
+      current_key = ''
       config = {}
       text.gsub!("\r", "\n")
-      text.gsub!("\xEF\xBB\xBF".force_encoding("ASCII-8BIT"), '')
+      text = text.force_encoding('UTF-8')
+      text.gsub!("\xEF\xBB\xBF".force_encoding('UTF-8'), '')
       text.each_line do |line|
         line.lstrip!
         line.rstrip!
-        line.gsub! /#.*/, ''
-        if line.length.nonzero? && line =~ /[^\s]/
-          if line =~ /User-agent:\s+(.+)/i
-            current_key = $1.downcase
-            config[current_key] = [] unless config[current_key]
-            next
+        line.gsub!(/#.*/, '')
+        next unless line.length.nonzero? && line =~ /[^\s]/
+        if line =~ /User-agent:\s+(.+)/i
+          previous_key = current_key
+          current_key = $1.downcase
+          config[current_key] = [] unless config[current_key]
+          # If we've seen a new user-agent directive and the previous one
+          # is empty then we have a cascading user-agent string. Copy the
+          # new user agent array ref so both user agents are identical.
+          if config.key?(previous_key) && config[previous_key].size.zero?
+            config[previous_key] = config[current_key]
           end
+        else
           config[current_key] << line
         end
       end
       config
-    end
+    end
     def disallow(value)
-      token = (value == "/" ? NULL_MATCH : value.chomp('*'))
+      token = (value == '/' ? NULL_MATCH : value.chomp('*'))
       @skip_list << [safe_unescape(token), true]
     end
     def allow(value)
-      token = (value == "/" ? NULL_MATCH : value.chomp('*'))
+      token = (value == '/' ? NULL_MATCH : value.chomp('*'))
       @skip_list << [safe_unescape(token), false]
     end
     def delay(value)
       @wait_time = WaitTime.new(value.to_i)
     end
     def safe_unescape(target)
-      t = target.gsub /%2f/, '^^^'
+      t = target.gsub(/%2f/, '^^^')
       t = CGI.unescape(t)
-      t.gsub /\^\^\^/, '%2f'
+      t.gsub(/\^\^\^/, '%2f')
     end
   end
 end

data/lib/queue.rb CHANGED Viewed

@@ -6,45 +6,48 @@ require 'bloom-filter'
 require 'exclusion'
 module Spider
   class VisitQueue
-    class IterationExit < Exception; end
+    IterationExit = Class.new(Exception)
     attr_accessor :visit_count
     attr_accessor :robot_txt
-    def initialize(robots=nil, agent=nil, finish=nil)
+    def initialize(robots = nil, agent = nil, finish = nil)
       @robot_txt = ExclusionParser.new(robots, agent) if robots
       @finalize = finish
       @visit_count = 0
       clear_visited
       @pending = []
-    end
+    end
     def visit_each
       begin
         until @pending.empty?
           url = @pending.pop
-          if url_okay(url)
-            yield url if block_given?
-            @visited.insert(url)
-            @visit_count += 1
-          end
-        end
+          next unless url_okay(url)
+          yield url.clone if block_given?
+          @visited.insert(url)
+          @visit_count += 1
+        end
       rescue IterationExit
       end
       @finalize.call if @finalize
-    end
+    end
     def push_front(urls)
-      add_url(urls) {|u| @pending.push(u)}
-    end
+      add_url(urls) { |u| @pending.push(u) }
+    end
     def push_back(urls)
-      add_url(urls) {|u| @pending.unshift(u)}
-    end
+      add_url(urls) { |u| @pending.unshift(u) }
+    end
+    def mark(urls)
+      urls = [urls] unless urls.is_a? Array
+      urls.each { |u| @visited.insert(u) }
+    end
     def size
       @pending.size
@@ -61,23 +64,21 @@ module Spider
     def clear_visited
       @visited = BloomFilter.new(size: 10_000, error_rate: 0.001)
     end
-    private
     def url_okay(url)
       return false if @visited.include?(url)
       return false if @robot_txt && @robot_txt.excluded?(url)
       true
     end
+    private
     def add_url(urls)
       urls = [urls] unless urls.is_a? Array
       urls.compact!
       urls.each do |url|
-        unless @visited.include?(url) || @pending.include?(url)
-          yield url
-        end
+        yield url unless @visited.include?(url) || @pending.include?(url)
       end
     end
   end

data/lib/recorder.rb ADDED Viewed

@@ -0,0 +1,111 @@
+# Author::    Robert Dormer (mailto:rdormer@gmail.com)
+# Copyright:: Copyright (c) 2016 Robert Dormer
+# License::   MIT
+#==============================================
+# Class for saving visited URIs as YAML-ized files with response codes and
+# headers. Takes the network request code as a block, hashes the URI to get
+# a file name, and then creates it and saves it if it's not present, or reads
+# and returns the contents if it is.  Since the data returned from the file
+# is an exact copy of what the block returned for that URI, it's a constant,
+# deterministic recording that is highly useful for integration tests and the
+# like.  Yes, you can use VCR if that's your thing, but I found it difficult
+# to integrate with real-world crawlers.  This is a lightweight wrapper to
+# give you 90% of the same thing.
+#==============================================
+require 'digest'
+require 'base64'
+require 'yaml'
+module Spider
+  class VisitRecorder
+    @@directory = ''
+    @@active = false
+    @@recording = false
+    class << self
+    def activate!
+      @@active = true
+    end
+    def record!
+      @@recording = true
+    end
+    def deactivate!
+      @@active = false
+    end
+    def pause!
+      @@recording = false
+    end
+    def config(dir)
+      @@directory = dir
+    end
+    def recall(*args)
+      if @@active
+        url = args.first.to_s
+        data = ''
+        store = locate_file(url)
+        if store.size == 0
+          raise "Unexpected request: #{url}" unless @@recording
+          data = yield(*args) if block_given?
+          begin
+            store.write(package(url, data))
+          rescue StandardError => e
+            puts e.message
+            puts "On file #{store.path}"
+          end
+        else
+          data = unpackage(store, url)
+        end
+        return data
+      elsif block_given?
+        yield(*args)
+      end
+    end
+      private
+    def locate_file(url)
+      key = Digest::MD5.hexdigest(url)
+      path = File.expand_path(key, @@directory)
+      fsize = File.size?(path)
+      (fsize.nil? || fsize.zero? ? File.open(path, 'w') : File.open(path, 'r'))
+    end
+    def package(url, data)
+      payload = {}
+      payload[:url] = url.encode('UTF-8')
+      payload[:data] = Base64.encode64(data)
+      unless data.http_status.nil?
+        payload[:response] = data.http_status
+      end
+      unless data.http_headers.nil?
+        payload[:headers] = Base64.encode64(data.http_headers)
+      end
+      payload.to_yaml
+    end
+    def unpackage(store, url)
+      raw = YAML.load(store.read)
+      raise 'URL mismatch in recording' unless raw[:url] == url
+      data = Base64.decode64(raw[:data])
+      data.http_headers = Base64.decode64(raw[:headers])
+      data.http_status = raw[:response]
+      data
+    end
+  end
+  end
+end

data/lib/spiderkit.rb CHANGED Viewed

@@ -2,12 +2,14 @@
 # Copyright:: Copyright (c) 2016 Robert Dormer
 # License::   MIT
-$: << File.dirname(__FILE__)
+$LOAD_PATH << File.dirname(__FILE__)
 require 'wait_time'
 require 'exclusion'
+require 'recorder'
 require 'version'
 require 'queue'
 class String
   attr_accessor :http_status
+  attr_accessor :http_headers
 end

data/lib/version.rb CHANGED Viewed

@@ -3,5 +3,5 @@
 # License::   MIT
 module Spider
-  VERSION = "0.1.2"
+  VERSION = "0.2.0"
 end

data/lib/wait_time.rb CHANGED Viewed

@@ -3,33 +3,32 @@
 # License::   MIT
 #==============================================
-#Class to encapsulate the crawl delay being used.
-#Clamps the value to a maximum amount and implements
-#an exponential backoff function for responding to
-#rate limit requests
+# Class to encapsulate the crawl delay being used.
+# Clamps the value to a maximum amount and implements
+# an exponential backoff function for responding to
+# rate limit requests
 #==============================================
 module Spider
   class WaitTime
     MAX_WAIT = 180
     DEFAULT_WAIT = 2
     REDUCE_WAIT = 300
-    def initialize(period=nil)
-      unless period.nil?
-        @wait = (period > MAX_WAIT ? MAX_WAIT : period)
-      else
+    def initialize(period = nil)
+      if period.nil?
         @wait = DEFAULT_WAIT
+      else
+        @wait = (period > MAX_WAIT ? MAX_WAIT : period)
       end
     end
     def back_off
       if @wait.zero?
-        @wait = DEFAULT_WAIT
+        @wait = DEFAULT_WAIT
       else
-        waitval = @wait * 2
+        waitval = @wait * 2
         @wait = (waitval > MAX_WAIT ? MAX_WAIT : waitval)
       end
     end
@@ -42,7 +41,7 @@ module Spider
       sleep(REDUCE_WAIT)
       back_off
     end
     def value
       @wait
     end

data/spec/exclusion_parser_spec.rb CHANGED Viewed

@@ -126,12 +126,10 @@ module Spider
           allow: /
         eos
-        txt.http_status = 401
-        @bottxt = described_class.new(txt)
+        @bottxt = described_class.new(txt, nil, 401)
         expect(@bottxt.excluded?('/')).to be true
-        txt.http_status = 403
-        @bottxt = described_class.new(txt)
+        @bottxt = described_class.new(txt, nil, 403)
         expect(@bottxt.excluded?('/')).to be true
       end
     end
@@ -243,7 +241,53 @@ module Spider
         expect(@bottxt.excluded?('/')).to be true
       end
-      xit "should allow cascading user-agent strings"
+      it "should use default agent if passed nil agent string" do
+        txt = <<-eos
+          user-agent: testbot
+          disallow: /
+          user-agent: *
+          disallow:
+        eos
+        @bottxt = described_class.new(txt, nil)
+        expect(@bottxt.excluded?('/')).to be false
+      end
+      it "should use default agent if passed blank agent string" do
+        txt = <<-eos
+          user-agent: testbot
+          disallow: /
+          user-agent: *
+          disallow:
+        eos
+        @bottxt = described_class.new(txt, '')
+        expect(@bottxt.excluded?('/')).to be false
+      end
+      it "should allow cascading user-agent strings" do
+        txt = <<-eos
+          user-agent: agentfirst
+          user-agent: agentlast
+          disallow: /test_dir
+          allow: /other_test_dir
+        eos
+        bottxt_first = described_class.new(txt, 'agentfirst')
+        bottxt_last = described_class.new(txt, 'agentlast')
+        expect(bottxt_first.excluded?('/test_dir')).to be true
+        expect(bottxt_last.excluded?('/test_dir')).to be true
+        expect(bottxt_first.allowed?('/test_dir')).to be false
+        expect(bottxt_last.allowed?('/test_dir')).to be false
+        expect(bottxt_first.excluded?('/other_test_dir')).to be false
+        expect(bottxt_last.excluded?('/other_test_dir')).to be false
+        expect(bottxt_first.allowed?('/other_test_dir')).to be true
+        expect(bottxt_last.allowed?('/other_test_dir')).to be true
+      end
     end
     describe "Disallow directive" do

data/spec/recorder_spec.rb ADDED Viewed

@@ -0,0 +1,128 @@
+# Author::    Robert Dormer (mailto:rdormer@gmail.com)
+# Copyright:: Copyright (c) 2016 Robert Dormer
+# License::   MIT
+require File.dirname(__FILE__) + '/../lib/spiderkit'
+module Spider
+  describe VisitRecorder do
+    describe 'when active' do
+      before(:each) do
+        described_class.activate!
+        described_class.record!
+        @url = "http://test.domain.123"
+      end
+      it 'should add http_status to string' do
+        expect("".respond_to? :http_status).to be true
+      end
+      it 'should add http_headers to string' do
+        expect("".respond_to? :http_headers).to be true
+      end
+      it 'should execute the block argument if recording data' do
+        run_data = ''
+        buffer = StringIO.new
+        allow(buffer).to receive(:size?).and_return(0)
+        allow(File).to receive(:open).and_return(buffer)
+        described_class.recall(@url) { |u| run_data = 'ran' }
+        expect(run_data).to eq 'ran'
+      end
+      describe 'saved information' do
+        before(:each) do
+          @buffer = StringIO.new
+          allow(File).to receive(:open).and_return(@buffer)
+          allow(@buffer).to receive(:size?).and_return(0)
+          @data = described_class.recall(@url) do |u|
+            rval = "this is the test body"
+            rval.http_headers = "test headers"
+            rval.http_status = 200
+            rval
+          end
+          @buffer.rewind
+        end
+        it 'should save and return headers' do
+          expect(@data.http_headers).to eq "test headers"
+          data = YAML.load(@buffer.read)
+          expect(data[:headers]).to eq Base64.encode64("test headers")
+        end
+        it 'should save the request url' do
+          data = YAML.load(@buffer.read)
+          expect(data[:url]).to eq @url
+        end
+        it 'should save response code' do
+          data = YAML.load(@buffer.read)
+          expect(data[:response]).to eq 200
+        end
+        it 'should save and return the response data' do
+          expect(@data).to eq "this is the test body"
+          data = YAML.load(@buffer.read)
+          expect(data[:data]).to eq Base64.encode64("this is the test body")
+        end
+        it 'should bypass the block argument if playing back data' do
+          run_flag = false
+          described_class.recall(@url) { |u| run_flag = true }
+          expect(run_flag).to be false
+        end
+      end
+      describe 'file operations' do
+        before(:each) do
+          @path = "test_path/"
+          @buffer = StringIO.new
+          @fname = Digest::MD5.hexdigest(@url)
+          allow(File).to receive(:open).and_return(@buffer)
+          expect(File).to receive(:size?).and_return(0)
+          described_class.config('/')
+        end
+        it 'name file as md5 hash of url with query' do
+          expect(File).to receive(:open).with('/' + @fname, 'w')
+          described_class.recall(@url) { |u| 'test data' }
+        end
+        it 'should not overwrite existing files' do
+          File.size?
+          expect(File).to receive(:size?).and_return(1234)
+          expect(File).to receive(:open).with('/' + @fname, 'r')
+          described_class.recall(@url) { |u| 'test data' }
+        end
+      end
+    end
+    describe 'when not active' do
+      before(:all) do
+        described_class.pause!
+        described_class.deactivate!
+      end
+      it 'should not record unless activated and recording enabled' do
+        expect(File).to_not receive(:size?)
+        expect(File).to_not receive(:open)
+      end
+      it 'should pass the return of the block through' do
+        expect(described_class.recall(@url) { |u| 'test data' }).to eq 'test data'
+      end
+      it 'should not playback if not active' do
+        data = {data: Base64.encode64('test data should not appear')}.to_yaml
+        @buffer = StringIO.new(data)
+        allow(File).to receive(:open).and_return(@buffer)
+        result = described_class.recall(@url) { |u| 'test data' }
+        expect(result).to_not eq 'test data should not appear'
+        expect(result).to eq 'test data'
+      end
+    end
+  end
+end

data/spec/visit_queue_spec.rb CHANGED Viewed

@@ -14,7 +14,6 @@ def get_visit_order(q)
   order
 end
 module Spider
   describe VisitQueue do
@@ -125,7 +124,7 @@ REND
       flag = false
       final = Proc.new { flag = true }
       queue = described_class.new(nil, nil, final)
-      queue.push_back((1..20).to_a)
+      queue.push_back(%w(1 2 3 4 5 6 7 8))
       queue.visit_each { queue.stop if queue.visit_count >= 1 }
       expect(queue.visit_count).to eq 1
       expect(flag).to be true
@@ -149,6 +148,19 @@ REND
       expect(@queue.visit_count).to eq 7
     end
+    it 'should not be affected by modification of url argument' do
+      @queue.push_front(%w(one three))
+      @queue.visit_each do |url|
+        url.gsub! /e/, '@'
+      end
+      expect(@queue.url_okay('one')).to be false
+      expect(@queue.url_okay('three')).to be false
+      expect(@queue.url_okay('on@')).to be true
+      expect(@queue.url_okay('thr@@')).to be true
+    end
   end
 end

data/spiderkit.gemspec CHANGED Viewed

@@ -16,6 +16,7 @@ Gem::Specification.new do |spec|
   spec.files         = `git ls-files`.split($/)
   spec.executables   = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
   spec.test_files    = spec.files.grep(%r{^(test|spec|features)/})
+  spec.required_ruby_version = '>= 1.9.2.330'
   spec.require_paths = ["lib"]
   spec.add_development_dependency "bundler", "~> 1.3"

metadata CHANGED Viewed

@@ -1,78 +1,69 @@
 --- !ruby/object:Gem::Specification
 name: spiderkit
 version: !ruby/object:Gem::Version
-  version: 0.1.2
-  prerelease:
+  version: 0.2.0
 platform: ruby
 authors:
 - Robert Dormer
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2016-07-15 00:00:00.000000000 Z
+date: 2016-08-06 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: bundler
   requirement: !ruby/object:Gem::Requirement
-    none: false
     requirements:
-    - - ~>
+    - - "~>"
       - !ruby/object:Gem::Version
         version: '1.3'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
-    none: false
     requirements:
-    - - ~>
+    - - "~>"
       - !ruby/object:Gem::Version
         version: '1.3'
 - !ruby/object:Gem::Dependency
   name: rspec
   requirement: !ruby/object:Gem::Requirement
-    none: false
     requirements:
-    - - ~>
+    - - "~>"
       - !ruby/object:Gem::Version
         version: 3.4.0
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
-    none: false
     requirements:
-    - - ~>
+    - - "~>"
       - !ruby/object:Gem::Version
         version: 3.4.0
 - !ruby/object:Gem::Dependency
   name: rake
   requirement: !ruby/object:Gem::Requirement
-    none: false
     requirements:
-    - - ! '>='
+    - - ">="
       - !ruby/object:Gem::Version
         version: '0'
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
-    none: false
     requirements:
-    - - ! '>='
+    - - ">="
       - !ruby/object:Gem::Version
         version: '0'
 - !ruby/object:Gem::Dependency
   name: bloom-filter
   requirement: !ruby/object:Gem::Requirement
-    none: false
     requirements:
-    - - ~>
+    - - "~>"
       - !ruby/object:Gem::Version
         version: 0.2.0
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
-    none: false
     requirements:
-    - - ~>
+    - - "~>"
       - !ruby/object:Gem::Version
         version: 0.2.0
 description: Spiderkit library for basic spiders and bots
@@ -88,39 +79,41 @@ files:
 - Rakefile
 - lib/exclusion.rb
 - lib/queue.rb
+- lib/recorder.rb
 - lib/spiderkit.rb
 - lib/version.rb
 - lib/wait_time.rb
 - spec/exclusion_parser_spec.rb
+- spec/recorder_spec.rb
 - spec/visit_queue_spec.rb
 - spec/wait_time_spec.rb
 - spiderkit.gemspec
 homepage: http://github.com/rdormer/spiderkit
 licenses:
 - MIT
+metadata: {}
 post_install_message:
 rdoc_options: []
 require_paths:
 - lib
 required_ruby_version: !ruby/object:Gem::Requirement
-  none: false
   requirements:
-  - - ! '>='
+  - - ">="
     - !ruby/object:Gem::Version
-      version: '0'
+      version: 1.9.2.330
 required_rubygems_version: !ruby/object:Gem::Requirement
-  none: false
   requirements:
-  - - ! '>='
+  - - ">="
     - !ruby/object:Gem::Version
       version: '0'
 requirements: []
 rubyforge_project:
-rubygems_version: 1.8.25
+rubygems_version: 2.4.8
 signing_key:
-specification_version: 3
+specification_version: 4
 summary: Basic toolkit for writing web spiders and bots
 test_files:
 - spec/exclusion_parser_spec.rb
+- spec/recorder_spec.rb
 - spec/visit_queue_spec.rb
 - spec/wait_time_spec.rb