RubyGems - spiderkit - Versions diffs - 0.1.0 - Mend

spiderkit 0.1.0

Files changed (14) hide show

data/Gemfile +4 -0
data/LICENSE.txt +22 -0
data/README.md +158 -0
data/Rakefile +5 -0
data/lib/exclusion.rb +149 -0
data/lib/queue.rb +80 -0
data/lib/spiderkit.rb +14 -0
data/lib/version.rb +7 -0
data/lib/wait_time.rb +50 -0
data/spec/exclusion_parser_spec.rb +481 -0
data/spec/visit_queue_spec.rb +129 -0
data/spec/wait_time_spec.rb +51 -0
data/spiderkit.gemspec +26 -0
metadata +126 -0

data/Gemfile ADDED Viewed

@@ -0,0 +1,4 @@
+source 'https://rubygems.org'
+# Specify your gem's dependencies in spiderkit.gemspec
+gemspec

data/LICENSE.txt ADDED Viewed

@@ -0,0 +1,22 @@
+Copyright (c) 2016 TODO: Write your name
+MIT License
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+"Software"), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
+LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
+OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
+WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/README.md ADDED Viewed

@@ -0,0 +1,158 @@
+# Spiderkit
+Spiderkit - Lightweight library for spiders and bots
+## Installation
+Add this line to your application's Gemfile:
+    gem 'spiderkit'
+And then execute:
+    $ bundle
+Or install it yourself as:
+    $ gem install spiderkit
+##Well Behaved Spiders
+Which is not to say you can't write ill-behaved spiders with this gem, but you're kind of a jerk if you do, and I'd really rather you didn't!  A well behaved spider will do a few simple things:
+* It will download and obey robots.txt
+* It will avoid repeatedly re-visiting pages
+* It will wait in between requests / avoid agressive spidering
+* It will honor rate-limit return codes
+* It will send a valid User-Agent string
+This library is written with an eye towards rapidly prototyping spiders that will do all of these things, plus whatever else you can come up with.
+## Usage
+Using Spiderkit, you implement your spiders and bots around the idea of a visit queue.  Urls (or any object you like) are added to the queue, and the queue is set to iterating. It obeys a few simple rules:
+* You can add any kind of object you like
+* You can add more objects to the queue as you iterate through it
+* Once an object is iterated over, it's removed from the queue
+* Once an object is iterated over, it's string value is added to an already-visited list, at which point you can't add it again.
+* The queue will stop once it's empty, and optionally execute a final Proc that you pass to it
+* The queue will not fetch web pages or anything else on it's own - that's part of what *you* implement.
+A basic example:
+```ruby
+mybot = Spider::VisitQueue.new
+mybot.push_front('http://someurl.com')
+mybot.visit_each do |url|
+  #fetch the url
+  #pull out the links as linklist
+  mybot.push_back(linklist)
+end
+```
+A slightly fancier example:
+```ruby
+#download robots.txt as variable txt
+#user agent is "my-bot/1.0"
+finalizer = Proc.new { puts "done"}
+mybot = Spider::VisitQueue.new(txt, "my-bot/1.0", finalizer)
+```
+As urls are fetched and added to the queue, any links already visited will be dropped transparently.  You have the option to push objects to either the front or rear of the queue at any time.  If you do push to the front of the queue while iterating over it, the things you push will be the next items visited, and vice versa if you push to the back:
+```ruby
+mybot.visit_each do |url|
+  #these will be visited next
+  mybot.push_front(nexturls)
+  #these will be visited last
+  mybot.push_back(lasturls)
+end
+```
+The already visited list is implemented as a Bloom filter, so you should be able to spider even fairly large domains (and there are quite a few out there) without re-visiting pages.
+Finally, you can forcefully stop spidering at any point:
+```ruby
+mybot.visit_each do |url|
+  mybot.stop
+end
+```
+The finalizer, if any, will still be executed after stopping iteration.
+## Robots.txt
+Spiderkit also includes a robots.txt parser that can either work standalone, or be passed as an argument to the visit queue.  If passed as an argument, urls that are excluded by the robots.txt will be dropped transparently.
+```
+#fetch robots.txt as variable txt
+#create a stand alone parser
+robots_txt = Spider::ExclusionParser.new(txt)
+robots_txt.excluded?("/") => true
+robots_txt.excluded?("/admin") => false
+robots_txt.allowed?("/blog") => true
+#pass text directly to visit queue
+mybot = Spider::VisitQueue(txt)
+```
+Note that you pass the robots.txt directly to the visit queue - no need to new up the parser yourself.  The VisitQueue also has a robots_txt accessor that you can use to access and set the exclusion parser while iterating through the queue:
+```ruby
+mybot.visit_each |url|
+  #download a new robots.txt from somewhere
+  mybot.robot_txt = Spider::ExclusionParser.new(txt)
+end
+```
+## Wait Time
+Ideally a bot should wait for some period of time in between requests to avoid crashing websites (less likely) or being blacklisted (more likely).  A WaitTime class is provided that encapsulates this waiting logic, and logic to respond to rate limit codes and the "crawl-delay" directives found in some robots.txt files.  Times are in seconds.
+You can create it standalone, or get it from an exclusion parser:
+```ruby
+#download a robots.txt with a crawl-delay 40
+robots_txt = Spider::ExclusionParser.new(txt)
+delay = robots_txt.wait_time
+delay.value => 40
+#receive a rate limit code, double wait time
+delay.back_off
+#actually do the waiting part
+delay.wait
+#in response to some rate limit codes you'll want
+#to sleep for a while, then back off
+delay.reduce_wait
+#after one call to back_off and one call to reduce_wait
+delay.value => 160
+```
+By default a WaitTime will specify an initial value of 2 seconds.  You can pass a value to new to specify the wait seconds, although values larger than the max allowable value will be set to the max allowable value (3 minutes / 180 seconds).
+## Contributing
+Bug reports, patches, and pull requests warmly welcomed at http://github.com/rdormer/spiderkit

data/Rakefile ADDED Viewed

@@ -0,0 +1,5 @@
+require "bundler/gem_tasks"
+require "rspec/core/rake_task"
+RSpec::Core::RakeTask.new(:spec)
+task :default => :spec

data/lib/exclusion.rb ADDED Viewed

@@ -0,0 +1,149 @@
+# Author::    Robert Dormer (mailto:rdormer@gmail.com)
+# Copyright:: Copyright (c) 2016 Robert Dormer
+# License::   MIT
+#==============================================
+# This is the class that parses robots.txt and implements the exclusion
+# checking logic therein.  Works by breaking the file up into a hash of
+# arrays of directives for each specified user agent, and then parsing the
+# directives into internal arrays and iterating through the list to find a
+# match.  Urls are matched case sensitive, everything else is case insensitive.
+# The root url is treated as a special case by using a token for it.
+#==============================================
+require 'cgi'
+module Spider
+  class ExclusionParser
+    attr_accessor :wait_time
+    DISALLOW = "disallow"
+    DELAY = "crawl-delay"
+    ALLOW = "allow"
+    MAX_DIRECTIVES = 1000
+    NULL_MATCH = "*!*"
+    def initialize(text, agent=nil)
+      @skip_list = []
+      @agent_key = agent
+      return if text.nil? || text.length.zero?
+      if [401, 403].include? text.http_status
+        @skip_list << [NULL_MATCH, true]
+        return
+      end
+      begin
+        config = parse_text(text)
+        grab_list(config)
+      rescue
+      end
+    end
+    # Check to see if the given url is matched by any rule
+    # in the file, and return it's associated status
+    def excluded?(url)
+      url = safe_unescape(url)
+      @skip_list.each do |entry|
+        return entry.last if url.include? entry.first
+        return entry.last if entry.first == NULL_MATCH
+      end
+      false
+    end
+    def allowed?(url)
+      !excluded?(url)
+    end
+    private
+    # Method to process the list of directives for a given user agent.
+    # Picks the one that applies to us, and then processes it's directives
+    # into the skip list by splitting the strings and taking the appropriate
+    # action. Stops after a set number of directives to avoid malformed files
+    # or denial of service attacks
+    def grab_list(config)
+      section = (config.include?(@agent_key) ?
+        config[@agent_key] : config['*'])
+      if(section.length > MAX_DIRECTIVES)
+        section.slice!(MAX_DIRECTIVES, section.length)
+      end
+      section.each do |pair|
+        key, value = pair.split(':')
+        next if key.nil? || value.nil? ||
+          key.empty? || value.empty?
+        key.downcase!
+        key.lstrip!
+        key.rstrip!
+        value.lstrip!
+        value.rstrip!
+        disallow(value) if key == DISALLOW
+        delay(value) if key == DELAY
+        allow(value) if key == ALLOW
+      end
+    end
+    # Top level file parsing method - makes sure carriage returns work,
+    # strips out any BOM, then loops through each line and opens up a new
+    # array of directives in the hash if a user-agent directive is found
+    def parse_text(text)
+      current_key = ""
+      config = {}
+      text.gsub!("\r", "\n")
+      text.gsub!("\xEF\xBB\xBF".force_encoding("ASCII-8BIT"), '')
+      text.each_line do |line|
+        line.lstrip!
+        line.rstrip!
+        line.gsub! /#.*/, ''
+        if line.length.nonzero? && line =~ /[^\s]/
+          if line =~ /User-agent:\s+(.+)/i
+            current_key = $1.downcase
+            config[current_key] = [] unless config[current_key]
+            next
+          end
+          config[current_key] << line
+        end
+      end
+      config
+    end
+    def disallow(value)
+      token = (value == "/" ? NULL_MATCH : value.chomp('*'))
+      @skip_list << [safe_unescape(token), true]
+    end
+    def allow(value)
+      token = (value == "/" ? NULL_MATCH : value.chomp('*'))
+      @skip_list << [safe_unescape(token), false]
+    end
+    def delay(value)
+      @wait_time = WaitTime.new(value.to_i)
+    end
+    def safe_unescape(target)
+      t = target.gsub /%2f/, '^^^'
+      t = CGI.unescape(t)
+      t.gsub /\^\^\^/, '%2f'
+    end
+  end
+end

data/lib/queue.rb ADDED Viewed

@@ -0,0 +1,80 @@
+# Author::    Robert Dormer (mailto:rdormer@gmail.com)
+# Copyright:: Copyright (c) 2016 Robert Dormer
+# License::   MIT
+require 'bloom-filter'
+require 'exclusion'
+module Spider
+  class VisitQueue
+    class IterationExit < Exception; end
+    attr_accessor :visit_count
+    attr_accessor :robot_txt
+    def initialize(robots=nil, agent=nil, finish=nil)
+      @visited = BloomFilter.new(size: 10_000, error_rate: 0.001)
+      @robot_txt = ExclusionParser.new(robots, agent) if robots
+      @finalize = finish
+      @visit_count = 0
+      @pending = []
+    end
+    def visit_each
+      begin
+        until @pending.empty?
+          url = @pending.pop
+          if url_okay(url)
+            yield url if block_given?
+            @visited.insert(url)
+            @visit_count += 1
+          end
+        end
+      rescue IterationExit
+      end
+      @finalize.call if @finalize
+    end
+    def push_front(urls)
+      add_url(urls) {|u| @pending.push(u)}
+    end
+    def push_back(urls)
+      add_url(urls) {|u| @pending.unshift(u)}
+    end
+    def size
+      @pending.size
+    end
+    def empty?
+      @pending.empty?
+    end
+    def stop
+      raise IterationExit
+    end
+    private
+    def url_okay(url)
+      return false if @visited.include?(url)
+      return false if @robot_txt && @robot_txt.excluded?(url)
+      true
+    end
+    def add_url(urls)
+      urls = [urls] unless urls.is_a? Array
+      urls.compact!
+      urls.each do |url|
+        unless @visited.include?(url)
+          yield url
+        end
+      end
+    end
+  end
+end

data/lib/spiderkit.rb ADDED Viewed

@@ -0,0 +1,14 @@
+# Author::    Robert Dormer (mailto:rdormer@gmail.com)
+# Copyright:: Copyright (c) 2016 Robert Dormer
+# License::   MIT
+$: << File.dirname(__FILE__)
+require 'wait_time'
+require 'exclusion'
+require 'urltree'
+require 'version'
+require 'queue'
+class String
+  attr_accessor :http_status
+end

data/lib/version.rb ADDED Viewed

@@ -0,0 +1,7 @@
+# Author::    Robert Dormer (mailto:rdormer@gmail.com)
+# Copyright:: Copyright (c) 2016 Robert Dormer
+# License::   MIT
+module Spider
+  VERSION = "0.1.0"
+end

data/lib/wait_time.rb ADDED Viewed

@@ -0,0 +1,50 @@
+# Author::    Robert Dormer (mailto:rdormer@gmail.com)
+# Copyright:: Copyright (c) 2016 Robert Dormer
+# License::   MIT
+#==============================================
+#Class to encapsulate the crawl delay being used.
+#Clamps the value to a maximum amount and implements
+#an exponential backoff function for responding to
+#rate limit requests
+#==============================================
+module Spider
+  class WaitTime
+    MAX_WAIT = 180
+    DEFAULT_WAIT = 2
+    REDUCE_WAIT = 300
+    def initialize(period=nil)
+      unless period.nil?
+        @wait = (period > MAX_WAIT ? MAX_WAIT : period)
+      else
+        @wait = DEFAULT_WAIT
+      end
+    end
+    def back_off
+      if @wait.zero?
+        @wait = DEFAULT_WAIT
+      else
+        waitval = @wait * 2
+        @wait = (waitval > MAX_WAIT ? MAX_WAIT : waitval)
+      end
+    end
+    def wait
+      sleep(@wait)
+    end
+    def reduce_wait
+      sleep(REDUCE_WAIT)
+      back_off
+    end
+    def value
+      @wait
+    end
+  end
+end

data/spec/exclusion_parser_spec.rb ADDED Viewed

@@ -0,0 +1,481 @@
+# Author::    Robert Dormer (mailto:rdormer@gmail.com)
+# Copyright:: Copyright (c) 2016 Robert Dormer
+# License::   MIT
+#See:
+#http://www.robotstxt.org/orig.html
+#http://www.robotstxt.org/norobots-rfc.txt
+require File.dirname(__FILE__) + '/../lib/spiderkit'
+module Spider
+  describe ExclusionParser do
+    describe "General file handling" do
+      it "should ignore comments" do
+        txt = <<-eos
+          user-agent: *
+          allow: /
+          #disallow: /
+        eos
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.excluded?('/')).to be false
+      end
+      it "should ignore comments starting with whitespace" do
+        txt = <<-eos
+          user-agent: *
+          allow: /
+             #disallow: /
+        eos
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.excluded?('/')).to be false
+      end
+      it "should cleanly handle winged comments" do
+        txt = <<-eos
+          user-agent: *
+          allow: / #disallow: /
+        eos
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.excluded?('/')).to be false
+      end
+      it "should ignore unrecognized headers" do
+        txt = <<-eos
+          user-agent: *
+          allow: /
+          whargarbl: /
+        eos
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.excluded?('/')).to be false
+      end
+      it "should completely ignore an empty file" do
+        @bottxt = described_class.new('')
+        expect(@bottxt.excluded?('/')).to be false
+        expect(@bottxt.excluded?('/test')).to be false
+      end
+      it "should stop processing after 1000 directives" do
+        txt = <<-eos
+          user-agent: *
+	eos
+        (1..1002).each {|x| txt += "disallow: /#{x}--\r\n"}
+        @bottxt = described_class.new(txt)
+        #remember, we're doing start-of-string matching here,
+        #so we need a delimiter or else 100 matches 1001, 1002...
+        expect(@bottxt.excluded?('/1--')).to be true
+        expect(@bottxt.excluded?('/100--')).to be true
+        expect(@bottxt.excluded?('/1000--')).to be true
+        expect(@bottxt.excluded?('/1001--')).to be false
+        expect(@bottxt.excluded?('/1002--')).to be false
+      end
+      it "should die cleanly on html" do
+         txt = <<-eos
+           <html>
+		<head></head>
+		<body></body>
+	   </html>
+	 eos
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.excluded?('/')).to be false
+      end
+      it "should drop byte order marks" do
+        txt = <<-eos
+	  \xEF\xBB\xBF
+          user-agent: *
+          disallow: /
+	eos
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.excluded?('/')).to be true
+      end
+      it "should be open if no user agent matches and there is no default" do
+        txt = <<-eos
+          user-agent: test1
+          disallow: /
+          user-agent: test2
+          disallow: /
+	eos
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.excluded?('/')).to be false
+      end
+      it "should handle nil text" do
+        @bottxt = described_class.new(nil)
+        expect(@bottxt.excluded?('/')).to be false
+      end
+      it "should default to deny-all if unauthorized" do
+        txt = <<-eos
+          user-agent: *
+          allow: /
+        eos
+        txt.http_status = 401
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.excluded?('/')).to be true
+        txt.http_status = 403
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.excluded?('/')).to be true
+      end
+    end
+    describe "General directive handling" do
+      it "should split on CR" do
+        txt = "user-agent: *\rdisallow: /"
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.excluded?('/')).to be true
+      end
+      it "should split on NL" do
+        txt = "user-agent: *\ndisallow: /"
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.excluded?('/')).to be true
+      end
+      it "should split on CR/NL" do
+        txt = "user-agent: *\r\ndisallow: /"
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.excluded?('/')).to be true
+      end
+      it "should be whitespace insensitive" do
+        txt = <<-eos
+          user-agent: *
+          allow:	/tmp
+            disallow: /
+	eos
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.excluded?('/')).to be true
+        expect(@bottxt.excluded?('/tmp')).to be false
+      end
+      it "should match directives case insensitively" do
+        txt = <<-eos
+          user-agent: *
+          DISALLOW: /test1
+          ALLOW: /test2
+          CRAWL-DELAY: 60
+	eos
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.excluded?('/test1')).to be true
+        expect(@bottxt.excluded?('/test2')).to be false
+      end
+    end
+    describe "User agent handling" do
+      it "should do a case insensitive agent match" do
+        txt1 = <<-eos
+          user-agent: testbot
+          disallow: /
+        eos
+        txt2 = <<-eos
+          user-agent: TESTbot
+          disallow: /
+        eos
+        txt3 = <<-eos
+          user-agent: TESTBOT
+          disallow: /
+        eos
+        @bottxt1 = described_class.new(txt1, 'testbot')
+        @bottxt2 = described_class.new(txt2, 'testbot')
+        @bottxt3 = described_class.new(txt3, 'testbot')
+        expect(@bottxt1.excluded?('/')).to be true
+        expect(@bottxt2.excluded?('/')).to be true
+        expect(@bottxt3.excluded?('/')).to be true
+      end
+      it "should handle default user agent" do
+        txt = <<-eos
+          user-agent: *
+          disallow: /test
+        eos
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.excluded?('/test')).to be true
+      end
+      it "should use only the first of multiple default user agents" do
+        txt = <<-eos
+          user-agent: *
+          disallow: /
+          user-agent: *
+          allow: /
+        eos
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.excluded?('/')).to be true
+      end
+      it "should give precedence to a matching user agent over default" do
+        txt = <<-eos
+          user-agent: testbot
+          disallow: /
+          user-agent: *
+          disallow:
+        eos
+        @bottxt = described_class.new(txt, 'testbot')
+        expect(@bottxt.excluded?('/')).to be true
+      end
+      xit "should allow cascading user-agent strings"
+    end
+    describe "Disallow directive" do
+      it "should allow all urls if disallow is empty" do
+        txt = <<-eos
+          user-agent: *
+          disallow:
+        eos
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.excluded?('/')).to be false
+        expect(@bottxt.excluded?('test')).to be false
+        expect(@bottxt.excluded?('/test')).to be false
+      end
+      it "should blacklist any url starting with the specified string" do
+        txt = <<-eos
+          user-agent: *
+          disallow: /tmp
+        eos
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.excluded?('/tmp')).to be true
+        expect(@bottxt.excluded?('/tmp1234')).to be true
+        expect(@bottxt.excluded?('/tmp/stuff')).to be true
+        expect(@bottxt.excluded?('/tmporary')).to be true
+        expect(@bottxt.excluded?('/nottmp')).to be false
+        expect(@bottxt.excluded?('tmp')).to be false
+      end
+      it "should blacklist all urls if root is specified" do
+        txt = <<-eos
+          user-agent: *
+          disallow: /
+        eos
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.excluded?('/')).to be true
+        expect(@bottxt.excluded?('/nottmp')).to be true
+        expect(@bottxt.excluded?('/test')).to be true
+        expect(@bottxt.excluded?('nottmp')).to be true
+        expect(@bottxt.excluded?('test')).to be true
+      end
+      it "should match urls case sensitively" do
+        txt = <<-eos
+          user-agent: *
+          disallow: /tmp
+        eos
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.excluded?('/tmp')).to be true
+        expect(@bottxt.excluded?('/TMP')).to be false
+        expect(@bottxt.excluded?('/Tmp')).to be false
+      end
+      it "should decode url encoded characters" do
+        txt = <<-eos
+          user-agent: *
+          disallow: /a%3cd.html
+        eos
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.excluded?('/a%3cd.html')).to be true
+        expect(@bottxt.excluded?('/a%3Cd.html')).to be true
+        txt = <<-eos
+          user-agent: *
+          disallow: /a%3Cd.html
+        eos
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.excluded?('/a%3cd.html')).to be true
+        expect(@bottxt.excluded?('/a%3Cd.html')).to be true
+      end
+      it "should not decode %2f" do
+        txt = <<-eos
+          user-agent: *
+          disallow: /a%2fb.html
+        eos
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.excluded?('/a%2fb.html')).to be true
+        expect(@bottxt.excluded?('/a/b.html')).to be false
+        txt = <<-eos
+          user-agent: *
+          disallow: /a/b.html
+        eos
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.excluded?('/a%2fb.html')).to be false
+        expect(@bottxt.excluded?('/a/b.html')).to be true
+      end
+      it "should override allow if it comes first" do
+        txt = <<-eos
+          user-agent: *
+          disallow: /tmp
+          allow: /tmp
+        eos
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.excluded?('/tmp')).to be true
+      end
+    end
+    describe "Allow directive" do
+      it "should override disallow if it comes first" do
+        txt = <<-eos
+          user-agent: *
+          allow: /tmp
+          disallow: /tmp
+        eos
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.excluded?('/tmp')).to be false
+      end
+      it "should override disallow root if it comes first" do
+        txt = <<-eos
+          user-agent: *
+          allow: /tmp
+          allow: /test
+          disallow: /
+        eos
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.excluded?('/tmp')).to be false
+        expect(@bottxt.excluded?('/test')).to be false
+        expect(@bottxt.excluded?('/other1')).to be true
+        expect(@bottxt.excluded?('/other2')).to be true
+      end
+      it "allowing root should blacklist nothing" do
+        txt = <<-eos
+          user-agent: *
+          allow: /
+        eos
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.excluded?('/tmp')).to be false
+        expect(@bottxt.excluded?('/test')).to be false
+        expect(@bottxt.excluded?('/zzz')).to be false
+      end
+    end
+    describe "Crawl-Delay directive" do
+      it "should set the crawl delay" do
+        txt = <<-eos
+          user-agent: *
+          crawl-delay: 100
+        eos
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.wait_time.value).to eq 100
+      end
+      it "should limit wait time to 180 seconds" do
+        txt = <<-eos
+          user-agent: *
+          crawl-delay: 1000
+        eos
+        @bottxt = described_class.new(txt)
+        expect(@bottxt.wait_time.value).to eq 180
+      end
+    end
+    describe "RFC Examples" do
+      it "#1" do
+        txt = <<-eos
+          User-agent: *
+          Disallow: /org/plans.html
+          Allow: /org/
+          Allow: /serv
+          Allow: /~mak
+          Disallow: /
+       eos
+       @bottxt = described_class.new(txt)
+       expect(@bottxt.excluded?('/')).to be true
+       expect(@bottxt.excluded?('/index.html')).to be true
+       expect(@bottxt.excluded?('/server.html')).to be false
+       expect(@bottxt.excluded?('/services/fast.html')).to be false
+       expect(@bottxt.excluded?('/services/slow.html')).to be false
+       expect(@bottxt.excluded?('/orgo.gif')).to be true
+       expect(@bottxt.excluded?('/org/about.html')).to be false
+       expect(@bottxt.excluded?('/org/plans.html')).to be true
+       expect(@bottxt.excluded?('/%7Ejim/jim.html ')).to be true
+       expect(@bottxt.excluded?('/%7Emak/mak.html')).to be false
+      end
+      it "#2" do
+        txt = <<-eos
+          User-agent: *
+          Disallow: /
+       eos
+       @bottxt = described_class.new(txt)
+       expect(@bottxt.excluded?('/')).to be true
+       expect(@bottxt.excluded?('/index.html')).to be true
+       expect(@bottxt.excluded?('/server.html')).to be true
+       expect(@bottxt.excluded?('/services/fast.html')).to be true
+       expect(@bottxt.excluded?('/services/slow.html')).to be true
+       expect(@bottxt.excluded?('/orgo.gif')).to be true
+       expect(@bottxt.excluded?('/org/about.html')).to be true
+       expect(@bottxt.excluded?('/org/plans.html')).to be true
+       expect(@bottxt.excluded?('/%7Ejim/jim.html ')).to be true
+       expect(@bottxt.excluded?('/%7Emak/mak.html')).to be true
+      end
+      it "#3" do
+        txt = <<-eos
+          User-agent: *
+          Disallow:
+       eos
+       @bottxt = described_class.new(txt)
+       expect(@bottxt.excluded?('/')).to be false
+       expect(@bottxt.excluded?('/index.html')).to be false
+       expect(@bottxt.excluded?('/server.html')).to be false
+       expect(@bottxt.excluded?('/services/fast.html')).to be false
+       expect(@bottxt.excluded?('/services/slow.html')).to be false
+       expect(@bottxt.excluded?('/orgo.gif')).to be false
+       expect(@bottxt.excluded?('/org/about.html')).to be false
+       expect(@bottxt.excluded?('/org/plans.html')).to be false
+       expect(@bottxt.excluded?('/%7Ejim/jim.html ')).to be false
+       expect(@bottxt.excluded?('/%7Emak/mak.html')).to be false
+      end
+    end
+  end
+end

data/spec/visit_queue_spec.rb ADDED Viewed

@@ -0,0 +1,129 @@
+# Author::    Robert Dormer (mailto:rdormer@gmail.com)
+# Copyright:: Copyright (c) 2016 Robert Dormer
+# License::   MIT
+require File.dirname(__FILE__) + '/../lib/spiderkit'
+def get_visit_order(q)
+  order = []
+  q.visit_each do |t|
+    order << t
+  end
+  order
+end
+module Spider
+  describe VisitQueue do
+    before(:each) do
+      @queue = described_class.new
+      @queue.push_front('divider')
+    end
+    it 'should allow appending to front of the queue' do
+      @queue.push_front('two')
+      @queue.push_front('one')
+      order = get_visit_order(@queue)
+      expect(order).to eq %w(one two divider)
+    end
+    it 'should allow appending to the back of the queue' do
+      @queue.push_back('one')
+      @queue.push_back('two')
+      order = get_visit_order(@queue)
+      expect(order).to eq %w(divider one two)
+    end
+    it 'should allow appending array of urls to front of the queue' do
+      @queue.push_front(%w(two one))
+      order = get_visit_order(@queue)
+      expect(order).to eq %w(one two divider)
+    end
+    it 'should allow appending array of urls to back of the queue' do
+      @queue.push_back(%w(one two))
+      order = get_visit_order(@queue)
+      expect(order).to eq %w(divider one two)
+    end
+    it 'should not allow appending nil to the queue' do
+      expect(@queue.size).to eq 1
+      @queue.push_back(nil)
+      @queue.push_front(nil)
+      @queue.push_back([nil, nil, nil])
+      @queue.push_front([nil, nil, nil])
+      expect(@queue.size).to eq 1
+    end
+    it 'should visit urls appended during iteration' do
+      @queue.push_front(%w(one two))
+      extra_urls = %w(three four five)
+      order = []
+      @queue.visit_each do |t|
+        order << t
+        @queue.push_back(extra_urls.pop)
+      end
+      expect(order).to eq %w(two one divider five four three)
+    end
+    it 'should ignore appending if url has already been visited' do
+      @queue.visit_each
+      expect(@queue.empty?).to be true
+      @queue.push_back('divider')
+      @queue.push_front('divider')
+      @queue.push_back(%w(divider divider))
+      @queue.push_front(%w(divider divider))
+      expect(@queue.empty?).to be true
+    end
+    it 'should track number of urls visited' do
+      expect(@queue.visit_count).to eq 0
+      @queue.push_back(%w(one two three four))
+      @queue.visit_each
+      expect(@queue.visit_count).to eq 5
+    end
+    it 'should not visit urls blocked by robots.txt' do
+rtext = <<-REND
+  User-agent: *
+  disallow: two
+  disallow: three
+  disallow: four
+REND
+      rtext_queue = described_class.new(rtext)
+      rtext_queue.push_front(%w(one two three four five six))
+      order = get_visit_order(rtext_queue)
+      expect(order).to eq %w(six five one)
+    end
+    it 'should execute a finalizer if given' do
+      flag = false
+      final = Proc.new { flag = true }
+      queue = described_class.new(nil, nil, final)
+      queue.visit_each
+      expect(flag).to be true
+    end
+    it 'should execute the finalizer even when breaking the loop' do
+      flag = false
+      final = Proc.new { flag = true }
+      queue = described_class.new(nil, nil, final)
+      queue.push_back((1..20).to_a)
+      queue.visit_each { queue.stop if queue.visit_count >= 1 }
+      expect(queue.visit_count).to eq 1
+      expect(flag).to be true
+    end
+  end
+end

data/spec/wait_time_spec.rb ADDED Viewed

@@ -0,0 +1,51 @@
+# Author::    Robert Dormer (mailto:rdormer@gmail.com)
+# Copyright:: Copyright (c) 2016 Robert Dormer
+# License::   MIT
+require File.dirname(__FILE__) + '/../lib/spiderkit'
+module Spider
+  describe WaitTime do
+    it 'should have a getter for the value' do
+      wait = described_class.new(100)
+      expect(wait.value).to eq(100)
+    end
+    it 'should clamp the wait time argument to three minutes' do
+      wait = described_class.new(1000)
+      expect(wait.value).to eq(180)
+    end
+    it 'should have a default wait time' do
+      wait = described_class.new
+      expect(wait.value).to eq(2)
+    end
+    describe '#back_off' do
+      it 'if wait is zero, should set default wait time' do
+        wait = described_class.new(0)
+        wait.back_off
+        expect(wait.value).to eq(2)
+      end
+      it 'should double the wait time every time called' do
+        wait = described_class.new(10)
+        wait.back_off
+        expect(wait.value).to eq(20)
+        wait.back_off
+        expect(wait.value).to eq(40)
+        wait.back_off
+        expect(wait.value).to eq(80)
+      end
+      it 'should not double beyond the maximum value' do
+        wait = described_class.new(90)
+        wait.back_off
+        expect(wait.value).to eq(180)
+        wait.back_off
+        expect(wait.value).to eq(180)
+      end
+    end
+  end
+end

data/spiderkit.gemspec ADDED Viewed

@@ -0,0 +1,26 @@
+# coding: utf-8
+lib = File.expand_path('../lib', __FILE__)
+$LOAD_PATH.unshift(lib) unless $LOAD_PATH.include?(lib)
+require 'version'
+Gem::Specification.new do |spec|
+  spec.name          = "spiderkit"
+  spec.version       = Spider::VERSION
+  spec.authors       = ["Robert Dormer"]
+  spec.email         = ["rdormer@gmail.com"]
+  spec.description   = %q{Spiderkit library for basic spiders and bots}
+  spec.summary       = %q{Basic toolkit for writing web spiders and bots}
+  spec.homepage      = "http://github.com/rdormer/spiderkit"
+  spec.license       = "MIT"
+  spec.files         = `git ls-files`.split($/)
+  spec.executables   = spec.files.grep(%r{^bin/}) { |f| File.basename(f) }
+  spec.test_files    = spec.files.grep(%r{^(test|spec|features)/})
+  spec.require_paths = ["lib"]
+  spec.add_development_dependency "bundler", "~> 1.3"
+  spec.add_development_dependency "rspec",  "~> 3.4.0"
+  spec.add_development_dependency "rake"
+  spec.add_dependency "bloom-filter", "~> 0.2.0"
+end

metadata ADDED Viewed

@@ -0,0 +1,126 @@
+--- !ruby/object:Gem::Specification
+name: spiderkit
+version: !ruby/object:Gem::Version
+  version: 0.1.0
+  prerelease:
+platform: ruby
+authors:
+- Robert Dormer
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2016-07-10 00:00:00.000000000 Z
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: bundler
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.3'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: '1.3'
+- !ruby/object:Gem::Dependency
+  name: rspec
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: 3.4.0
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: 3.4.0
+- !ruby/object:Gem::Dependency
+  name: rake
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+  type: :development
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ! '>='
+      - !ruby/object:Gem::Version
+        version: '0'
+- !ruby/object:Gem::Dependency
+  name: bloom-filter
+  requirement: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: 0.2.0
+  type: :runtime
+  prerelease: false
+  version_requirements: !ruby/object:Gem::Requirement
+    none: false
+    requirements:
+    - - ~>
+      - !ruby/object:Gem::Version
+        version: 0.2.0
+description: Spiderkit library for basic spiders and bots
+email:
+- rdormer@gmail.com
+executables: []
+extensions: []
+extra_rdoc_files: []
+files:
+- Gemfile
+- LICENSE.txt
+- README.md
+- Rakefile
+- lib/exclusion.rb
+- lib/queue.rb
+- lib/spiderkit.rb
+- lib/version.rb
+- lib/wait_time.rb
+- spec/exclusion_parser_spec.rb
+- spec/visit_queue_spec.rb
+- spec/wait_time_spec.rb
+- spiderkit.gemspec
+homepage: http://github.com/rdormer/spiderkit
+licenses:
+- MIT
+post_install_message:
+rdoc_options: []
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  none: false
+  requirements:
+  - - ! '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+required_rubygems_version: !ruby/object:Gem::Requirement
+  none: false
+  requirements:
+  - - ! '>='
+    - !ruby/object:Gem::Version
+      version: '0'
+requirements: []
+rubyforge_project:
+rubygems_version: 1.8.25
+signing_key:
+specification_version: 3
+summary: Basic toolkit for writing web spiders and bots
+test_files:
+- spec/exclusion_parser_spec.rb
+- spec/visit_queue_spec.rb
+- spec/wait_time_spec.rb