RubyGems - spidr - Versions diffs - 0.1.9 → 0.2.0 - Mend

spidr 0.1.9 → 0.2.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (37) hide show

data.tar.gz.sig +0 -0
data/History.txt +43 -0
data/Manifest.txt +19 -0
data/README.txt +100 -11
data/Rakefile +15 -5
data/lib/spidr/actions.rb +2 -0
data/lib/spidr/actions/actions.rb +79 -0
data/lib/spidr/actions/exceptions.rb +4 -0
data/lib/spidr/actions/exceptions/action.rb +6 -0
data/lib/spidr/actions/exceptions/paused.rb +8 -0
data/lib/spidr/actions/exceptions/skip_link.rb +8 -0
data/lib/spidr/actions/exceptions/skip_page.rb +8 -0
data/lib/spidr/agent.rb +385 -444
data/lib/spidr/events.rb +87 -0
data/lib/spidr/extensions.rb +1 -0
data/lib/spidr/extensions/uri.rb +45 -0
data/lib/spidr/filters.rb +438 -0
data/lib/spidr/page.rb +211 -70
data/lib/spidr/rules.rb +40 -18
data/lib/spidr/spidr.rb +57 -7
data/lib/spidr/version.rb +2 -1
data/spec/actions_spec.rb +61 -0
data/spec/agent_spec.rb +24 -31
data/spec/extensions/uri_spec.rb +39 -0
data/spec/filters_spec.rb +53 -0
data/spec/helpers/page.rb +8 -0
data/spec/page_examples.rb +17 -0
data/spec/page_spec.rb +81 -0
data/spec/rules_spec.rb +43 -0
data/spec/spec_helper.rb +1 -1
data/spec/spidr_spec.rb +30 -0
data/static/course/specs.json +1 -1
data/tasks/course.rb +8 -1
data/tasks/spec.rb +1 -0
data/tasks/yard.rb +12 -0
metadata +45 -6
metadata.gz.sig +0 -0

data.tar.gz.sig CHANGED Viewed

Binary file

data/History.txt CHANGED Viewed

@@ -1,3 +1,46 @@
+=== 0.2.0 / 2009-10-10
+* Added URI.expand_path.
+* Added Spidr::Page#search.
+* Added Spidr::Page#at.
+* Added Spidr::Page#title.
+* Added Spidr::Agent#failures=.
+* Added a HTTP session cache to Spidr::Agent, per suggestion of falter.
+  * Added Spidr::Agent#get_session.
+  * Added Spidr::Agent#kill_session.
+* Added Spidr.proxy=.
+* Added Spidr.disable_proxy!.
+* Aliased Spidr::Page#txt? to Spidr::Page#plain_text?.
+* Aliased Spidr::Page#ok? to Spidr::Page#is_ok?.
+* Aliased Spidr::Page#redirect? to Spidr::Page#is_redirect?.
+* Aliased Spidr::Page#unauthorized? to Spidr::Page#is_unauthorized?.
+* Aliased Spidr::Page#forbidden? to Spidr::Page#is_forbidden?.
+* Aliased Spidr::Page#missing? to Spidr::Page#is_missing?.
+* Split URL filtering code out of Spidr::Agent and into Spidr::Filtering.
+* Split URL / Page event code out of Spidr::Agent and into Spidr::Events.
+* Split pause! / continue! / skip_link! / skip_page! methods out of
+  Spidr::Agent and into Spidr::Actions.
+* Fixed a bug in Spidr::Page#code, where it was not returning an Integer.
+* Make sure Spidr::Page#doc returns Nokogiri::XML::Document objects for
+  RSS/RDF/Atom pages as well.
+* Fixed the handling of the Location header in Spidr::Page#links
+  (thanks falter).
+* Fixed a bug in Spidr::Page#to_absolute where trailing '/' characters on
+  URI paths were not being preserved (thanks falter).
+* Fixed a bug where the URI query was not being sent with the request
+  in Spidr::Agent#get_page (thanks Damian Steer).
+* Fixed a bug where SSL sessions were not being properly setup
+  (thanks falter).
+* Switched Spidr::Agent#history to be a Set, to improve search-time
+  of the history (thanks falter).
+* Switched Spidr::Agent#failures to a Set.
+* Allow a block to be passed to Spidr::Agent#run, which will receive all
+  pages visited.
+* Allow Spidr::Agent#start_at and Spidr::Agent#continue! to pass blocks to
+  Spidr::Agent#run.
+* Made Spidr::Agent#visit_page public.
+* Moved to YARD based documentation.
 === 0.1.9 / 2009-06-13
 * Upgraded to Hoe 2.0.0.

data/Manifest.txt CHANGED Viewed

@@ -3,15 +3,34 @@ Manifest.txt
 README.txt
 Rakefile
 lib/spidr.rb
+lib/spidr/extensions.rb
+lib/spidr/extensions/uri.rb
 lib/spidr/page.rb
 lib/spidr/rules.rb
+lib/spidr/filters.rb
+lib/spidr/events.rb
+lib/spidr/actions.rb
+lib/spidr/actions/exceptions.rb
+lib/spidr/actions/exceptions/action.rb
+lib/spidr/actions/exceptions/paused.rb
+lib/spidr/actions/exceptions/skip_link.rb
+lib/spidr/actions/exceptions/skip_page.rb
+lib/spidr/actions/actions.rb
 lib/spidr/agent.rb
 lib/spidr/spidr.rb
 lib/spidr/version.rb
 tasks/spec.rb
+tasks/yard.rb
 tasks/course.rb
 spec/spec_helper.rb
 spec/helpers/course.rb
+spec/helpers/page.rb
+spec/extensions/uri_spec.rb
+spec/page_examples.rb
+spec/page_spec.rb
+spec/rules_spec.rb
+spec/filters_spec.rb
+spec/actions_spec.rb
 spec/agent_spec.rb
 spec/spidr_spec.rb
 static/course/index.html

data/README.txt CHANGED Viewed

@@ -28,19 +28,14 @@ and easy to use.
   * Every visited URL.
   * Every visited URL that matches a specified pattern.
   * Every URL that failed to be visited.
-* Pause and continue spidering.
+* Provides action methods to:
+  * Pause spidering.
+  * Skip processing of pages.
+  * Skip processing of links.
 * Restore the spidering queue and history from a previous session.
 * Custom User-Agent strings.
 * Custom proxy settings.
-== REQUIREMENTS:
-* {nokogiri}[http://nokogiri.rubyforge.org/]
-== INSTALL:
-  $ sudo gem install spidr
 == EXAMPLES:
 * Start spidering from a URL:
@@ -49,11 +44,32 @@ and easy to use.
 * Spider a host:
-    Spidr.host('www.0x000000.com')
+    Spidr.host('coderrr.wordpress.com')
 * Spider a site:
-    Spidr.site('http://hackety.org/')
+    Spidr.site('http://rubyflow.com/')
+* Spider multiple hosts:
+    Spidr.start_at(
+      'http://company.com/',
+      :hosts => [
+        'company.com',
+	/host\d\.company\.com/
+      ]
+    )
+* Do not spider certain links:
+    Spidr.site('http://matasano.com/', :ignore_links => [/log/])
+* Do not spider links on certain ports:
+    Spidr.site(
+      'http://sketchy.content.com/',
+      :ignore_ports => [8000, 8010, 8080]
+    )
 * Print out visited URLs:
@@ -61,6 +77,79 @@ and easy to use.
       spider.every_url { |url| puts url }
     end
+* Print out the URLs that could not be requested:
+    Spidr.site('http://sketchy.content.com/') do |spider|
+      spider.every_failed_url { |url| puts url }
+    end
+* Search HTML and XML pages:
+    Spidr.site('http://company.withablog.com/') do |spider|
+      spider.every_page do |page|
+        puts "[-] #{page.url}"
+        page.search('//meta').each do |meta|
+	  name = (meta.attributes['name'] || meta.attributes['http-equiv'])
+	  value = meta.attributes['content']
+	  puts "    #{name} = #{value}"
+	end
+      end
+    end
+* Print out the titles from every page:
+    Spidr.site('http://www.rubypulse.com/') do |spider|
+      spider.every_page do |page|
+        puts page.title if page.html?
+      end
+    end
+* Find what kinds of web servers a host is using, by accessing the headers:
+    servers = Set[]
+    Spidr.host('generic.company.com') do |spider|
+      spider.all_headers do |headers|
+        servers << headers['server']
+      end
+    end
+* Pause the spider on a forbidden page:
+    spider = Spidr.host('overnight.startup.com') do |spider|
+      spider.every_page do |page|
+        spider.pause! if page.forbidden?
+      end
+    end
+* Skip the processing of a page:
+    Spidr.host('sketchy.content.com') do |spider|
+      spider.every_page do |page|
+        spider.skip_page! if page.not_found?
+      end
+    end
+* Skip the processing of links:
+    Spidr.host('sketchy.content.com') do |spider|
+      spider.every_url do |url|
+        if url.path.split('/').find { |dir| dir.to_i > 1000 }
+	  spider.skip_link!
+	end
+      end
+    end
+== REQUIREMENTS:
+* {nokogiri}[http://nokogiri.rubyforge.org/] >= 1.2.0
+== INSTALL:
+  $ sudo gem install spidr
 == LICENSE:
 The MIT License

data/Rakefile CHANGED Viewed

@@ -4,14 +4,24 @@ require 'rubygems'
 require 'hoe'
 require 'hoe/signing'
 require './tasks/spec.rb'
+require './tasks/yard.rb'
 require './tasks/course.rb'
 require './lib/spidr/version.rb'
-Hoe.spec('spidr') do |p|
-  p.rubyforge_name = 'spidr'
-  p.developer('Postmodern', 'postmodern.mod3@gmail.com')
-  p.remote_rdoc_dir = 'docs'
-  p.extra_deps = ['nokogiri']
+Hoe.spec('spidr') do
+  self.rubyforge_name = 'spidr'
+  self.developer('Postmodern', 'postmodern.mod3@gmail.com')
+  self.remote_rdoc_dir = 'docs'
+  self.extra_deps = [
+    ['nokogiri', '>=1.2.0']
+  ]
+  self.extra_dev_deps = [
+    ['rspec', '>=1.2.8'],
+    ['yard', '>=0.2.3.5']
+  ]
+  self.spec_extras = {:has_rdoc => 'yard'}
 end
 # vim: syntax=Ruby

data/lib/spidr/actions.rb ADDED Viewed

	@@ -0,0 +1,2 @@
1	+ require 'spidr/actions/exceptions'
2	+ require 'spidr/actions/actions'

data/lib/spidr/actions/actions.rb ADDED Viewed

@@ -0,0 +1,79 @@
+require 'spidr/actions/exceptions/paused'
+require 'spidr/actions/exceptions/skip_link'
+require 'spidr/actions/exceptions/skip_page'
+module Spidr
+  module Actions
+    def initialize(options={})
+      @paused = false
+      super(options)
+    end
+    #
+    # Continue spidering.
+    #
+    # @yield [page]
+    #   If a block is given, it will be passed every page visited.
+    #
+    # @yieldparam [Page] page
+    #   The page to be visited.
+    #
+    def continue!(&block)
+      @paused = false
+      return run(&block)
+    end
+    #
+    # Sets the pause state of the agent.
+    #
+    # @param [Boolean] state
+    #   The new pause state of the agent.
+    #
+    def pause=(state)
+      @paused = state
+    end
+    #
+    # Pauses the agent, causing spidering to temporarily stop.
+    #
+    # @raise [Paused]
+    #   Indicates to the agent, that it should pause spidering.
+    #
+    def pause!
+      @paused = true
+      raise(Paused)
+    end
+    #
+    # Determines whether the agent is paused.
+    #
+    # @return [Boolean]
+    #   Specifies whether the agent is paused.
+    #
+    def paused?
+      @paused == true
+    end
+    #
+    # Causes the agent to skip the link being enqueued.
+    #
+    # @raise [SkipLink]
+    #   Indicates to the agent, that the current link should be skipped,
+    #   and not enqueued or visited.
+    #
+    def skip_link!
+      raise(SkipLink)
+    end
+    #
+    # Causes the agent to skip the page being visited.
+    #
+    # @raise [SkipPage]
+    #   Indicates to the agent, that the current page should be skipped.
+    #
+    def skip_page!
+      raise(SkipPage)
+    end
+  end
+end

data/lib/spidr/actions/exceptions.rb ADDED Viewed

@@ -0,0 +1,4 @@
+require 'spidr/actions/exceptions/action'
+require 'spidr/actions/exceptions/paused'
+require 'spidr/actions/exceptions/skip_link'
+require 'spidr/actions/exceptions/skip_page'

data/lib/spidr/actions/exceptions/action.rb ADDED Viewed

@@ -0,0 +1,6 @@
+module Spidr
+  module Actions
+    class Action < RuntimeError
+    end
+  end
+end

data/lib/spidr/actions/exceptions/paused.rb ADDED Viewed

@@ -0,0 +1,8 @@
+require 'spidr/actions/exceptions/action'
+module Spidr
+  module Actions
+    class Paused < Action
+    end
+  end
+end

data/lib/spidr/actions/exceptions/skip_link.rb ADDED Viewed

@@ -0,0 +1,8 @@
+require 'spidr/actions/exceptions/action'
+module Spidr
+  module Actions
+    class SkipLink < Action
+    end
+  end
+end

data/lib/spidr/actions/exceptions/skip_page.rb ADDED Viewed

@@ -0,0 +1,8 @@
+require 'spidr/actions/exceptions/action'
+module Spidr
+  module Actions
+    class SkipPage < Action
+    end
+  end
+end

data/lib/spidr/agent.rb CHANGED Viewed

@@ -1,12 +1,19 @@
-require 'spidr/rules'
+require 'spidr/filters'
+require 'spidr/events'
+require 'spidr/actions'
 require 'spidr/page'
 require 'spidr/spidr'
 require 'net/http'
+require 'set'
 module Spidr
   class Agent
+    include Filters
+    include Events
+    include Actions
     # Proxy to use
     attr_accessor :proxy
@@ -19,9 +26,6 @@ module Spidr
     # Delay in between fetching pages
     attr_accessor :delay
-    # List of acceptable URL schemes to follow
-    attr_reader :schemes
     # History containing visited URLs
     attr_reader :history
@@ -32,105 +36,81 @@ module Spidr
     attr_reader :queue
     #
-    # Creates a new Agent object with the given _options_ and _block_.
-    # If a _block_ is given, it will be passed the newly created
-    # Agent object.
-    #
-    # _options_ may contain the following keys:
-    # <tt>:proxy</tt>:: The proxy to use while spidering.
-    # <tt>:user_agent</tt>:: The User-Agent string to send.
-    # <tt>:referer</tt>:: The referer URL to send.
-    # <tt>:delay</tt>:: Duration in seconds to pause between spidering each
-    #                   link. Defaults to 0.
-    # <tt>:schemes</tt>:: The list of acceptable URL schemes to follow.
-    #                     Defaults to +http+ and +https+. +https+ URL
-    #                     schemes will be ignored if <tt>net/http</tt>
-    #                     cannot be loaded.
-    # <tt>:host</tt>:: The host-name to visit.
-    # <tt>:hosts</tt>:: An +Array+ of host patterns to visit.
-    # <tt>:ignore_hosts</tt>:: An +Array+ of host patterns to not visit.
-    # <tt>:ports</tt>:: An +Array+ of port patterns to visit.
-    # <tt>:ignore_ports</tt>:: An +Array+ of port patterns to not visit.
-    # <tt>:links</tt>:: An +Array+ of link patterns to visit.
-    # <tt>:ignore_links</tt>:: An +Array+ of link patterns to not visit.
-    # <tt>:exts</tt>:: An +Array+ of File extension patterns to visit.
-    # <tt>:ignore_exts</tt>:: An +Array+ of File extension patterns to not
-    #                         visit.
-    # <tt>:queue</tt>:: An initial queue of URLs to visit.
-    # <tt>:history</tt>:: An initial list of visited URLs.
+    # Creates a new Agent object.
+    #
+    # @param [Hash] options
+    #   Additional options
+    #
+    # @option options [Hash] :proxy (Spidr.proxy)
+    #   The proxy information to use.
+    #
+    # @option :proxy [String] :host
+    #   The host the proxy is running on.
+    #
+    # @option :proxy [Integer] :port
+    #   The port the proxy is running on.
+    #
+    # @option :proxy [String] :user
+    #   The user to authenticate as with the proxy.
+    #
+    # @option :proxy [String] :password
+    #   The password to authenticate with.
+    #
+    # @option options [String] :user_agent (Spidr.user_agent)
+    #   The User-Agent string to send with each requests.
+    #
+    # @option options [String] :referer
+    #   The Referer URL to send with each request.
+    #
+    # @option options [Integer] :delay (0)
+    #   The number of seconds to pause between each request.
+    #
+    # @option options [Set, Array] :queue
+    #   The initial queue of URLs to visit.
+    #
+    # @option options [Set, Array] :history
+    #   The initial list of visited URLs.
+    #
+    # @yield [agent]
+    #   If a block is given, it will be passed the newly created agent
+    #   for further configuration.
+    #
+    # @yieldparam [Agent] agent
+    #   The newly created agent.
     #
     def initialize(options={},&block)
       @proxy = (options[:proxy] || Spidr.proxy)
       @user_agent = (options[:user_agent] || Spidr.user_agent)
       @referer = options[:referer]
-      @schemes = []
-      if options[:schemes]
-        @schemes += options[:schemes]
-      else
-        @schemes << 'http'
-        begin
-          require 'net/https'
-          @schemes << 'https'
-        rescue Gem::LoadError => e
-          raise(e)
-        rescue ::LoadError
-          STDERR.puts "Warning: cannot load 'net/https', https support disabled"
-        end
-      end
-      @host_rules = Rules.new(
-        :accept => options[:hosts],
-        :reject => options[:ignore_hosts]
-      )
-      @port_rules = Rules.new(
-        :accept => options[:ports],
-        :reject => options[:ignore_ports]
-      )
-      @link_rules = Rules.new(
-        :accept => options[:links],
-        :reject => options[:ignore_links]
-      )
-      @ext_rules = Rules.new(
-        :accept => options[:exts],
-        :reject => options[:ignore_exts]
-      )
-      @every_url_blocks = []
-      @every_failed_url_blocks = []
-      @urls_like_blocks = Hash.new { |hash,key| hash[key] = [] }
-      @every_page_blocks = []
+      @running = false
       @delay = (options[:delay] || 0)
-      @history = []
-      @failures = []
+      @history = Set[]
+      @failures = Set[]
       @queue = []
-      @paused = true
-      if options[:host]
-        visit_hosts_like(options[:host])
-      end
+      @sessions = {}
-      if options[:queue]
-        self.queue = options[:queue]
-      end
-      if options[:history]
-        self.history = options[:history]
-      end
+      super(options)
       block.call(self) if block
     end
     #
-    # Creates a new Agent object with the given _options_ and will begin
-    # spidering at the specified _url_. If a _block_ is given it will be
-    # passed the newly created Agent object, before the agent begins
-    # spidering.
+    # Creates a new agent and begin spidering at the given URL.
+    #
+    # @param [URI::HTTP, String] url
+    #   The URL to start spidering at.
+    #
+    # @param [Hash] options
+    #   Additional options. See {Agent#initialize}.
+    #
+    # @yield [agent]
+    #   If a block is given, it will be passed the newly created agent
+    #   before it begins spidering.
+    #
+    # @yieldparam [Agent] agent
+    #   The newly created agent.
     #
     def self.start_at(url,options={},&block)
       self.new(options) do |spider|
@@ -141,10 +121,20 @@ module Spidr
     end
     #
-    # Creates a new Agent object with the given _options_ and will begin
-    # spidering the specified host _name_. If a _block_ is given it will be
-    # passed the newly created Agent object, before the agent begins
-    # spidering.
+    # Creates a new agent and spiders the given host.
+    #
+    # @param [String]
+    #   The host-name to spider.
+    #
+    # @param [Hash] options
+    #   Additional options. See {Agent#initialize}.
+    #
+    # @yield [agent]
+    #   If a block is given, it will be passed the newly created agent
+    #   before it begins spidering.
+    #
+    # @yieldparam [Agent] agent
+    #   The newly created agent.
     #
     def self.host(name,options={},&block)
       self.new(options.merge(:host => name)) do |spider|
@@ -155,10 +145,20 @@ module Spidr
     end
     #
-    # Creates a new Agent object with the given _options_ and will begin
-    # spidering the host of the specified _url_. If a _block_ is given it
-    # will be passed the newly created Agent object, before the agent
-    # begins spidering.
+    # Creates a new agent and spiders the web-site located at the given URL.
+    #
+    # @param [URI::HTTP, String] url
+    #   The web-site to spider.
+    #
+    # @param [Hash] options
+    #   Additional options. See {Agent#initialize}.
+    #
+    # @yield [agent]
+    #   If a block is given, it will be passed the newly created agent
+    #   before it begins spidering.
+    #
+    # @yieldparam [Agent] agent
+    #   The newly created agent.
     #
     def self.site(url,options={},&block)
       url = URI(url.to_s)
@@ -171,348 +171,280 @@ module Spidr
     end
     #
-    # Returns the +Array+ of host patterns to visit.
-    #
-    def visit_hosts
-      @host_rules.accept
-    end
-    #
-    # Adds the given _pattern_ to the visit_hosts. If a _block_ is given,
-    # it will be added to the visit_hosts.
+    # Clears the history of the agent.
     #
-    def visit_hosts_like(pattern=nil,&block)
-      if pattern
-        visit_hosts << pattern
-      elsif block
-        visit_hosts << block
-      end
+    def clear
+      @queue.clear
+      @history.clear
+      @failures.clear
       return self
     end
     #
-    # Returns the +Array+ of URL host patterns to not visit.
+    # Start spidering at a given URL.
     #
-    def ignore_hosts
-      @host_rules.reject
-    end
+    # @param [URI::HTTP, String] url
+    #   The URL to start spidering at.
     #
-    # Adds the given _pattern_ to the ignore_hosts. If a _block_ is given,
-    # it will be added to the ignore_hosts.
+    # @yield [page]
+    #   If a block is given, it will be passed every page visited.
     #
-    def ignore_hosts_like(pattern=nil,&block)
-      if pattern
-        ignore_hosts << pattern
-      elsif block
-        ignore_hosts << block
-      end
+    # @yieldparam [Page] page
+    #   A page which has been visited.
+    #
+    def start_at(url,&block)
+      enqueue(url)
-      return self
+      return run(&block)
     end
     #
-    # Returns the +Array+ of URL port patterns to visit.
+    # Start spidering until the queue becomes empty or the agent is
+    # paused.
     #
-    def visit_ports
-      @port_rules.accept
-    end
+    # @yield [page]
+    #   If a block is given, it will be passed every page visited.
     #
-    # Adds the given _pattern_ to the visit_ports. If a _block_ is given,
-    # it will be added to the visit_ports.
+    # @yieldparam [Page] page
+    #   A page which has been visited.
     #
-    def visit_ports_like(pattern=nil,&block)
-      if pattern
-        visit_ports << pattern
-      elsif block
-        visit_ports << block
-      end
+    def run(&block)
+      @running = true
-      return self
-    end
+      until (@queue.empty? || paused?)
+        begin
+          visit_page(dequeue,&block)
+        rescue Actions::Paused
+          return self
+        rescue Actions::Action
+        end
+      end
-    #
-    # Returns the +Array+ of URL port patterns to not visit.
-    #
-    def ignore_ports
-      @port_rules.reject
-    end
+      @running = false
-    #
-    # Adds the given _pattern_ to the ignore_hosts. If a _block_ is given,
-    # it will be added to the ignore_hosts.
-    #
-    def ignore_ports_like(pattern=nil,&block)
-      if pattern
-        ignore_ports << pattern
-      elsif block
-        ignore_ports << block
+      @sessions.each_value do |sess|
+        begin
+          sess.finish
+        rescue IOError
+          nil
+        end
       end
+      @sessions.clear
       return self
     end
     #
-    # Returns the +Array+ of link patterns to visit.
+    # Determines if the agent is running.
     #
-    def visit_links
-      @link_rules.accept
+    # @return [Boolean]
+    #   Specifies whether the agent is running or stopped.
+    #
+    def running?
+      @running == true
     end
     #
-    # Adds the given _pattern_ to the visit_links. If a _block_ is given,
-    # it will be added to the visit_links.
+    # Sets the history of URLs that were previously visited.
     #
-    def visit_links_like(pattern=nil,&block)
-      if pattern
-        visit_links << pattern
-      elsif block
-        visit_links << block
+    # @param [#each] new_history
+    #   A list of URLs to populate the history with.
+    #
+    # @return [Set<URI::HTTP>]
+    #   The history of the agent.
+    #
+    # @example
+    #   agent.history = ['http://tenderlovemaking.com/2009/05/06/ann-nokogiri-130rc1-has-been-released/']
+    #
+    def history=(new_history)
+      @history.clear
+      new_history.each do |url|
+        @history << unless url.kind_of?(URI)
+                      URI(url.to_s)
+                    else
+                      url
+                    end
       end
-      return self
+      return @history
     end
+    alias visited_urls history
+    #
+    # Specifies the links which have been visited.
     #
-    # Returns the +Array+ of link patterns to not visit.
+    # @return [Array<String>]
+    #   The links which have been visited.
     #
-    def ignore_links
-      @link_rules.reject
+    def visited_links
+      @history.map { |url| url.to_s }
     end
     #
-    # Adds the given _pattern_ to the ignore_links. If a _block_ is given,
-    # it will be added to the ignore_links.
+    # Specifies all hosts that were visited.
     #
-    def ignore_links_like(pattern=nil,&block)
-      if pattern
-        ignore_links << pattern
-      elsif block
-        ignore_links << block
-      end
-      return self
+    # @return [Array<String>]
+    #   The hosts which have been visited.
+    #
+    def visited_hosts
+      visited_urls.map { |uri| uri.host }.uniq
     end
     #
-    # Returns the +Array+ of URL extension patterns to visit.
+    # Determines whether a URL was visited or not.
     #
-    def visit_exts
-      @ext_rules.accept
-    end
+    # @param [URI::HTTP, String] url
+    #   The URL to search for.
     #
-    # Adds the given _pattern_ to the visit_exts. If a _block_ is given,
-    # it will be added to the visit_exts.
+    # @return [Boolean]
+    #   Specifies whether a URL was visited.
     #
-    def visit_exts_like(pattern=nil,&block)
-      if pattern
-        visit_exts << pattern
-      elsif block
-        visit_exts << block
-      end
+    def visited?(url)
+      url = URI(url.to_s) unless url.kind_of?(URI)
-      return self
+      return @history.include?(url)
     end
     #
-    # Returns the +Array+ of URL extension patterns to not visit.
+    # Sets the list of failed URLs.
     #
-    def ignore_exts
-      @ext_rules.reject
-    end
+    # @param [#each]
+    #   The new list of failed URLs.
+    #
+    # @return [Array<URI::HTTP>]
+    #   The list of failed URLs.
     #
-    # Adds the given _pattern_ to the ignore_exts. If a _block_ is given,
-    # it will be added to the ignore_exts.
+    # @example
+    #   agent.failures = ['http://localhost/']
     #
-    def ignore_exts_like(pattern=nil,&block)
-      if pattern
-        ignore_exts << pattern
-      elsif block
-        ignore_exts << block
+    def failures=(new_failures)
+      @failures.clear
+      new_failures.each do |url|
+        @failures << unless url.kind_of?(URI)
+                    URI(url.to_s)
+                  else
+                    url
+                  end
       end
-      return self
+      return @failures
     end
     #
-    # For every URL that the agent visits it will be passed to the
-    # specified _block_.
+    # Determines whether a given URL could not be visited.
     #
-    def every_url(&block)
-      @every_url_blocks << block
-      return self
-    end
+    # @param [URI::HTTP, String] url
+    #   The URL to check for failures.
     #
-    # For every URL that the agent is unable to visit, it will be passed
-    # to the specified _block_.
+    # @return [Boolean]
+    #   Specifies whether the given URL was unable to be visited.
     #
-    def every_failed_url(&block)
-      @every_failed_url_blocks << block
-      return self
-    end
+    def failed?(url)
+      url = URI(url.to_s) unless url.kind_of?(URI)
-    #
-    # For every URL that the agent visits and matches the specified
-    # _pattern_, it will be passed to the specified _block_.
-    #
-    def urls_like(pattern,&block)
-      @urls_like_blocks[pattern] << block
-      return self
+      return @failures.include?(url)
     end
-    #
-    # For every Page that the agent visits, pass the page to the
-    # specified _block_.
-    #
-    def every_page(&block)
-      @every_page_blocks << block
-      return self
-    end
+    alias pending_urls queue
     #
-    # For every Page that the agent visits, pass the headers to the given
-    # _block_.
+    # Sets the queue of URLs to visit.
     #
-    def all_headers(&block)
-      every_page { |page| block.call(page.headers) }
-    end
-    #
-    # Clears the history of the agent.
+    # @param [#each]
+    #   The new list of URLs to visit.
     #
-    def clear
-      @queue.clear
-      @history.clear
-      @failures.clear
-      return self
-    end
+    # @return [Array<URI::HTTP>]
+    #   The list of URLs to visit.
     #
-    # Start spidering at the specified _url_.
+    # @example
+    #   agent.queue = ['http://www.vimeo.com/', 'http://www.reddit.com/']
     #
-    def start_at(url)
-      enqueue(url)
-      return continue!
-    end
+    def queue=(new_queue)
+      @queue.clear
-    #
-    # Start spidering until the queue becomes empty or the agent is
-    # paused.
-    #
-    def run
-      until (@queue.empty? || @paused == true)
-        visit_page(dequeue)
+      new_queue.each do |url|
+        @queue << unless url.kind_of?(URI)
+                    URI(url.to_s)
+                  else
+                    url
+                  end
       end
-      return self
+      return @queue
     end
     #
-    # Continue spidering.
+    # Determines whether a given URL has been enqueued.
     #
-    def continue!
-      @paused = false
-      return run
-    end
+    # @param [URI::HTTP] url
+    #   The URL to search for in the queue.
     #
-    # Returns +true+ if the agent is still spidering, returns +false+
-    # otherwise.
+    # @return [Boolean]
+    #   Specifies whether the given URL has been queued for visiting.
     #
-    def running?
-      @paused == false
+    def queued?(url)
+      @queue.include?(url)
     end
     #
-    # Returns +true+ if the agent is paused, returns +false+ otherwise.
+    # Enqueues a given URL for visiting, only if it passes all of the
+    # agent's rules for visiting a given URL.
     #
-    def paused?
-      @paused == true
-    end
+    # @param [URI::HTTP, String] url
+    #   The URL to enqueue for visiting.
     #
-    # Pauses the agent, causing spidering to temporarily stop.
+    # @return [Boolean]
+    #   Specifies whether the URL was enqueued, or ignored.
     #
-    def pause!
-      @paused = true
-      return self
-    end
+    def enqueue(url)
+      link = url.to_s
+      url = URI(link) unless url.kind_of?(URI)
-    #
-    # Sets the list of acceptable URL schemes to follow to the
-    # _new_schemes_.
-    #
-    #   agent.schemes = ['http']
-    #
-    def schemes=(new_schemes)
-      @schemes = new_schemes.map { |scheme| scheme.to_s }
-    end
+      if (!(queued?(url)) && visit?(url))
+        begin
+          @every_url_blocks.each { |block| block.call(url) }
-    #
-    # Sets the history of links that were previously visited to the
-    # specified _new_history_.
-    #
-    #   agent.history = ['http://tenderlovemaking.com/2009/05/06/ann-nokogiri-130rc1-has-been-released/']
-    #
-    def history=(new_history)
-      @history = new_history.map do |url|
-        unless url.kind_of?(URI)
-          URI(url.to_s)
-        else
-          url
+          @urls_like_blocks.each do |pattern,blocks|
+            if ((pattern.kind_of?(Regexp) && link =~ pattern) || pattern == link || pattern == url)
+              blocks.each { |url_block| url_block.call(url) }
+            end
+          end
+        rescue Actions::Paused => action
+          raise(action)
+        rescue Actions::SkipLink
+          return false
+        rescue Actions::Action
         end
-      end
-    end
-    alias visited_urls history
+        @queue << url
+        return true
+      end
-    #
-    # Returns the +Array+ of visited URLs.
-    #
-    def visited_links
-      @history.map { |uri| uri.to_s }
+      return false
     end
     #
-    # Return the +Array+ of hosts that were visited.
+    # Requests and creates a new Page object from a given URL.
     #
-    def visited_hosts
-      @history.map { |uri| uri.host }.uniq
-    end
+    # @param [URI::HTTP] url
+    #   The URL to request.
     #
-    # Returns +true+ if the specified _url_ was visited, returns +false+
-    # otherwise.
+    # @yield [page]
+    #   If a block is given, it will be passed the page that represents the
+    #   response.
     #
-    def visited?(url)
-      url = URI(url) unless url.kind_of?(URI)
-      return @history.include?(url)
-    end
+    # @yieldparam [Page] page
+    #   The page for the response.
     #
-    # Returns +true+ if the specified _url_ was unable to be visited,
-    # returns +false+ otherwise.
-    #
-    def failed?(url)
-      url = URI(url) unless url.kind_of?(URI)
-      return @failures.include?(url)
-    end
-    alias pending_urls queue
-    #
-    # Creates a new Page object from the specified _url_. If a _block_ is
-    # given, it will be passed the newly created Page object.
+    # @return [Page, nil]
+    #   The page for the response, or +nil+ if the request failed.
     #
     def get_page(url,&block)
+      url = URI(url.to_s) unless url.kind_of?(URI)
       host = url.host
       port = url.port
@@ -522,15 +454,12 @@ module Spidr
         path = '/'
       end
-      proxy_host = @proxy[:host]
-      proxy_port = @proxy[:port]
-      proxy_user = @proxy[:user]
-      proxy_password = @proxy[:password]
+      # append the URL query to the path
+      path += "?#{url.query}" if url.query
       begin
-        Net::HTTP::Proxy(proxy_host,proxy_port,proxy_user,proxy_password).start(host,port) do |sess|
+        get_session(url.scheme,host,port) do |sess|
           headers = {}
           headers['User-Agent'] = @user_agent if @user_agent
           headers['Referer'] = @referer if @referer
@@ -539,157 +468,169 @@ module Spidr
           block.call(new_page) if block
           return new_page
         end
-      rescue SystemCallError, Net::HTTPBadResponse
+      rescue SystemCallError, Timeout::Error, Net::HTTPBadResponse, IOError
         failed(url)
+        kill_session(url.scheme,host,port)
         return nil
       end
     end
     #
-    # Returns the agent represented as a Hash containing the agents
-    # +history+ and +queue+ information.
+    # Visits a given URL, and enqueus the links recovered from the URL
+    # to be visited later.
     #
-    def to_hash
-      {:history => @history, :queue => @queue}
-    end
+    # @param [URI::HTTP, String] url
+    #   The URL to visit.
     #
-    # Sets the queue of links to visit to the specified _new_queue_.
+    # @yield [page]
+    #   If a block is given, it will be passed the page which was visited.
     #
-    #   agent.queue = ['http://www.vimeo.com/', 'http://www.reddit.com/']
+    # @yieldparam [Page] page
+    #   The page which was visited.
     #
-    def queue=(new_queue)
-      @queue = new_queue.map do |url|
-        unless url.kind_of?(URI)
-          URI(url.to_s)
-        else
-          url
+    # @return [Page, nil]
+    #   The page that was visited. If +nil+ is returned, either the request
+    #   for the page failed, or the page was skipped.
+    #
+    def visit_page(url,&block)
+      url = URI(url.to_s) unless url.kind_of?(URI)
+      get_page(url) do |page|
+        @history << page.url
+        begin
+          @every_page_blocks.each { |page_block| page_block.call(page) }
+          block.call(page) if block
+        rescue Actions::Paused => action
+          raise(action)
+        rescue Actions::SkipPage
+          return nil
+        rescue Actions::Action
         end
+        page.urls.each { |next_url| enqueue(next_url) }
       end
     end
     #
-    # Returns +true+ if the specified _url_ is queued for visiting, returns
-    # +false+ otherwise.
+    # Converts the agent into a Hash.
     #
-    def queued?(url)
-      @queue.include?(url)
+    # @return [Hash]
+    #   The agent represented as a Hash containing the +history+ and
+    #   the +queue+ of the agent.
+    #
+    def to_hash
+      {:history => @history, :queue => @queue}
     end
+    protected
     #
-    # Enqueues the specified _url_ for visiting, only if it passes all the
-    # agent's rules for visiting a given URL. Returns +true+ if the _url_
-    # was successfully enqueued, returns +false+ otherwise.
+    # Provides an active HTTP session for the given scheme, host
+    # and port.
     #
-    def enqueue(url)
-      link = url.to_s
-      url = URI(link)
+    # @param [String] scheme
+    #   The scheme of the URL, which will be requested later.
+    #
+    # @param [String] host
+    #   The host that the session is needed with.
+    #
+    # @param [Integer] port
+    #   The port that the session is needed for.
+    #
+    # @yield [session]
+    #   If a block is given, it will be passed the active HTTP session.
+    #
+    # @yieldparam [Net::HTTP] session
+    #   The active HTTP session object.
+    #
+    def get_session(scheme,host,port,&block)
+      key = [scheme,host,port]
-      if (!(queued?(url)) && visit?(url))
-        @every_url_blocks.each { |block| block.call(url) }
+      unless @sessions[key]
+        session = Net::HTTP::Proxy(
+          @proxy[:host],
+          @proxy[:port],
+          @proxy[:user],
+          @proxy[:password]
+        ).new(host,port)
-        @urls_like_blocks.each do |pattern,blocks|
-          if ((pattern.kind_of?(Regexp) && link =~ pattern) || pattern == link || pattern == url)
-            blocks.each { |url_block| url_block.call(url) }
-          end
+        if scheme == 'https'
+          session.use_ssl = true
+          session.verify_mode = OpenSSL::SSL::VERIFY_NONE
         end
-        @queue << url
-        return true
+        @sessions[key] = session
       end
-      return false
+      session = @sessions[key]
+      block.call(session) if block
+      return session
     end
-    protected
     #
-    # Dequeues a URL that will later be visited.
+    # Destroys an HTTP session for the given scheme, host and port.
     #
-    def dequeue
-      @queue.shift
-    end
+    # @param [String] scheme
+    #   The scheme of the URL, which was requested through the session.
     #
-    # Returns +true+ if the specified _url_ should be visited, based on
-    # it's scheme, returns +false+ otherwise.
+    # @param [String] host
+    #   The host that the session was connected with.
     #
-    def visit_scheme?(url)
-      if url.scheme
-        return @schemes.include?(url.scheme)
-      else
-        return true
+    # @param [Integer] port
+    #   The port that the session was connected to.
+    #
+    def kill_session(scheme,host,port,&block)
+      key = [scheme,host,port]
+      sess = @sessions[key]
+      begin
+        sess.finish
+      rescue IOError
+        nil
       end
-    end
-    #
-    # Returns +true+ if the specified _url_ should be visited, based on
-    # the host of the _url_, returns +false+ otherwise.
-    #
-    def visit_host?(url)
-      @host_rules.accept?(url.host)
+      @sessions.delete(key)
+      block.call if block
+      return nil
     end
     #
-    # Returns +true+ if the specified _url_ should be visited, based on
-    # the port of the _url_, returns +false+ otherwise.
-    #
-    def visit_port?(url)
-      @port_rules.accept?(url.port)
-    end
+    # Dequeues a URL that will later be visited.
     #
-    # Returns +true+ if the specified _url_ should be visited, based on
-    # the pattern of the _url_, returns +false+ otherwise.
+    # @return [URI::HTTP]
+    #   The URL that was at the front of the queue.
     #
-    def visit_link?(url)
-      @link_rules.accept?(url.to_s)
+    def dequeue
+      @queue.shift
     end
     #
-    # Returns +true+ if the specified _url_ should be visited, based on
-    # the file extension of the _url_, returns +false+ otherwise.
+    # Determines if a given URL should be visited.
     #
-    def visit_ext?(url)
-      @ext_rules.accept?(File.extname(url.path)[1..-1])
-    end
+    # @param [URI::HTTP] url
+    #   The URL in question.
     #
-    # Returns +true+ if the specified URL should be visited, returns
-    # +false+ otherwise.
+    # @return [Boolean]
+    #   Specifies whether the given URL should be visited.
     #
     def visit?(url)
       (!(visited?(url)) &&
-       visit_scheme?(url) &&
-       visit_host?(url) &&
-       visit_port?(url) &&
-       visit_link?(url) &&
-       visit_ext?(url))
+       visit_scheme?(url.scheme) &&
+       visit_host?(url.host) &&
+       visit_port?(url.port) &&
+       visit_link?(url.to_s) &&
+       visit_ext?(url.path))
     end
     #
-    # Visits the spedified _url_ and enqueus it's links for visiting. If a
-    # _block_ is given, it will be passed a newly created Page object
-    # for the specified _url_.
-    #
-    def visit_page(url,&block)
-      get_page(url) do |page|
-        @history << page.url
-        page.urls.each { |next_url| enqueue(next_url) }
-        @every_page_blocks.each { |page_block| page_block.call(page) }
-        block.call(page) if block
-      end
-    end
+    # Adds a given URL to the failures list.
     #
-    # Adds the specified _url_ to the failures list.
+    # @param [URI::HTTP] url
+    #   The URL to add to the failures list.
     #
     def failed(url)
-      url = URI(url.to_s) unless url.kind_of?(URI)
       @every_failed_url_blocks.each { |block| block.call(url) }
       @failures << url
       return true