RubyGems - spidr - Versions diffs - 0.2.7 → 0.3.0 - Mend

spidr 0.2.7 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (33) hide show

data/.rspec ADDED Viewed

	@@ -0,0 +1 @@
1	+ --colour --format documentation

data/ChangeLog.md CHANGED Viewed

@@ -1,19 +1,44 @@
+### 0.3.0 / 2011-04-14
+* Switched from Jeweler to [Ore](http://github.com/ruby-ore/ore).
+* Split all header related methods out of {Spidr::Page} and into
+  {Spidr::Headers}.
+* Split all body related methods out of {Spidr::Page} and into
+  {Spidr::Body}.
+* Split all link related methods out of {Spidr::Page} and into
+  {Spidr::Links}.
+* Added {Spidr::Headers#directory?}.
+* Added {Spidr::Headers#json?}.
+* Added {Spidr::Links#each_url}.
+* Added {Spidr::Links#each_link}.
+* Added {Spidr::Links#each_redirect}.
+* Added {Spidr::Links#each_meta_redirect}.
+* Aliased {Spidr::Headers#raw_cookie} to {Spidr::Headers#cookie}.
+* Aliased {Spidr::Body#to_s} to {Spidr::Body#body}.
+* Also check for `application/xml` in {Spidr::Headers#xml?}.
+* Catch all exceptions when merging URIs in {Spidr::Links#to_absolute}.
+* Always prepend a `/` to all FTP URI paths. Fixes a Ruby 1.8 specific
+  bug, where it expects an absolute path for all FTP URIs.
+* Refactored {URI.expand_path}.
+* Start the session in {Spidr::SessionCache#[]} to prevent multiple
+  `CONNECT` commands being sent to HTTP Proxies (thanks falaise).
 ### 0.2.7 / 2010-08-17
 * Added {Spidr::CookieJar#cookies_for_host} (thanks zapnap).
-* Renamed `Spidr::Page#cookie` to {Spidr::Page#raw_cookie}.
+* Renamed `Spidr::Page#cookie` to `Spidr::Page#raw_cookie`.
 * Rescue `URI::InvalidComponentError` exceptions in
-  {Spidr::Page#to_absolute} (thanks zapnap).
+  `Spidr::Page#to_absolute` (thanks zapnap).
 ### 0.2.6 / 2010-07-05
-* Fixed a bug in {Spidr::Page#meta_redirect}, by calling
+* Fixed a bug in `Spidr::Page#meta_redirect`, by calling
   `Nokogiri::XML::Element#get_attribute` instead of `attr`.
 ### 0.2.5 / 2010-07-02
-* Added {Spidr::Page#meta_redirect}.
-* Added {Spidr::Page#meta_redirect?}.
+* Added `Spidr::Page#meta_redirect`.
+* Added `Spidr::Page#meta_redirect?`.
 * Manage development dependencies with Bundler.
 * Support following "old-school" meta-refresh redirects (thanks zapnap).
 * Allow {Spidr::CookieJar} inherit cookies set by a parent domain.
@@ -26,10 +51,10 @@
 * Added {Spidr::Filters#visit_urls_like}.
 * Added {Spidr::Filters#ignore_urls}.
 * Added {Spidr::Filters#ignore_urls_like}.
-* Added {Spidr::Page#is_content_type?}.
-* Default {Spidr::Page#body} to an empty String.
-* Default {Spidr::Page#content_type} to an empty String.
-* Default {Spidr::Page#content_types} to an empty Array.
+* Added `Spidr::Page#is_content_type?`.
+* Default `Spidr::Page#body` to an empty String.
+* Default `Spidr::Page#content_type` to an empty String.
+* Default `Spidr::Page#content_types` to an empty Array.
 * Improved reliability of {Spidr::Page#is_redirect?}.
 * Improved content type detection in {Spidr::Page} to handle `Content-Type`
   headers containing charsets (thanks Josh Lindsey).
@@ -47,10 +72,10 @@
 * Require Web Spider Obstacle Course (WSOC) >= 0.1.1.
 * Integrated the new WSOC into the specs.
 * Removed the built-in Web Spider Obstacle Course.
-* Added {Spidr::Page#content_types}.
-* Added {Spidr::Page#cookie}.
-* Added {Spidr::Page#cookies}.
-* Added {Spidr::Page#cookie_params}.
+* Added `Spidr::Page#content_types`.
+* Added `Spidr::Page#cookie`.
+* Added `Spidr::Page#cookies`.
+* Added `Spidr::Page#cookie_params`.
 * Added {Spidr::Sanitizers}.
 * Added {Spidr::SessionCache}.
 * Added {Spidr::CookieJar} (thanks Nick Plante).
@@ -93,33 +118,33 @@
 ### 0.2.0 / 2009-10-10
 * Added {URI.expand_path}.
-* Added {Spidr::Page#search}.
-* Added {Spidr::Page#at}.
-* Added {Spidr::Page#title}.
+* Added `Spidr::Page#search`.
+* Added `Spidr::Page#at`.
+* Added `Spidr::Page#title`.
 * Added {Spidr::Agent#failures=}.
 * Added a HTTP session cache to {Spidr::Agent}, per suggestion of falter.
   * Added `Spidr::Agent#get_session`.
   * Added `Spidr::Agent#kill_session`.
 * Added {Spidr.proxy=}.
 * Added {Spidr.disable_proxy!}.
-* Aliased `Spidr::Page#txt?` to {Spidr::Page#plain_text?}.
-* Aliased `Spidr::Page#ok?` to {Spidr::Page#is_ok?}.
-* Aliased `Spidr::Page#redirect?` to {Spidr::Page#is_redirect?}.
-* Aliased `Spidr::Page#unauthorized?` to {Spidr::Page#is_unauthorized?}.
-* Aliased `Spidr::Page#forbidden?` to {Spidr::Page#is_forbidden?}.
-* Aliased `Spidr::Page#missing?` to {Spidr::Page#is_missing?}.
+* Aliased `Spidr::Page#txt?` to `Spidr::Page#plain_text?`.
+* Aliased `Spidr::Page#ok?` to `Spidr::Page#is_ok?`.
+* Aliased `Spidr::Page#redirect?` to `Spidr::Page#is_redirect?`.
+* Aliased `Spidr::Page#unauthorized?` to `Spidr::Page#is_unauthorized?`.
+* Aliased `Spidr::Page#forbidden?` to `Spidr::Page#is_forbidden?`.
+* Aliased `Spidr::Page#missing?` to `Spidr::Page#is_missing?`.
 * Split URL filtering code out of {Spidr::Agent} and into
   {Spidr::Filters}.
 * Split URL / Page event code out of {Spidr::Agent} and into
   {Spidr::Events}.
 * Split pause! / continue! / skip_link! / skip_page! methods out of
   {Spidr::Agent} and into {Spidr::Actions}.
-* Fixed a bug in {Spidr::Page#code}, where it was not returning an Integer.
-* Make sure {Spidr::Page#doc} returns `Nokogiri::XML::Document` objects for
+* Fixed a bug in `Spidr::Page#code`, where it was not returning an Integer.
+* Make sure `Spidr::Page#doc` returns `Nokogiri::XML::Document` objects for
   RSS/RDF/Atom pages as well.
-* Fixed the handling of the Location header in {Spidr::Page#links}
+* Fixed the handling of the Location header in `Spidr::Page#links`
   (thanks falter).
-* Fixed a bug in {Spidr::Page#to_absolute} where trailing `/` characters on
+* Fixed a bug in `Spidr::Page#to_absolute` where trailing `/` characters on
   URI paths were not being preserved (thanks falter).
 * Fixed a bug where the URI query was not being sent with the request
   in {Spidr::Agent#get_page} (thanks Damian Steer).
@@ -169,7 +194,7 @@
 * Added `Spidr::Agent#all_headers`.
 * Fixed a bug where {Spidr::Page#headers} was always `nil`.
-* {Spidr::Spidr::Agent} will now follow the Location header in HTTP 300,
+* {Spidr::Agent} will now follow the Location header in HTTP 300,
   301, 302, 303 and 307 Redirects.
 * {Spidr::Agent} will now follow iframe and frame tags.
@@ -189,8 +214,8 @@
 ### 0.1.5 / 2009-03-22
-* Catch malformed URIs in {Spidr::Page#to_absolute} and return `nil`.
-* Filter out `nil` URIs in {Spidr::Page#urls}.
+* Catch malformed URIs in `Spidr::Page#to_absolute` and return `nil`.
+* Filter out `nil` URIs in `Spidr::Page#urls`.
 ### 0.1.4 / 2009-01-15
@@ -204,9 +229,9 @@
 ### 0.1.2 / 2008-11-06
-* Fixed a bug in {Spidr::Page#to_absolute} where URLs with no path were not
+* Fixed a bug in `Spidr::Page#to_absolute` where URLs with no path were not
   receiving a default path of `/`.
-* Fixed a bug in {Spidr::Page#to_absolute} where URL paths were not being
+* Fixed a bug in `Spidr::Page#to_absolute` where URL paths were not being
   expanded, in order to remove `..` and `.` directories.
 * Fixed a bug where absolute URLs could have a blank path, thus causing
   {Spidr::Agent#get_page} to crash when it performed the HTTP request.

data/Gemfile CHANGED Viewed

@@ -1,27 +1,13 @@
 source 'https://rubygems.org'
-group(:runtime) do
-  gem 'nokogiri',	'>= 1.3.0'
-end
+gemspec
-group(:development) do
-  gem 'rake',			'~> 0.8.7'
-  gem 'jeweler',		'~> 1.4.0', :git => 'git://github.com/technicalpickles/jeweler.git'
-end
+group :development do
+  gem 'rake',         '~> 0.8.7'
-group(:doc) do
-  case RUBY_PLATFORM
-  when 'java'
-    gem 'maruku',	'~> 0.6.0'
-  else
-    gem 'rdiscount',	'~> 1.6.3'
-  end
+  gem 'ore-tasks',    '~> 0.4'
+  gem 'rspec',        '~> 2.4'
+  gem 'wsoc',         '~> 0.1.3'
-  gem 'yard',		'~> 0.5.3'
+  gem 'kramdown',     '~> 0.12'
 end
-group(:test) do
-  gem 'wsoc',	'~> 0.1.3'
-end
-gem 'rspec',	'~> 1.3.0', :group => [:development, :test]

data/LICENSE.txt CHANGED Viewed

@@ -1,5 +1,4 @@
-Copyright (c) 2008-2010 Hal Brodigan
+Copyright (c) 2008-2011 Hal Brodigan
 Permission is hereby granted, free of charge, to any person obtaining
 a copy of this software and associated documentation files (the

data/README.md CHANGED Viewed

@@ -1,9 +1,9 @@
 # Spidr
-* [spidr.rubyforge.org](http://spidr.rubyforge.org/)
-* [github.com/postmodern/spidr](http://github.com/postmodern/spidr)
-* [github.com/postmodern/spidr/issues](http://github.com/postmodern/spidr/issues)
-* [groups.google.com/group/spidr](http://groups.google.com/group/spidr)
+* [Homepage](http://spidr.rubyforge.org/)
+* [Source](http://github.com/postmodern/spidr)
+* [Issues](http://github.com/postmodern/spidr/issues)
+* [Mailing List](http://groups.google.com/group/spidr)
 * irc.freenode.net #spidr
 ## Description
@@ -177,7 +177,7 @@ Skip the processing of links:
 ## Requirements
-* [nokogiri](http://nokogiri.rubyforge.org/) >= 1.3.0
+* [nokogiri](http://nokogiri.rubyforge.org/) ~> 1.3
 ## Install
@@ -185,5 +185,6 @@ Skip the processing of links:
 ## License
-See {file:LICENSE.txt} for license information.
+Copyright (c) 2008-2011 Hal Brodigan
+See {file:LICENSE.txt} for license information.

data/Rakefile CHANGED Viewed

@@ -1,8 +1,15 @@
 require 'rubygems'
-require 'bundler'
 begin
-  Bundler.setup(:development, :doc)
+  require 'bundler'
+rescue LoadError => e
+  STDERR.puts e.message
+  STDERR.puts "Run `gem install bundler` to install Bundler."
+  exit e.status_code
+end
+begin
+  Bundler.setup(:development)
 rescue Bundler::BundlerError => e
   STDERR.puts e.message
   STDERR.puts "Run `bundle install` to install missing gems"
@@ -10,29 +17,12 @@ rescue Bundler::BundlerError => e
 end
 require 'rake'
-require 'jeweler'
-require './lib/spidr/version.rb'
-Jeweler::Tasks.new do |gem|
-  gem.name = 'spidr'
-  gem.version = Spidr::VERSION
-  gem.license = 'MIT'
-  gem.summary = %Q{A versatile Ruby web spidering library}
-  gem.description = %Q{Spidr is a versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.}
-  gem.email = 'postmodern.mod3@gmail.com'
-  gem.homepage = 'http://github.com/postmodern/spidr'
-  gem.authors = ['Postmodern']
-  gem.has_rdoc = 'yard'
-end
-Jeweler::GemcutterTasks.new
-require 'spec/rake/spectask'
-Spec::Rake::SpecTask.new(:spec) do |spec|
-  spec.libs += ['lib', 'spec']
-  spec.spec_files = FileList['spec/**/*_spec.rb']
-  spec.spec_opts = ['--options', '.specopts']
-end
+require 'ore/tasks'
+Ore::Tasks.new
+require 'rspec/core/rake_task'
+RSpec::Core::RakeTask.new
 task :default => :spec
 require 'yard'

data/gemspec.yml ADDED Viewed

@@ -0,0 +1,19 @@
+name: spidr
+summary: A versatile Ruby web spidering library
+description:
+  Spidr is a versatile Ruby web spidering library that can spider a site,
+  multiple domains, certain links or infinitely. Spidr is designed to be
+  fast and easy to use.
+license: MIT
+authors: Postmodern
+email: postmodern.mod3@gmail.com
+homepage: http://github.com/postmodern/spidr
+has_yard: true
+dependencies:
+  nokogiri: ~> 1.3
+development_dependencies:
+  bundler: ~> 1.0.0
+  yard: ~> 0.6.0

data/lib/spidr/actions/actions.rb CHANGED Viewed

@@ -4,7 +4,7 @@ require 'spidr/actions/exceptions/skip_page'
 module Spidr
   #
-  # The {Actions} module adds methods to {Agent} for controling the
+  # The {Actions} module adds methods to {Agent} for controlling the
   # spidering of links.
   #
   module Actions

data/lib/spidr/agent.rb CHANGED Viewed

@@ -48,6 +48,12 @@ module Spidr
     # Cached cookies
     attr_reader :cookies
+    # Maximum depth
+    attr_reader :max_depth
+    # The visited URLs and their depth within a site
+    attr_reader :levels
     #
     # Creates a new Agent object.
@@ -91,6 +97,9 @@ module Spidr
     # @option options [Set, Array] :history
     #   The initial list of visited URLs.
     #
+    # @option options [Integer] :max_depth
+    #   The maximum link depth to follow.
+    #
     # @yield [agent]
     #   If a block is given, it will be passed the newly created agent
     #   for further configuration.
@@ -119,6 +128,9 @@ module Spidr
       @failures = Set[]
       @queue = []
+      @levels = Hash.new(0)
+      @max_depth = options[:max_depth]
       super(options)
       yield self if block_given?
@@ -450,7 +462,7 @@ module Spidr
     # @return [Boolean]
     #   Specifies whether the URL was enqueued, or ignored.
     #
-    def enqueue(url)
+    def enqueue(url,level=0)
       url = sanitize_url(url)
       if (!(queued?(url)) && visit?(url))
@@ -477,14 +489,15 @@ module Spidr
           return false
         rescue Actions::Action
         end
         @queue << url
+        @levels[url] = level
         return true
       end
       return false
     end
     #
     # Requests and creates a new Page object from a given URL.
     #
@@ -568,7 +581,7 @@ module Spidr
     #   for the page failed, or the page was skipped.
     #
     def visit_page(url)
-      url = URI(url.to_s) unless url.kind_of?(URI)
+      url = sanitize_url(url)
       get_page(url) do |page|
         @history << page.url
@@ -584,7 +597,7 @@ module Spidr
         rescue Actions::Action
         end
-        page.urls.each do |next_url|
+        page.each_url do |next_url|
           begin
             @every_link_blocks.each do |link_block|
               link_block.call(page.url,next_url)
@@ -596,7 +609,9 @@ module Spidr
           rescue Actions::Action
           end
-          enqueue(next_url)
+          if (@max_depth.nil? || @max_depth > @levels[url])
+            enqueue(next_url,@levels[url] + 1)
+          end
         end
       end
     end

data/lib/spidr/auth_store.rb CHANGED Viewed

@@ -24,7 +24,7 @@ module Spidr
     # Given a URL, return the most specific matching auth credential.
     #
     # @param [URI] url
-    #   A fully qualified url includig optional path.
+    #   A fully qualified url including optional path.
     #
     # @return [AuthCredential, nil]
     #   Closest matching {AuthCredential} values for the URL,

data/lib/spidr/body.rb ADDED Viewed

@@ -0,0 +1,99 @@
+require 'nokogiri'
+module Spidr
+  module Body
+    #
+    # The body of the response.
+    #
+    # @return [String]
+    #   The body of the response.
+    #
+    def body
+      (response.body || '')
+    end
+    #
+    # Returns a parsed document object for HTML, XML, RSS and Atom pages.
+    #
+    # @return [Nokogiri::HTML::Document, Nokogiri::XML::Document, nil]
+    #   The document that represents HTML or XML pages.
+    #   Returns `nil` if the page is neither HTML, XML, RSS, Atom or if
+    #   the page could not be parsed properly.
+    #
+    # @see http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/Document.html
+    # @see http://nokogiri.rubyforge.org/nokogiri/Nokogiri/HTML/Document.html
+    #
+    def doc
+      return nil if body.empty?
+      begin
+        if html?
+          return @doc ||= Nokogiri::HTML(body)
+        elsif (xml? || xsl? || rss? || atom?)
+          return @doc ||= Nokogiri::XML(body)
+        end
+      rescue
+        return nil
+      end
+    end
+    #
+    # Searches the document for XPath or CSS Path paths.
+    #
+    # @param [Array<String>] paths
+    #   CSS or XPath expressions to search the document with.
+    #
+    # @return [Array]
+    #   The matched nodes from the document.
+    #   Returns an empty Array if no nodes were matched, or if the page
+    #   is not an HTML or XML document.
+    #
+    # @example
+    #   page.search('//a[@href]')
+    #
+    # @see http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/Node.html#M000239
+    #
+    def search(*paths)
+      if doc
+        doc.search(*paths)
+      else
+        []
+      end
+    end
+    #
+    # Searches for the first occurrence an XPath or CSS Path expression.
+    #
+    # @return [Nokogiri::HTML::Node, Nokogiri::XML::Node, nil]
+    #   The first matched node. Returns `nil` if no nodes could be matched,
+    #   or if the page is not a HTML or XML document.
+    #
+    # @example
+    #   page.at('//title')
+    #
+    # @see http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/Node.html#M000251
+    #
+    def at(*arguments)
+      if doc
+        doc.at(*arguments)
+      end
+    end
+    alias / search
+    alias % at
+    #
+    # The title of the HTML page.
+    #
+    # @return [String]
+    #   The inner-text of the title element of the page.
+    #
+    def title
+      if (node = at('//title'))
+        node.inner_text
+      end
+    end
+    alias to_s body
+  end
+end