RubyGems - spidr - Versions diffs - 0.2.2 → 0.2.3 - Mend

spidr 0.2.2 → 0.2.3

Files changed (34) hide show

data/.gitignore +8 -0
data/.specopts +1 -0
data/.yardopts +1 -0
data/{History.rdoc → ChangeLog.md} +47 -39
data/LICENSE.txt +21 -0
data/{README.rdoc → README.md} +57 -49
data/Rakefile +36 -22
data/lib/spidr/actions/actions.rb +4 -0
data/lib/spidr/actions/exceptions/action.rb +3 -0
data/lib/spidr/actions/exceptions/paused.rb +3 -0
data/lib/spidr/actions/exceptions/skip_link.rb +4 -0
data/lib/spidr/actions/exceptions/skip_page.rb +4 -0
data/lib/spidr/agent.rb +61 -17
data/lib/spidr/auth_credential.rb +3 -0
data/lib/spidr/auth_store.rb +12 -8
data/lib/spidr/cookie_jar.rb +4 -1
data/lib/spidr/events.rb +25 -0
data/lib/spidr/filters.rb +5 -1
data/lib/spidr/page.rb +29 -24
data/lib/spidr/rules.rb +4 -0
data/lib/spidr/sanitizers.rb +4 -0
data/lib/spidr/session_cache.rb +26 -1
data/lib/spidr/version.rb +1 -1
data/spec/auth_store_spec.rb +85 -0
data/spec/cookie_jar_spec.rb +108 -0
data/spec/page_spec.rb +0 -1
data/spec/session_cache.rb +58 -0
data/spidr.gemspec +115 -0
metadata +99 -90
data.tar.gz.sig +0 -2
data/Manifest.txt +0 -41
data/tasks/spec.rb +0 -10
data/tasks/yard.rb +0 -12
metadata.gz.sig +0 -0

data/.gitignore ADDED Viewed

@@ -0,0 +1,8 @@
+pkg
+doc
+web
+tmp
+.DS_Store
+.yardoc
+*.swp
+*~

data/.specopts ADDED Viewed

	@@ -0,0 +1 @@
1	+ --colour --format specdoc

data/.yardopts ADDED Viewed

	@@ -0,0 +1 @@
1	+ --markup markdown --title 'Spidr Documentation' --protected --files ChangeLog.md,LICENSE.txt

data/{History.rdoc → ChangeLog.md} RENAMED Viewed

@@ -1,4 +1,12 @@
-=== 0.2.2 / 2010-01-06
+### 0.2.3 / 2010-02-27
+* Migrated to Jeweler, for the packaging and releasing RubyGems.
+* Switched to MarkDown formatted YARD documentation.
+* Added {Spidr::Events#every_link}.
+* Added {Spidr::SessionCache#active?}.
+* Added specs for {Spidr::SessionCache}.
+### 0.2.2 / 2010-01-06
 * Require Web Spider Obstacle Course (WSOC) >= 0.1.1.
 * Integrated the new WSOC into the specs.
@@ -12,10 +20,10 @@
 * Added {Spidr::CookieJar} (thanks Nick Plante).
 * Added {Spidr::AuthStore} (thanks Nick Plante).
 * Added {Spidr::Agent#post_page} (thanks Nick Plante).
-* Renamed Spidr::Agent#get_session to {Spidr::SessionCache#[]}.
-* Renamed Spidr::Agent#kill_session to {Spidr::SessionCache#kill!}.
+* Renamed `Spidr::Agent#get_session` to {Spidr::SessionCache#[]}.
+* Renamed `Spidr::Agent#kill_session` to {Spidr::SessionCache#kill!}.
-=== 0.2.1 / 2009-11-25
+### 0.2.1 / 2009-11-25
 * Added {Spidr::Events#every_ok_page}.
 * Added {Spidr::Events#every_redirect_page}.
@@ -44,9 +52,9 @@
 * Added {Spidr::Events#every_zip_page}.
 * Fixed a bug where {Spidr::Agent#delay} was not being used to delay
   requesting pages.
-* Spider +link+ and +script+ tags in HTML pages (thanks Nick Plante).
+* Spider `link` and `script` tags in HTML pages (thanks Nick Plante).
-=== 0.2.0 / 2009-10-10
+### 0.2.0 / 2009-10-10
 * Added {URI.expand_path}.
 * Added {Spidr::Page#search}.
@@ -54,16 +62,16 @@
 * Added {Spidr::Page#title}.
 * Added {Spidr::Agent#failures=}.
 * Added a HTTP session cache to {Spidr::Agent}, per suggestion of falter.
-  * Added Spidr::Agent#get_session.
-  * Added Spidr::Agent#kill_session.
+  * Added `Spidr::Agent#get_session`.
+  * Added `Spidr::Agent#kill_session`.
 * Added {Spidr.proxy=}.
 * Added {Spidr.disable_proxy!}.
-* Aliased Spidr::Page#txt? to {Spidr::Page#plain_text?}.
-* Aliased Spidr::Page#ok? to {Spidr::Page#is_ok?}.
-* Aliased Spidr::Page#redirect? to {Spidr::Page#is_redirect?}.
-* Aliased Spidr::Page#unauthorized? to {Spidr::Page#is_unauthorized?}.
-* Aliased Spidr::Page#forbidden? to {Spidr::Page#is_forbidden?}.
-* Aliased Spidr::Page#missing? to {Spidr::Page#is_missing?}.
+* Aliased `Spidr::Page#txt?` to {Spidr::Page#plain_text?}.
+* Aliased `Spidr::Page#ok?` to {Spidr::Page#is_ok?}.
+* Aliased `Spidr::Page#redirect?` to {Spidr::Page#is_redirect?}.
+* Aliased `Spidr::Page#unauthorized?` to {Spidr::Page#is_unauthorized?}.
+* Aliased `Spidr::Page#forbidden?` to {Spidr::Page#is_forbidden?}.
+* Aliased `Spidr::Page#missing?` to {Spidr::Page#is_missing?}.
 * Split URL filtering code out of {Spidr::Agent} and into
   {Spidr::Filters}.
 * Split URL / Page event code out of {Spidr::Agent} and into
@@ -71,11 +79,11 @@
 * Split pause! / continue! / skip_link! / skip_page! methods out of
   {Spidr::Agent} and into {Spidr::Actions}.
 * Fixed a bug in {Spidr::Page#code}, where it was not returning an Integer.
-* Make sure {Spidr::Page#doc} returns Nokogiri::XML::Document objects for
+* Make sure {Spidr::Page#doc} returns `Nokogiri::XML::Document` objects for
   RSS/RDF/Atom pages as well.
 * Fixed the handling of the Location header in {Spidr::Page#links}
   (thanks falter).
-* Fixed a bug in {Spidr::Page#to_absolute} where trailing '/' characters on
+* Fixed a bug in {Spidr::Page#to_absolute} where trailing `/` characters on
   URI paths were not being preserved (thanks falter).
 * Fixed a bug where the URI query was not being sent with the request
   in {Spidr::Agent#get_page} (thanks Damian Steer).
@@ -86,17 +94,17 @@
 * Switched {Spidr::Agent#failures} to a Set.
 * Allow a block to be passed to {Spidr::Agent#run}, which will receive all
   pages visited.
-* Allow Spidr::Agent#start_at and Spidr::Agent#continue! to pass blocks
+* Allow `Spidr::Agent#start_at` and `Spidr::Agent#continue!` to pass blocks
   to {Spidr::Agent#run}.
 * Made {Spidr::Agent#visit_page} public.
 * Moved to YARD based documentation.
-=== 0.1.9 / 2009-06-13
+### 0.1.9 / 2009-06-13
 * Upgraded to Hoe 2.0.0.
   * Use Hoe.spec instead of Hoe.new.
   * Use the Hoe signing task for signed gems.
-* Added the Spidr::Agent#schemes and Spidr::Agent#schemes= methods.
+* Added the `Spidr::Agent#schemes` and `Spidr::Agent#schemes=` methods.
 * Added a warning message if 'net/https' cannot be loaded.
 * Allow the list of acceptable URL schemes to be passed into
   {Spidr::Agent#initialize}.
@@ -108,10 +116,10 @@
   could not be loaded.
 * Removed Spidr::Agent::SCHEMES.
-=== 0.1.8 / 2009-05-27
+### 0.1.8 / 2009-05-27
-* Added the Spidr::Agent#pause! and Spidr::Agent#continue! methods.
-* Added the Spidr::Agent#running? and Spidr::Agent#paused? methods.
+* Added the `Spidr::Agent#pause!` and `Spidr::Agent#continue!` methods.
+* Added the `Spidr::Agent#running?` and `Spidr::Agent#paused?` methods.
 * Added an alias for pending_urls to the queue methods.
 * Added {Spidr::Agent#queue} to provide read access to the queue.
 * Added {Spidr::Agent#queue=} and {Spidr::Agent#history=} for setting the
@@ -121,49 +129,49 @@
 * Made {Spidr::Agent#enqueue} and {Spidr::Agent#queued?} public.
 * Added more specs.
-=== 0.1.7 / 2009-04-24
+### 0.1.7 / 2009-04-24
-* Added Spidr::Agent#all_headers.
-* Fixed a bug where Page#headers was always +nil+.
+* Added `Spidr::Agent#all_headers`.
+* Fixed a bug where {Spidr::Page#headers} was always `nil`.
 * {Spidr::Spidr::Agent} will now follow the Location header in HTTP 300,
   301, 302, 303 and 307 Redirects.
 * {Spidr::Agent} will now follow iframe and frame tags.
-=== 0.1.6 / 2009-04-14
+### 0.1.6 / 2009-04-14
 * Added {Spidr::Agent#failures}, a list of URLs which could not be visited.
 * Added {Spidr::Agent#failed?}.
-* Added Spidr::Agent#every_failed_url.
+* Added `Spidr::Agent#every_failed_url`.
 * Added {Spidr::Agent#clear}, which clears the history and failures URL
   lists.
 * Improved fault tolerance in {Spidr::Agent#get_page}.
   * If a Network or HTTP error is encountered, the URL will be added to
     the failures list and the next URL will be visited.
-* Fixed a typo in Spidr::Agent#ignore_exts_like.
+* Fixed a typo in `Spidr::Agent#ignore_exts_like`.
 * Updated the Web Spider Obstacle Course with links that always fail to be
   visited.
-=== 0.1.5 / 2009-03-22
+### 0.1.5 / 2009-03-22
-* Catch malformed URIs in {Spidr::Page#to_absolute} and return +nil+.
-* Filter out +nil+ URIs in {Spidr::Page#urls}.
+* Catch malformed URIs in {Spidr::Page#to_absolute} and return `nil`.
+* Filter out `nil` URIs in {Spidr::Page#urls}.
-=== 0.1.4 / 2009-01-15
+### 0.1.4 / 2009-01-15
 * Use Nokogiri for HTML and XML parsing.
-=== 0.1.3 / 2009-01-10
+### 0.1.3 / 2009-01-10
-* Added the :host options to {Spidr::Agent#initialize}.
+* Added the `:host` options to {Spidr::Agent#initialize}.
 * Added the Web Spider Obstacle Course files to the Manifest.
 * Aliased {Spidr::Agent#visited_urls} to {Spidr::Agent#history}.
-=== 0.1.2 / 2008-11-06
+### 0.1.2 / 2008-11-06
 * Fixed a bug in {Spidr::Page#to_absolute} where URLs with no path were not
-  receiving a default path of <tt>/</tt>.
+  receiving a default path of `/`.
 * Fixed a bug in {Spidr::Page#to_absolute} where URL paths were not being
-  expanded, in order to remove <tt>..</tt> and <tt>.</tt> directories.
+  expanded, in order to remove `..` and `.` directories.
 * Fixed a bug where absolute URLs could have a blank path, thus causing
   {Spidr::Agent#get_page} to crash when it performed the HTTP request.
 * Added RSpec spec tests.
@@ -171,12 +179,12 @@
   (http://spidr.rubyforge.org/course/start.html) which is used in the spec
   tests.
-=== 0.1.1 / 2008-10-04
+### 0.1.1 / 2008-10-04
 * Added a reader method for the response instance variable in Page.
 * Fixed a bug in {Spidr::Page#method_missing}.
-=== 0.1.0 / 2008-05-23
+### 0.1.0 / 2008-05-23
 * Initial release.
   * Black-list or white-list URLs based upon:

data/LICENSE.txt ADDED Viewed

@@ -0,0 +1,21 @@
+Copyright (c) 2008-2010 Hal Brodigan
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+'Software'), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/{README.rdoc → README.md} RENAMED Viewed

@@ -1,18 +1,18 @@
-= Spidr
+# Spidr
-* http://spidr.rubyforge.org
-* http://github.com/postmodern/spidr
-* http://github.com/postmodern/spidr/issues
-* http://groups.google.com/group/spidr
+* [spidr.rubyforge.org](http://spidr.rubyforge.org/)
+* [github.com/postmodern/spidr](http://github.com/postmodern/spidr)
+* [github.com/postmodern/spidr/issues](http://github.com/postmodern/spidr/issues)
+* [groups.google.com/group/spidr](http://groups.google.com/group/spidr)
 * irc.freenode.net #spidr
-== DESCRIPTION:
+## Description
 Spidr is a versatile Ruby web spidering library that can spider a site,
 multiple domains, certain links or infinitely. Spidr is designed to be fast
 and easy to use.
-== FEATURES:
+## Features
 * Follows:
   * a tags.
@@ -31,6 +31,7 @@ and easy to use.
   * Every visited Page.
   * Every visited URL.
   * Every visited URL that matches a specified pattern.
+  * Every origin and destination URI of a link.
   * Every URL that failed to be visited.
 * Provides action methods to:
   * Pause spidering.
@@ -39,22 +40,23 @@ and easy to use.
 * Restore the spidering queue and history from a previous session.
 * Custom User-Agent strings.
 * Custom proxy settings.
+* HTTPS support.
-== EXAMPLES:
+## Examples
-* Start spidering from a URL:
+Start spidering from a URL:
     Spidr.start_at('http://tenderlovemaking.com/')
-* Spider a host:
+Spider a host:
     Spidr.host('coderrr.wordpress.com')
-* Spider a site:
+Spider a site:
     Spidr.site('http://rubyflow.com/')
-* Spider multiple hosts:
+Spider multiple hosts:
     Spidr.start_at(
       'http://company.com/',
@@ -64,30 +66,56 @@ and easy to use.
       ]
     )
-* Do not spider certain links:
+Do not spider certain links:
     Spidr.site('http://matasano.com/', :ignore_links => [/log/])
-* Do not spider links on certain ports:
+Do not spider links on certain ports:
     Spidr.site(
       'http://sketchy.content.com/',
       :ignore_ports => [8000, 8010, 8080]
     )
-* Print out visited URLs:
+Print out visited URLs:
     Spidr.site('http://rubyinside.org/') do |spider|
       spider.every_url { |url| puts url }
     end
-* Print out the URLs that could not be requested:
+Build a URL map of a site:
+    url_map = Hash.new { |hash,key| hash[key] = [] }
+    Spidr.site('http://intranet.com/') do |spider|
+      spider.every_link do |origin,dest|
+        url_map[dest] << origin
+      end
+    end
+Print out the URLs that could not be requested:
     Spidr.site('http://sketchy.content.com/') do |spider|
       spider.every_failed_url { |url| puts url }
     end
-* Search HTML and XML pages:
+Finds all pages which have broken links:
+    url_map = Hash.new { |hash,key| hash[key] = [] }
+    spider = Spidr.site('http://intranet.com/') do |spider|
+      spider.every_link do |origin,dest|
+        url_map[dest] << origin
+      end
+    end
+    spider.failures.each do |url|
+      puts "Broken link #{url} found in:"
+      url_map[url].each { |page| puts "  #{page}" }
+    end
+Search HTML and XML pages:
     Spidr.site('http://company.withablog.com/') do |spider|
       spider.every_page do |page|
@@ -98,11 +126,11 @@ and easy to use.
           value = meta.attributes['content']
           puts "    #{name} = #{value}"
-	end
+        end
       end
     end
-* Print out the titles from every page:
+Print out the titles from every page:
     Spidr.site('http://www.rubypulse.com/') do |spider|
       spider.every_html_page do |page|
@@ -110,7 +138,7 @@ and easy to use.
       end
     end
-* Find what kinds of web servers a host is using, by accessing the headers:
+Find what kinds of web servers a host is using, by accessing the headers:
     servers = Set[]
@@ -120,7 +148,7 @@ and easy to use.
       end
     end
-* Pause the spider on a forbidden page:
+Pause the spider on a forbidden page:
     spider = Spidr.host('overnight.startup.com') do |spider|
       spider.every_forbidden_page do |page|
@@ -128,7 +156,7 @@ and easy to use.
       end
     end
-* Skip the processing of a page:
+Skip the processing of a page:
     Spidr.host('sketchy.content.com') do |spider|
       spider.every_missing_page do |page|
@@ -136,7 +164,7 @@ and easy to use.
       end
     end
-* Skip the processing of links:
+Skip the processing of links:
     Spidr.host('sketchy.content.com') do |spider|
       spider.every_url do |url|
@@ -146,35 +174,15 @@ and easy to use.
       end
     end
-== REQUIREMENTS:
-* {nokogiri}[http://nokogiri.rubyforge.org/] >= 1.2.0
-== INSTALL:
-  $ sudo gem install spidr
+## Requirements
-== LICENSE:
+* [nokogiri](http://nokogiri.rubyforge.org/) >= 1.2.0
-The MIT License
+## Install
-Copyright (c) 2008-2010 Hal Brodigan
+    $ sudo gem install spidr
-Permission is hereby granted, free of charge, to any person obtaining
-a copy of this software and associated documentation files (the
-'Software'), to deal in the Software without restriction, including
-without limitation the rights to use, copy, modify, merge, publish,
-distribute, sublicense, and/or sell copies of the Software, and to
-permit persons to whom the Software is furnished to do so, subject to
-the following conditions:
+## License
-The above copyright notice and this permission notice shall be
-included in all copies or substantial portions of the Software.
+See {file:LICENSE.txt} for license information.
-THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
-EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
-MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
-IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
-CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
-TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
-SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/Rakefile CHANGED Viewed

@@ -1,29 +1,43 @@
-# -*- ruby -*-
 require 'rubygems'
-require 'hoe'
-require 'hoe/signing'
-require './tasks/spec.rb'
-require './tasks/yard.rb'
+require 'rake'
+require './lib/spidr/version.rb'
-Hoe.spec('spidr') do
-  self.developer('Postmodern', 'postmodern.mod3@gmail.com')
+begin
+  require 'jeweler'
+  Jeweler::Tasks.new do |gem|
+    gem.name = 'spidr'
+    gem.version = Spidr::VERSION
+    gem.summary = %Q{A versatile Ruby web spidering library}
+    gem.description = %Q{Spidr is a versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.}
+    gem.email = 'postmodern.mod3@gmail.com'
+    gem.homepage = 'http://github.com/postmodern/spidr'
+    gem.authors = ['Postmodern']
+    gem.add_dependency 'nokogiri', '>= 1.2.0'
+    gem.add_development_dependency 'rspec', '>= 1.3.0'
+    gem.add_development_dependency 'yard', '>= 0.5.3'
+    gem.add_development_dependency 'wsoc', '>= 0.1.1'
+    gem.has_rdoc = 'yard'
+  end
+rescue LoadError
+  puts "Jeweler (or a dependency) not available. Install it with: gem install jeweler"
+end
-  self.readme_file = 'README.rdoc'
-  self.history_file = 'History.rdoc'
-  self.remote_rdoc_dir = 'docs'
+require 'spec/rake/spectask'
+Spec::Rake::SpecTask.new(:spec) do |spec|
+  spec.libs += ['lib', 'spec']
+  spec.spec_files = FileList['spec/**/*_spec.rb']
+  spec.spec_opts = ['--options', '.specopts']
+end
-  self.extra_deps = [
-    ['nokogiri', '>=1.2.0']
-  ]
+task :spec => :check_dependencies
+task :default => :spec
-  self.extra_dev_deps = [
-    ['rspec', '>=1.2.8'],
-    ['yard', '>=0.4.0'],
-    ['wsoc', '>=0.1.1']
-  ]
+begin
+  require 'yard'
-  self.spec_extras = {:has_rdoc => 'yard'}
+  YARD::Rake::YardocTask.new
+rescue LoadError
+  task :yard do
+    abort "YARD is not available. In order to run yard, you must: gem install yard"
+  end
 end
-# vim: syntax=Ruby