RubyGems - spidr - Versions diffs - 0.1.0 - Mend

spidr 0.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (12) hide show

data/History.txt ADDED

@@ -0,0 +1,13 @@
+=== 0.1.0 / 2008-05-23
+* Initial release.
+ * Black-list or white-list URLs based upon:
+  * Host name
+  * Port number
+  * Full link
+  * URL extension
+ * Provides call-backs for:
+  * Every visited Page.
+  * Every visited URL.
+  * Every visited URL that matches a specified pattern.

data/Manifest.txt ADDED

@@ -0,0 +1,11 @@
+History.txt
+Manifest.txt
+README.txt
+Rakefile
+lib/spidr.rb
+lib/spidr/page.rb
+lib/spidr/rules.rb
+lib/spidr/agent.rb
+lib/spidr/spidr.rb
+lib/spidr/version.rb
+test/test_spidr.rb

data/README.txt ADDED

@@ -0,0 +1,55 @@
+= Spidr
+* http://spidr.rubyforge.org/
+* Postmodern Modulus III (postmodern.mod3@gmail.com)
+== DESCRIPTION:
+Spidr is a versatile Ruby web spidering library that can spider a site,
+multiple domains, certain links or infinitely. Spidr is designed to be fast
+and easy to use.
+== FEATURES/PROBLEMS:
+* Black-list or white-list URLs based upon:
+ * Host name
+ * Port number
+ * Full link
+ * URL extension
+* Provides call-backs for:
+ * Every visited Page.
+ * Every visited URL.
+ * Every visited URL that matches a specified pattern.
+== REQUIREMENTS:
+* Hpricot
+== INSTALL:
+  $ sudo gem install spidr
+== LICENSE:
+The MIT License
+Copyright (c) 2008 Hal Brodigan
+Permission is hereby granted, free of charge, to any person obtaining
+a copy of this software and associated documentation files (the
+'Software'), to deal in the Software without restriction, including
+without limitation the rights to use, copy, modify, merge, publish,
+distribute, sublicense, and/or sell copies of the Software, and to
+permit persons to whom the Software is furnished to do so, subject to
+the following conditions:
+The above copyright notice and this permission notice shall be
+included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
+EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
+IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
+CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
+TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
+SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

data/Rakefile ADDED

@@ -0,0 +1,13 @@
+# -*- ruby -*-
+require 'rubygems'
+require 'hoe'
+require './lib/spidr/version.rb'
+Hoe.new('spidr', Spidr::VERSION) do |p|
+  p.rubyforge_name = 'spidr'
+  p.developer('Postmodern Modulus III', 'postmodern.mod3@gmail.com')
+  p.extra_deps = ['hpricot']
+end
+# vim: syntax=Ruby

data/lib/spidr.rb ADDED

@@ -0,0 +1,3 @@
+require 'spidr/agent'
+require 'spidr/spidr'
+require 'spidr/version'

data/lib/spidr/agent.rb ADDED

@@ -0,0 +1,490 @@
+require 'spidr/rules'
+require 'spidr/page'
+require 'spidr/spidr'
+require 'net/http'
+require 'hpricot'
+module Spidr
+  class Agent
+    # URL schemes to visit
+    SCHEMES = ['http', 'https']
+    # Proxy to use
+    attr_accessor :proxy
+    # User-Agent to use
+    attr_accessor :user_agent
+    # Referer to use
+    attr_accessor :referer
+    # Delay in between fetching pages
+    attr_accessor :delay
+    # History containing visited URLs
+    attr_accessor :history
+    #
+    # Creates a new Agent object with the given _options_ and _block_.
+    # If a _block_ is given, it will be passed the newly created
+    # Agent object.
+    #
+    # _options_ may contain the following keys:
+    # <tt>:proxy</tt>:: The proxy to use while spidering.
+    # <tt>:user_agent</tt>:: the User-Agent string to send.
+    # <tt>:referer</tt>:: The referer URL to send.
+    # <tt>:delay</tt>:: Duration in seconds to pause between spidering each
+    #                   link. Defaults to 0.
+    # <tt>:hosts</tt>:: An +Array+ of host patterns to visit.
+    # <tt>:ignore_hosts</tt>:: An +Array+ of host patterns to not visit.
+    # <tt>:ports</tt>:: An +Array+ of port patterns to visit.
+    # <tt>:ignore_ports</tt>:: An +Array+ of port patterns to not visit.
+    # <tt>:links</tt>:: An +Array+ of link patterns to visit.
+    # <tt>:ignore_links</tt>:: An +Array+ of link patterns to not visit.
+    # <tt>:exts</tt>:: An +Array+ of File extension patterns to visit.
+    # <tt>:ignore_exts</tt>:: An +Array+ of File extension patterns to not
+    #                         visit.
+    #
+    def initialize(options={},&block)
+      @proxy = (options[:proxy] || Spidr.proxy)
+      @user_agent = (options[:user_agent] || Spidr.user_agent)
+      @referer = options[:referer]
+      @host_rules = Rules.new(:accept => options[:hosts],
+                              :reject => options[:ignore_hosts])
+      @port_rules = Rules.new(:accept => options[:ports],
+                              :reject => options[:ignore_ports])
+      @link_rules = Rules.new(:accept => options[:links],
+                              :reject => options[:ignore_links])
+      @ext_rules = Rules.new(:accept => options[:exts],
+                             :reject => options[:ignore_exts])
+      @every_url_blocks = []
+      @urls_like_blocks = Hash.new { |hash,key| hash[key] = [] }
+      @every_page_blocks = []
+      @delay = (options[:delay] || 0)
+      @history = []
+      @queue = []
+      block.call(self) if block
+    end
+    #
+    # Creates a new Agent object with the given _options_ and will begin
+    # spidering at the specified _url_. If a _block_ is given it will be
+    # passed the newly created Agent object, before the agent begins
+    # spidering.
+    #
+    def self.start_at(url,options={},&block)
+      self.new(options) do |spider|
+        block.call(spider) if block
+        spider.start_at(url)
+      end
+    end
+    #
+    # Creates a new Agent object with the given _options_ and will begin
+    # spidering the specified host _name_. If a _block_ is given it will be
+    # passed the newly created Agent object, before the agent begins
+    # spidering.
+    #
+    def self.host(name,options={},&block)
+      self.new(options.merge(:hosts => [name.to_s])) do |spider|
+        block.call(spider) if block
+        spider.start_at("http://#{name}/")
+      end
+    end
+    #
+    # Creates a new Agent object with the given _options_ and will begin
+    # spidering the host of the specified _url_. If a _block_ is given it
+    # will be passed the newly created Agent object, before the agent
+    # begins spidering.
+    #
+    def self.site(url,options={},&block)
+      url = URI(url.to_s)
+      return self.new(options.merge(:hosts => [url.host])) do |spider|
+        block.call(spider) if block
+        spider.start_at(url)
+      end
+    end
+    #
+    # Returns the +Array+ of host patterns to visit.
+    #
+    def visit_hosts
+      @host_rules.accept
+    end
+    #
+    # Adds the given _pattern_ to the visit_hosts. If a _block_ is given,
+    # it will be added to the visit_hosts.
+    #
+    def visit_hosts_like(pattern=nil,&block)
+      if pattern
+        visit_hosts << pattern
+      elsif block
+        visit_hosts << block
+      end
+      return self
+    end
+    #
+    # Returns the +Array+ of URL host patterns to not visit.
+    #
+    def ignore_hosts
+      @host_rules.reject
+    end
+    #
+    # Adds the given _pattern_ to the ignore_hosts. If a _block_ is given,
+    # it will be added to the ignore_hosts.
+    #
+    def ignore_hosts_like(pattern=nil,&block)
+      if pattern
+        ignore_hosts << pattern
+      elsif block
+        ignore_hosts << block
+      end
+      return self
+    end
+    #
+    # Returns the +Array+ of URL port patterns to visit.
+    #
+    def visit_ports
+      @port_rules.accept
+    end
+    #
+    # Adds the given _pattern_ to the visit_ports. If a _block_ is given,
+    # it will be added to the visit_ports.
+    #
+    def visit_ports_like(pattern=nil,&block)
+      if pattern
+        visit_ports << pattern
+      elsif block
+        visit_ports << block
+      end
+      return self
+    end
+    #
+    # Returns the +Array+ of URL port patterns to not visit.
+    #
+    def ignore_ports
+      @port_rules.reject
+    end
+    #
+    # Adds the given _pattern_ to the ignore_hosts. If a _block_ is given,
+    # it will be added to the ignore_hosts.
+    #
+    def ignore_ports_like(pattern=nil,&block)
+      if pattern
+        ignore_ports << pattern
+      elsif block
+        ignore_ports << block
+      end
+      return self
+    end
+    #
+    # Returns the +Array+ of link patterns to visit.
+    #
+    def visit_links
+      @link_rules.accept
+    end
+    #
+    # Adds the given _pattern_ to the visit_links. If a _block_ is given,
+    # it will be added to the visit_links.
+    #
+    def visit_links_like(pattern=nil,&block)
+      if pattern
+        visit_links << pattern
+      elsif block
+        visit_links << block
+      end
+      return self
+    end
+    #
+    # Returns the +Array+ of link patterns to not visit.
+    #
+    def ignore_links
+      @link_rules.reject
+    end
+    #
+    # Adds the given _pattern_ to the ignore_links. If a _block_ is given,
+    # it will be added to the ignore_links.
+    #
+    def ignore_links_like(pattern=nil,&block)
+      if pattern
+        ignore_links << pattern
+      elsif block
+        ignore_links << block
+      end
+      return self
+    end
+    #
+    # Returns the +Array+ of URL extension patterns to visit.
+    #
+    def visit_exts
+      @ext_rules.accept
+    end
+    #
+    # Adds the given _pattern_ to the visit_exts. If a _block_ is given,
+    # it will be added to the visit_exts.
+    #
+    def visit_exts_like(pattern=nil,&block)
+      if pattern
+        visit_exts << pattern
+      elsif block
+        visit_exts << block
+      end
+      return self
+    end
+    #
+    # Returns the +Array+ of URL extension patterns to not visit.
+    #
+    def ignore_exts
+      @ext_rules.reject
+    end
+    #
+    # Adds the given _pattern_ to the ignore_exts. If a _block_ is given,
+    # it will be added to the ignore_exts.
+    #
+    def ignore_exts_like(&block)
+      if pattern
+        ignore_exts << pattern
+      elsif block
+        ignore_exts << block
+      end
+      return self
+    end
+    #
+    # For every URL that the agent visits it will be passed to the
+    # specified _block_.
+    #
+    def every_url(&block)
+      @every_url_blocks << block
+      return self
+    end
+    #
+    # For every URL that the agent visits and matches the specified
+    # _pattern_, it will be passed to the specified _block_.
+    #
+    def urls_like(pattern,&block)
+      @urls_like_blocks[pattern] << block
+      return self
+    end
+    #
+    # For every Page that the agent visits it will be passed to the
+    # specified _block_.
+    #
+    def every_page(&block)
+      @every_page_blocks << block
+      return self
+    end
+    #
+    # Clear the history and start spidering at the specified _url_.
+    #
+    def start_at(url)
+      @history.clear
+      return run(url)
+    end
+    #
+    # Start spidering at the specified _url_.
+    #
+    def run(url)
+      enqueue(url)
+      until @queue.empty?
+        visit_page(dequeue)
+      end
+      return self
+    end
+    #
+    # Returns the +Array+ of visited URLs.
+    #
+    def visited_urls
+      @history
+    end
+    #
+    # Returns the +Array+ of visited URLs.
+    #
+    def visited_links
+      @history.map { |uri| uri.to_s }
+    end
+    #
+    # Return the +Array+ of hosts that were visited.
+    #
+    def visited_hosts
+      @history.map { |uri| uri.host }.uniq
+    end
+    #
+    # Returns +true+ if the specified _url_ was visited, returns +false+
+    # otherwise.
+    #
+    def visited?(url)
+      if url.kind_of?(URI)
+        return @history.include?(url)
+      else
+        return @history.include?(URI(url).to_s)
+      end
+    end
+    protected
+    #
+    # Returns +true+ if the specified _url_ is queued for visiting, returns
+    # +false+ otherwise.
+    #
+    def queued?(url)
+      @queue.include?(url)
+    end
+    #
+    # Enqueues the specified _url_ for visiting, only if it passes all the
+    # agent's rules for visiting a given URL. Returns +true+ if the _url_
+    # was successfully enqueued, returns +false+ otherwise.
+    #
+    def enqueue(url)
+      link = url.to_s
+      url = URI(link)
+      if (!(queued?(url)) && visit?(url))
+        @every_url_blocks.each { |block| block.call(url) }
+        @urls_like_blocks.each do |pattern,blocks|
+          if ((pattern.kind_of?(Regexp) && link =~ pattern) || pattern == link || pattern == url)
+            blocks.each { |url_block| url_block.call(url) }
+          end
+        end
+        @queue << url
+        return true
+      end
+      return false
+    end
+    #
+    # Dequeues a URL that will later be visited.
+    #
+    def dequeue
+      @queue.shift
+    end
+    #
+    # Returns +true+ if the specified URL should be visited, returns
+    # +false+ otherwise.
+    #
+    def visit?(url)
+      (!(visited?(url)) &&
+       visit_scheme?(url) &&
+       visit_host?(url) &&
+       visit_port?(url) &&
+       visit_link?(url) &&
+       visit_ext?(url))
+    end
+    #
+    # Visits the spedified _url_ and enqueus it's links for visiting. If a
+    # _block_ is given, it will be passed a newly created Page object
+    # for the specified _url_.
+    #
+    def visit_page(url,&block)
+      get_page(url) do |page|
+        @history << page.url
+        page.urls.each { |next_url| enqueue(next_url) }
+        @every_page_blocks.each { |page_block| page_block.call(page) }
+        block.call(page) if block
+      end
+    end
+    private
+    def visit_scheme?(url)
+      if url.scheme
+        return SCHEMES.include?(url.scheme)
+      else
+        return true
+      end
+    end
+    def visit_host?(url)
+      @host_rules.accept?(url.host)
+    end
+    def visit_port?(url)
+      @port_rules.accept?(url.port)
+    end
+    def visit_link?(url)
+      @link_rules.accept?(url.to_s)
+    end
+    def visit_ext?(url)
+      @ext_rules.accept?(File.extname(url.path)[1..-1])
+    end
+    def get_page(url,&block)
+      host = url.host
+      port = url.port
+      proxy_host = @proxy[:host]
+      proxy_port = @proxy[:port]
+      proxy_user = @proxy[:user]
+      proxy_password = @proxy[:password]
+      Net::HTTP::Proxy(proxy_host,proxy_port,proxy_user,proxy_password).start(host,port) do |sess|
+        headers = {}
+        headers['User-Agent'] = @user_agent if @user_agent
+        headers['Referer'] = @referer if @referer
+        new_page = Page.new(url,sess.get(url.path,headers))
+        block.call(new_page) if block
+        return new_page
+      end
+    end
+  end
+end

data/lib/spidr/page.rb ADDED

@@ -0,0 +1,159 @@
+require 'uri'
+require 'hpricot'
+module Spidr
+  class Page
+    # URL of the page
+    attr_reader :url
+    # Body returned for the page
+    attr_reader :body
+    # Headers returned with the body
+    attr_reader :headers
+    #
+    # Creates a new Page object from the specified _url_ and HTTP
+    # _response_.
+    #
+    def initialize(url,response)
+      @url = url
+      @response = response
+      @doc = nil
+    end
+    #
+    # Returns the content-type of the page.
+    #
+    def content_type
+      @response['Content-Type']
+    end
+    #
+    # Returns +true+ if the page is a HTML document, returns +false+
+    # otherwise.
+    #
+    def html?
+      (content_type =~ /text\/html/) == 0
+    end
+    #
+    # Returns +true+ if the page is a XML document, returns +false+
+    # otherwise.
+    #
+    def xml?
+      (content_type =~ /text\/xml/) == 0
+    end
+    #
+    # Returns +true+ if the page is a Javascript file, returns +false+
+    # otherwise.
+    #
+    def javascript?
+      (content_type =~ /(text|application)\/javascript/) == 0
+    end
+    #
+    # Returns +true+ if the page is a CSS file, returns +false+
+    # otherwise.
+    #
+    def css?
+      (content_type =~ /text\/css/) == 0
+    end
+    #
+    # Returns +true+ if the page is a RSS/RDF feed, returns +false+
+    # otherwise.
+    #
+    def rss?
+      (content_type =~ /application\/(rss|rdf)\+xml/) == 0
+    end
+    #
+    # Returns +true+ if the page is a Atom feed, returns +false+
+    # otherwise.
+    #
+    def atom?
+      (content_type =~ /application\/atom\+xml/) == 0
+    end
+    #
+    # Returns the body of the page in +String+ form.
+    #
+    def body
+      @response.body
+    end
+    #
+    # Returns an Hpricot::Doc if the page represents a HTML document,
+    # returns +nil+ otherwise.
+    #
+    def doc
+      if html?
+        return @doc ||= Hpricot(body)
+      end
+    end
+    #
+    # Returns all links from the HTML page.
+    #
+    def links
+      if html?
+        return doc.search('a[@href]').map do |a|
+          a.attributes['href'].strip
+        end
+      end
+      return []
+    end
+    #
+    # Returns all links from the HtML page as absolute URLs.
+    #
+    def urls
+      links.map { |link| to_absolute(link) }
+    end
+    protected
+    #
+    # Converts the specified _link_ into an absolute URL
+    # based on the url of the page.
+    #
+    def to_absolute(link)
+      link = URI.encode(link.to_s.gsub(/#.*$/,''))
+      relative = URI(link)
+      if relative.scheme.nil?
+        new_url = @url.clone
+        if relative.path[0..0] == '/'
+          new_url.path = relative.path
+        elsif relative.path[-1..-1] == '/'
+          new_url.path = File.expand_path(File.join(new_url.path,relative.path))
+        elsif !(relative.path.empty?)
+          new_url.path = File.expand_path(File.join(File.dirname(new_url.path),relative.path))
+        end
+        return new_url
+      end
+      return relative
+    end
+    #
+    # Provides transparent access to the values in the +headers+ +Hash+.
+    #
+    def method_missing(sym,*args,&block)
+      if (args.empty? && block.nil?)
+        name = sym.id2name.sub('_','-')
+        return @response[name] if @response.has_key?(name)
+      end
+      return super(sym,*args,&block)
+    end
+  end
+end

data/lib/spidr/rules.rb ADDED

@@ -0,0 +1,61 @@
+module Spidr
+  class Rules
+    # Accept rules
+    attr_reader :accept
+    # Reject rules
+    attr_reader :reject
+    def initialize(options={})
+      @accept = (options[:accept] || [])
+      @reject = (options[:reject] || [])
+    end
+    #
+    # Returns +true+ if the _field_ is accepted by the rules,
+    # returns +false+ otherwise.
+    #
+    def accept?(field)
+      unless @accept.empty?
+        @accept.each do |rule|
+          return true if test_field(field,rule)
+        end
+        return false
+      else
+        @reject.each do |rule|
+          return false if test_field(field,rule)
+        end
+        return true
+      end
+    end
+    #
+    # Returns +true+ if the _field_ is rejected by the rules,
+    # returns +false+ otherwise.
+    #
+    def reject?(field)
+      !(accept?(field))
+    end
+    protected
+    #
+    # Tests the specified _field_ against the specified _rule_. Returns
+    # +true+ when the _rule_ matches the specified _field_, returns
+    # +false+ otherwise.
+    #
+    def test_field(field,rule)
+      if rule.kind_of?(Proc)
+        return (rule.call(field) == true)
+      elsif rule.kind_of?(Regexp)
+        return !((field.to_s =~ rule).nil?)
+      else
+        return field == rule
+      end
+    end
+  end
+end

data/lib/spidr/spidr.rb ADDED

@@ -0,0 +1,48 @@
+require 'spidr/agent'
+module Spidr
+  # Common proxy port.
+  COMMON_PROXY_PORT = 8080
+  #
+  # Returns the +Hash+ of the Spidr proxy information.
+  #
+  def Spidr.proxy
+    @@spidr_proxy ||= {:host => nil, :port => COMMON_PROXY_PORT, :user => nil, :password => nil}
+  end
+  #
+  # Returns the Spidr User-Agent
+  #
+  def Spidr.user_agent
+    @@spidr_user_agent ||= nil
+  end
+  #
+  # Sets the Spidr Web User-Agent to the specified _new_agent_.
+  #
+  def Spidr.user_agent=(new_agent)
+    @@spidr_user_agent = new_agent
+  end
+  #
+  # See Agent.start_at.
+  #
+  def Spidr.start_at(url,options={},&block)
+    Agent.start_at(url,options,&block)
+  end
+  #
+  # See Agent.host.
+  #
+  def Spidr.host(name,options={},&block)
+    Agent.host(name,options,&block)
+  end
+  #
+  # See Agent.site.
+  #
+  def Spidr.site(url,options={},&block)
+    Agent.site(url,options,&block)
+  end
+end

data/lib/spidr/version.rb ADDED

@@ -0,0 +1,3 @@
+module Spidr
+  VERSION = '0.1.0'
+end

data/test/test_spidr.rb ADDED

File without changes

metadata ADDED

@@ -0,0 +1,84 @@
+--- !ruby/object:Gem::Specification
+name: spidr
+version: !ruby/object:Gem::Version
+  version: 0.1.0
+platform: ruby
+authors:
+- Postmodern Modulus III
+autorequire:
+bindir: bin
+cert_chain: []
+date: 2008-05-23 00:00:00 -07:00
+default_executable:
+dependencies:
+- !ruby/object:Gem::Dependency
+  name: hpricot
+  version_requirement:
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: "0"
+    version:
+- !ruby/object:Gem::Dependency
+  name: hoe
+  version_requirement:
+  version_requirements: !ruby/object:Gem::Requirement
+    requirements:
+    - - ">="
+      - !ruby/object:Gem::Version
+        version: 1.5.3
+    version:
+description: Spidr is a versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
+email:
+- postmodern.mod3@gmail.com
+executables: []
+extensions: []
+extra_rdoc_files:
+- History.txt
+- Manifest.txt
+- README.txt
+files:
+- History.txt
+- Manifest.txt
+- README.txt
+- Rakefile
+- lib/spidr.rb
+- lib/spidr/page.rb
+- lib/spidr/rules.rb
+- lib/spidr/agent.rb
+- lib/spidr/spidr.rb
+- lib/spidr/version.rb
+- test/test_spidr.rb
+has_rdoc: true
+homepage: http://spidr.rubyforge.org/
+post_install_message:
+rdoc_options:
+- --main
+- README.txt
+require_paths:
+- lib
+required_ruby_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: "0"
+  version:
+required_rubygems_version: !ruby/object:Gem::Requirement
+  requirements:
+  - - ">="
+    - !ruby/object:Gem::Version
+      version: "0"
+  version:
+requirements: []
+rubyforge_project: spidr
+rubygems_version: 1.1.1
+signing_key:
+specification_version: 2
+summary: Spidr is a versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely
+test_files:
+- test/test_spidr.rb