RubyGems - rcrawl - Versions diffs - 0.2.5 → 0.2.6 - Mend

rcrawl 0.2.5 → 0.2.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

data/README CHANGED Viewed

@@ -1,22 +1,49 @@
-Rcrawl is intended to be a web crawler written entirely in ruby.
-It's limited right now by the fact that it will stay on the original domain provided.
-I decided to roll my own crawler in ruby after finding only snippets of code on
-various web sites or newsgroups, for crawlers written in ruby.
-The structure of the crawling process was inspired by the specs of the Mercator crawler (http://www.cindoc.csic.es/cybermetrics/pdf/68.pdf).
-== Examples
-bot = Rcrawl.new(url) # This instantiates a new Rcrawl object
-bot.crawl # This will actually crawl the website
-== After the bot is done crawling
-bot.visited_links # Returns an array of visited links
-bot.dump # Returns a hash where the key is a url and the value is
-         # the raw html from that url
-bot.errors # Returns a hash where the key is a URL and the value is
-           # the error message from stderr
-bot.external_links # Returns an array of external links
+= Rcrawl web crawler for Ruby
+Rcrawl is a web crawler written entirely in ruby.  It's limited right now by the fact that it will stay on the original domain provided.
+Rcrawl uses a modular approach to processing the HTML it receives.  The link exraction portion of Rcrawl depends on the scrAPI toolkit by Assaf Arkin (http://labnotes.org).
+	gem install scrapi
+The structure of the crawling process was inspired by the specs of the Mercator crawler (http://www.cindoc.csic.es/cybermetrics/pdf/68.pdf).  A (somewhat) brief overview of the design philosophy follows.
+== The Rcrawl process
+1. Remove an absolute URL from the URL Server.
+2. Download corresponding document from the internet, grabbing and processing robots.txt first, if available.
+3. Feed the document into a rewing input stream(ris) to be read/re-read as needed.  Based on MIME type, invoke the process method of
+4. Processing module associated with that MIME type.  For example, a link extractor or tag counter module for text/html MIME types, of a gif stats module for image/gif.  By default, all text/html MIME types will pass through the link extractor.  Each link will be converted to an absolute URL and tested against a (ideally user-supplied) URL filter to determine if it should be downloaded.
+5. If the URL passes the filter (currently hard coded as Same Domain?), then call the URL-seen? test.
+6. Has the URL been seen before?  Namely, is it in the URL Server of has it been downloaded already?  If the URL is new, it is added to the URL Server.
+7. Back to step 1, repeat until the URL Server is empty.
+== Examples
+# Instantiate a new Rcrawl object
+	crawler = Rcrawl.new(url)
+# Begin the crawl process
+	crawler.crawl
+== After the crawler is done crawling
+# Returns an array of visited links
+	crawler.visited_links
+# Returns a hash where the key is a url and the value is
+# the raw html from that url
+	crawler.dump
+# Returns a hash where the key is a URL and the value is
+# the error message from stderr
+	crawler.errors
+# Returns an array of external links
+	crawler.external_links
+== License
+Copyright © 2006 Shawn Hansen, under MIT License
+Developed for http://denomi.net
+News, code, and documentation at http://blog.denomi.net

data/Rakefile CHANGED Viewed

@@ -1,39 +1,39 @@
-require 'rubygems'
-Gem::manage_gems
-require 'rake'
-require 'rake/rdoctask'
-require 'rake/gempackagetask'
-desc "Generate documentation"
-Rake::RDocTask.new(:rdoc) do |rdoc|
-  rdoc.rdoc_dir = "rdoc"
-  rdoc.title    = "Crawler"
-  rdoc.options << "--line-numbers"
-  rdoc.options << "--inline-source"
-  rdoc.rdoc_files.include("README")
-  rdoc.rdoc_files.include("lib/**/*.rb")
-end
-spec = Gem::Specification.new do |s|
-  s.name         = "rcrawl"
-  s.version      = "0.2.5"
-  s.author       = "Shawn Hansen"
-  s.email        = "shawn.hansen@gmail.com"
-  s.homepage     = "http://blog.denomi.net"
-  s.platform     = Gem::Platform::RUBY
-  s.summary      = "A web crawler written in ruby"
-  s.files        = FileList["{test,lib}/**/*", "README", "MIT-LICENSE", "Rakefile", "TODO"].to_a
-  s.require_path = "lib"
-  s.autorequire  = "rcrawl.rb"
-  s.has_rdoc     = true
-  s.extra_rdoc_files = ["README", "MIT-LICENSE", "TODO"]
-  s.add_dependency("scrapi", ">=1.2.0")
-  s.rubyforge_project = "rcrawl"
-end
-gem = Rake::GemPackageTask.new(spec) do |pkg|
-  pkg.need_tar = true
-  pkg.need_zip = true
+require 'rubygems'
+Gem::manage_gems
+require 'rake'
+require 'rake/rdoctask'
+require 'rake/gempackagetask'
+desc "Generate documentation"
+Rake::RDocTask.new(:rdoc) do |rdoc|
+  rdoc.rdoc_dir = "rdoc"
+  rdoc.title    = "Crawler"
+  rdoc.options << "--line-numbers"
+  rdoc.options << "--inline-source"
+  rdoc.rdoc_files.include("README")
+  rdoc.rdoc_files.include("lib/**/*.rb")
+end
+spec = Gem::Specification.new do |s|
+  s.name         = "rcrawl"
+  s.version      = "0.2.6"
+  s.author       = "Shawn Hansen"
+  s.email        = "shawn.hansen@gmail.com"
+  s.homepage     = "http://blog.denomi.net"
+  s.platform     = Gem::Platform::RUBY
+  s.summary      = "A web crawler written in ruby"
+  s.files        = FileList["{test,lib}/**/*", "README", "MIT-LICENSE", "Rakefile", "TODO"].to_a
+  s.require_path = "lib"
+  s.autorequire  = "rcrawl.rb"
+  s.has_rdoc     = true
+  s.extra_rdoc_files = ["README", "MIT-LICENSE", "TODO"]
+  s.add_dependency("scrapi", ">=1.2.0")
+  s.rubyforge_project = "rcrawl"
+end
+gem = Rake::GemPackageTask.new(spec) do |pkg|
+  pkg.need_tar = true
+  pkg.need_zip = true
 end

data/lib/rcrawl.rb CHANGED Viewed

@@ -70,14 +70,15 @@ class Rcrawl
    # Rewind Input Stream, for storing and reading of raw HTML
    def ris(document)
+		print "."
       # Store raw HTML into local variable
       # Based on MIME type, invoke the proper processing modules
-		if document.content_type == "text/html"
-			print "."
-			link_extractor(document) # If HTML
-			process_html(document)   # If HTML
-        else
-          print "... not HTML, skipping..."
+		case document.content_type
+			when "text/html"
+				link_extractor(document)
+				process_html(document)
+			else
+				print "... not HTML, skipping..."
 		end
    end

data/lib/robot_rules.rb CHANGED Viewed

@@ -1,81 +1,79 @@
-#!/usr/bin/env ruby
-# robot_rules.rb
-#
-#  Created by James Edward Gray II on 2006-01-31.
+#!/usr/bin/env ruby
+# robot_rules.rb
+#
+#  Created by James Edward Gray II on 2006-01-31.
 #  Copyright 2006 Gray Productions. All rights reserved.
-#  Included with rcrawl by permission from James Edward Gray II
-require "uri"
-# Based on Perl's WWW::RobotRules module, by Gisle Aas.
-class RobotRules
-   def initialize( user_agent )
-     @user_agent = user_agent.scan(/\S+/).first.sub(%r{/.*},
-																"").downcase
-     @rules      = Hash.new { |rules, rule| rules[rule] = Array.new }
-   end
-   def parse( text_uri, robots_data )
-     uri      = URI.parse(text_uri)
-     location = "#{uri.host}:#{uri.port}"
-     @rules.delete(location)
-     rules      = robots_data.split(/[\015\012]+/).
-                              map { |rule| rule.sub(/\s*#.*$/, "") }
-     anon_rules = Array.new
-     my_rules   = Array.new
-     current    = anon_rules
-     rules.each do |rule|
-       case rule
-       when /^\s*User-Agent\s*:\s*(.+?)\s*$/i
-         break unless my_rules.empty?
-         current = if $1 == "*"
-           anon_rules
-         elsif $1.downcase.index(@user_agent)
-           my_rules
-         else
-           nil
-         end
-       when /^\s*Disallow\s*:\s*(.*?)\s*$/i
-         next if current.nil?
-         if $1.empty?
-           current << nil
-         else
-           disallow = URI.parse($1)
-           next unless disallow.scheme.nil? or disallow.scheme ==
-uri.scheme
-           next unless disallow.port.nil?   or disallow.port == uri.port
-           next unless disallow.host.nil?   or
-                       disallow.host.downcase == uri.host.downcase
-           disallow = disallow.path
-           disallow = "/"            if disallow.empty?
-           disallow = "/#{disallow}" unless disallow[0] == ?/
-           current << disallow
-         end
-       end
-     end
-     @rules[location] = if my_rules.empty?
-       anon_rules.compact
-     else
-       my_rules.compact
-     end
-   end
-   def allowed?( text_uri )
-     uri      = URI.parse(text_uri)
-     location = "#{uri.host}:#{uri.port}"
-     path     = uri.path
-     return true unless %w{http https}.include?(uri.scheme)
-     not @rules[location].any? { |rule| path.index(rule) == 0 }
-   end
-end
+#  Included with rcrawl by permission from James Edward Gray II
+require "uri"
+# Based on Perl's WWW::RobotRules module, by Gisle Aas.
+class RobotRules
+   def initialize( user_agent )
+     @user_agent = user_agent.scan(/\S+/).first.sub(%r{/.*},
+																"").downcase
+     @rules      = Hash.new { |rules, rule| rules[rule] = Array.new }
+   end
+   def parse( text_uri, robots_data )
+     uri      = URI.parse(text_uri)
+     location = "#{uri.host}:#{uri.port}"
+     @rules.delete(location)
+     rules      = robots_data.split(/[\015\012]+/).
+                              map { |rule| rule.sub(/\s*#.*$/, "") }
+     anon_rules = Array.new
+     my_rules   = Array.new
+     current    = anon_rules
+     rules.each do |rule|
+       case rule
+       when /^\s*User-Agent\s*:\s*(.+?)\s*$/i
+         break unless my_rules.empty?
+         current = if $1 == "*"
+           anon_rules
+         elsif $1.downcase.index(@user_agent)
+           my_rules
+         else
+           nil
+         end
+       when /^\s*Disallow\s*:\s*(.*?)\s*$/i
+         next if current.nil?
+         if $1.empty?
+           current << nil
+         else
+           disallow = URI.parse($1)
+           next unless disallow.scheme.nil? or disallow.scheme == uri.scheme
+           next unless disallow.port.nil?   or disallow.port == uri.port
+           next unless disallow.host.nil?   or
+                       disallow.host.downcase == uri.host.downcase
+           disallow = disallow.path
+           disallow = "/"            if disallow.empty?
+           disallow = "/#{disallow}" unless disallow[0] == ?/
+           current << disallow
+         end
+       end
+     end
+     @rules[location] = if my_rules.empty?
+       anon_rules.compact
+     else
+       my_rules.compact
+     end
+   end
+   def allowed?( text_uri )
+     uri      = URI.parse(text_uri)
+     location = "#{uri.host}:#{uri.port}"
+     path     = uri.path
+     return true unless %w{http https}.include?(uri.scheme)
+     not @rules[location].any? { |rule| path.index(rule) == 0 }
+   end
+end

metadata CHANGED Viewed

@@ -3,8 +3,8 @@ rubygems_version: 0.9.0
 specification_version: 1
 name: rcrawl
 version: !ruby/object:Gem::Version
-  version: 0.2.5
-date: 2006-09-20 00:00:00 -05:00
+  version: 0.2.6
+date: 2006-09-21 00:00:00 -05:00
 summary: A web crawler written in ruby
 require_paths:
 - lib
@@ -29,8 +29,8 @@ post_install_message:
 authors:
 - Shawn Hansen
 files:
-- lib/rcrawl.rb
 - lib/robot_rules.rb
+- lib/rcrawl.rb
 - README
 - MIT-LICENSE
 - Rakefile