RubyGems - rcrawl - Versions diffs - 0.4.6 → 0.4.7 - Mend

rcrawl 0.4.6 → 0.4.7

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (5) hide show

data/README CHANGED Viewed

@@ -43,6 +43,9 @@ The structure of the crawling process was inspired by the specs of the Mercator
 # Returns an array of external links
 	crawler.external_links
+# Set user agent
+	crawler.user_agent = "Your fancy crawler name here"
 == License
 Copyright © 2006 Digital Duckies, LLC, under MIT License

data/Rakefile CHANGED Viewed

@@ -18,7 +18,7 @@ end
 spec = Gem::Specification.new do |s|
   s.name         = "rcrawl"
-  s.version      = "0.4.6"
+  s.version      = "0.4.7"
   s.author       = "Digital Duckies"
   s.email        = "rcrawl@digitalduckies.net"
   s.homepage     = "http://digitalduckies.net"

data/TODO CHANGED Viewed

@@ -1 +1,7 @@
-Lots!  TODO will be updated soon.
+Add timeout code
+Add max connections and max connections/second code
+Add referer code
+Add proxy code?  Is this high up on anyone's list, or can it be put off for now?
+Logging code
+Store page headers and page metadata

data/lib/rcrawl/crawler.rb CHANGED Viewed

@@ -1,8 +1,8 @@
 module Rcrawl
 	class Crawler
-		attr_accessor :links_to_visit, :site
+		attr_accessor :links_to_visit, :site, :user_agent
 		attr_reader :visited_links, :external_links, :raw_html, :rules, :sites,
 						:errors
 		# Initializes various variables when a new Crawler object is instantiated
@@ -12,7 +12,8 @@ module Rcrawl
 			@visited_links = Array.new
 			@external_links = Array.new
 			@raw_html = Hash.new
-			@rules = RobotRules.new("rcrawl/#{VERSION}")
+			@rules = RobotRules.new('Rcrawl')
+			@user_agent = "Rcrawl/#{VERSION} (http://rubyforge.org/projects/rcrawl/)"
 			@sites = Hash.new
 			@errors = Hash.new
 			@site = URI.parse(site) || raise("You didn't give me a site to crawl")
@@ -64,7 +65,7 @@ module Rcrawl
 			# if not, parse robots.txt then grab document.
 			uri = URI.parse(url)
 			print "Visiting: #{url}"
-			@document = uri.read
+			@document = uri.read("User-Agent" => @user_agent, "Referer" => url)
 			@visited_links << url
 		end

metadata CHANGED Viewed

@@ -3,8 +3,8 @@ rubygems_version: 0.9.0
 specification_version: 1
 name: rcrawl
 version: !ruby/object:Gem::Version
-  version: 0.4.6
-date: 2006-09-26 00:00:00 -05:00
+  version: 0.4.7
+date: 2006-09-27 00:00:00 -05:00
 summary: A web crawler written in ruby
 require_paths:
 - lib