rcrawl 0.4.6 → 0.4.7

Sign up to get free protection for your applications and to get access to all the features.
Files changed (5) hide show
  1. data/README +3 -0
  2. data/Rakefile +1 -1
  3. data/TODO +7 -1
  4. data/lib/rcrawl/crawler.rb +5 -4
  5. metadata +2 -2
data/README CHANGED
@@ -43,6 +43,9 @@ The structure of the crawling process was inspired by the specs of the Mercator
43
43
  # Returns an array of external links
44
44
  crawler.external_links
45
45
 
46
+ # Set user agent
47
+ crawler.user_agent = "Your fancy crawler name here"
48
+
46
49
  == License
47
50
  Copyright © 2006 Digital Duckies, LLC, under MIT License
48
51
 
data/Rakefile CHANGED
@@ -18,7 +18,7 @@ end
18
18
 
19
19
  spec = Gem::Specification.new do |s|
20
20
  s.name = "rcrawl"
21
- s.version = "0.4.6"
21
+ s.version = "0.4.7"
22
22
  s.author = "Digital Duckies"
23
23
  s.email = "rcrawl@digitalduckies.net"
24
24
  s.homepage = "http://digitalduckies.net"
data/TODO CHANGED
@@ -1 +1,7 @@
1
- Lots! TODO will be updated soon.
1
+ Add timeout code
2
+ Add max connections and max connections/second code
3
+ Add referer code
4
+ Add proxy code? Is this high up on anyone's list, or can it be put off for now?
5
+ Logging code
6
+ Store page headers and page metadata
7
+
@@ -1,8 +1,8 @@
1
1
  module Rcrawl
2
2
 
3
3
  class Crawler
4
-
5
- attr_accessor :links_to_visit, :site
4
+
5
+ attr_accessor :links_to_visit, :site, :user_agent
6
6
  attr_reader :visited_links, :external_links, :raw_html, :rules, :sites,
7
7
  :errors
8
8
  # Initializes various variables when a new Crawler object is instantiated
@@ -12,7 +12,8 @@ module Rcrawl
12
12
  @visited_links = Array.new
13
13
  @external_links = Array.new
14
14
  @raw_html = Hash.new
15
- @rules = RobotRules.new("rcrawl/#{VERSION}")
15
+ @rules = RobotRules.new('Rcrawl')
16
+ @user_agent = "Rcrawl/#{VERSION} (http://rubyforge.org/projects/rcrawl/)"
16
17
  @sites = Hash.new
17
18
  @errors = Hash.new
18
19
  @site = URI.parse(site) || raise("You didn't give me a site to crawl")
@@ -64,7 +65,7 @@ module Rcrawl
64
65
  # if not, parse robots.txt then grab document.
65
66
  uri = URI.parse(url)
66
67
  print "Visiting: #{url}"
67
- @document = uri.read
68
+ @document = uri.read("User-Agent" => @user_agent, "Referer" => url)
68
69
  @visited_links << url
69
70
  end
70
71
 
metadata CHANGED
@@ -3,8 +3,8 @@ rubygems_version: 0.9.0
3
3
  specification_version: 1
4
4
  name: rcrawl
5
5
  version: !ruby/object:Gem::Version
6
- version: 0.4.6
7
- date: 2006-09-26 00:00:00 -05:00
6
+ version: 0.4.7
7
+ date: 2006-09-27 00:00:00 -05:00
8
8
  summary: A web crawler written in ruby
9
9
  require_paths:
10
10
  - lib