rcrawl 0.2.5 → 0.2.6

Sign up to get free protection for your applications and to get access to all the features.
Files changed (5) hide show
  1. data/README +49 -22
  2. data/Rakefile +38 -38
  3. data/lib/rcrawl.rb +7 -6
  4. data/lib/robot_rules.rb +78 -80
  5. metadata +3 -3
data/README CHANGED
@@ -1,22 +1,49 @@
1
- Rcrawl is intended to be a web crawler written entirely in ruby.
2
- It's limited right now by the fact that it will stay on the original domain provided.
3
- I decided to roll my own crawler in ruby after finding only snippets of code on
4
- various web sites or newsgroups, for crawlers written in ruby.
5
-
6
- The structure of the crawling process was inspired by the specs of the Mercator crawler (http://www.cindoc.csic.es/cybermetrics/pdf/68.pdf).
7
-
8
- == Examples
9
- bot = Rcrawl.new(url) # This instantiates a new Rcrawl object
10
-
11
- bot.crawl # This will actually crawl the website
12
-
13
- == After the bot is done crawling
14
- bot.visited_links # Returns an array of visited links
15
-
16
- bot.dump # Returns a hash where the key is a url and the value is
17
- # the raw html from that url
18
-
19
- bot.errors # Returns a hash where the key is a URL and the value is
20
- # the error message from stderr
21
-
22
- bot.external_links # Returns an array of external links
1
+ = Rcrawl web crawler for Ruby
2
+ Rcrawl is a web crawler written entirely in ruby. It's limited right now by the fact that it will stay on the original domain provided.
3
+
4
+ Rcrawl uses a modular approach to processing the HTML it receives. The link exraction portion of Rcrawl depends on the scrAPI toolkit by Assaf Arkin (http://labnotes.org).
5
+ gem install scrapi
6
+
7
+ The structure of the crawling process was inspired by the specs of the Mercator crawler (http://www.cindoc.csic.es/cybermetrics/pdf/68.pdf). A (somewhat) brief overview of the design philosophy follows.
8
+
9
+ == The Rcrawl process
10
+ 1. Remove an absolute URL from the URL Server.
11
+ 2. Download corresponding document from the internet, grabbing and processing robots.txt first, if available.
12
+ 3. Feed the document into a rewing input stream(ris) to be read/re-read as needed. Based on MIME type, invoke the process method of
13
+ 4. Processing module associated with that MIME type. For example, a link extractor or tag counter module for text/html MIME types, of a gif stats module for image/gif. By default, all text/html MIME types will pass through the link extractor. Each link will be converted to an absolute URL and tested against a (ideally user-supplied) URL filter to determine if it should be downloaded.
14
+ 5. If the URL passes the filter (currently hard coded as Same Domain?), then call the URL-seen? test.
15
+ 6. Has the URL been seen before? Namely, is it in the URL Server of has it been downloaded already? If the URL is new, it is added to the URL Server.
16
+ 7. Back to step 1, repeat until the URL Server is empty.
17
+
18
+ == Examples
19
+ # Instantiate a new Rcrawl object
20
+ crawler = Rcrawl.new(url)
21
+
22
+
23
+ # Begin the crawl process
24
+ crawler.crawl
25
+
26
+ == After the crawler is done crawling
27
+ # Returns an array of visited links
28
+ crawler.visited_links
29
+
30
+
31
+ # Returns a hash where the key is a url and the value is
32
+ # the raw html from that url
33
+ crawler.dump
34
+
35
+
36
+ # Returns a hash where the key is a URL and the value is
37
+ # the error message from stderr
38
+ crawler.errors
39
+
40
+
41
+ # Returns an array of external links
42
+ crawler.external_links
43
+
44
+ == License
45
+ Copyright © 2006 Shawn Hansen, under MIT License
46
+
47
+ Developed for http://denomi.net
48
+
49
+ News, code, and documentation at http://blog.denomi.net
data/Rakefile CHANGED
@@ -1,39 +1,39 @@
1
- require 'rubygems'
2
- Gem::manage_gems
3
- require 'rake'
4
- require 'rake/rdoctask'
5
- require 'rake/gempackagetask'
6
-
7
-
8
-
9
- desc "Generate documentation"
10
- Rake::RDocTask.new(:rdoc) do |rdoc|
11
- rdoc.rdoc_dir = "rdoc"
12
- rdoc.title = "Crawler"
13
- rdoc.options << "--line-numbers"
14
- rdoc.options << "--inline-source"
15
- rdoc.rdoc_files.include("README")
16
- rdoc.rdoc_files.include("lib/**/*.rb")
17
- end
18
-
19
- spec = Gem::Specification.new do |s|
20
- s.name = "rcrawl"
21
- s.version = "0.2.5"
22
- s.author = "Shawn Hansen"
23
- s.email = "shawn.hansen@gmail.com"
24
- s.homepage = "http://blog.denomi.net"
25
- s.platform = Gem::Platform::RUBY
26
- s.summary = "A web crawler written in ruby"
27
- s.files = FileList["{test,lib}/**/*", "README", "MIT-LICENSE", "Rakefile", "TODO"].to_a
28
- s.require_path = "lib"
29
- s.autorequire = "rcrawl.rb"
30
- s.has_rdoc = true
31
- s.extra_rdoc_files = ["README", "MIT-LICENSE", "TODO"]
32
- s.add_dependency("scrapi", ">=1.2.0")
33
- s.rubyforge_project = "rcrawl"
34
- end
35
-
36
- gem = Rake::GemPackageTask.new(spec) do |pkg|
37
- pkg.need_tar = true
38
- pkg.need_zip = true
1
+ require 'rubygems'
2
+ Gem::manage_gems
3
+ require 'rake'
4
+ require 'rake/rdoctask'
5
+ require 'rake/gempackagetask'
6
+
7
+
8
+
9
+ desc "Generate documentation"
10
+ Rake::RDocTask.new(:rdoc) do |rdoc|
11
+ rdoc.rdoc_dir = "rdoc"
12
+ rdoc.title = "Crawler"
13
+ rdoc.options << "--line-numbers"
14
+ rdoc.options << "--inline-source"
15
+ rdoc.rdoc_files.include("README")
16
+ rdoc.rdoc_files.include("lib/**/*.rb")
17
+ end
18
+
19
+ spec = Gem::Specification.new do |s|
20
+ s.name = "rcrawl"
21
+ s.version = "0.2.6"
22
+ s.author = "Shawn Hansen"
23
+ s.email = "shawn.hansen@gmail.com"
24
+ s.homepage = "http://blog.denomi.net"
25
+ s.platform = Gem::Platform::RUBY
26
+ s.summary = "A web crawler written in ruby"
27
+ s.files = FileList["{test,lib}/**/*", "README", "MIT-LICENSE", "Rakefile", "TODO"].to_a
28
+ s.require_path = "lib"
29
+ s.autorequire = "rcrawl.rb"
30
+ s.has_rdoc = true
31
+ s.extra_rdoc_files = ["README", "MIT-LICENSE", "TODO"]
32
+ s.add_dependency("scrapi", ">=1.2.0")
33
+ s.rubyforge_project = "rcrawl"
34
+ end
35
+
36
+ gem = Rake::GemPackageTask.new(spec) do |pkg|
37
+ pkg.need_tar = true
38
+ pkg.need_zip = true
39
39
  end
data/lib/rcrawl.rb CHANGED
@@ -70,14 +70,15 @@ class Rcrawl
70
70
 
71
71
  # Rewind Input Stream, for storing and reading of raw HTML
72
72
  def ris(document)
73
+ print "."
73
74
  # Store raw HTML into local variable
74
75
  # Based on MIME type, invoke the proper processing modules
75
- if document.content_type == "text/html"
76
- print "."
77
- link_extractor(document) # If HTML
78
- process_html(document) # If HTML
79
- else
80
- print "... not HTML, skipping..."
76
+ case document.content_type
77
+ when "text/html"
78
+ link_extractor(document)
79
+ process_html(document)
80
+ else
81
+ print "... not HTML, skipping..."
81
82
  end
82
83
  end
83
84
 
data/lib/robot_rules.rb CHANGED
@@ -1,81 +1,79 @@
1
- #!/usr/bin/env ruby
2
-
3
- # robot_rules.rb
4
- #
5
- # Created by James Edward Gray II on 2006-01-31.
1
+ #!/usr/bin/env ruby
2
+
3
+ # robot_rules.rb
4
+ #
5
+ # Created by James Edward Gray II on 2006-01-31.
6
6
  # Copyright 2006 Gray Productions. All rights reserved.
7
- # Included with rcrawl by permission from James Edward Gray II
8
-
9
- require "uri"
10
-
11
- # Based on Perl's WWW::RobotRules module, by Gisle Aas.
12
- class RobotRules
13
- def initialize( user_agent )
14
- @user_agent = user_agent.scan(/\S+/).first.sub(%r{/.*},
15
- "").downcase
16
- @rules = Hash.new { |rules, rule| rules[rule] = Array.new }
17
- end
18
-
19
- def parse( text_uri, robots_data )
20
- uri = URI.parse(text_uri)
21
- location = "#{uri.host}:#{uri.port}"
22
- @rules.delete(location)
23
-
24
- rules = robots_data.split(/[\015\012]+/).
25
- map { |rule| rule.sub(/\s*#.*$/, "") }
26
- anon_rules = Array.new
27
- my_rules = Array.new
28
- current = anon_rules
29
- rules.each do |rule|
30
- case rule
31
- when /^\s*User-Agent\s*:\s*(.+?)\s*$/i
32
- break unless my_rules.empty?
33
-
34
- current = if $1 == "*"
35
- anon_rules
36
- elsif $1.downcase.index(@user_agent)
37
- my_rules
38
- else
39
- nil
40
- end
41
- when /^\s*Disallow\s*:\s*(.*?)\s*$/i
42
- next if current.nil?
43
-
44
- if $1.empty?
45
- current << nil
46
- else
47
- disallow = URI.parse($1)
48
-
49
- next unless disallow.scheme.nil? or disallow.scheme ==
50
- uri.scheme
51
- next unless disallow.port.nil? or disallow.port == uri.port
52
- next unless disallow.host.nil? or
53
- disallow.host.downcase == uri.host.downcase
54
-
55
- disallow = disallow.path
56
- disallow = "/" if disallow.empty?
57
- disallow = "/#{disallow}" unless disallow[0] == ?/
58
-
59
- current << disallow
60
- end
61
- end
62
- end
63
-
64
- @rules[location] = if my_rules.empty?
65
- anon_rules.compact
66
- else
67
- my_rules.compact
68
- end
69
- end
70
-
71
- def allowed?( text_uri )
72
- uri = URI.parse(text_uri)
73
- location = "#{uri.host}:#{uri.port}"
74
- path = uri.path
75
-
76
- return true unless %w{http https}.include?(uri.scheme)
77
-
78
- not @rules[location].any? { |rule| path.index(rule) == 0 }
79
- end
80
- end
81
-
7
+ # Included with rcrawl by permission from James Edward Gray II
8
+
9
+ require "uri"
10
+
11
+ # Based on Perl's WWW::RobotRules module, by Gisle Aas.
12
+ class RobotRules
13
+ def initialize( user_agent )
14
+ @user_agent = user_agent.scan(/\S+/).first.sub(%r{/.*},
15
+ "").downcase
16
+ @rules = Hash.new { |rules, rule| rules[rule] = Array.new }
17
+ end
18
+
19
+ def parse( text_uri, robots_data )
20
+ uri = URI.parse(text_uri)
21
+ location = "#{uri.host}:#{uri.port}"
22
+ @rules.delete(location)
23
+
24
+ rules = robots_data.split(/[\015\012]+/).
25
+ map { |rule| rule.sub(/\s*#.*$/, "") }
26
+ anon_rules = Array.new
27
+ my_rules = Array.new
28
+ current = anon_rules
29
+ rules.each do |rule|
30
+ case rule
31
+ when /^\s*User-Agent\s*:\s*(.+?)\s*$/i
32
+ break unless my_rules.empty?
33
+
34
+ current = if $1 == "*"
35
+ anon_rules
36
+ elsif $1.downcase.index(@user_agent)
37
+ my_rules
38
+ else
39
+ nil
40
+ end
41
+ when /^\s*Disallow\s*:\s*(.*?)\s*$/i
42
+ next if current.nil?
43
+
44
+ if $1.empty?
45
+ current << nil
46
+ else
47
+ disallow = URI.parse($1)
48
+
49
+ next unless disallow.scheme.nil? or disallow.scheme == uri.scheme
50
+ next unless disallow.port.nil? or disallow.port == uri.port
51
+ next unless disallow.host.nil? or
52
+ disallow.host.downcase == uri.host.downcase
53
+
54
+ disallow = disallow.path
55
+ disallow = "/" if disallow.empty?
56
+ disallow = "/#{disallow}" unless disallow[0] == ?/
57
+
58
+ current << disallow
59
+ end
60
+ end
61
+ end
62
+
63
+ @rules[location] = if my_rules.empty?
64
+ anon_rules.compact
65
+ else
66
+ my_rules.compact
67
+ end
68
+ end
69
+
70
+ def allowed?( text_uri )
71
+ uri = URI.parse(text_uri)
72
+ location = "#{uri.host}:#{uri.port}"
73
+ path = uri.path
74
+
75
+ return true unless %w{http https}.include?(uri.scheme)
76
+
77
+ not @rules[location].any? { |rule| path.index(rule) == 0 }
78
+ end
79
+ end
metadata CHANGED
@@ -3,8 +3,8 @@ rubygems_version: 0.9.0
3
3
  specification_version: 1
4
4
  name: rcrawl
5
5
  version: !ruby/object:Gem::Version
6
- version: 0.2.5
7
- date: 2006-09-20 00:00:00 -05:00
6
+ version: 0.2.6
7
+ date: 2006-09-21 00:00:00 -05:00
8
8
  summary: A web crawler written in ruby
9
9
  require_paths:
10
10
  - lib
@@ -29,8 +29,8 @@ post_install_message:
29
29
  authors:
30
30
  - Shawn Hansen
31
31
  files:
32
- - lib/rcrawl.rb
33
32
  - lib/robot_rules.rb
33
+ - lib/rcrawl.rb
34
34
  - README
35
35
  - MIT-LICENSE
36
36
  - Rakefile