spidr 0.2.7 → 0.3.0

Sign up to get free protection for your applications and to get access to all the features.
data/.rspec ADDED
@@ -0,0 +1 @@
1
+ --colour --format documentation
data/ChangeLog.md CHANGED
@@ -1,19 +1,44 @@
1
+ ### 0.3.0 / 2011-04-14
2
+
3
+ * Switched from Jeweler to [Ore](http://github.com/ruby-ore/ore).
4
+ * Split all header related methods out of {Spidr::Page} and into
5
+ {Spidr::Headers}.
6
+ * Split all body related methods out of {Spidr::Page} and into
7
+ {Spidr::Body}.
8
+ * Split all link related methods out of {Spidr::Page} and into
9
+ {Spidr::Links}.
10
+ * Added {Spidr::Headers#directory?}.
11
+ * Added {Spidr::Headers#json?}.
12
+ * Added {Spidr::Links#each_url}.
13
+ * Added {Spidr::Links#each_link}.
14
+ * Added {Spidr::Links#each_redirect}.
15
+ * Added {Spidr::Links#each_meta_redirect}.
16
+ * Aliased {Spidr::Headers#raw_cookie} to {Spidr::Headers#cookie}.
17
+ * Aliased {Spidr::Body#to_s} to {Spidr::Body#body}.
18
+ * Also check for `application/xml` in {Spidr::Headers#xml?}.
19
+ * Catch all exceptions when merging URIs in {Spidr::Links#to_absolute}.
20
+ * Always prepend a `/` to all FTP URI paths. Fixes a Ruby 1.8 specific
21
+ bug, where it expects an absolute path for all FTP URIs.
22
+ * Refactored {URI.expand_path}.
23
+ * Start the session in {Spidr::SessionCache#[]} to prevent multiple
24
+ `CONNECT` commands being sent to HTTP Proxies (thanks falaise).
25
+
1
26
  ### 0.2.7 / 2010-08-17
2
27
 
3
28
  * Added {Spidr::CookieJar#cookies_for_host} (thanks zapnap).
4
- * Renamed `Spidr::Page#cookie` to {Spidr::Page#raw_cookie}.
29
+ * Renamed `Spidr::Page#cookie` to `Spidr::Page#raw_cookie`.
5
30
  * Rescue `URI::InvalidComponentError` exceptions in
6
- {Spidr::Page#to_absolute} (thanks zapnap).
31
+ `Spidr::Page#to_absolute` (thanks zapnap).
7
32
 
8
33
  ### 0.2.6 / 2010-07-05
9
34
 
10
- * Fixed a bug in {Spidr::Page#meta_redirect}, by calling
35
+ * Fixed a bug in `Spidr::Page#meta_redirect`, by calling
11
36
  `Nokogiri::XML::Element#get_attribute` instead of `attr`.
12
37
 
13
38
  ### 0.2.5 / 2010-07-02
14
39
 
15
- * Added {Spidr::Page#meta_redirect}.
16
- * Added {Spidr::Page#meta_redirect?}.
40
+ * Added `Spidr::Page#meta_redirect`.
41
+ * Added `Spidr::Page#meta_redirect?`.
17
42
  * Manage development dependencies with Bundler.
18
43
  * Support following "old-school" meta-refresh redirects (thanks zapnap).
19
44
  * Allow {Spidr::CookieJar} inherit cookies set by a parent domain.
@@ -26,10 +51,10 @@
26
51
  * Added {Spidr::Filters#visit_urls_like}.
27
52
  * Added {Spidr::Filters#ignore_urls}.
28
53
  * Added {Spidr::Filters#ignore_urls_like}.
29
- * Added {Spidr::Page#is_content_type?}.
30
- * Default {Spidr::Page#body} to an empty String.
31
- * Default {Spidr::Page#content_type} to an empty String.
32
- * Default {Spidr::Page#content_types} to an empty Array.
54
+ * Added `Spidr::Page#is_content_type?`.
55
+ * Default `Spidr::Page#body` to an empty String.
56
+ * Default `Spidr::Page#content_type` to an empty String.
57
+ * Default `Spidr::Page#content_types` to an empty Array.
33
58
  * Improved reliability of {Spidr::Page#is_redirect?}.
34
59
  * Improved content type detection in {Spidr::Page} to handle `Content-Type`
35
60
  headers containing charsets (thanks Josh Lindsey).
@@ -47,10 +72,10 @@
47
72
  * Require Web Spider Obstacle Course (WSOC) >= 0.1.1.
48
73
  * Integrated the new WSOC into the specs.
49
74
  * Removed the built-in Web Spider Obstacle Course.
50
- * Added {Spidr::Page#content_types}.
51
- * Added {Spidr::Page#cookie}.
52
- * Added {Spidr::Page#cookies}.
53
- * Added {Spidr::Page#cookie_params}.
75
+ * Added `Spidr::Page#content_types`.
76
+ * Added `Spidr::Page#cookie`.
77
+ * Added `Spidr::Page#cookies`.
78
+ * Added `Spidr::Page#cookie_params`.
54
79
  * Added {Spidr::Sanitizers}.
55
80
  * Added {Spidr::SessionCache}.
56
81
  * Added {Spidr::CookieJar} (thanks Nick Plante).
@@ -93,33 +118,33 @@
93
118
  ### 0.2.0 / 2009-10-10
94
119
 
95
120
  * Added {URI.expand_path}.
96
- * Added {Spidr::Page#search}.
97
- * Added {Spidr::Page#at}.
98
- * Added {Spidr::Page#title}.
121
+ * Added `Spidr::Page#search`.
122
+ * Added `Spidr::Page#at`.
123
+ * Added `Spidr::Page#title`.
99
124
  * Added {Spidr::Agent#failures=}.
100
125
  * Added a HTTP session cache to {Spidr::Agent}, per suggestion of falter.
101
126
  * Added `Spidr::Agent#get_session`.
102
127
  * Added `Spidr::Agent#kill_session`.
103
128
  * Added {Spidr.proxy=}.
104
129
  * Added {Spidr.disable_proxy!}.
105
- * Aliased `Spidr::Page#txt?` to {Spidr::Page#plain_text?}.
106
- * Aliased `Spidr::Page#ok?` to {Spidr::Page#is_ok?}.
107
- * Aliased `Spidr::Page#redirect?` to {Spidr::Page#is_redirect?}.
108
- * Aliased `Spidr::Page#unauthorized?` to {Spidr::Page#is_unauthorized?}.
109
- * Aliased `Spidr::Page#forbidden?` to {Spidr::Page#is_forbidden?}.
110
- * Aliased `Spidr::Page#missing?` to {Spidr::Page#is_missing?}.
130
+ * Aliased `Spidr::Page#txt?` to `Spidr::Page#plain_text?`.
131
+ * Aliased `Spidr::Page#ok?` to `Spidr::Page#is_ok?`.
132
+ * Aliased `Spidr::Page#redirect?` to `Spidr::Page#is_redirect?`.
133
+ * Aliased `Spidr::Page#unauthorized?` to `Spidr::Page#is_unauthorized?`.
134
+ * Aliased `Spidr::Page#forbidden?` to `Spidr::Page#is_forbidden?`.
135
+ * Aliased `Spidr::Page#missing?` to `Spidr::Page#is_missing?`.
111
136
  * Split URL filtering code out of {Spidr::Agent} and into
112
137
  {Spidr::Filters}.
113
138
  * Split URL / Page event code out of {Spidr::Agent} and into
114
139
  {Spidr::Events}.
115
140
  * Split pause! / continue! / skip_link! / skip_page! methods out of
116
141
  {Spidr::Agent} and into {Spidr::Actions}.
117
- * Fixed a bug in {Spidr::Page#code}, where it was not returning an Integer.
118
- * Make sure {Spidr::Page#doc} returns `Nokogiri::XML::Document` objects for
142
+ * Fixed a bug in `Spidr::Page#code`, where it was not returning an Integer.
143
+ * Make sure `Spidr::Page#doc` returns `Nokogiri::XML::Document` objects for
119
144
  RSS/RDF/Atom pages as well.
120
- * Fixed the handling of the Location header in {Spidr::Page#links}
145
+ * Fixed the handling of the Location header in `Spidr::Page#links`
121
146
  (thanks falter).
122
- * Fixed a bug in {Spidr::Page#to_absolute} where trailing `/` characters on
147
+ * Fixed a bug in `Spidr::Page#to_absolute` where trailing `/` characters on
123
148
  URI paths were not being preserved (thanks falter).
124
149
  * Fixed a bug where the URI query was not being sent with the request
125
150
  in {Spidr::Agent#get_page} (thanks Damian Steer).
@@ -169,7 +194,7 @@
169
194
 
170
195
  * Added `Spidr::Agent#all_headers`.
171
196
  * Fixed a bug where {Spidr::Page#headers} was always `nil`.
172
- * {Spidr::Spidr::Agent} will now follow the Location header in HTTP 300,
197
+ * {Spidr::Agent} will now follow the Location header in HTTP 300,
173
198
  301, 302, 303 and 307 Redirects.
174
199
  * {Spidr::Agent} will now follow iframe and frame tags.
175
200
 
@@ -189,8 +214,8 @@
189
214
 
190
215
  ### 0.1.5 / 2009-03-22
191
216
 
192
- * Catch malformed URIs in {Spidr::Page#to_absolute} and return `nil`.
193
- * Filter out `nil` URIs in {Spidr::Page#urls}.
217
+ * Catch malformed URIs in `Spidr::Page#to_absolute` and return `nil`.
218
+ * Filter out `nil` URIs in `Spidr::Page#urls`.
194
219
 
195
220
  ### 0.1.4 / 2009-01-15
196
221
 
@@ -204,9 +229,9 @@
204
229
 
205
230
  ### 0.1.2 / 2008-11-06
206
231
 
207
- * Fixed a bug in {Spidr::Page#to_absolute} where URLs with no path were not
232
+ * Fixed a bug in `Spidr::Page#to_absolute` where URLs with no path were not
208
233
  receiving a default path of `/`.
209
- * Fixed a bug in {Spidr::Page#to_absolute} where URL paths were not being
234
+ * Fixed a bug in `Spidr::Page#to_absolute` where URL paths were not being
210
235
  expanded, in order to remove `..` and `.` directories.
211
236
  * Fixed a bug where absolute URLs could have a blank path, thus causing
212
237
  {Spidr::Agent#get_page} to crash when it performed the HTTP request.
data/Gemfile CHANGED
@@ -1,27 +1,13 @@
1
1
  source 'https://rubygems.org'
2
2
 
3
- group(:runtime) do
4
- gem 'nokogiri', '>= 1.3.0'
5
- end
3
+ gemspec
6
4
 
7
- group(:development) do
8
- gem 'rake', '~> 0.8.7'
9
- gem 'jeweler', '~> 1.4.0', :git => 'git://github.com/technicalpickles/jeweler.git'
10
- end
5
+ group :development do
6
+ gem 'rake', '~> 0.8.7'
11
7
 
12
- group(:doc) do
13
- case RUBY_PLATFORM
14
- when 'java'
15
- gem 'maruku', '~> 0.6.0'
16
- else
17
- gem 'rdiscount', '~> 1.6.3'
18
- end
8
+ gem 'ore-tasks', '~> 0.4'
9
+ gem 'rspec', '~> 2.4'
10
+ gem 'wsoc', '~> 0.1.3'
19
11
 
20
- gem 'yard', '~> 0.5.3'
12
+ gem 'kramdown', '~> 0.12'
21
13
  end
22
-
23
- group(:test) do
24
- gem 'wsoc', '~> 0.1.3'
25
- end
26
-
27
- gem 'rspec', '~> 1.3.0', :group => [:development, :test]
data/LICENSE.txt CHANGED
@@ -1,5 +1,4 @@
1
-
2
- Copyright (c) 2008-2010 Hal Brodigan
1
+ Copyright (c) 2008-2011 Hal Brodigan
3
2
 
4
3
  Permission is hereby granted, free of charge, to any person obtaining
5
4
  a copy of this software and associated documentation files (the
data/README.md CHANGED
@@ -1,9 +1,9 @@
1
1
  # Spidr
2
2
 
3
- * [spidr.rubyforge.org](http://spidr.rubyforge.org/)
4
- * [github.com/postmodern/spidr](http://github.com/postmodern/spidr)
5
- * [github.com/postmodern/spidr/issues](http://github.com/postmodern/spidr/issues)
6
- * [groups.google.com/group/spidr](http://groups.google.com/group/spidr)
3
+ * [Homepage](http://spidr.rubyforge.org/)
4
+ * [Source](http://github.com/postmodern/spidr)
5
+ * [Issues](http://github.com/postmodern/spidr/issues)
6
+ * [Mailing List](http://groups.google.com/group/spidr)
7
7
  * irc.freenode.net #spidr
8
8
 
9
9
  ## Description
@@ -177,7 +177,7 @@ Skip the processing of links:
177
177
 
178
178
  ## Requirements
179
179
 
180
- * [nokogiri](http://nokogiri.rubyforge.org/) >= 1.3.0
180
+ * [nokogiri](http://nokogiri.rubyforge.org/) ~> 1.3
181
181
 
182
182
  ## Install
183
183
 
@@ -185,5 +185,6 @@ Skip the processing of links:
185
185
 
186
186
  ## License
187
187
 
188
- See {file:LICENSE.txt} for license information.
188
+ Copyright (c) 2008-2011 Hal Brodigan
189
189
 
190
+ See {file:LICENSE.txt} for license information.
data/Rakefile CHANGED
@@ -1,8 +1,15 @@
1
1
  require 'rubygems'
2
- require 'bundler'
3
2
 
4
3
  begin
5
- Bundler.setup(:development, :doc)
4
+ require 'bundler'
5
+ rescue LoadError => e
6
+ STDERR.puts e.message
7
+ STDERR.puts "Run `gem install bundler` to install Bundler."
8
+ exit e.status_code
9
+ end
10
+
11
+ begin
12
+ Bundler.setup(:development)
6
13
  rescue Bundler::BundlerError => e
7
14
  STDERR.puts e.message
8
15
  STDERR.puts "Run `bundle install` to install missing gems"
@@ -10,29 +17,12 @@ rescue Bundler::BundlerError => e
10
17
  end
11
18
 
12
19
  require 'rake'
13
- require 'jeweler'
14
- require './lib/spidr/version.rb'
15
-
16
- Jeweler::Tasks.new do |gem|
17
- gem.name = 'spidr'
18
- gem.version = Spidr::VERSION
19
- gem.license = 'MIT'
20
- gem.summary = %Q{A versatile Ruby web spidering library}
21
- gem.description = %Q{Spidr is a versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.}
22
- gem.email = 'postmodern.mod3@gmail.com'
23
- gem.homepage = 'http://github.com/postmodern/spidr'
24
- gem.authors = ['Postmodern']
25
- gem.has_rdoc = 'yard'
26
- end
27
- Jeweler::GemcutterTasks.new
28
20
 
29
- require 'spec/rake/spectask'
30
- Spec::Rake::SpecTask.new(:spec) do |spec|
31
- spec.libs += ['lib', 'spec']
32
- spec.spec_files = FileList['spec/**/*_spec.rb']
33
- spec.spec_opts = ['--options', '.specopts']
34
- end
21
+ require 'ore/tasks'
22
+ Ore::Tasks.new
35
23
 
24
+ require 'rspec/core/rake_task'
25
+ RSpec::Core::RakeTask.new
36
26
  task :default => :spec
37
27
 
38
28
  require 'yard'
data/gemspec.yml ADDED
@@ -0,0 +1,19 @@
1
+ name: spidr
2
+ summary: A versatile Ruby web spidering library
3
+ description:
4
+ Spidr is a versatile Ruby web spidering library that can spider a site,
5
+ multiple domains, certain links or infinitely. Spidr is designed to be
6
+ fast and easy to use.
7
+
8
+ license: MIT
9
+ authors: Postmodern
10
+ email: postmodern.mod3@gmail.com
11
+ homepage: http://github.com/postmodern/spidr
12
+ has_yard: true
13
+
14
+ dependencies:
15
+ nokogiri: ~> 1.3
16
+
17
+ development_dependencies:
18
+ bundler: ~> 1.0.0
19
+ yard: ~> 0.6.0
@@ -4,7 +4,7 @@ require 'spidr/actions/exceptions/skip_page'
4
4
 
5
5
  module Spidr
6
6
  #
7
- # The {Actions} module adds methods to {Agent} for controling the
7
+ # The {Actions} module adds methods to {Agent} for controlling the
8
8
  # spidering of links.
9
9
  #
10
10
  module Actions
data/lib/spidr/agent.rb CHANGED
@@ -48,6 +48,12 @@ module Spidr
48
48
 
49
49
  # Cached cookies
50
50
  attr_reader :cookies
51
+
52
+ # Maximum depth
53
+ attr_reader :max_depth
54
+
55
+ # The visited URLs and their depth within a site
56
+ attr_reader :levels
51
57
 
52
58
  #
53
59
  # Creates a new Agent object.
@@ -91,6 +97,9 @@ module Spidr
91
97
  # @option options [Set, Array] :history
92
98
  # The initial list of visited URLs.
93
99
  #
100
+ # @option options [Integer] :max_depth
101
+ # The maximum link depth to follow.
102
+ #
94
103
  # @yield [agent]
95
104
  # If a block is given, it will be passed the newly created agent
96
105
  # for further configuration.
@@ -119,6 +128,9 @@ module Spidr
119
128
  @failures = Set[]
120
129
  @queue = []
121
130
 
131
+ @levels = Hash.new(0)
132
+ @max_depth = options[:max_depth]
133
+
122
134
  super(options)
123
135
 
124
136
  yield self if block_given?
@@ -450,7 +462,7 @@ module Spidr
450
462
  # @return [Boolean]
451
463
  # Specifies whether the URL was enqueued, or ignored.
452
464
  #
453
- def enqueue(url)
465
+ def enqueue(url,level=0)
454
466
  url = sanitize_url(url)
455
467
 
456
468
  if (!(queued?(url)) && visit?(url))
@@ -477,14 +489,15 @@ module Spidr
477
489
  return false
478
490
  rescue Actions::Action
479
491
  end
480
-
492
+
481
493
  @queue << url
494
+ @levels[url] = level
482
495
  return true
483
496
  end
484
497
 
485
498
  return false
486
499
  end
487
-
500
+
488
501
  #
489
502
  # Requests and creates a new Page object from a given URL.
490
503
  #
@@ -568,7 +581,7 @@ module Spidr
568
581
  # for the page failed, or the page was skipped.
569
582
  #
570
583
  def visit_page(url)
571
- url = URI(url.to_s) unless url.kind_of?(URI)
584
+ url = sanitize_url(url)
572
585
 
573
586
  get_page(url) do |page|
574
587
  @history << page.url
@@ -584,7 +597,7 @@ module Spidr
584
597
  rescue Actions::Action
585
598
  end
586
599
 
587
- page.urls.each do |next_url|
600
+ page.each_url do |next_url|
588
601
  begin
589
602
  @every_link_blocks.each do |link_block|
590
603
  link_block.call(page.url,next_url)
@@ -596,7 +609,9 @@ module Spidr
596
609
  rescue Actions::Action
597
610
  end
598
611
 
599
- enqueue(next_url)
612
+ if (@max_depth.nil? || @max_depth > @levels[url])
613
+ enqueue(next_url,@levels[url] + 1)
614
+ end
600
615
  end
601
616
  end
602
617
  end
@@ -24,7 +24,7 @@ module Spidr
24
24
  # Given a URL, return the most specific matching auth credential.
25
25
  #
26
26
  # @param [URI] url
27
- # A fully qualified url includig optional path.
27
+ # A fully qualified url including optional path.
28
28
  #
29
29
  # @return [AuthCredential, nil]
30
30
  # Closest matching {AuthCredential} values for the URL,
data/lib/spidr/body.rb ADDED
@@ -0,0 +1,99 @@
1
+ require 'nokogiri'
2
+
3
+ module Spidr
4
+ module Body
5
+ #
6
+ # The body of the response.
7
+ #
8
+ # @return [String]
9
+ # The body of the response.
10
+ #
11
+ def body
12
+ (response.body || '')
13
+ end
14
+
15
+ #
16
+ # Returns a parsed document object for HTML, XML, RSS and Atom pages.
17
+ #
18
+ # @return [Nokogiri::HTML::Document, Nokogiri::XML::Document, nil]
19
+ # The document that represents HTML or XML pages.
20
+ # Returns `nil` if the page is neither HTML, XML, RSS, Atom or if
21
+ # the page could not be parsed properly.
22
+ #
23
+ # @see http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/Document.html
24
+ # @see http://nokogiri.rubyforge.org/nokogiri/Nokogiri/HTML/Document.html
25
+ #
26
+ def doc
27
+ return nil if body.empty?
28
+
29
+ begin
30
+ if html?
31
+ return @doc ||= Nokogiri::HTML(body)
32
+ elsif (xml? || xsl? || rss? || atom?)
33
+ return @doc ||= Nokogiri::XML(body)
34
+ end
35
+ rescue
36
+ return nil
37
+ end
38
+ end
39
+
40
+ #
41
+ # Searches the document for XPath or CSS Path paths.
42
+ #
43
+ # @param [Array<String>] paths
44
+ # CSS or XPath expressions to search the document with.
45
+ #
46
+ # @return [Array]
47
+ # The matched nodes from the document.
48
+ # Returns an empty Array if no nodes were matched, or if the page
49
+ # is not an HTML or XML document.
50
+ #
51
+ # @example
52
+ # page.search('//a[@href]')
53
+ #
54
+ # @see http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/Node.html#M000239
55
+ #
56
+ def search(*paths)
57
+ if doc
58
+ doc.search(*paths)
59
+ else
60
+ []
61
+ end
62
+ end
63
+
64
+ #
65
+ # Searches for the first occurrence an XPath or CSS Path expression.
66
+ #
67
+ # @return [Nokogiri::HTML::Node, Nokogiri::XML::Node, nil]
68
+ # The first matched node. Returns `nil` if no nodes could be matched,
69
+ # or if the page is not a HTML or XML document.
70
+ #
71
+ # @example
72
+ # page.at('//title')
73
+ #
74
+ # @see http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/Node.html#M000251
75
+ #
76
+ def at(*arguments)
77
+ if doc
78
+ doc.at(*arguments)
79
+ end
80
+ end
81
+
82
+ alias / search
83
+ alias % at
84
+
85
+ #
86
+ # The title of the HTML page.
87
+ #
88
+ # @return [String]
89
+ # The inner-text of the title element of the page.
90
+ #
91
+ def title
92
+ if (node = at('//title'))
93
+ node.inner_text
94
+ end
95
+ end
96
+
97
+ alias to_s body
98
+ end
99
+ end