spidr 0.2.7 → 0.3.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/.rspec ADDED
@@ -0,0 +1 @@
1
+ --colour --format documentation
data/ChangeLog.md CHANGED
@@ -1,19 +1,44 @@
1
+ ### 0.3.0 / 2011-04-14
2
+
3
+ * Switched from Jeweler to [Ore](http://github.com/ruby-ore/ore).
4
+ * Split all header related methods out of {Spidr::Page} and into
5
+ {Spidr::Headers}.
6
+ * Split all body related methods out of {Spidr::Page} and into
7
+ {Spidr::Body}.
8
+ * Split all link related methods out of {Spidr::Page} and into
9
+ {Spidr::Links}.
10
+ * Added {Spidr::Headers#directory?}.
11
+ * Added {Spidr::Headers#json?}.
12
+ * Added {Spidr::Links#each_url}.
13
+ * Added {Spidr::Links#each_link}.
14
+ * Added {Spidr::Links#each_redirect}.
15
+ * Added {Spidr::Links#each_meta_redirect}.
16
+ * Aliased {Spidr::Headers#raw_cookie} to {Spidr::Headers#cookie}.
17
+ * Aliased {Spidr::Body#to_s} to {Spidr::Body#body}.
18
+ * Also check for `application/xml` in {Spidr::Headers#xml?}.
19
+ * Catch all exceptions when merging URIs in {Spidr::Links#to_absolute}.
20
+ * Always prepend a `/` to all FTP URI paths. Fixes a Ruby 1.8 specific
21
+ bug, where it expects an absolute path for all FTP URIs.
22
+ * Refactored {URI.expand_path}.
23
+ * Start the session in {Spidr::SessionCache#[]} to prevent multiple
24
+ `CONNECT` commands being sent to HTTP Proxies (thanks falaise).
25
+
1
26
  ### 0.2.7 / 2010-08-17
2
27
 
3
28
  * Added {Spidr::CookieJar#cookies_for_host} (thanks zapnap).
4
- * Renamed `Spidr::Page#cookie` to {Spidr::Page#raw_cookie}.
29
+ * Renamed `Spidr::Page#cookie` to `Spidr::Page#raw_cookie`.
5
30
  * Rescue `URI::InvalidComponentError` exceptions in
6
- {Spidr::Page#to_absolute} (thanks zapnap).
31
+ `Spidr::Page#to_absolute` (thanks zapnap).
7
32
 
8
33
  ### 0.2.6 / 2010-07-05
9
34
 
10
- * Fixed a bug in {Spidr::Page#meta_redirect}, by calling
35
+ * Fixed a bug in `Spidr::Page#meta_redirect`, by calling
11
36
  `Nokogiri::XML::Element#get_attribute` instead of `attr`.
12
37
 
13
38
  ### 0.2.5 / 2010-07-02
14
39
 
15
- * Added {Spidr::Page#meta_redirect}.
16
- * Added {Spidr::Page#meta_redirect?}.
40
+ * Added `Spidr::Page#meta_redirect`.
41
+ * Added `Spidr::Page#meta_redirect?`.
17
42
  * Manage development dependencies with Bundler.
18
43
  * Support following "old-school" meta-refresh redirects (thanks zapnap).
19
44
  * Allow {Spidr::CookieJar} inherit cookies set by a parent domain.
@@ -26,10 +51,10 @@
26
51
  * Added {Spidr::Filters#visit_urls_like}.
27
52
  * Added {Spidr::Filters#ignore_urls}.
28
53
  * Added {Spidr::Filters#ignore_urls_like}.
29
- * Added {Spidr::Page#is_content_type?}.
30
- * Default {Spidr::Page#body} to an empty String.
31
- * Default {Spidr::Page#content_type} to an empty String.
32
- * Default {Spidr::Page#content_types} to an empty Array.
54
+ * Added `Spidr::Page#is_content_type?`.
55
+ * Default `Spidr::Page#body` to an empty String.
56
+ * Default `Spidr::Page#content_type` to an empty String.
57
+ * Default `Spidr::Page#content_types` to an empty Array.
33
58
  * Improved reliability of {Spidr::Page#is_redirect?}.
34
59
  * Improved content type detection in {Spidr::Page} to handle `Content-Type`
35
60
  headers containing charsets (thanks Josh Lindsey).
@@ -47,10 +72,10 @@
47
72
  * Require Web Spider Obstacle Course (WSOC) >= 0.1.1.
48
73
  * Integrated the new WSOC into the specs.
49
74
  * Removed the built-in Web Spider Obstacle Course.
50
- * Added {Spidr::Page#content_types}.
51
- * Added {Spidr::Page#cookie}.
52
- * Added {Spidr::Page#cookies}.
53
- * Added {Spidr::Page#cookie_params}.
75
+ * Added `Spidr::Page#content_types`.
76
+ * Added `Spidr::Page#cookie`.
77
+ * Added `Spidr::Page#cookies`.
78
+ * Added `Spidr::Page#cookie_params`.
54
79
  * Added {Spidr::Sanitizers}.
55
80
  * Added {Spidr::SessionCache}.
56
81
  * Added {Spidr::CookieJar} (thanks Nick Plante).
@@ -93,33 +118,33 @@
93
118
  ### 0.2.0 / 2009-10-10
94
119
 
95
120
  * Added {URI.expand_path}.
96
- * Added {Spidr::Page#search}.
97
- * Added {Spidr::Page#at}.
98
- * Added {Spidr::Page#title}.
121
+ * Added `Spidr::Page#search`.
122
+ * Added `Spidr::Page#at`.
123
+ * Added `Spidr::Page#title`.
99
124
  * Added {Spidr::Agent#failures=}.
100
125
  * Added a HTTP session cache to {Spidr::Agent}, per suggestion of falter.
101
126
  * Added `Spidr::Agent#get_session`.
102
127
  * Added `Spidr::Agent#kill_session`.
103
128
  * Added {Spidr.proxy=}.
104
129
  * Added {Spidr.disable_proxy!}.
105
- * Aliased `Spidr::Page#txt?` to {Spidr::Page#plain_text?}.
106
- * Aliased `Spidr::Page#ok?` to {Spidr::Page#is_ok?}.
107
- * Aliased `Spidr::Page#redirect?` to {Spidr::Page#is_redirect?}.
108
- * Aliased `Spidr::Page#unauthorized?` to {Spidr::Page#is_unauthorized?}.
109
- * Aliased `Spidr::Page#forbidden?` to {Spidr::Page#is_forbidden?}.
110
- * Aliased `Spidr::Page#missing?` to {Spidr::Page#is_missing?}.
130
+ * Aliased `Spidr::Page#txt?` to `Spidr::Page#plain_text?`.
131
+ * Aliased `Spidr::Page#ok?` to `Spidr::Page#is_ok?`.
132
+ * Aliased `Spidr::Page#redirect?` to `Spidr::Page#is_redirect?`.
133
+ * Aliased `Spidr::Page#unauthorized?` to `Spidr::Page#is_unauthorized?`.
134
+ * Aliased `Spidr::Page#forbidden?` to `Spidr::Page#is_forbidden?`.
135
+ * Aliased `Spidr::Page#missing?` to `Spidr::Page#is_missing?`.
111
136
  * Split URL filtering code out of {Spidr::Agent} and into
112
137
  {Spidr::Filters}.
113
138
  * Split URL / Page event code out of {Spidr::Agent} and into
114
139
  {Spidr::Events}.
115
140
  * Split pause! / continue! / skip_link! / skip_page! methods out of
116
141
  {Spidr::Agent} and into {Spidr::Actions}.
117
- * Fixed a bug in {Spidr::Page#code}, where it was not returning an Integer.
118
- * Make sure {Spidr::Page#doc} returns `Nokogiri::XML::Document` objects for
142
+ * Fixed a bug in `Spidr::Page#code`, where it was not returning an Integer.
143
+ * Make sure `Spidr::Page#doc` returns `Nokogiri::XML::Document` objects for
119
144
  RSS/RDF/Atom pages as well.
120
- * Fixed the handling of the Location header in {Spidr::Page#links}
145
+ * Fixed the handling of the Location header in `Spidr::Page#links`
121
146
  (thanks falter).
122
- * Fixed a bug in {Spidr::Page#to_absolute} where trailing `/` characters on
147
+ * Fixed a bug in `Spidr::Page#to_absolute` where trailing `/` characters on
123
148
  URI paths were not being preserved (thanks falter).
124
149
  * Fixed a bug where the URI query was not being sent with the request
125
150
  in {Spidr::Agent#get_page} (thanks Damian Steer).
@@ -169,7 +194,7 @@
169
194
 
170
195
  * Added `Spidr::Agent#all_headers`.
171
196
  * Fixed a bug where {Spidr::Page#headers} was always `nil`.
172
- * {Spidr::Spidr::Agent} will now follow the Location header in HTTP 300,
197
+ * {Spidr::Agent} will now follow the Location header in HTTP 300,
173
198
  301, 302, 303 and 307 Redirects.
174
199
  * {Spidr::Agent} will now follow iframe and frame tags.
175
200
 
@@ -189,8 +214,8 @@
189
214
 
190
215
  ### 0.1.5 / 2009-03-22
191
216
 
192
- * Catch malformed URIs in {Spidr::Page#to_absolute} and return `nil`.
193
- * Filter out `nil` URIs in {Spidr::Page#urls}.
217
+ * Catch malformed URIs in `Spidr::Page#to_absolute` and return `nil`.
218
+ * Filter out `nil` URIs in `Spidr::Page#urls`.
194
219
 
195
220
  ### 0.1.4 / 2009-01-15
196
221
 
@@ -204,9 +229,9 @@
204
229
 
205
230
  ### 0.1.2 / 2008-11-06
206
231
 
207
- * Fixed a bug in {Spidr::Page#to_absolute} where URLs with no path were not
232
+ * Fixed a bug in `Spidr::Page#to_absolute` where URLs with no path were not
208
233
  receiving a default path of `/`.
209
- * Fixed a bug in {Spidr::Page#to_absolute} where URL paths were not being
234
+ * Fixed a bug in `Spidr::Page#to_absolute` where URL paths were not being
210
235
  expanded, in order to remove `..` and `.` directories.
211
236
  * Fixed a bug where absolute URLs could have a blank path, thus causing
212
237
  {Spidr::Agent#get_page} to crash when it performed the HTTP request.
data/Gemfile CHANGED
@@ -1,27 +1,13 @@
1
1
  source 'https://rubygems.org'
2
2
 
3
- group(:runtime) do
4
- gem 'nokogiri', '>= 1.3.0'
5
- end
3
+ gemspec
6
4
 
7
- group(:development) do
8
- gem 'rake', '~> 0.8.7'
9
- gem 'jeweler', '~> 1.4.0', :git => 'git://github.com/technicalpickles/jeweler.git'
10
- end
5
+ group :development do
6
+ gem 'rake', '~> 0.8.7'
11
7
 
12
- group(:doc) do
13
- case RUBY_PLATFORM
14
- when 'java'
15
- gem 'maruku', '~> 0.6.0'
16
- else
17
- gem 'rdiscount', '~> 1.6.3'
18
- end
8
+ gem 'ore-tasks', '~> 0.4'
9
+ gem 'rspec', '~> 2.4'
10
+ gem 'wsoc', '~> 0.1.3'
19
11
 
20
- gem 'yard', '~> 0.5.3'
12
+ gem 'kramdown', '~> 0.12'
21
13
  end
22
-
23
- group(:test) do
24
- gem 'wsoc', '~> 0.1.3'
25
- end
26
-
27
- gem 'rspec', '~> 1.3.0', :group => [:development, :test]
data/LICENSE.txt CHANGED
@@ -1,5 +1,4 @@
1
-
2
- Copyright (c) 2008-2010 Hal Brodigan
1
+ Copyright (c) 2008-2011 Hal Brodigan
3
2
 
4
3
  Permission is hereby granted, free of charge, to any person obtaining
5
4
  a copy of this software and associated documentation files (the
data/README.md CHANGED
@@ -1,9 +1,9 @@
1
1
  # Spidr
2
2
 
3
- * [spidr.rubyforge.org](http://spidr.rubyforge.org/)
4
- * [github.com/postmodern/spidr](http://github.com/postmodern/spidr)
5
- * [github.com/postmodern/spidr/issues](http://github.com/postmodern/spidr/issues)
6
- * [groups.google.com/group/spidr](http://groups.google.com/group/spidr)
3
+ * [Homepage](http://spidr.rubyforge.org/)
4
+ * [Source](http://github.com/postmodern/spidr)
5
+ * [Issues](http://github.com/postmodern/spidr/issues)
6
+ * [Mailing List](http://groups.google.com/group/spidr)
7
7
  * irc.freenode.net #spidr
8
8
 
9
9
  ## Description
@@ -177,7 +177,7 @@ Skip the processing of links:
177
177
 
178
178
  ## Requirements
179
179
 
180
- * [nokogiri](http://nokogiri.rubyforge.org/) >= 1.3.0
180
+ * [nokogiri](http://nokogiri.rubyforge.org/) ~> 1.3
181
181
 
182
182
  ## Install
183
183
 
@@ -185,5 +185,6 @@ Skip the processing of links:
185
185
 
186
186
  ## License
187
187
 
188
- See {file:LICENSE.txt} for license information.
188
+ Copyright (c) 2008-2011 Hal Brodigan
189
189
 
190
+ See {file:LICENSE.txt} for license information.
data/Rakefile CHANGED
@@ -1,8 +1,15 @@
1
1
  require 'rubygems'
2
- require 'bundler'
3
2
 
4
3
  begin
5
- Bundler.setup(:development, :doc)
4
+ require 'bundler'
5
+ rescue LoadError => e
6
+ STDERR.puts e.message
7
+ STDERR.puts "Run `gem install bundler` to install Bundler."
8
+ exit e.status_code
9
+ end
10
+
11
+ begin
12
+ Bundler.setup(:development)
6
13
  rescue Bundler::BundlerError => e
7
14
  STDERR.puts e.message
8
15
  STDERR.puts "Run `bundle install` to install missing gems"
@@ -10,29 +17,12 @@ rescue Bundler::BundlerError => e
10
17
  end
11
18
 
12
19
  require 'rake'
13
- require 'jeweler'
14
- require './lib/spidr/version.rb'
15
-
16
- Jeweler::Tasks.new do |gem|
17
- gem.name = 'spidr'
18
- gem.version = Spidr::VERSION
19
- gem.license = 'MIT'
20
- gem.summary = %Q{A versatile Ruby web spidering library}
21
- gem.description = %Q{Spidr is a versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.}
22
- gem.email = 'postmodern.mod3@gmail.com'
23
- gem.homepage = 'http://github.com/postmodern/spidr'
24
- gem.authors = ['Postmodern']
25
- gem.has_rdoc = 'yard'
26
- end
27
- Jeweler::GemcutterTasks.new
28
20
 
29
- require 'spec/rake/spectask'
30
- Spec::Rake::SpecTask.new(:spec) do |spec|
31
- spec.libs += ['lib', 'spec']
32
- spec.spec_files = FileList['spec/**/*_spec.rb']
33
- spec.spec_opts = ['--options', '.specopts']
34
- end
21
+ require 'ore/tasks'
22
+ Ore::Tasks.new
35
23
 
24
+ require 'rspec/core/rake_task'
25
+ RSpec::Core::RakeTask.new
36
26
  task :default => :spec
37
27
 
38
28
  require 'yard'
data/gemspec.yml ADDED
@@ -0,0 +1,19 @@
1
+ name: spidr
2
+ summary: A versatile Ruby web spidering library
3
+ description:
4
+ Spidr is a versatile Ruby web spidering library that can spider a site,
5
+ multiple domains, certain links or infinitely. Spidr is designed to be
6
+ fast and easy to use.
7
+
8
+ license: MIT
9
+ authors: Postmodern
10
+ email: postmodern.mod3@gmail.com
11
+ homepage: http://github.com/postmodern/spidr
12
+ has_yard: true
13
+
14
+ dependencies:
15
+ nokogiri: ~> 1.3
16
+
17
+ development_dependencies:
18
+ bundler: ~> 1.0.0
19
+ yard: ~> 0.6.0
@@ -4,7 +4,7 @@ require 'spidr/actions/exceptions/skip_page'
4
4
 
5
5
  module Spidr
6
6
  #
7
- # The {Actions} module adds methods to {Agent} for controling the
7
+ # The {Actions} module adds methods to {Agent} for controlling the
8
8
  # spidering of links.
9
9
  #
10
10
  module Actions
data/lib/spidr/agent.rb CHANGED
@@ -48,6 +48,12 @@ module Spidr
48
48
 
49
49
  # Cached cookies
50
50
  attr_reader :cookies
51
+
52
+ # Maximum depth
53
+ attr_reader :max_depth
54
+
55
+ # The visited URLs and their depth within a site
56
+ attr_reader :levels
51
57
 
52
58
  #
53
59
  # Creates a new Agent object.
@@ -91,6 +97,9 @@ module Spidr
91
97
  # @option options [Set, Array] :history
92
98
  # The initial list of visited URLs.
93
99
  #
100
+ # @option options [Integer] :max_depth
101
+ # The maximum link depth to follow.
102
+ #
94
103
  # @yield [agent]
95
104
  # If a block is given, it will be passed the newly created agent
96
105
  # for further configuration.
@@ -119,6 +128,9 @@ module Spidr
119
128
  @failures = Set[]
120
129
  @queue = []
121
130
 
131
+ @levels = Hash.new(0)
132
+ @max_depth = options[:max_depth]
133
+
122
134
  super(options)
123
135
 
124
136
  yield self if block_given?
@@ -450,7 +462,7 @@ module Spidr
450
462
  # @return [Boolean]
451
463
  # Specifies whether the URL was enqueued, or ignored.
452
464
  #
453
- def enqueue(url)
465
+ def enqueue(url,level=0)
454
466
  url = sanitize_url(url)
455
467
 
456
468
  if (!(queued?(url)) && visit?(url))
@@ -477,14 +489,15 @@ module Spidr
477
489
  return false
478
490
  rescue Actions::Action
479
491
  end
480
-
492
+
481
493
  @queue << url
494
+ @levels[url] = level
482
495
  return true
483
496
  end
484
497
 
485
498
  return false
486
499
  end
487
-
500
+
488
501
  #
489
502
  # Requests and creates a new Page object from a given URL.
490
503
  #
@@ -568,7 +581,7 @@ module Spidr
568
581
  # for the page failed, or the page was skipped.
569
582
  #
570
583
  def visit_page(url)
571
- url = URI(url.to_s) unless url.kind_of?(URI)
584
+ url = sanitize_url(url)
572
585
 
573
586
  get_page(url) do |page|
574
587
  @history << page.url
@@ -584,7 +597,7 @@ module Spidr
584
597
  rescue Actions::Action
585
598
  end
586
599
 
587
- page.urls.each do |next_url|
600
+ page.each_url do |next_url|
588
601
  begin
589
602
  @every_link_blocks.each do |link_block|
590
603
  link_block.call(page.url,next_url)
@@ -596,7 +609,9 @@ module Spidr
596
609
  rescue Actions::Action
597
610
  end
598
611
 
599
- enqueue(next_url)
612
+ if (@max_depth.nil? || @max_depth > @levels[url])
613
+ enqueue(next_url,@levels[url] + 1)
614
+ end
600
615
  end
601
616
  end
602
617
  end
@@ -24,7 +24,7 @@ module Spidr
24
24
  # Given a URL, return the most specific matching auth credential.
25
25
  #
26
26
  # @param [URI] url
27
- # A fully qualified url includig optional path.
27
+ # A fully qualified url including optional path.
28
28
  #
29
29
  # @return [AuthCredential, nil]
30
30
  # Closest matching {AuthCredential} values for the URL,
data/lib/spidr/body.rb ADDED
@@ -0,0 +1,99 @@
1
+ require 'nokogiri'
2
+
3
+ module Spidr
4
+ module Body
5
+ #
6
+ # The body of the response.
7
+ #
8
+ # @return [String]
9
+ # The body of the response.
10
+ #
11
+ def body
12
+ (response.body || '')
13
+ end
14
+
15
+ #
16
+ # Returns a parsed document object for HTML, XML, RSS and Atom pages.
17
+ #
18
+ # @return [Nokogiri::HTML::Document, Nokogiri::XML::Document, nil]
19
+ # The document that represents HTML or XML pages.
20
+ # Returns `nil` if the page is neither HTML, XML, RSS, Atom or if
21
+ # the page could not be parsed properly.
22
+ #
23
+ # @see http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/Document.html
24
+ # @see http://nokogiri.rubyforge.org/nokogiri/Nokogiri/HTML/Document.html
25
+ #
26
+ def doc
27
+ return nil if body.empty?
28
+
29
+ begin
30
+ if html?
31
+ return @doc ||= Nokogiri::HTML(body)
32
+ elsif (xml? || xsl? || rss? || atom?)
33
+ return @doc ||= Nokogiri::XML(body)
34
+ end
35
+ rescue
36
+ return nil
37
+ end
38
+ end
39
+
40
+ #
41
+ # Searches the document for XPath or CSS Path paths.
42
+ #
43
+ # @param [Array<String>] paths
44
+ # CSS or XPath expressions to search the document with.
45
+ #
46
+ # @return [Array]
47
+ # The matched nodes from the document.
48
+ # Returns an empty Array if no nodes were matched, or if the page
49
+ # is not an HTML or XML document.
50
+ #
51
+ # @example
52
+ # page.search('//a[@href]')
53
+ #
54
+ # @see http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/Node.html#M000239
55
+ #
56
+ def search(*paths)
57
+ if doc
58
+ doc.search(*paths)
59
+ else
60
+ []
61
+ end
62
+ end
63
+
64
+ #
65
+ # Searches for the first occurrence an XPath or CSS Path expression.
66
+ #
67
+ # @return [Nokogiri::HTML::Node, Nokogiri::XML::Node, nil]
68
+ # The first matched node. Returns `nil` if no nodes could be matched,
69
+ # or if the page is not a HTML or XML document.
70
+ #
71
+ # @example
72
+ # page.at('//title')
73
+ #
74
+ # @see http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/Node.html#M000251
75
+ #
76
+ def at(*arguments)
77
+ if doc
78
+ doc.at(*arguments)
79
+ end
80
+ end
81
+
82
+ alias / search
83
+ alias % at
84
+
85
+ #
86
+ # The title of the HTML page.
87
+ #
88
+ # @return [String]
89
+ # The inner-text of the title element of the page.
90
+ #
91
+ def title
92
+ if (node = at('//title'))
93
+ node.inner_text
94
+ end
95
+ end
96
+
97
+ alias to_s body
98
+ end
99
+ end