spidr 0.2.7 → 0.3.0
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/.rspec +1 -0
- data/ChangeLog.md +56 -31
- data/Gemfile +7 -21
- data/LICENSE.txt +1 -2
- data/README.md +7 -6
- data/Rakefile +13 -23
- data/gemspec.yml +19 -0
- data/lib/spidr/actions/actions.rb +1 -1
- data/lib/spidr/agent.rb +21 -6
- data/lib/spidr/auth_store.rb +1 -1
- data/lib/spidr/body.rb +99 -0
- data/lib/spidr/extensions/uri.rb +14 -7
- data/lib/spidr/headers.rb +323 -0
- data/lib/spidr/links.rb +229 -0
- data/lib/spidr/page.rb +32 -536
- data/lib/spidr/sanitizers.rb +3 -3
- data/lib/spidr/session_cache.rb +1 -0
- data/lib/spidr/version.rb +1 -1
- data/spec/actions_spec.rb +6 -8
- data/spec/auth_store_spec.rb +28 -28
- data/spec/cookie_jar_spec.rb +49 -60
- data/spec/extensions/uri_spec.rb +4 -0
- data/spec/filters_spec.rb +8 -0
- data/spec/page_spec.rb +0 -7
- data/spec/rules_spec.rb +8 -6
- data/spec/sanitizers_spec.rb +10 -16
- data/spec/spec_helper.rb +1 -12
- data/spec/spidr_spec.rb +11 -11
- data/spidr.gemspec +11 -110
- metadata +24 -52
- data/.gitignore +0 -9
- data/.specopts +0 -1
- data/Gemfile.lock +0 -39
data/.rspec
ADDED
@@ -0,0 +1 @@
|
|
1
|
+
--colour --format documentation
|
data/ChangeLog.md
CHANGED
@@ -1,19 +1,44 @@
|
|
1
|
+
### 0.3.0 / 2011-04-14
|
2
|
+
|
3
|
+
* Switched from Jeweler to [Ore](http://github.com/ruby-ore/ore).
|
4
|
+
* Split all header related methods out of {Spidr::Page} and into
|
5
|
+
{Spidr::Headers}.
|
6
|
+
* Split all body related methods out of {Spidr::Page} and into
|
7
|
+
{Spidr::Body}.
|
8
|
+
* Split all link related methods out of {Spidr::Page} and into
|
9
|
+
{Spidr::Links}.
|
10
|
+
* Added {Spidr::Headers#directory?}.
|
11
|
+
* Added {Spidr::Headers#json?}.
|
12
|
+
* Added {Spidr::Links#each_url}.
|
13
|
+
* Added {Spidr::Links#each_link}.
|
14
|
+
* Added {Spidr::Links#each_redirect}.
|
15
|
+
* Added {Spidr::Links#each_meta_redirect}.
|
16
|
+
* Aliased {Spidr::Headers#raw_cookie} to {Spidr::Headers#cookie}.
|
17
|
+
* Aliased {Spidr::Body#to_s} to {Spidr::Body#body}.
|
18
|
+
* Also check for `application/xml` in {Spidr::Headers#xml?}.
|
19
|
+
* Catch all exceptions when merging URIs in {Spidr::Links#to_absolute}.
|
20
|
+
* Always prepend a `/` to all FTP URI paths. Fixes a Ruby 1.8 specific
|
21
|
+
bug, where it expects an absolute path for all FTP URIs.
|
22
|
+
* Refactored {URI.expand_path}.
|
23
|
+
* Start the session in {Spidr::SessionCache#[]} to prevent multiple
|
24
|
+
`CONNECT` commands being sent to HTTP Proxies (thanks falaise).
|
25
|
+
|
1
26
|
### 0.2.7 / 2010-08-17
|
2
27
|
|
3
28
|
* Added {Spidr::CookieJar#cookies_for_host} (thanks zapnap).
|
4
|
-
* Renamed `Spidr::Page#cookie` to
|
29
|
+
* Renamed `Spidr::Page#cookie` to `Spidr::Page#raw_cookie`.
|
5
30
|
* Rescue `URI::InvalidComponentError` exceptions in
|
6
|
-
|
31
|
+
`Spidr::Page#to_absolute` (thanks zapnap).
|
7
32
|
|
8
33
|
### 0.2.6 / 2010-07-05
|
9
34
|
|
10
|
-
* Fixed a bug in
|
35
|
+
* Fixed a bug in `Spidr::Page#meta_redirect`, by calling
|
11
36
|
`Nokogiri::XML::Element#get_attribute` instead of `attr`.
|
12
37
|
|
13
38
|
### 0.2.5 / 2010-07-02
|
14
39
|
|
15
|
-
* Added
|
16
|
-
* Added
|
40
|
+
* Added `Spidr::Page#meta_redirect`.
|
41
|
+
* Added `Spidr::Page#meta_redirect?`.
|
17
42
|
* Manage development dependencies with Bundler.
|
18
43
|
* Support following "old-school" meta-refresh redirects (thanks zapnap).
|
19
44
|
* Allow {Spidr::CookieJar} inherit cookies set by a parent domain.
|
@@ -26,10 +51,10 @@
|
|
26
51
|
* Added {Spidr::Filters#visit_urls_like}.
|
27
52
|
* Added {Spidr::Filters#ignore_urls}.
|
28
53
|
* Added {Spidr::Filters#ignore_urls_like}.
|
29
|
-
* Added
|
30
|
-
* Default
|
31
|
-
* Default
|
32
|
-
* Default
|
54
|
+
* Added `Spidr::Page#is_content_type?`.
|
55
|
+
* Default `Spidr::Page#body` to an empty String.
|
56
|
+
* Default `Spidr::Page#content_type` to an empty String.
|
57
|
+
* Default `Spidr::Page#content_types` to an empty Array.
|
33
58
|
* Improved reliability of {Spidr::Page#is_redirect?}.
|
34
59
|
* Improved content type detection in {Spidr::Page} to handle `Content-Type`
|
35
60
|
headers containing charsets (thanks Josh Lindsey).
|
@@ -47,10 +72,10 @@
|
|
47
72
|
* Require Web Spider Obstacle Course (WSOC) >= 0.1.1.
|
48
73
|
* Integrated the new WSOC into the specs.
|
49
74
|
* Removed the built-in Web Spider Obstacle Course.
|
50
|
-
* Added
|
51
|
-
* Added
|
52
|
-
* Added
|
53
|
-
* Added
|
75
|
+
* Added `Spidr::Page#content_types`.
|
76
|
+
* Added `Spidr::Page#cookie`.
|
77
|
+
* Added `Spidr::Page#cookies`.
|
78
|
+
* Added `Spidr::Page#cookie_params`.
|
54
79
|
* Added {Spidr::Sanitizers}.
|
55
80
|
* Added {Spidr::SessionCache}.
|
56
81
|
* Added {Spidr::CookieJar} (thanks Nick Plante).
|
@@ -93,33 +118,33 @@
|
|
93
118
|
### 0.2.0 / 2009-10-10
|
94
119
|
|
95
120
|
* Added {URI.expand_path}.
|
96
|
-
* Added
|
97
|
-
* Added
|
98
|
-
* Added
|
121
|
+
* Added `Spidr::Page#search`.
|
122
|
+
* Added `Spidr::Page#at`.
|
123
|
+
* Added `Spidr::Page#title`.
|
99
124
|
* Added {Spidr::Agent#failures=}.
|
100
125
|
* Added a HTTP session cache to {Spidr::Agent}, per suggestion of falter.
|
101
126
|
* Added `Spidr::Agent#get_session`.
|
102
127
|
* Added `Spidr::Agent#kill_session`.
|
103
128
|
* Added {Spidr.proxy=}.
|
104
129
|
* Added {Spidr.disable_proxy!}.
|
105
|
-
* Aliased `Spidr::Page#txt?` to
|
106
|
-
* Aliased `Spidr::Page#ok?` to
|
107
|
-
* Aliased `Spidr::Page#redirect?` to
|
108
|
-
* Aliased `Spidr::Page#unauthorized?` to
|
109
|
-
* Aliased `Spidr::Page#forbidden?` to
|
110
|
-
* Aliased `Spidr::Page#missing?` to
|
130
|
+
* Aliased `Spidr::Page#txt?` to `Spidr::Page#plain_text?`.
|
131
|
+
* Aliased `Spidr::Page#ok?` to `Spidr::Page#is_ok?`.
|
132
|
+
* Aliased `Spidr::Page#redirect?` to `Spidr::Page#is_redirect?`.
|
133
|
+
* Aliased `Spidr::Page#unauthorized?` to `Spidr::Page#is_unauthorized?`.
|
134
|
+
* Aliased `Spidr::Page#forbidden?` to `Spidr::Page#is_forbidden?`.
|
135
|
+
* Aliased `Spidr::Page#missing?` to `Spidr::Page#is_missing?`.
|
111
136
|
* Split URL filtering code out of {Spidr::Agent} and into
|
112
137
|
{Spidr::Filters}.
|
113
138
|
* Split URL / Page event code out of {Spidr::Agent} and into
|
114
139
|
{Spidr::Events}.
|
115
140
|
* Split pause! / continue! / skip_link! / skip_page! methods out of
|
116
141
|
{Spidr::Agent} and into {Spidr::Actions}.
|
117
|
-
* Fixed a bug in
|
118
|
-
* Make sure
|
142
|
+
* Fixed a bug in `Spidr::Page#code`, where it was not returning an Integer.
|
143
|
+
* Make sure `Spidr::Page#doc` returns `Nokogiri::XML::Document` objects for
|
119
144
|
RSS/RDF/Atom pages as well.
|
120
|
-
* Fixed the handling of the Location header in
|
145
|
+
* Fixed the handling of the Location header in `Spidr::Page#links`
|
121
146
|
(thanks falter).
|
122
|
-
* Fixed a bug in
|
147
|
+
* Fixed a bug in `Spidr::Page#to_absolute` where trailing `/` characters on
|
123
148
|
URI paths were not being preserved (thanks falter).
|
124
149
|
* Fixed a bug where the URI query was not being sent with the request
|
125
150
|
in {Spidr::Agent#get_page} (thanks Damian Steer).
|
@@ -169,7 +194,7 @@
|
|
169
194
|
|
170
195
|
* Added `Spidr::Agent#all_headers`.
|
171
196
|
* Fixed a bug where {Spidr::Page#headers} was always `nil`.
|
172
|
-
* {Spidr::
|
197
|
+
* {Spidr::Agent} will now follow the Location header in HTTP 300,
|
173
198
|
301, 302, 303 and 307 Redirects.
|
174
199
|
* {Spidr::Agent} will now follow iframe and frame tags.
|
175
200
|
|
@@ -189,8 +214,8 @@
|
|
189
214
|
|
190
215
|
### 0.1.5 / 2009-03-22
|
191
216
|
|
192
|
-
* Catch malformed URIs in
|
193
|
-
* Filter out `nil` URIs in
|
217
|
+
* Catch malformed URIs in `Spidr::Page#to_absolute` and return `nil`.
|
218
|
+
* Filter out `nil` URIs in `Spidr::Page#urls`.
|
194
219
|
|
195
220
|
### 0.1.4 / 2009-01-15
|
196
221
|
|
@@ -204,9 +229,9 @@
|
|
204
229
|
|
205
230
|
### 0.1.2 / 2008-11-06
|
206
231
|
|
207
|
-
* Fixed a bug in
|
232
|
+
* Fixed a bug in `Spidr::Page#to_absolute` where URLs with no path were not
|
208
233
|
receiving a default path of `/`.
|
209
|
-
* Fixed a bug in
|
234
|
+
* Fixed a bug in `Spidr::Page#to_absolute` where URL paths were not being
|
210
235
|
expanded, in order to remove `..` and `.` directories.
|
211
236
|
* Fixed a bug where absolute URLs could have a blank path, thus causing
|
212
237
|
{Spidr::Agent#get_page} to crash when it performed the HTTP request.
|
data/Gemfile
CHANGED
@@ -1,27 +1,13 @@
|
|
1
1
|
source 'https://rubygems.org'
|
2
2
|
|
3
|
-
|
4
|
-
gem 'nokogiri', '>= 1.3.0'
|
5
|
-
end
|
3
|
+
gemspec
|
6
4
|
|
7
|
-
group
|
8
|
-
gem 'rake',
|
9
|
-
gem 'jeweler', '~> 1.4.0', :git => 'git://github.com/technicalpickles/jeweler.git'
|
10
|
-
end
|
5
|
+
group :development do
|
6
|
+
gem 'rake', '~> 0.8.7'
|
11
7
|
|
12
|
-
|
13
|
-
|
14
|
-
|
15
|
-
gem 'maruku', '~> 0.6.0'
|
16
|
-
else
|
17
|
-
gem 'rdiscount', '~> 1.6.3'
|
18
|
-
end
|
8
|
+
gem 'ore-tasks', '~> 0.4'
|
9
|
+
gem 'rspec', '~> 2.4'
|
10
|
+
gem 'wsoc', '~> 0.1.3'
|
19
11
|
|
20
|
-
gem '
|
12
|
+
gem 'kramdown', '~> 0.12'
|
21
13
|
end
|
22
|
-
|
23
|
-
group(:test) do
|
24
|
-
gem 'wsoc', '~> 0.1.3'
|
25
|
-
end
|
26
|
-
|
27
|
-
gem 'rspec', '~> 1.3.0', :group => [:development, :test]
|
data/LICENSE.txt
CHANGED
data/README.md
CHANGED
@@ -1,9 +1,9 @@
|
|
1
1
|
# Spidr
|
2
2
|
|
3
|
-
* [
|
4
|
-
* [
|
5
|
-
* [
|
6
|
-
* [
|
3
|
+
* [Homepage](http://spidr.rubyforge.org/)
|
4
|
+
* [Source](http://github.com/postmodern/spidr)
|
5
|
+
* [Issues](http://github.com/postmodern/spidr/issues)
|
6
|
+
* [Mailing List](http://groups.google.com/group/spidr)
|
7
7
|
* irc.freenode.net #spidr
|
8
8
|
|
9
9
|
## Description
|
@@ -177,7 +177,7 @@ Skip the processing of links:
|
|
177
177
|
|
178
178
|
## Requirements
|
179
179
|
|
180
|
-
* [nokogiri](http://nokogiri.rubyforge.org/)
|
180
|
+
* [nokogiri](http://nokogiri.rubyforge.org/) ~> 1.3
|
181
181
|
|
182
182
|
## Install
|
183
183
|
|
@@ -185,5 +185,6 @@ Skip the processing of links:
|
|
185
185
|
|
186
186
|
## License
|
187
187
|
|
188
|
-
|
188
|
+
Copyright (c) 2008-2011 Hal Brodigan
|
189
189
|
|
190
|
+
See {file:LICENSE.txt} for license information.
|
data/Rakefile
CHANGED
@@ -1,8 +1,15 @@
|
|
1
1
|
require 'rubygems'
|
2
|
-
require 'bundler'
|
3
2
|
|
4
3
|
begin
|
5
|
-
|
4
|
+
require 'bundler'
|
5
|
+
rescue LoadError => e
|
6
|
+
STDERR.puts e.message
|
7
|
+
STDERR.puts "Run `gem install bundler` to install Bundler."
|
8
|
+
exit e.status_code
|
9
|
+
end
|
10
|
+
|
11
|
+
begin
|
12
|
+
Bundler.setup(:development)
|
6
13
|
rescue Bundler::BundlerError => e
|
7
14
|
STDERR.puts e.message
|
8
15
|
STDERR.puts "Run `bundle install` to install missing gems"
|
@@ -10,29 +17,12 @@ rescue Bundler::BundlerError => e
|
|
10
17
|
end
|
11
18
|
|
12
19
|
require 'rake'
|
13
|
-
require 'jeweler'
|
14
|
-
require './lib/spidr/version.rb'
|
15
|
-
|
16
|
-
Jeweler::Tasks.new do |gem|
|
17
|
-
gem.name = 'spidr'
|
18
|
-
gem.version = Spidr::VERSION
|
19
|
-
gem.license = 'MIT'
|
20
|
-
gem.summary = %Q{A versatile Ruby web spidering library}
|
21
|
-
gem.description = %Q{Spidr is a versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.}
|
22
|
-
gem.email = 'postmodern.mod3@gmail.com'
|
23
|
-
gem.homepage = 'http://github.com/postmodern/spidr'
|
24
|
-
gem.authors = ['Postmodern']
|
25
|
-
gem.has_rdoc = 'yard'
|
26
|
-
end
|
27
|
-
Jeweler::GemcutterTasks.new
|
28
20
|
|
29
|
-
require '
|
30
|
-
|
31
|
-
spec.libs += ['lib', 'spec']
|
32
|
-
spec.spec_files = FileList['spec/**/*_spec.rb']
|
33
|
-
spec.spec_opts = ['--options', '.specopts']
|
34
|
-
end
|
21
|
+
require 'ore/tasks'
|
22
|
+
Ore::Tasks.new
|
35
23
|
|
24
|
+
require 'rspec/core/rake_task'
|
25
|
+
RSpec::Core::RakeTask.new
|
36
26
|
task :default => :spec
|
37
27
|
|
38
28
|
require 'yard'
|
data/gemspec.yml
ADDED
@@ -0,0 +1,19 @@
|
|
1
|
+
name: spidr
|
2
|
+
summary: A versatile Ruby web spidering library
|
3
|
+
description:
|
4
|
+
Spidr is a versatile Ruby web spidering library that can spider a site,
|
5
|
+
multiple domains, certain links or infinitely. Spidr is designed to be
|
6
|
+
fast and easy to use.
|
7
|
+
|
8
|
+
license: MIT
|
9
|
+
authors: Postmodern
|
10
|
+
email: postmodern.mod3@gmail.com
|
11
|
+
homepage: http://github.com/postmodern/spidr
|
12
|
+
has_yard: true
|
13
|
+
|
14
|
+
dependencies:
|
15
|
+
nokogiri: ~> 1.3
|
16
|
+
|
17
|
+
development_dependencies:
|
18
|
+
bundler: ~> 1.0.0
|
19
|
+
yard: ~> 0.6.0
|
data/lib/spidr/agent.rb
CHANGED
@@ -48,6 +48,12 @@ module Spidr
|
|
48
48
|
|
49
49
|
# Cached cookies
|
50
50
|
attr_reader :cookies
|
51
|
+
|
52
|
+
# Maximum depth
|
53
|
+
attr_reader :max_depth
|
54
|
+
|
55
|
+
# The visited URLs and their depth within a site
|
56
|
+
attr_reader :levels
|
51
57
|
|
52
58
|
#
|
53
59
|
# Creates a new Agent object.
|
@@ -91,6 +97,9 @@ module Spidr
|
|
91
97
|
# @option options [Set, Array] :history
|
92
98
|
# The initial list of visited URLs.
|
93
99
|
#
|
100
|
+
# @option options [Integer] :max_depth
|
101
|
+
# The maximum link depth to follow.
|
102
|
+
#
|
94
103
|
# @yield [agent]
|
95
104
|
# If a block is given, it will be passed the newly created agent
|
96
105
|
# for further configuration.
|
@@ -119,6 +128,9 @@ module Spidr
|
|
119
128
|
@failures = Set[]
|
120
129
|
@queue = []
|
121
130
|
|
131
|
+
@levels = Hash.new(0)
|
132
|
+
@max_depth = options[:max_depth]
|
133
|
+
|
122
134
|
super(options)
|
123
135
|
|
124
136
|
yield self if block_given?
|
@@ -450,7 +462,7 @@ module Spidr
|
|
450
462
|
# @return [Boolean]
|
451
463
|
# Specifies whether the URL was enqueued, or ignored.
|
452
464
|
#
|
453
|
-
def enqueue(url)
|
465
|
+
def enqueue(url,level=0)
|
454
466
|
url = sanitize_url(url)
|
455
467
|
|
456
468
|
if (!(queued?(url)) && visit?(url))
|
@@ -477,14 +489,15 @@ module Spidr
|
|
477
489
|
return false
|
478
490
|
rescue Actions::Action
|
479
491
|
end
|
480
|
-
|
492
|
+
|
481
493
|
@queue << url
|
494
|
+
@levels[url] = level
|
482
495
|
return true
|
483
496
|
end
|
484
497
|
|
485
498
|
return false
|
486
499
|
end
|
487
|
-
|
500
|
+
|
488
501
|
#
|
489
502
|
# Requests and creates a new Page object from a given URL.
|
490
503
|
#
|
@@ -568,7 +581,7 @@ module Spidr
|
|
568
581
|
# for the page failed, or the page was skipped.
|
569
582
|
#
|
570
583
|
def visit_page(url)
|
571
|
-
url =
|
584
|
+
url = sanitize_url(url)
|
572
585
|
|
573
586
|
get_page(url) do |page|
|
574
587
|
@history << page.url
|
@@ -584,7 +597,7 @@ module Spidr
|
|
584
597
|
rescue Actions::Action
|
585
598
|
end
|
586
599
|
|
587
|
-
page.
|
600
|
+
page.each_url do |next_url|
|
588
601
|
begin
|
589
602
|
@every_link_blocks.each do |link_block|
|
590
603
|
link_block.call(page.url,next_url)
|
@@ -596,7 +609,9 @@ module Spidr
|
|
596
609
|
rescue Actions::Action
|
597
610
|
end
|
598
611
|
|
599
|
-
|
612
|
+
if (@max_depth.nil? || @max_depth > @levels[url])
|
613
|
+
enqueue(next_url,@levels[url] + 1)
|
614
|
+
end
|
600
615
|
end
|
601
616
|
end
|
602
617
|
end
|
data/lib/spidr/auth_store.rb
CHANGED
@@ -24,7 +24,7 @@ module Spidr
|
|
24
24
|
# Given a URL, return the most specific matching auth credential.
|
25
25
|
#
|
26
26
|
# @param [URI] url
|
27
|
-
# A fully qualified url
|
27
|
+
# A fully qualified url including optional path.
|
28
28
|
#
|
29
29
|
# @return [AuthCredential, nil]
|
30
30
|
# Closest matching {AuthCredential} values for the URL,
|
data/lib/spidr/body.rb
ADDED
@@ -0,0 +1,99 @@
|
|
1
|
+
require 'nokogiri'
|
2
|
+
|
3
|
+
module Spidr
|
4
|
+
module Body
|
5
|
+
#
|
6
|
+
# The body of the response.
|
7
|
+
#
|
8
|
+
# @return [String]
|
9
|
+
# The body of the response.
|
10
|
+
#
|
11
|
+
def body
|
12
|
+
(response.body || '')
|
13
|
+
end
|
14
|
+
|
15
|
+
#
|
16
|
+
# Returns a parsed document object for HTML, XML, RSS and Atom pages.
|
17
|
+
#
|
18
|
+
# @return [Nokogiri::HTML::Document, Nokogiri::XML::Document, nil]
|
19
|
+
# The document that represents HTML or XML pages.
|
20
|
+
# Returns `nil` if the page is neither HTML, XML, RSS, Atom or if
|
21
|
+
# the page could not be parsed properly.
|
22
|
+
#
|
23
|
+
# @see http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/Document.html
|
24
|
+
# @see http://nokogiri.rubyforge.org/nokogiri/Nokogiri/HTML/Document.html
|
25
|
+
#
|
26
|
+
def doc
|
27
|
+
return nil if body.empty?
|
28
|
+
|
29
|
+
begin
|
30
|
+
if html?
|
31
|
+
return @doc ||= Nokogiri::HTML(body)
|
32
|
+
elsif (xml? || xsl? || rss? || atom?)
|
33
|
+
return @doc ||= Nokogiri::XML(body)
|
34
|
+
end
|
35
|
+
rescue
|
36
|
+
return nil
|
37
|
+
end
|
38
|
+
end
|
39
|
+
|
40
|
+
#
|
41
|
+
# Searches the document for XPath or CSS Path paths.
|
42
|
+
#
|
43
|
+
# @param [Array<String>] paths
|
44
|
+
# CSS or XPath expressions to search the document with.
|
45
|
+
#
|
46
|
+
# @return [Array]
|
47
|
+
# The matched nodes from the document.
|
48
|
+
# Returns an empty Array if no nodes were matched, or if the page
|
49
|
+
# is not an HTML or XML document.
|
50
|
+
#
|
51
|
+
# @example
|
52
|
+
# page.search('//a[@href]')
|
53
|
+
#
|
54
|
+
# @see http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/Node.html#M000239
|
55
|
+
#
|
56
|
+
def search(*paths)
|
57
|
+
if doc
|
58
|
+
doc.search(*paths)
|
59
|
+
else
|
60
|
+
[]
|
61
|
+
end
|
62
|
+
end
|
63
|
+
|
64
|
+
#
|
65
|
+
# Searches for the first occurrence an XPath or CSS Path expression.
|
66
|
+
#
|
67
|
+
# @return [Nokogiri::HTML::Node, Nokogiri::XML::Node, nil]
|
68
|
+
# The first matched node. Returns `nil` if no nodes could be matched,
|
69
|
+
# or if the page is not a HTML or XML document.
|
70
|
+
#
|
71
|
+
# @example
|
72
|
+
# page.at('//title')
|
73
|
+
#
|
74
|
+
# @see http://nokogiri.rubyforge.org/nokogiri/Nokogiri/XML/Node.html#M000251
|
75
|
+
#
|
76
|
+
def at(*arguments)
|
77
|
+
if doc
|
78
|
+
doc.at(*arguments)
|
79
|
+
end
|
80
|
+
end
|
81
|
+
|
82
|
+
alias / search
|
83
|
+
alias % at
|
84
|
+
|
85
|
+
#
|
86
|
+
# The title of the HTML page.
|
87
|
+
#
|
88
|
+
# @return [String]
|
89
|
+
# The inner-text of the title element of the page.
|
90
|
+
#
|
91
|
+
def title
|
92
|
+
if (node = at('//title'))
|
93
|
+
node.inner_text
|
94
|
+
end
|
95
|
+
end
|
96
|
+
|
97
|
+
alias to_s body
|
98
|
+
end
|
99
|
+
end
|