spidr 0.2.2 → 0.2.3

Sign up to get free protection for your applications and to get access to all the features.
data/.gitignore ADDED
@@ -0,0 +1,8 @@
1
+ pkg
2
+ doc
3
+ web
4
+ tmp
5
+ .DS_Store
6
+ .yardoc
7
+ *.swp
8
+ *~
data/.specopts ADDED
@@ -0,0 +1 @@
1
+ --colour --format specdoc
data/.yardopts ADDED
@@ -0,0 +1 @@
1
+ --markup markdown --title 'Spidr Documentation' --protected --files ChangeLog.md,LICENSE.txt
@@ -1,4 +1,12 @@
1
- === 0.2.2 / 2010-01-06
1
+ ### 0.2.3 / 2010-02-27
2
+
3
+ * Migrated to Jeweler, for the packaging and releasing RubyGems.
4
+ * Switched to MarkDown formatted YARD documentation.
5
+ * Added {Spidr::Events#every_link}.
6
+ * Added {Spidr::SessionCache#active?}.
7
+ * Added specs for {Spidr::SessionCache}.
8
+
9
+ ### 0.2.2 / 2010-01-06
2
10
 
3
11
  * Require Web Spider Obstacle Course (WSOC) >= 0.1.1.
4
12
  * Integrated the new WSOC into the specs.
@@ -12,10 +20,10 @@
12
20
  * Added {Spidr::CookieJar} (thanks Nick Plante).
13
21
  * Added {Spidr::AuthStore} (thanks Nick Plante).
14
22
  * Added {Spidr::Agent#post_page} (thanks Nick Plante).
15
- * Renamed Spidr::Agent#get_session to {Spidr::SessionCache#[]}.
16
- * Renamed Spidr::Agent#kill_session to {Spidr::SessionCache#kill!}.
23
+ * Renamed `Spidr::Agent#get_session` to {Spidr::SessionCache#[]}.
24
+ * Renamed `Spidr::Agent#kill_session` to {Spidr::SessionCache#kill!}.
17
25
 
18
- === 0.2.1 / 2009-11-25
26
+ ### 0.2.1 / 2009-11-25
19
27
 
20
28
  * Added {Spidr::Events#every_ok_page}.
21
29
  * Added {Spidr::Events#every_redirect_page}.
@@ -44,9 +52,9 @@
44
52
  * Added {Spidr::Events#every_zip_page}.
45
53
  * Fixed a bug where {Spidr::Agent#delay} was not being used to delay
46
54
  requesting pages.
47
- * Spider +link+ and +script+ tags in HTML pages (thanks Nick Plante).
55
+ * Spider `link` and `script` tags in HTML pages (thanks Nick Plante).
48
56
 
49
- === 0.2.0 / 2009-10-10
57
+ ### 0.2.0 / 2009-10-10
50
58
 
51
59
  * Added {URI.expand_path}.
52
60
  * Added {Spidr::Page#search}.
@@ -54,16 +62,16 @@
54
62
  * Added {Spidr::Page#title}.
55
63
  * Added {Spidr::Agent#failures=}.
56
64
  * Added a HTTP session cache to {Spidr::Agent}, per suggestion of falter.
57
- * Added Spidr::Agent#get_session.
58
- * Added Spidr::Agent#kill_session.
65
+ * Added `Spidr::Agent#get_session`.
66
+ * Added `Spidr::Agent#kill_session`.
59
67
  * Added {Spidr.proxy=}.
60
68
  * Added {Spidr.disable_proxy!}.
61
- * Aliased Spidr::Page#txt? to {Spidr::Page#plain_text?}.
62
- * Aliased Spidr::Page#ok? to {Spidr::Page#is_ok?}.
63
- * Aliased Spidr::Page#redirect? to {Spidr::Page#is_redirect?}.
64
- * Aliased Spidr::Page#unauthorized? to {Spidr::Page#is_unauthorized?}.
65
- * Aliased Spidr::Page#forbidden? to {Spidr::Page#is_forbidden?}.
66
- * Aliased Spidr::Page#missing? to {Spidr::Page#is_missing?}.
69
+ * Aliased `Spidr::Page#txt?` to {Spidr::Page#plain_text?}.
70
+ * Aliased `Spidr::Page#ok?` to {Spidr::Page#is_ok?}.
71
+ * Aliased `Spidr::Page#redirect?` to {Spidr::Page#is_redirect?}.
72
+ * Aliased `Spidr::Page#unauthorized?` to {Spidr::Page#is_unauthorized?}.
73
+ * Aliased `Spidr::Page#forbidden?` to {Spidr::Page#is_forbidden?}.
74
+ * Aliased `Spidr::Page#missing?` to {Spidr::Page#is_missing?}.
67
75
  * Split URL filtering code out of {Spidr::Agent} and into
68
76
  {Spidr::Filters}.
69
77
  * Split URL / Page event code out of {Spidr::Agent} and into
@@ -71,11 +79,11 @@
71
79
  * Split pause! / continue! / skip_link! / skip_page! methods out of
72
80
  {Spidr::Agent} and into {Spidr::Actions}.
73
81
  * Fixed a bug in {Spidr::Page#code}, where it was not returning an Integer.
74
- * Make sure {Spidr::Page#doc} returns Nokogiri::XML::Document objects for
82
+ * Make sure {Spidr::Page#doc} returns `Nokogiri::XML::Document` objects for
75
83
  RSS/RDF/Atom pages as well.
76
84
  * Fixed the handling of the Location header in {Spidr::Page#links}
77
85
  (thanks falter).
78
- * Fixed a bug in {Spidr::Page#to_absolute} where trailing '/' characters on
86
+ * Fixed a bug in {Spidr::Page#to_absolute} where trailing `/` characters on
79
87
  URI paths were not being preserved (thanks falter).
80
88
  * Fixed a bug where the URI query was not being sent with the request
81
89
  in {Spidr::Agent#get_page} (thanks Damian Steer).
@@ -86,17 +94,17 @@
86
94
  * Switched {Spidr::Agent#failures} to a Set.
87
95
  * Allow a block to be passed to {Spidr::Agent#run}, which will receive all
88
96
  pages visited.
89
- * Allow Spidr::Agent#start_at and Spidr::Agent#continue! to pass blocks
97
+ * Allow `Spidr::Agent#start_at` and `Spidr::Agent#continue!` to pass blocks
90
98
  to {Spidr::Agent#run}.
91
99
  * Made {Spidr::Agent#visit_page} public.
92
100
  * Moved to YARD based documentation.
93
101
 
94
- === 0.1.9 / 2009-06-13
102
+ ### 0.1.9 / 2009-06-13
95
103
 
96
104
  * Upgraded to Hoe 2.0.0.
97
105
  * Use Hoe.spec instead of Hoe.new.
98
106
  * Use the Hoe signing task for signed gems.
99
- * Added the Spidr::Agent#schemes and Spidr::Agent#schemes= methods.
107
+ * Added the `Spidr::Agent#schemes` and `Spidr::Agent#schemes=` methods.
100
108
  * Added a warning message if 'net/https' cannot be loaded.
101
109
  * Allow the list of acceptable URL schemes to be passed into
102
110
  {Spidr::Agent#initialize}.
@@ -108,10 +116,10 @@
108
116
  could not be loaded.
109
117
  * Removed Spidr::Agent::SCHEMES.
110
118
 
111
- === 0.1.8 / 2009-05-27
119
+ ### 0.1.8 / 2009-05-27
112
120
 
113
- * Added the Spidr::Agent#pause! and Spidr::Agent#continue! methods.
114
- * Added the Spidr::Agent#running? and Spidr::Agent#paused? methods.
121
+ * Added the `Spidr::Agent#pause!` and `Spidr::Agent#continue!` methods.
122
+ * Added the `Spidr::Agent#running?` and `Spidr::Agent#paused?` methods.
115
123
  * Added an alias for pending_urls to the queue methods.
116
124
  * Added {Spidr::Agent#queue} to provide read access to the queue.
117
125
  * Added {Spidr::Agent#queue=} and {Spidr::Agent#history=} for setting the
@@ -121,49 +129,49 @@
121
129
  * Made {Spidr::Agent#enqueue} and {Spidr::Agent#queued?} public.
122
130
  * Added more specs.
123
131
 
124
- === 0.1.7 / 2009-04-24
132
+ ### 0.1.7 / 2009-04-24
125
133
 
126
- * Added Spidr::Agent#all_headers.
127
- * Fixed a bug where Page#headers was always +nil+.
134
+ * Added `Spidr::Agent#all_headers`.
135
+ * Fixed a bug where {Spidr::Page#headers} was always `nil`.
128
136
  * {Spidr::Spidr::Agent} will now follow the Location header in HTTP 300,
129
137
  301, 302, 303 and 307 Redirects.
130
138
  * {Spidr::Agent} will now follow iframe and frame tags.
131
139
 
132
- === 0.1.6 / 2009-04-14
140
+ ### 0.1.6 / 2009-04-14
133
141
 
134
142
  * Added {Spidr::Agent#failures}, a list of URLs which could not be visited.
135
143
  * Added {Spidr::Agent#failed?}.
136
- * Added Spidr::Agent#every_failed_url.
144
+ * Added `Spidr::Agent#every_failed_url`.
137
145
  * Added {Spidr::Agent#clear}, which clears the history and failures URL
138
146
  lists.
139
147
  * Improved fault tolerance in {Spidr::Agent#get_page}.
140
148
  * If a Network or HTTP error is encountered, the URL will be added to
141
149
  the failures list and the next URL will be visited.
142
- * Fixed a typo in Spidr::Agent#ignore_exts_like.
150
+ * Fixed a typo in `Spidr::Agent#ignore_exts_like`.
143
151
  * Updated the Web Spider Obstacle Course with links that always fail to be
144
152
  visited.
145
153
 
146
- === 0.1.5 / 2009-03-22
154
+ ### 0.1.5 / 2009-03-22
147
155
 
148
- * Catch malformed URIs in {Spidr::Page#to_absolute} and return +nil+.
149
- * Filter out +nil+ URIs in {Spidr::Page#urls}.
156
+ * Catch malformed URIs in {Spidr::Page#to_absolute} and return `nil`.
157
+ * Filter out `nil` URIs in {Spidr::Page#urls}.
150
158
 
151
- === 0.1.4 / 2009-01-15
159
+ ### 0.1.4 / 2009-01-15
152
160
 
153
161
  * Use Nokogiri for HTML and XML parsing.
154
162
 
155
- === 0.1.3 / 2009-01-10
163
+ ### 0.1.3 / 2009-01-10
156
164
 
157
- * Added the :host options to {Spidr::Agent#initialize}.
165
+ * Added the `:host` options to {Spidr::Agent#initialize}.
158
166
  * Added the Web Spider Obstacle Course files to the Manifest.
159
167
  * Aliased {Spidr::Agent#visited_urls} to {Spidr::Agent#history}.
160
168
 
161
- === 0.1.2 / 2008-11-06
169
+ ### 0.1.2 / 2008-11-06
162
170
 
163
171
  * Fixed a bug in {Spidr::Page#to_absolute} where URLs with no path were not
164
- receiving a default path of <tt>/</tt>.
172
+ receiving a default path of `/`.
165
173
  * Fixed a bug in {Spidr::Page#to_absolute} where URL paths were not being
166
- expanded, in order to remove <tt>..</tt> and <tt>.</tt> directories.
174
+ expanded, in order to remove `..` and `.` directories.
167
175
  * Fixed a bug where absolute URLs could have a blank path, thus causing
168
176
  {Spidr::Agent#get_page} to crash when it performed the HTTP request.
169
177
  * Added RSpec spec tests.
@@ -171,12 +179,12 @@
171
179
  (http://spidr.rubyforge.org/course/start.html) which is used in the spec
172
180
  tests.
173
181
 
174
- === 0.1.1 / 2008-10-04
182
+ ### 0.1.1 / 2008-10-04
175
183
 
176
184
  * Added a reader method for the response instance variable in Page.
177
185
  * Fixed a bug in {Spidr::Page#method_missing}.
178
186
 
179
- === 0.1.0 / 2008-05-23
187
+ ### 0.1.0 / 2008-05-23
180
188
 
181
189
  * Initial release.
182
190
  * Black-list or white-list URLs based upon:
data/LICENSE.txt ADDED
@@ -0,0 +1,21 @@
1
+
2
+ Copyright (c) 2008-2010 Hal Brodigan
3
+
4
+ Permission is hereby granted, free of charge, to any person obtaining
5
+ a copy of this software and associated documentation files (the
6
+ 'Software'), to deal in the Software without restriction, including
7
+ without limitation the rights to use, copy, modify, merge, publish,
8
+ distribute, sublicense, and/or sell copies of the Software, and to
9
+ permit persons to whom the Software is furnished to do so, subject to
10
+ the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be
13
+ included in all copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
16
+ EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
17
+ MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
18
+ IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
19
+ CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
20
+ TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
21
+ SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -1,18 +1,18 @@
1
- = Spidr
1
+ # Spidr
2
2
 
3
- * http://spidr.rubyforge.org
4
- * http://github.com/postmodern/spidr
5
- * http://github.com/postmodern/spidr/issues
6
- * http://groups.google.com/group/spidr
3
+ * [spidr.rubyforge.org](http://spidr.rubyforge.org/)
4
+ * [github.com/postmodern/spidr](http://github.com/postmodern/spidr)
5
+ * [github.com/postmodern/spidr/issues](http://github.com/postmodern/spidr/issues)
6
+ * [groups.google.com/group/spidr](http://groups.google.com/group/spidr)
7
7
  * irc.freenode.net #spidr
8
8
 
9
- == DESCRIPTION:
9
+ ## Description
10
10
 
11
11
  Spidr is a versatile Ruby web spidering library that can spider a site,
12
12
  multiple domains, certain links or infinitely. Spidr is designed to be fast
13
13
  and easy to use.
14
14
 
15
- == FEATURES:
15
+ ## Features
16
16
 
17
17
  * Follows:
18
18
  * a tags.
@@ -31,6 +31,7 @@ and easy to use.
31
31
  * Every visited Page.
32
32
  * Every visited URL.
33
33
  * Every visited URL that matches a specified pattern.
34
+ * Every origin and destination URI of a link.
34
35
  * Every URL that failed to be visited.
35
36
  * Provides action methods to:
36
37
  * Pause spidering.
@@ -39,22 +40,23 @@ and easy to use.
39
40
  * Restore the spidering queue and history from a previous session.
40
41
  * Custom User-Agent strings.
41
42
  * Custom proxy settings.
43
+ * HTTPS support.
42
44
 
43
- == EXAMPLES:
45
+ ## Examples
44
46
 
45
- * Start spidering from a URL:
47
+ Start spidering from a URL:
46
48
 
47
49
  Spidr.start_at('http://tenderlovemaking.com/')
48
50
 
49
- * Spider a host:
51
+ Spider a host:
50
52
 
51
53
  Spidr.host('coderrr.wordpress.com')
52
54
 
53
- * Spider a site:
55
+ Spider a site:
54
56
 
55
57
  Spidr.site('http://rubyflow.com/')
56
58
 
57
- * Spider multiple hosts:
59
+ Spider multiple hosts:
58
60
 
59
61
  Spidr.start_at(
60
62
  'http://company.com/',
@@ -64,30 +66,56 @@ and easy to use.
64
66
  ]
65
67
  )
66
68
 
67
- * Do not spider certain links:
69
+ Do not spider certain links:
68
70
 
69
71
  Spidr.site('http://matasano.com/', :ignore_links => [/log/])
70
72
 
71
- * Do not spider links on certain ports:
73
+ Do not spider links on certain ports:
72
74
 
73
75
  Spidr.site(
74
76
  'http://sketchy.content.com/',
75
77
  :ignore_ports => [8000, 8010, 8080]
76
78
  )
77
79
 
78
- * Print out visited URLs:
80
+ Print out visited URLs:
79
81
 
80
82
  Spidr.site('http://rubyinside.org/') do |spider|
81
83
  spider.every_url { |url| puts url }
82
84
  end
83
85
 
84
- * Print out the URLs that could not be requested:
86
+ Build a URL map of a site:
87
+
88
+ url_map = Hash.new { |hash,key| hash[key] = [] }
89
+
90
+ Spidr.site('http://intranet.com/') do |spider|
91
+ spider.every_link do |origin,dest|
92
+ url_map[dest] << origin
93
+ end
94
+ end
95
+
96
+ Print out the URLs that could not be requested:
85
97
 
86
98
  Spidr.site('http://sketchy.content.com/') do |spider|
87
99
  spider.every_failed_url { |url| puts url }
88
100
  end
89
101
 
90
- * Search HTML and XML pages:
102
+ Finds all pages which have broken links:
103
+
104
+ url_map = Hash.new { |hash,key| hash[key] = [] }
105
+
106
+ spider = Spidr.site('http://intranet.com/') do |spider|
107
+ spider.every_link do |origin,dest|
108
+ url_map[dest] << origin
109
+ end
110
+ end
111
+
112
+ spider.failures.each do |url|
113
+ puts "Broken link #{url} found in:"
114
+
115
+ url_map[url].each { |page| puts " #{page}" }
116
+ end
117
+
118
+ Search HTML and XML pages:
91
119
 
92
120
  Spidr.site('http://company.withablog.com/') do |spider|
93
121
  spider.every_page do |page|
@@ -98,11 +126,11 @@ and easy to use.
98
126
  value = meta.attributes['content']
99
127
 
100
128
  puts " #{name} = #{value}"
101
- end
129
+ end
102
130
  end
103
131
  end
104
132
 
105
- * Print out the titles from every page:
133
+ Print out the titles from every page:
106
134
 
107
135
  Spidr.site('http://www.rubypulse.com/') do |spider|
108
136
  spider.every_html_page do |page|
@@ -110,7 +138,7 @@ and easy to use.
110
138
  end
111
139
  end
112
140
 
113
- * Find what kinds of web servers a host is using, by accessing the headers:
141
+ Find what kinds of web servers a host is using, by accessing the headers:
114
142
 
115
143
  servers = Set[]
116
144
 
@@ -120,7 +148,7 @@ and easy to use.
120
148
  end
121
149
  end
122
150
 
123
- * Pause the spider on a forbidden page:
151
+ Pause the spider on a forbidden page:
124
152
 
125
153
  spider = Spidr.host('overnight.startup.com') do |spider|
126
154
  spider.every_forbidden_page do |page|
@@ -128,7 +156,7 @@ and easy to use.
128
156
  end
129
157
  end
130
158
 
131
- * Skip the processing of a page:
159
+ Skip the processing of a page:
132
160
 
133
161
  Spidr.host('sketchy.content.com') do |spider|
134
162
  spider.every_missing_page do |page|
@@ -136,7 +164,7 @@ and easy to use.
136
164
  end
137
165
  end
138
166
 
139
- * Skip the processing of links:
167
+ Skip the processing of links:
140
168
 
141
169
  Spidr.host('sketchy.content.com') do |spider|
142
170
  spider.every_url do |url|
@@ -146,35 +174,15 @@ and easy to use.
146
174
  end
147
175
  end
148
176
 
149
- == REQUIREMENTS:
150
-
151
- * {nokogiri}[http://nokogiri.rubyforge.org/] >= 1.2.0
152
-
153
- == INSTALL:
154
-
155
- $ sudo gem install spidr
177
+ ## Requirements
156
178
 
157
- == LICENSE:
179
+ * [nokogiri](http://nokogiri.rubyforge.org/) >= 1.2.0
158
180
 
159
- The MIT License
181
+ ## Install
160
182
 
161
- Copyright (c) 2008-2010 Hal Brodigan
183
+ $ sudo gem install spidr
162
184
 
163
- Permission is hereby granted, free of charge, to any person obtaining
164
- a copy of this software and associated documentation files (the
165
- 'Software'), to deal in the Software without restriction, including
166
- without limitation the rights to use, copy, modify, merge, publish,
167
- distribute, sublicense, and/or sell copies of the Software, and to
168
- permit persons to whom the Software is furnished to do so, subject to
169
- the following conditions:
185
+ ## License
170
186
 
171
- The above copyright notice and this permission notice shall be
172
- included in all copies or substantial portions of the Software.
187
+ See {file:LICENSE.txt} for license information.
173
188
 
174
- THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
175
- EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
176
- MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
177
- IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
178
- CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
179
- TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
180
- SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/Rakefile CHANGED
@@ -1,29 +1,43 @@
1
- # -*- ruby -*-
2
-
3
1
  require 'rubygems'
4
- require 'hoe'
5
- require 'hoe/signing'
6
- require './tasks/spec.rb'
7
- require './tasks/yard.rb'
2
+ require 'rake'
3
+ require './lib/spidr/version.rb'
8
4
 
9
- Hoe.spec('spidr') do
10
- self.developer('Postmodern', 'postmodern.mod3@gmail.com')
5
+ begin
6
+ require 'jeweler'
7
+ Jeweler::Tasks.new do |gem|
8
+ gem.name = 'spidr'
9
+ gem.version = Spidr::VERSION
10
+ gem.summary = %Q{A versatile Ruby web spidering library}
11
+ gem.description = %Q{Spidr is a versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.}
12
+ gem.email = 'postmodern.mod3@gmail.com'
13
+ gem.homepage = 'http://github.com/postmodern/spidr'
14
+ gem.authors = ['Postmodern']
15
+ gem.add_dependency 'nokogiri', '>= 1.2.0'
16
+ gem.add_development_dependency 'rspec', '>= 1.3.0'
17
+ gem.add_development_dependency 'yard', '>= 0.5.3'
18
+ gem.add_development_dependency 'wsoc', '>= 0.1.1'
19
+ gem.has_rdoc = 'yard'
20
+ end
21
+ rescue LoadError
22
+ puts "Jeweler (or a dependency) not available. Install it with: gem install jeweler"
23
+ end
11
24
 
12
- self.readme_file = 'README.rdoc'
13
- self.history_file = 'History.rdoc'
14
- self.remote_rdoc_dir = 'docs'
25
+ require 'spec/rake/spectask'
26
+ Spec::Rake::SpecTask.new(:spec) do |spec|
27
+ spec.libs += ['lib', 'spec']
28
+ spec.spec_files = FileList['spec/**/*_spec.rb']
29
+ spec.spec_opts = ['--options', '.specopts']
30
+ end
15
31
 
16
- self.extra_deps = [
17
- ['nokogiri', '>=1.2.0']
18
- ]
32
+ task :spec => :check_dependencies
33
+ task :default => :spec
19
34
 
20
- self.extra_dev_deps = [
21
- ['rspec', '>=1.2.8'],
22
- ['yard', '>=0.4.0'],
23
- ['wsoc', '>=0.1.1']
24
- ]
35
+ begin
36
+ require 'yard'
25
37
 
26
- self.spec_extras = {:has_rdoc => 'yard'}
38
+ YARD::Rake::YardocTask.new
39
+ rescue LoadError
40
+ task :yard do
41
+ abort "YARD is not available. In order to run yard, you must: gem install yard"
42
+ end
27
43
  end
28
-
29
- # vim: syntax=Ruby