mechanize 2.0.pre.1 → 2.0.pre.2
Sign up to get free protection for your applications and to get access to all the features.
Potentially problematic release.
This version of mechanize might be problematic. Click here for more details.
- data.tar.gz.sig +2 -2
- data/CHANGELOG.rdoc +24 -2
- data/Manifest.txt +15 -19
- data/Rakefile +6 -3
- data/lib/mechanize.rb +168 -28
- data/lib/mechanize/form.rb +14 -2
- data/lib/mechanize/page.rb +43 -14
- data/lib/mechanize/page/link.rb +10 -0
- data/lib/mechanize/redirect_not_get_or_head_error.rb +2 -1
- data/lib/mechanize/robots_disallowed_error.rb +29 -0
- data/lib/mechanize/util.rb +30 -6
- data/test/helper.rb +6 -0
- data/test/htdocs/canonical_uri.html +9 -0
- data/test/htdocs/nofollow.html +9 -0
- data/test/htdocs/noindex.html +9 -0
- data/test/htdocs/norobots.html +8 -0
- data/test/htdocs/rel_nofollow.html +8 -0
- data/test/htdocs/robots.html +8 -0
- data/test/htdocs/robots.txt +2 -0
- data/test/htdocs/tc_links.html +3 -3
- data/test/test_links.rb +9 -0
- data/test/test_mechanize.rb +617 -2
- data/test/{test_forms.rb → test_mechanize_form.rb} +45 -1
- data/test/test_mechanize_form_check_box.rb +37 -0
- data/test/test_mechanize_form_encoding.rb +118 -0
- data/test/{test_field_precedence.rb → test_mechanize_form_field.rb} +4 -16
- data/test/test_mechanize_page.rb +60 -1
- data/test/test_mechanize_redirect_not_get_or_head_error.rb +18 -0
- data/test/test_mechanize_subclass.rb +22 -0
- data/test/test_mechanize_util.rb +87 -2
- data/test/test_robots.rb +87 -0
- metadata +51 -43
- metadata.gz.sig +0 -0
- data/lib/mechanize/uri_resolver.rb +0 -82
- data/test/test_authenticate.rb +0 -71
- data/test/test_bad_links.rb +0 -25
- data/test/test_blank_form.rb +0 -16
- data/test/test_checkboxes.rb +0 -61
- data/test/test_content_type.rb +0 -13
- data/test/test_encoded_links.rb +0 -20
- data/test/test_errors.rb +0 -49
- data/test/test_follow_meta.rb +0 -119
- data/test/test_get_headers.rb +0 -52
- data/test/test_gzipping.rb +0 -22
- data/test/test_hash_api.rb +0 -45
- data/test/test_mech.rb +0 -283
- data/test/test_mech_proxy.rb +0 -16
- data/test/test_mechanize_uri_resolver.rb +0 -29
- data/test/test_redirect_verb_handling.rb +0 -49
- data/test/test_subclass.rb +0 -30
data.tar.gz.sig
CHANGED
@@ -1,2 +1,2 @@
|
|
1
|
-
|
2
|
-
|
1
|
+
����:�0*��r��jkb�V�w!��\y� �������]��F}6<��R@x�8D �f^�[�^d��
|
2
|
+
�*����%94��6�ˉ���g8ϗ�o�~�ƛ̰����X?�B��8�6s�����:�S�M�ѡ\φC>�9xR���R��@v�&�^�&�*$���;)��on۱�m`��y��b�/���j�ѱ�n��Q��f+��g8W `�a�
|
data/CHANGELOG.rdoc
CHANGED
@@ -1,11 +1,11 @@
|
|
1
1
|
= Mechanize CHANGELOG
|
2
2
|
|
3
|
-
=== 2.0.pre.
|
3
|
+
=== 2.0.pre.2 / 2011-04-17
|
4
4
|
|
5
5
|
Mechanize is now under the MIT license
|
6
6
|
|
7
7
|
* API changes
|
8
|
-
* WWW::Mechanize has been removed.
|
8
|
+
* WWW::Mechanize has been removed. Use Mechanize.
|
9
9
|
* Pre connect hooks are now called with the agent and the request. See
|
10
10
|
Mechanize#pre_connect_hooks.
|
11
11
|
* Post connect hooks are now called with the agent and the response. See
|
@@ -23,6 +23,16 @@ Mechanize is now under the MIT license
|
|
23
23
|
* The User-Agent header has changed. It no longer includes the WWW- prefix
|
24
24
|
and now includes the ruby version. The URL has been updated as well.
|
25
25
|
* Mechanize now requires ruby 1.8.7 or newer.
|
26
|
+
* Hpricot support has been removed as webrobots requires nokogiri.
|
27
|
+
* Mechanize#get no longer accepts the referer as the second argument.
|
28
|
+
* Mechanize#get no longer allows the HTTP method to be changed (:verb
|
29
|
+
option).
|
30
|
+
|
31
|
+
* Deprecations
|
32
|
+
* Mechanize#get with an options hash is deprecated and will be removed after
|
33
|
+
October, 2011.
|
34
|
+
* Mechanize::Util::to_native_charset is deprecated as it is no longer used
|
35
|
+
by Mechanize.
|
26
36
|
|
27
37
|
* New Features
|
28
38
|
|
@@ -39,6 +49,17 @@ Mechanize is now under the MIT license
|
|
39
49
|
* Mechanize now allows a certificate and key to be passed directly. GH #71
|
40
50
|
* Mechanize::Form::MultiSelectList now implements #option_with and
|
41
51
|
#options_with. GH #42
|
52
|
+
* Add Mechanize::Page::Link#rel and #rel?(kind) to read and test the rel
|
53
|
+
attribute.
|
54
|
+
* Add Mechanize::Page#canonical_uri to read a </tt><link
|
55
|
+
rel="canonical"></tt> tag.
|
56
|
+
* Add support for Robots Exclusion Protocol (i.e. robots.txt) and
|
57
|
+
nofollow/noindex in meta tags and the rel attribute. Automatic
|
58
|
+
exclusion can be turned on by setting:
|
59
|
+
agent.robots = true
|
60
|
+
* Manual robots.txt test can be performed with
|
61
|
+
Mechanize#robots_allowed? and #robots_disallowed?.
|
62
|
+
* Mechanize::Form now supports the accept-charset attribute. GH #96
|
42
63
|
|
43
64
|
* Bug Fixes:
|
44
65
|
|
@@ -63,6 +84,7 @@ Mechanize is now under the MIT license
|
|
63
84
|
* Content-Encoding: x-gzip is now treated like gzip per RFC 2616.
|
64
85
|
* Mechanize now unescapes URIs for meta refresh. GH #68
|
65
86
|
* Mechanize now has more robust HTML charset detection. GH #43
|
87
|
+
* Mechanize::Form::Textarea is now created from a textarea element. GH #94
|
66
88
|
|
67
89
|
=== 1.0.0
|
68
90
|
|
data/Manifest.txt
CHANGED
@@ -46,8 +46,8 @@ lib/mechanize/pluggable_parsers.rb
|
|
46
46
|
lib/mechanize/redirect_limit_reached_error.rb
|
47
47
|
lib/mechanize/redirect_not_get_or_head_error.rb
|
48
48
|
lib/mechanize/response_code_error.rb
|
49
|
+
lib/mechanize/robots_disallowed_error.rb
|
49
50
|
lib/mechanize/unsupported_scheme_error.rb
|
50
|
-
lib/mechanize/uri_resolver.rb
|
51
51
|
lib/mechanize/util.rb
|
52
52
|
test/data/htpasswd
|
53
53
|
test/data/server.crt
|
@@ -58,6 +58,7 @@ test/helper.rb
|
|
58
58
|
test/htdocs/alt_text.html
|
59
59
|
test/htdocs/bad_form_test.html
|
60
60
|
test/htdocs/button.jpg
|
61
|
+
test/htdocs/canonical_uri.html
|
61
62
|
test/htdocs/dir with spaces/foo.html
|
62
63
|
test/htdocs/empty_form.html
|
63
64
|
test/htdocs/file_upload.html
|
@@ -79,8 +80,14 @@ test/htdocs/index.html
|
|
79
80
|
test/htdocs/link with space.html
|
80
81
|
test/htdocs/meta_cookie.html
|
81
82
|
test/htdocs/no_title_test.html
|
83
|
+
test/htdocs/nofollow.html
|
84
|
+
test/htdocs/noindex.html
|
85
|
+
test/htdocs/norobots.html
|
82
86
|
test/htdocs/rails_3_encoding_hack_form_test.html
|
87
|
+
test/htdocs/rel_nofollow.html
|
83
88
|
test/htdocs/relative/tc_relative_links.html
|
89
|
+
test/htdocs/robots.html
|
90
|
+
test/htdocs/robots.txt
|
84
91
|
test/htdocs/tc_bad_charset.html
|
85
92
|
test/htdocs/tc_bad_links.html
|
86
93
|
test/htdocs/tc_base_images.html
|
@@ -106,25 +113,12 @@ test/htdocs/test_click.html
|
|
106
113
|
test/htdocs/unusual______.html
|
107
114
|
test/servlets.rb
|
108
115
|
test/ssl_server.rb
|
109
|
-
test/test_authenticate.rb
|
110
|
-
test/test_bad_links.rb
|
111
|
-
test/test_blank_form.rb
|
112
|
-
test/test_checkboxes.rb
|
113
|
-
test/test_content_type.rb
|
114
116
|
test/test_cookies.rb
|
115
|
-
test/test_encoded_links.rb
|
116
|
-
test/test_errors.rb
|
117
|
-
test/test_field_precedence.rb
|
118
|
-
test/test_follow_meta.rb
|
119
117
|
test/test_form_action.rb
|
120
118
|
test/test_form_as_hash.rb
|
121
119
|
test/test_form_button.rb
|
122
120
|
test/test_form_no_inputname.rb
|
123
|
-
test/test_forms.rb
|
124
121
|
test/test_frames.rb
|
125
|
-
test/test_get_headers.rb
|
126
|
-
test/test_gzipping.rb
|
127
|
-
test/test_hash_api.rb
|
128
122
|
test/test_headers.rb
|
129
123
|
test/test_history.rb
|
130
124
|
test/test_history_added.rb
|
@@ -132,17 +126,20 @@ test/test_html_unscape_forms.rb
|
|
132
126
|
test/test_if_modified_since.rb
|
133
127
|
test/test_images.rb
|
134
128
|
test/test_links.rb
|
135
|
-
test/test_mech.rb
|
136
|
-
test/test_mech_proxy.rb
|
137
129
|
test/test_mechanize.rb
|
138
130
|
test/test_mechanize_cookie.rb
|
139
131
|
test/test_mechanize_cookie_jar.rb
|
140
132
|
test/test_mechanize_file.rb
|
141
133
|
test/test_mechanize_file_request.rb
|
142
134
|
test/test_mechanize_file_response.rb
|
135
|
+
test/test_mechanize_form.rb
|
136
|
+
test/test_mechanize_form_check_box.rb
|
137
|
+
test/test_mechanize_form_encoding.rb
|
138
|
+
test/test_mechanize_form_field.rb
|
143
139
|
test/test_mechanize_form_image_button.rb
|
144
140
|
test/test_mechanize_page.rb
|
145
|
-
test/
|
141
|
+
test/test_mechanize_redirect_not_get_or_head_error.rb
|
142
|
+
test/test_mechanize_subclass.rb
|
146
143
|
test/test_mechanize_util.rb
|
147
144
|
test/test_meta.rb
|
148
145
|
test/test_multi_select.rb
|
@@ -153,11 +150,11 @@ test/test_post_form.rb
|
|
153
150
|
test/test_pretty_print.rb
|
154
151
|
test/test_radiobutton.rb
|
155
152
|
test/test_redirect_limit_reached.rb
|
156
|
-
test/test_redirect_verb_handling.rb
|
157
153
|
test/test_referer.rb
|
158
154
|
test/test_relative_links.rb
|
159
155
|
test/test_request.rb
|
160
156
|
test/test_response_code.rb
|
157
|
+
test/test_robots.rb
|
161
158
|
test/test_save_file.rb
|
162
159
|
test/test_scheme.rb
|
163
160
|
test/test_select.rb
|
@@ -166,7 +163,6 @@ test/test_select_none.rb
|
|
166
163
|
test/test_select_noopts.rb
|
167
164
|
test/test_set_fields.rb
|
168
165
|
test/test_ssl_server.rb
|
169
|
-
test/test_subclass.rb
|
170
166
|
test/test_textarea.rb
|
171
167
|
test/test_upload.rb
|
172
168
|
test/test_verbs.rb
|
data/Rakefile
CHANGED
@@ -12,9 +12,12 @@ Hoe.spec 'mechanize' do
|
|
12
12
|
self.readme_file = 'README.rdoc'
|
13
13
|
self.history_file = 'CHANGELOG.rdoc'
|
14
14
|
self.extra_rdoc_files += Dir['*.rdoc']
|
15
|
-
|
16
|
-
self.extra_deps
|
17
|
-
self.extra_deps
|
15
|
+
|
16
|
+
self.extra_deps << ['nokogiri', '~> 1.4']
|
17
|
+
self.extra_deps << ['net-http-persistent', '~> 1.6']
|
18
|
+
self.extra_deps << ['net-http-digest_auth', '~> 1.1', '>= 1.1.1']
|
19
|
+
self.extra_deps << ['webrobots', '~> 0.0', '>= 0.0.6']
|
20
|
+
|
18
21
|
self.spec_extras[:required_ruby_version] = '>= 1.8.7'
|
19
22
|
end
|
20
23
|
|
data/lib/mechanize.rb
CHANGED
@@ -12,7 +12,6 @@ require 'uri'
|
|
12
12
|
require 'webrick/httputils'
|
13
13
|
require 'zlib'
|
14
14
|
|
15
|
-
|
16
15
|
# = Synopsis
|
17
16
|
# The Mechanize library is used for automating interaction with a website. It
|
18
17
|
# can follow links, and submit forms. Form fields can be populated and
|
@@ -71,7 +70,7 @@ class Mechanize
|
|
71
70
|
attr_accessor :read_timeout
|
72
71
|
|
73
72
|
# The identification string for the client initiating a web request
|
74
|
-
|
73
|
+
attr_reader :user_agent
|
75
74
|
|
76
75
|
# The value of watch_for_set is passed to pluggable parsers for retrieved
|
77
76
|
# content
|
@@ -96,6 +95,15 @@ class Mechanize
|
|
96
95
|
# redirects are followed.
|
97
96
|
attr_accessor :redirect_ok
|
98
97
|
|
98
|
+
# Says this agent should consult the site's robots.txt for each access.
|
99
|
+
attr_reader :robots
|
100
|
+
|
101
|
+
def robots=(value)
|
102
|
+
require 'webrobots' if value
|
103
|
+
@webrobots = nil if value != @robots
|
104
|
+
@robots = value
|
105
|
+
end
|
106
|
+
|
99
107
|
# Disables HTTP/1.1 gzip compression (enabled by default)
|
100
108
|
attr_accessor :gzip_enabled
|
101
109
|
|
@@ -198,6 +206,9 @@ class Mechanize
|
|
198
206
|
@follow_meta_refresh = false
|
199
207
|
@redirection_limit = 20
|
200
208
|
|
209
|
+
@robots = false
|
210
|
+
@webrobots = nil
|
211
|
+
|
201
212
|
# Connection Cache & Keep alive
|
202
213
|
@keep_alive_time = 300
|
203
214
|
@keep_alive = true
|
@@ -208,8 +219,16 @@ class Mechanize
|
|
208
219
|
@proxy_user = nil
|
209
220
|
@proxy_pass = nil
|
210
221
|
|
211
|
-
@
|
212
|
-
|
222
|
+
@scheme_handlers = Hash.new { |h, scheme|
|
223
|
+
h[scheme] = lambda { |link, page|
|
224
|
+
raise Mechanize::UnsupportedSchemeError, scheme
|
225
|
+
}
|
226
|
+
}
|
227
|
+
|
228
|
+
@scheme_handlers['http'] = lambda { |link, page| link }
|
229
|
+
@scheme_handlers['https'] = @scheme_handlers['http']
|
230
|
+
@scheme_handlers['relative'] = @scheme_handlers['http']
|
231
|
+
@scheme_handlers['file'] = @scheme_handlers['http']
|
213
232
|
|
214
233
|
@pre_connect_hooks = []
|
215
234
|
@post_connect_hooks = []
|
@@ -243,6 +262,11 @@ class Mechanize
|
|
243
262
|
nil
|
244
263
|
end
|
245
264
|
|
265
|
+
def user_agent=(value)
|
266
|
+
@webrobots = nil if value != @user_agent
|
267
|
+
@user_agent = value
|
268
|
+
end
|
269
|
+
|
246
270
|
# Set the user agent for the Mechanize object.
|
247
271
|
# See AGENT_ALIASES
|
248
272
|
def user_agent_alias=(al)
|
@@ -263,17 +287,15 @@ class Mechanize
|
|
263
287
|
alias :basic_auth :auth
|
264
288
|
|
265
289
|
# Fetches the URL passed in and returns a page.
|
266
|
-
def get(
|
290
|
+
def get(uri, parameters = [], referer = nil, headers = {})
|
267
291
|
method = :get
|
268
292
|
|
269
|
-
|
270
|
-
|
271
|
-
|
272
|
-
|
273
|
-
|
274
|
-
|
275
|
-
else
|
276
|
-
raise ArgumentError, "url must be specified" unless url = options[:url]
|
293
|
+
if Hash === uri then
|
294
|
+
options = uri
|
295
|
+
location = Gem.location_of_caller.join ':'
|
296
|
+
warn "#{location}: Mechanize#get with options hash is deprecated and will be removed October 2011"
|
297
|
+
|
298
|
+
raise ArgumentError, "url must be specified" unless uri = options[:url]
|
277
299
|
parameters = options[:params] || []
|
278
300
|
referer = options[:referer]
|
279
301
|
headers = options[:headers]
|
@@ -281,7 +303,7 @@ class Mechanize
|
|
281
303
|
end
|
282
304
|
|
283
305
|
unless referer
|
284
|
-
if
|
306
|
+
if uri.to_s =~ %r{\Ahttps?://}
|
285
307
|
referer = Page.new(nil, {'content-type'=>'text/html'})
|
286
308
|
else
|
287
309
|
referer = current_page || Page.new(nil, {'content-type'=>'text/html'})
|
@@ -299,7 +321,7 @@ class Mechanize
|
|
299
321
|
|
300
322
|
# fetch the page
|
301
323
|
headers ||= {}
|
302
|
-
page = fetch_page
|
324
|
+
page = fetch_page uri, method, headers, parameters, referer
|
303
325
|
add_to_history(page)
|
304
326
|
yield page if block_given?
|
305
327
|
page
|
@@ -347,6 +369,14 @@ class Mechanize
|
|
347
369
|
# Mechanize::Page::Link object passed in. Returns the page fetched.
|
348
370
|
def click(link)
|
349
371
|
case link
|
372
|
+
when Page::Link
|
373
|
+
referer = link.page || current_page()
|
374
|
+
if robots
|
375
|
+
if (referer.is_a?(Page) && referer.parser.nofollow?) || link.rel?('nofollow')
|
376
|
+
raise RobotsDisallowedError.new(link.href)
|
377
|
+
end
|
378
|
+
end
|
379
|
+
get link.href, [], referer
|
350
380
|
when String, Regexp
|
351
381
|
if real_link = page.link_with(:text => link)
|
352
382
|
click real_link
|
@@ -359,10 +389,10 @@ class Mechanize
|
|
359
389
|
submit form, button if form
|
360
390
|
end
|
361
391
|
else
|
362
|
-
referer =
|
392
|
+
referer = current_page()
|
363
393
|
href = link.respond_to?(:href) ? link.href :
|
364
394
|
(link['href'] || link['src'])
|
365
|
-
get
|
395
|
+
get href, [], referer
|
366
396
|
end
|
367
397
|
end
|
368
398
|
|
@@ -421,11 +451,10 @@ class Mechanize
|
|
421
451
|
when 'POST'
|
422
452
|
post_form(form.action, form, headers)
|
423
453
|
when 'GET'
|
424
|
-
get(
|
425
|
-
|
426
|
-
|
427
|
-
|
428
|
-
)
|
454
|
+
get(form.action.gsub(/\?[^\?]*$/, ''),
|
455
|
+
form.build_query,
|
456
|
+
form.page,
|
457
|
+
headers)
|
429
458
|
else
|
430
459
|
raise ArgumentError, "unsupported method: #{form.method.upcase}"
|
431
460
|
end
|
@@ -473,6 +502,36 @@ class Mechanize
|
|
473
502
|
end
|
474
503
|
end
|
475
504
|
|
505
|
+
# Tests if this agent is allowed to access +url+, consulting the
|
506
|
+
# site's robots.txt.
|
507
|
+
def robots_allowed?(uri)
|
508
|
+
return true if uri.request_uri == '/robots.txt'
|
509
|
+
|
510
|
+
webrobots.allowed?(uri)
|
511
|
+
end
|
512
|
+
|
513
|
+
# Equivalent to !robots_allowed?(url).
|
514
|
+
def robots_disallowed?(url)
|
515
|
+
!webrobots.allowed?(url)
|
516
|
+
end
|
517
|
+
|
518
|
+
# Returns an error object if there is an error in fetching or
|
519
|
+
# parsing robots.txt of the site +url+.
|
520
|
+
def robots_error(url)
|
521
|
+
webrobots.error(url)
|
522
|
+
end
|
523
|
+
|
524
|
+
# Raises the error if there is an error in fetching or parsing
|
525
|
+
# robots.txt of the site +url+.
|
526
|
+
def robots_error!(url)
|
527
|
+
webrobots.error!(url)
|
528
|
+
end
|
529
|
+
|
530
|
+
# Removes robots.txt cache for the site +url+.
|
531
|
+
def robots_reset(url)
|
532
|
+
webrobots.reset(url)
|
533
|
+
end
|
534
|
+
|
476
535
|
alias :page :current_page
|
477
536
|
|
478
537
|
def connection_for uri
|
@@ -608,6 +667,69 @@ class Mechanize
|
|
608
667
|
request['User-Agent'] = @user_agent if @user_agent
|
609
668
|
end
|
610
669
|
|
670
|
+
def resolve(uri, referer = current_page())
|
671
|
+
uri = uri.dup if uri.is_a?(URI)
|
672
|
+
|
673
|
+
unless uri.is_a?(URI)
|
674
|
+
uri = uri.to_s.strip.gsub(/[^#{0.chr}-#{126.chr}]/o) { |match|
|
675
|
+
if RUBY_VERSION >= "1.9.0"
|
676
|
+
Mechanize::Util.uri_escape(match)
|
677
|
+
else
|
678
|
+
sprintf('%%%X', match.unpack($KCODE == 'UTF8' ? 'U' : 'C')[0])
|
679
|
+
end
|
680
|
+
}
|
681
|
+
|
682
|
+
unescaped = uri.split(/(?:%[0-9A-Fa-f]{2})+|#/)
|
683
|
+
escaped = uri.scan(/(?:%[0-9A-Fa-f]{2})+|#/)
|
684
|
+
|
685
|
+
escaped_uri = Mechanize::Util.html_unescape(
|
686
|
+
unescaped.zip(escaped).map { |x,y|
|
687
|
+
"#{WEBrick::HTTPUtils.escape(x)}#{y}"
|
688
|
+
}.join('')
|
689
|
+
)
|
690
|
+
|
691
|
+
begin
|
692
|
+
uri = URI.parse(escaped_uri)
|
693
|
+
rescue
|
694
|
+
uri = URI.parse(WEBrick::HTTPUtils.escape(escaped_uri))
|
695
|
+
end
|
696
|
+
end
|
697
|
+
|
698
|
+
scheme = uri.relative? ? 'relative' : uri.scheme.downcase
|
699
|
+
uri = @scheme_handlers[scheme].call(uri, referer)
|
700
|
+
|
701
|
+
if referer && referer.uri
|
702
|
+
if uri.path.length == 0 && uri.relative?
|
703
|
+
uri.path = referer.uri.path
|
704
|
+
end
|
705
|
+
end
|
706
|
+
|
707
|
+
uri.path = '/' if uri.path.length == 0
|
708
|
+
|
709
|
+
if uri.relative?
|
710
|
+
raise ArgumentError, "absolute URL needed (not #{uri})" unless
|
711
|
+
referer && referer.uri
|
712
|
+
|
713
|
+
base = nil
|
714
|
+
if referer.respond_to?(:bases) && referer.parser
|
715
|
+
base = referer.bases.last
|
716
|
+
end
|
717
|
+
|
718
|
+
uri = ((base && base.uri && base.uri.absolute?) ?
|
719
|
+
base.uri :
|
720
|
+
referer.uri) + uri
|
721
|
+
uri = referer.uri + uri
|
722
|
+
# Strip initial "/.." bits from the path
|
723
|
+
uri.path.sub!(/^(\/\.\.)+(?=\/)/, '')
|
724
|
+
end
|
725
|
+
|
726
|
+
unless ['http', 'https', 'file'].include?(uri.scheme.downcase)
|
727
|
+
raise ArgumentError, "unsupported scheme: #{uri.scheme}"
|
728
|
+
end
|
729
|
+
|
730
|
+
uri
|
731
|
+
end
|
732
|
+
|
611
733
|
def resolve_parameters uri, method, parameters
|
612
734
|
case method
|
613
735
|
when :head, :get, :delete, :trace then
|
@@ -707,7 +829,7 @@ class Mechanize
|
|
707
829
|
response.read_body { |part|
|
708
830
|
total += part.length
|
709
831
|
body.write(part)
|
710
|
-
log.debug("Read #{total}
|
832
|
+
log.debug("Read #{part.length} bytes (#{total} total)") if log
|
711
833
|
}
|
712
834
|
|
713
835
|
body.rewind
|
@@ -814,8 +936,15 @@ class Mechanize
|
|
814
936
|
|
815
937
|
private
|
816
938
|
|
817
|
-
def
|
818
|
-
|
939
|
+
def webrobots_http_get(uri)
|
940
|
+
get_file(uri)
|
941
|
+
rescue Mechanize::ResponseCodeError => e
|
942
|
+
return '' if e.response_code == '404'
|
943
|
+
raise e
|
944
|
+
end
|
945
|
+
|
946
|
+
def webrobots
|
947
|
+
@webrobots ||= WebRobots.new(@user_agent, :http_get => method(:webrobots_http_get))
|
819
948
|
end
|
820
949
|
|
821
950
|
def set_http proxy = nil
|
@@ -868,7 +997,7 @@ class Mechanize
|
|
868
997
|
referer = current_page, redirects = 0
|
869
998
|
referer_uri = referer ? referer.uri : nil
|
870
999
|
|
871
|
-
uri =
|
1000
|
+
uri = resolve uri, referer
|
872
1001
|
|
873
1002
|
uri, params = resolve_parameters uri, method, params
|
874
1003
|
|
@@ -889,6 +1018,11 @@ class Mechanize
|
|
889
1018
|
|
890
1019
|
pre_connect request
|
891
1020
|
|
1021
|
+
# Consult robots.txt
|
1022
|
+
if robots && uri.is_a?(URI::HTTP)
|
1023
|
+
robots_allowed?(uri) or raise RobotsDisallowedError.new(uri)
|
1024
|
+
end
|
1025
|
+
|
892
1026
|
# Add If-Modified-Since if page is in history
|
893
1027
|
if (page = visited_page(uri)) and page.response['Last-Modified']
|
894
1028
|
request['If-Modified-Since'] = page.response['Last-Modified']
|
@@ -921,7 +1055,13 @@ class Mechanize
|
|
921
1055
|
return meta if meta
|
922
1056
|
|
923
1057
|
case response
|
924
|
-
when Net::HTTPSuccess
|
1058
|
+
when Net::HTTPSuccess
|
1059
|
+
if robots && page.is_a?(Page)
|
1060
|
+
page.parser.noindex? and raise RobotsDisallowedError.new(uri)
|
1061
|
+
end
|
1062
|
+
|
1063
|
+
page
|
1064
|
+
when Mechanize::FileResponse
|
925
1065
|
page
|
926
1066
|
when Net::HTTPNotModified
|
927
1067
|
log.debug("Got cached page") if log
|
@@ -958,7 +1098,7 @@ require 'mechanize/pluggable_parsers'
|
|
958
1098
|
require 'mechanize/redirect_limit_reached_error'
|
959
1099
|
require 'mechanize/redirect_not_get_or_head_error'
|
960
1100
|
require 'mechanize/response_code_error'
|
1101
|
+
require 'mechanize/robots_disallowed_error'
|
961
1102
|
require 'mechanize/unsupported_scheme_error'
|
962
|
-
require 'mechanize/uri_resolver'
|
963
1103
|
require 'mechanize/util'
|
964
1104
|
|