mechanize 0.5.4 → 0.6.0

Sign up to get free protection for your applications and to get access to all the features.

Potentially problematic release.


This version of mechanize might be problematic. Click here for more details.

data/CHANGELOG CHANGED
@@ -1,5 +1,17 @@
1
1
  = Mechanize CHANGELOG
2
2
 
3
+ == 0.6.0
4
+
5
+ * Changed main parser to use hpricot
6
+ * Made WWW::Mechanize::Page class searchable like hpricot
7
+ * Updated WWW::Mechanize#click to support hpricot links like this:
8
+ @agent.click (page/"a").first
9
+ * Clicking a Frame is now possible:
10
+ @agent.click (page/"frame").first
11
+ * Removed deprecated attr_finder
12
+ * Removed REXML helper methods since the main parser is now hpricot
13
+ * Overhauled cookie parser to use WEBrick::Cookie
14
+
3
15
  == 0.5.4
4
16
 
5
17
  * Added WWW::Mechanize#trasact for saving history state between in a
data/GUIDE ADDED
@@ -0,0 +1,125 @@
1
+ = Getting Started With WWW::Mechanize
2
+ This guide is meant to get you started using Mechanize. By the end of this
3
+ guide, you should be able to fetch pages, click links, fill out and submit
4
+ forms, scrape data, and many other hopefully useful things. This guide
5
+ really just scratches the surface of what is available, but should be enough
6
+ information to get you really going!
7
+
8
+ == Let's Fetch a Page!
9
+ First thing is first. Make sure that you've required mechanize and that you
10
+ instantiate a new mechanize object:
11
+ require 'rubygems'
12
+ require 'mechanize'
13
+
14
+ agent = WWW::Mechanize.new
15
+ Now we'll use the agent we've created to fetch a page. Let's fetch google
16
+ with our mechanize agent:
17
+ page = agent.get('http://google.com/')
18
+ What just happened? We told mechanize to go pick up google's main page.
19
+ Mechanize stored any cookies that were set, and followed any redirects that
20
+ google may have sent. The agent gave us back a page that we can use to
21
+ scrape data, find links to click, or find forms to fill out.
22
+
23
+ Next, lets try finding some links to click.
24
+
25
+ == Finding Links
26
+ Mechanize returns a page object whenever you get a page, post, or submit a
27
+ form. When a page is fetched, the agent will parse the page and put a list
28
+ of links on the page object.
29
+
30
+ Now that we've fetched google's homepage, lets try listing all of the links:
31
+ page.links.each do |link|
32
+ puts link.text
33
+ end
34
+ We can list the links, but Mechanize gives a few shortcuts to help us find a
35
+ link to click on. Lets say we wanted to click the link whose text is 'News'.
36
+ Normally, we would have to do this:
37
+ page = agent.click page.links.find { |l| l.name == 'News' }
38
+ But Mechanize gives us a shortcut. Instead we can say this:
39
+ page = agent.click page.links.name('News')
40
+ That shortcut says "find all links with the name 'News'". You're probably
41
+ thinking "there could be multiple links with that text!", and you would be
42
+ correct! If you pass a list of links to the "click" method, Mechanize will
43
+ click on the first one. If you wanted to click on the second news link, you
44
+ could do this:
45
+ agent.click page.links.name('News')[1]
46
+ We can even find a link with a certain href like so:
47
+ page.links.href('/something')
48
+ Or chain them together to find a link with certain text and certain href:
49
+ page.links.name('News').href('/something')
50
+
51
+ These shortcuts that mechanize provides are available on any list that you
52
+ can fetch like frames, iframes, or forms. Now that we know how to find and
53
+ click links, lets try something more complicated like filling out a form.
54
+
55
+ == Filling Out Forms
56
+ Lets continue with our google example. Here's the code we have so far:
57
+ require 'rubygems'
58
+ require 'mechanize'
59
+
60
+ agent = WWW::Mechanize.new
61
+ page = agent.get('http://google.com/')
62
+ If we pretty print the page, we can see that there is one form named 'f',
63
+ that has a couple buttons and a few fields:
64
+ pp page
65
+ Now that we know the name of the form, lets fetch it off the page:
66
+ google_form = page.form('f')
67
+ Mechanize lets you access form input fields in a few different ways, but the
68
+ most convenient is that you can access input fields as accessors on the
69
+ object. So lets set the form field named 'q' on the form to 'ruby mechanize':
70
+ google_form.q = 'ruby mechanize'
71
+ To make sure that we set the value, lets pretty print the form, and you should
72
+ see a line similar to this:
73
+ #<WWW::Mechanize::Field:0x1403488 @name="q", @value="ruby mechanize">
74
+ If you saw that the value of 'q' changed, you're on the right track! Now we
75
+ can submit the form and 'press' the submit button and print the results:
76
+ page = agent.submit(google_form, google_form.buttons.first)
77
+ pp page
78
+ What we just did was equivalent to putting text in the search field and
79
+ clicking the 'Google Search' button. If we had submitted the form without
80
+ a button, it would be like typing in the text field and hitting the return
81
+ button.
82
+
83
+ Lets take a look at the code all together:
84
+ require 'rubygems'
85
+ require 'mechanize'
86
+
87
+ agent = WWW::Mechanize.new
88
+ page = agent.get('http://google.com/')
89
+ google_form = page.form('f')
90
+ google_form.q = 'ruby mechanize'
91
+ page = agent.submit(google_form)
92
+ pp page
93
+
94
+ Before we go on to screen scraping, lets take a look at forms a little more
95
+ in depth. Unless you want to skip ahead!
96
+
97
+ == Advanced Form Techniques
98
+ In this section, I want to touch on using the different types in input fields
99
+ possible with a form. Password and textarea fields can be treated just like
100
+ text input fields. Select fields are very similar to text fields, but they
101
+ have many options associated with them. If you select one option, mechanize
102
+ will deselect the other options (unless it is a multi select!).
103
+
104
+ For example, lets select an option on a list:
105
+ form.fields.name('list').options[0].select
106
+
107
+ Now lets take a look at checkboxes and radio buttons. To select a checkbox,
108
+ just check it like this:
109
+ form.checkboxes.name('box').check
110
+ Radio buttons are very similar to checkboxes, but they know how to uncheck
111
+ other radio buttons of the same name. Just check a radio button like you
112
+ would a checkbox:
113
+ form.radiobuttons.name('box')[1].check
114
+ Mechanize also makes file uploads easy! Just find the file upload field, and
115
+ tell it what file name you want to upload:
116
+ form.file_uploads.file_name = "somefile.jpg"
117
+
118
+ == Scraping Data
119
+ Mechanize uses hpricot[http://code.whytheluckystiff.net/hpricot/] to parse
120
+ html. What does this mean for you? You can treat a mechanize page like
121
+ an hpricot object. After you have used Mechanize to navigate to the page
122
+ that you need to scrape, then scrape it using hpricot methods:
123
+ agent.get('http://someurl.com/').search("//p[@class='posted']")
124
+ For more information on this powerful scraper, take a look at
125
+ HpricotBasics[http://code.whytheluckystiff.net/hpricot/wiki/HpricotBasics]
data/NOTES CHANGED
@@ -1,5 +1,33 @@
1
1
  = Mechanize Release Notes
2
2
 
3
+ == 0.6.0 (Rufus)
4
+
5
+ WWW::Mechanize 0.6.0 aka Rufus is ready! This hpricot flavored pie has
6
+ finished cooling on the window sill and is ready for you to eat. But if you
7
+ don't want to eat it, you can just download it and use it. I would
8
+ understand that.
9
+
10
+ The best new feature in this release in my opinion is the hpricot flavoring
11
+ packed inside. Mechanize now uses hpricot as its html parser. This means
12
+ mechanize gets a huge speed boost, and you can use the power of hpricot for
13
+ scraping data. Page objects returned from mechanize will allow you to use
14
+ hpricot search methods:
15
+ agent.get('http://rubyforge.org').search("//strong")
16
+ or
17
+ agent.get('http://rubyforge.org')/"strong"
18
+
19
+ The click method on mechanize has been updated so that you can click on links
20
+ you find using hpricot methods:
21
+ agent.click (page/"a").first
22
+ Or click on frames:
23
+ agent.click (page/"frame").first
24
+
25
+ The cookie parser has been overhauled to be more RFC 2109 compliant and to
26
+ use WEBrick cookies. Dependencies on ruby-web and mime-types have been
27
+ removed in favor of using hpricot and WEBrick respectively.
28
+
29
+ attr_finder and REXML helper methods have been removed.
30
+
3
31
  == 0.5.4 (Sylvester)
4
32
 
5
33
  WWW::Mechanize 0.5.4 aka Sylvester is fresh out the the frying pan and in to
data/README CHANGED
@@ -1,20 +1,23 @@
1
1
  = WWW::Mechanize
2
2
 
3
- The Mechanize library is used for automating interaction with a website. It
3
+ The Mechanize library is used for automating interaction with websites.
4
+ Mechanize automatically stores and sends cookies, follows redirects,
4
5
  can follow links, and submit forms. Form fields can be populated and
5
- submitted. A history of URL's is maintained and can be queried.
6
+ submitted. Mechanize also keeps track of the sites that you have visited as
7
+ a history.
6
8
 
7
9
  == Dependencies
8
10
 
9
11
  * ruby 1.8.2
12
+ * hpricot[http://code.whytheluckystiff.net/hpricot/]
10
13
 
11
14
  Note that the files in the net-overrides/ directory are taken from Ruby 1.9.0.
12
15
 
13
- * ruby-web 1.1.0 (http://rubyforge.org/projects/ruby-web/)
14
16
 
15
17
  == Examples
16
18
 
17
- See the EXAMPLES[link://files/EXAMPLES.html] file
19
+ If you are just starting, check out the GUIDE[link://files/GUIDE.html].
20
+ Also, check out the EXAMPLES[link://files/EXAMPLES.html] file.
18
21
 
19
22
  == Authors
20
23
 
@@ -24,7 +27,8 @@ Copyright (c) 2005 by Michael Neumann (mneumann@ntecs.de)
24
27
  New Code:
25
28
  Copyright (c) 2006 by Aaron Patterson (aaronp@rubyforge.org)
26
29
 
27
- This library comes with a shameless plug for employing me (Aaron) programming
30
+ This library comes with a shameless plug for employing me
31
+ (Aaron[http://tenderlovemaking.com/]) programming
28
32
  Ruby, my favorite language!
29
33
 
30
34
  == License
data/lib/mechanize.rb CHANGED
@@ -15,11 +15,10 @@ require 'net/http'
15
15
  require 'net/https'
16
16
 
17
17
  require 'uri'
18
- require 'webrick'
18
+ require 'webrick/httputils'
19
19
  require 'zlib'
20
20
  require 'stringio'
21
- require 'web/htmltools/xmltree' # narf
22
- require 'mechanize/module'
21
+ require 'mechanize/hpricot'
23
22
  require 'mechanize/mech_version'
24
23
  require 'mechanize/cookie'
25
24
  require 'mechanize/errors'
@@ -29,7 +28,6 @@ require 'mechanize/form_elements'
29
28
  require 'mechanize/list'
30
29
  require 'mechanize/page'
31
30
  require 'mechanize/page_elements'
32
- require 'mechanize/parsing'
33
31
  require 'mechanize/inspect'
34
32
 
35
33
  module WWW
@@ -132,7 +130,7 @@ class Mechanize
132
130
 
133
131
  # Fetches the URL passed in and returns a page.
134
132
  def get(url)
135
- cur_page = current_page() || Page.new
133
+ cur_page = current_page || Page.new( nil, {'content-type'=>'text/html'})
136
134
 
137
135
  # fetch the page
138
136
  abs_uri = to_absolute_uri(url, cur_page)
@@ -151,7 +149,9 @@ class Mechanize
151
149
  # Clicks the WWW::Mechanize::Link object passed in and returns the
152
150
  # page fetched.
153
151
  def click(link)
154
- uri = to_absolute_uri(link.href.strip)
152
+ uri = to_absolute_uri(
153
+ link.attributes['href'] || link.attributes['src'] || link.href
154
+ )
155
155
  get(uri)
156
156
  end
157
157
 
@@ -168,11 +168,10 @@ class Mechanize
168
168
  # or
169
169
  # agent.post('http://example.com/', [ ["foo", "bar"] ])
170
170
  def post(url, query={})
171
- cur_page = current_page() || Page.new
172
-
173
- node = REXML::Element.new
174
- node.add_attribute('method', 'POST')
175
- node.add_attribute('enctype', 'application/x-www-form-urlencoded')
171
+ node = Hpricot::Elem.new(Hpricot::STag.new('form'))
172
+ node.attributes = {}
173
+ node.attributes['method'] = 'POST'
174
+ node.attributes['enctype'] = 'application/x-www-form-urlencoded'
176
175
 
177
176
  form = Form.new(node)
178
177
  query.each { |k,v|
@@ -246,7 +245,7 @@ class Mechanize
246
245
  end
247
246
 
248
247
  def post_form(url, form)
249
- cur_page = current_page() || Page.new
248
+ cur_page = current_page || Page.new(nil, {'content-type'=>'text/html'})
250
249
 
251
250
  request_data = form.request_data
252
251
 
@@ -279,7 +278,7 @@ class Mechanize
279
278
 
280
279
  log.info("#{ request.class }: #{ uri.to_s }") if log
281
280
 
282
- page = Page.new(uri)
281
+ page = nil
283
282
 
284
283
  http_obj = Net::HTTP.new( uri.host,
285
284
  uri.port,
@@ -323,7 +322,7 @@ class Mechanize
323
322
  # Add User-Agent header to request
324
323
  request.add_field('User-Agent', @user_agent) if @user_agent
325
324
 
326
- request.basic_auth(@user, @password) if @user
325
+ request.basic_auth(@user, @password) if @user || @password
327
326
 
328
327
  # Log specified headers for the request
329
328
  if log
@@ -348,7 +347,7 @@ class Mechanize
348
347
  (response.get_fields('Set-Cookie')||[]).each do |cookie|
349
348
  Cookie::parse(uri, cookie) { |c|
350
349
  log.debug("saved cookie: #{c}") if log
351
- @cookie_jar.add(c)
350
+ @cookie_jar.add(uri, c)
352
351
  }
353
352
  end
354
353
 
@@ -1,69 +1,48 @@
1
1
  require 'yaml'
2
2
  require 'time'
3
+ require 'webrick/cookie'
3
4
 
4
5
  module WWW
5
6
  class Mechanize
6
7
  # This class is used to represent an HTTP Cookie.
7
- class Cookie
8
- attr_reader :name, :value, :path, :domain, :expires, :secure
9
- def initialize(cookie)
10
- @name = cookie[:name]
11
- @value = cookie[:value]
12
- @path = cookie[:path]
13
- @domain = cookie[:domain]
14
- @expires = cookie[:expires]
15
- @secure = cookie[:secure]
16
- end
17
-
18
- def Cookie::parse(uri, raw_cookie, &block)
19
- esc = raw_cookie.gsub(/(expires=[^,]*),([^;]*(;|$))/i) { "#{$1}#{$2}" }
20
- esc.split(/,/).each do |cookie_text|
21
- cookie = Hash.new
22
- valid_cookie = true
23
- cookie_text.split(/; ?/).each do |data|
24
- name, value = data.split('=', 2)
25
- next unless name
26
-
27
- name.strip!
28
-
29
- # Set the cookie to invalid if the domain is incorrect
30
- case name.downcase
31
- when 'path'
32
- cookie[:path] = value
8
+ class Cookie < WEBrick::Cookie
9
+ def self.parse(uri, str)
10
+ cookies = []
11
+ str.gsub(/(,([^;,]*=)|,$)/) { "\r\n#{$2}" }.split(/\r\n/).each { |c|
12
+ cookie_elem = c.split(/;/)
13
+ first_elem = cookie_elem.shift
14
+ first_elem.strip!
15
+ key, value = first_elem.split(/=/, 2)
16
+ cookie = new(key, WEBrick::HTTPUtils.dequote(value))
17
+ cookie_elem.each{|pair|
18
+ pair.strip!
19
+ key, value = pair.split(/=/, 2)
20
+ if value
21
+ value = WEBrick::HTTPUtils.dequote(value.strip)
22
+ end
23
+ case key.downcase
24
+ when "domain" then cookie.domain = value.sub(/^\./, '')
25
+ when "path" then cookie.path = value
33
26
  when 'expires'
34
- cookie[:expires] = begin
27
+ cookie.expires = begin
35
28
  Time::parse(value)
36
29
  rescue
37
30
  Time.now
38
31
  end
39
- when 'secure'
40
- cookie[:secure] = true
41
- when 'domain' # Reject the cookie if it isn't for this domain
42
- cookie[:domain] = value.sub(/^\./, '')
43
-
44
- # Reject cookies not for this domain
45
- # TODO Move the logic to reject based on host to the jar
46
- unless uri.host =~ /#{cookie[:domain]}$/
47
- valid_cookie = false
48
- end
49
- when 'httponly'
50
- # do nothing
51
- # http://msdn.microsoft.com/workshop/author/dhtml/httponly_cookies.asp
52
- else
53
- cookie[:name] = name
54
- cookie[:value] = value
32
+ when "max-age" then cookie.max_age = Integer(value)
33
+ when "comment" then cookie.comment = value
34
+ when "version" then cookie.version = Integer(value)
35
+ when "secure" then cookie.secure = true
55
36
  end
56
- end
57
-
58
- # Don't yield this cookie if it is invalid
59
- next unless valid_cookie
60
-
61
- cookie[:path] ||= uri.path
62
- cookie[:secure] ||= false
63
- cookie[:domain] ||= uri.host
64
-
65
- yield Cookie.new(cookie)
66
- end
37
+ }
38
+ cookie.path ||= uri.path
39
+ cookie.secure ||= false
40
+ cookie.domain ||= uri.host
41
+ # Move this in to the cookie jar
42
+ yield cookie if block_given?
43
+ cookies << cookie
44
+ }
45
+ return cookies
67
46
  end
68
47
 
69
48
  def to_s
@@ -81,7 +60,8 @@ module WWW
81
60
  end
82
61
 
83
62
  # Add a cookie to the Jar.
84
- def add(cookie)
63
+ def add(uri, cookie)
64
+ return unless uri.host =~ /#{cookie.domain}$/
85
65
  unless @jar.has_key?(cookie.domain)
86
66
  @jar[cookie.domain] = Hash.new
87
67
  end
@@ -1,5 +1,3 @@
1
- require 'mime/types'
2
-
3
1
  module WWW
4
2
  class Mechanize
5
3
  # =Synopsis
@@ -26,12 +24,13 @@ module WWW
26
24
  attr_reader :form_node, :elements_node
27
25
  attr_accessor :method, :action, :name
28
26
 
29
- attr_finder :fields, :buttons, :file_uploads, :radiobuttons, :checkboxes
27
+ attr_reader :fields, :buttons, :file_uploads, :radiobuttons, :checkboxes
30
28
  attr_reader :enctype
31
29
 
32
30
  def initialize(form_node, elements_node)
33
31
  @form_node, @elements_node = form_node, elements_node
34
32
 
33
+ @form_node.attributes ||= {}
35
34
  @method = (@form_node.attributes['method'] || 'GET').upcase
36
35
  @action = @form_node.attributes['action']
37
36
  @name = @form_node.attributes['name']
@@ -41,22 +40,6 @@ module WWW
41
40
  parse
42
41
  end
43
42
 
44
- # In the case of malformed HTML, fields of multiple forms might occure in this forms'
45
- # field array. If the fields have the same name, posterior fields overwrite former fields.
46
- # To avoid this, this method rejects all posterior duplicate fields.
47
-
48
- def uniq_fields!
49
- names_in = {}
50
- fields.reject! {|f|
51
- if names_in.include?(f.name)
52
- true
53
- else
54
- names_in[f.name] = true
55
- false
56
- end
57
- }
58
- end
59
-
60
43
  # This method builds an array of arrays that represent the query
61
44
  # parameters to be used with this form. The return value can then
62
45
  # be used to create a query string for this form.
@@ -130,38 +113,45 @@ module WWW
130
113
  @radiobuttons = WWW::Mechanize::List.new
131
114
  @checkboxes = WWW::Mechanize::List.new
132
115
 
133
- @elements_node.each_recursive {|node|
116
+ # Find all input tags
117
+ (@elements_node/'input').each do |node|
118
+ node.attributes ||= {}
134
119
  type = (node.attributes['type'] || 'text').downcase
120
+ name = node.attributes['name']
121
+ next if type != 'submit' && name.nil?
122
+ case type
123
+ when 'text', 'password', 'hidden', 'int'
124
+ @fields << Field.new(node.attributes['name'], node.attributes['value'] || '')
125
+ when 'radio'
126
+ @radiobuttons << RadioButton.new(node.attributes['name'], node.attributes['value'], node.attributes.has_key?('checked'), self)
127
+ when 'checkbox'
128
+ @checkboxes << CheckBox.new(node.attributes['name'], node.attributes['value'], node.attributes.has_key?('checked'), self)
129
+ when 'file'
130
+ @file_uploads << FileUpload.new(node.attributes['name'], nil)
131
+ when 'submit'
132
+ @buttons << Button.new(node.attributes['name'], node.attributes['value'])
133
+ when 'image'
134
+ @buttons << ImageButton.new(node.attributes['name'], node.attributes['value'])
135
+ end
136
+ end
135
137
 
136
- # Don't add fields that don't have a name
137
- next if type != 'submit' && node.attributes['name'].nil?
138
+ # Find all textarea tags
139
+ (@elements_node/'textarea').each do |node|
140
+ next if node.attributes.nil?
141
+ next if node.attributes['name'].nil?
142
+ @fields << Field.new(node.attributes['name'], node.all_text)
143
+ end
138
144
 
139
- case node.name.downcase
140
- when 'input'
141
- case type
142
- when 'text', 'password', 'hidden', 'int'
143
- @fields << Field.new(node.attributes['name'], node.attributes['value'] || '')
144
- when 'radio'
145
- @radiobuttons << RadioButton.new(node.attributes['name'], node.attributes['value'], node.attributes.has_key?('checked'), self)
146
- when 'checkbox'
147
- @checkboxes << CheckBox.new(node.attributes['name'], node.attributes['value'], node.attributes.has_key?('checked'), self)
148
- when 'file'
149
- @file_uploads << FileUpload.new(node.attributes['name'], nil)
150
- when 'submit'
151
- @buttons << Button.new(node.attributes['name'], node.attributes['value'])
152
- when 'image'
153
- @buttons << ImageButton.new(node.attributes['name'], node.attributes['value'])
154
- end
155
- when 'textarea'
156
- @fields << Field.new(node.attributes['name'], node.all_text)
157
- when 'select'
158
- if node.attributes.has_key? 'multiple'
159
- @fields << MultiSelectList.new(node.attributes['name'], node)
160
- else
161
- @fields << SelectList.new(node.attributes['name'], node)
162
- end
145
+ # Find all select tags
146
+ (@elements_node/'select').each do |node|
147
+ next if node.attributes.nil?
148
+ next if node.attributes['name'].nil?
149
+ if node.attributes.has_key? 'multiple'
150
+ @fields << MultiSelectList.new(node.attributes['name'], node)
151
+ else
152
+ @fields << SelectList.new(node.attributes['name'], node)
163
153
  end
164
- }
154
+ end
165
155
  end
166
156
 
167
157
  def rand_string(len = 10)
@@ -189,7 +179,8 @@ module WWW
189
179
 
190
180
  if file.file_data.nil? and ! file.file_name.nil?
191
181
  file.file_data = ::File.open(file.file_name, "rb") { |f| f.read }
192
- file.mime_type = MIME::Types.type_for(file.file_name).first
182
+ file.mime_type = WEBrick::HTTPUtils.mime_type(file.file_name,
183
+ WEBrick::HTTPUtils::DefaultMimeTypes)
193
184
  end
194
185
 
195
186
  if file.mime_type != nil