RubyGems - mechanize - Versions diffs - 0.5.4 → 0.6.0 - Mend

mechanize 0.5.4 → 0.6.0

Potentially problematic release.

This version of mechanize might be problematic. Click here for more details.

Files changed (30) hide show

data/CHANGELOG +12 -0
data/GUIDE +125 -0
data/NOTES +28 -0
data/README +9 -5
data/lib/mechanize.rb +14 -15
data/lib/mechanize/cookie.rb +35 -55
data/lib/mechanize/form.rb +39 -48
data/lib/mechanize/form_elements.rb +7 -9
data/lib/mechanize/hpricot.rb +12 -0
data/lib/mechanize/inspect.rb +0 -6
data/lib/mechanize/mech_version.rb +1 -3
data/lib/mechanize/page.rb +70 -115
data/lib/mechanize/page_elements.rb +10 -6
data/test/htdocs/frame_test.html +1 -1
data/test/htdocs/tc_no_attributes.html +16 -0
data/test/tc_checkboxes.rb +8 -8
data/test/tc_cookie_jar.rb +36 -28
data/test/tc_mech.rb +21 -1
data/test/tc_no_attributes.rb +20 -0
data/test/tc_page.rb +1 -1
data/test/tc_pluggable_parser.rb +31 -17
data/test/tc_pretty_print.rb +1 -1
data/test/tc_radiobutton.rb +4 -4
data/test/ts_mech.rb +1 -1
metadata +126 -134
data/lib/mechanize/module.rb +0 -27
data/lib/mechanize/parsing.rb +0 -224
data/test/parse.rb +0 -39
data/test/tc_parsing.rb +0 -64
data/test/test_mech.rb +0 -27

data/CHANGELOG CHANGED Viewed

@@ -1,5 +1,17 @@
 = Mechanize CHANGELOG
+== 0.6.0
+* Changed main parser to use hpricot
+* Made WWW::Mechanize::Page class searchable like hpricot
+* Updated WWW::Mechanize#click to support hpricot links like this:
+  @agent.click (page/"a").first
+* Clicking a Frame is now possible:
+  @agent.click (page/"frame").first
+* Removed deprecated attr_finder
+* Removed REXML helper methods since the main parser is now hpricot
+* Overhauled cookie parser to use WEBrick::Cookie
 == 0.5.4
 * Added WWW::Mechanize#trasact for saving history state between in a

data/GUIDE ADDED Viewed

@@ -0,0 +1,125 @@
+= Getting Started With WWW::Mechanize
+This guide is meant to get you started using Mechanize.  By the end of this
+guide, you should be able to fetch pages, click links, fill out and submit
+forms, scrape data, and many other hopefully useful things.  This guide
+really just scratches the surface of what is available, but should be enough
+information to get you really going!
+== Let's Fetch a Page!
+First thing is first.  Make sure that you've required mechanize and that you
+instantiate a new mechanize object:
+ require 'rubygems'
+ require 'mechanize'
+ agent = WWW::Mechanize.new
+Now we'll use the agent we've created to fetch a page.  Let's fetch google
+with our mechanize agent:
+ page = agent.get('http://google.com/')
+What just happened?  We told mechanize to go pick up google's main page.
+Mechanize stored any cookies that were set, and followed any redirects that
+google may have sent.  The agent gave us back a page that we can use to
+scrape data, find links to click, or find forms to fill out.
+Next, lets try finding some links to click.
+== Finding Links
+Mechanize returns a page object whenever you get a page, post, or submit a
+form.  When a page is fetched, the agent will parse the page and put a list
+of links on the page object.
+Now that we've fetched google's homepage, lets try listing all of the links:
+ page.links.each do |link|
+   puts link.text
+ end
+We can list the links, but Mechanize gives a few shortcuts to help us find a
+link to click on.  Lets say we wanted to click the link whose text is 'News'.
+Normally, we would have to do this:
+ page = agent.click page.links.find { |l| l.name == 'News' }
+But Mechanize gives us a shortcut.  Instead we can say this:
+ page = agent.click page.links.name('News')
+That shortcut says "find all links with the name 'News'".  You're probably
+thinking "there could be multiple links with that text!", and you would be
+correct!  If you pass a list of links to the "click" method, Mechanize will
+click on the first one.  If you wanted to click on the second news link, you
+could do this:
+ agent.click page.links.name('News')[1]
+We can even find a link with a certain href like so:
+ page.links.href('/something')
+Or chain them together to find a link with certain text and certain href:
+ page.links.name('News').href('/something')
+These shortcuts that mechanize provides are available on any list that you
+can fetch like frames, iframes, or forms.  Now that we know how to find and
+click links, lets try something more complicated like filling out a form.
+== Filling Out Forms
+Lets continue with our google example.  Here's the code we have so far:
+ require 'rubygems'
+ require 'mechanize'
+ agent = WWW::Mechanize.new
+ page = agent.get('http://google.com/')
+If we pretty print the page, we can see that there is one form named 'f',
+that has a couple buttons and a few fields:
+ pp page
+Now that we know the name of the form, lets fetch it off the page:
+ google_form = page.form('f')
+Mechanize lets you access form input fields in a few different ways, but the
+most convenient is that you can access input fields as accessors on the
+object.  So lets set the form field named 'q' on the form to 'ruby mechanize':
+ google_form.q = 'ruby mechanize'
+To make sure that we set the value, lets pretty print the form, and you should
+see a line similar to this:
+ #<WWW::Mechanize::Field:0x1403488 @name="q", @value="ruby mechanize">
+If you saw that the value of 'q' changed, you're on the right track!  Now we
+can submit the form and 'press' the submit button and print the results:
+ page = agent.submit(google_form, google_form.buttons.first)
+ pp page
+What we just did was equivalent to putting text in the search field and
+clicking the 'Google Search' button.  If we had submitted the form without
+a button, it would be like typing in the text field and hitting the return
+button.
+Lets take a look at the code all together:
+ require 'rubygems'
+ require 'mechanize'
+ agent = WWW::Mechanize.new
+ page = agent.get('http://google.com/')
+ google_form = page.form('f')
+ google_form.q = 'ruby mechanize'
+ page = agent.submit(google_form)
+ pp page
+Before we go on to screen scraping, lets take a look at forms a little more
+in depth.  Unless you want to skip ahead!
+== Advanced Form Techniques
+In this section, I want to touch on using the different types in input fields
+possible with a form.  Password and textarea fields can be treated just like
+text input fields.  Select fields are very similar to text fields, but they
+have many options associated with them.  If you select one option, mechanize
+will deselect the other options (unless it is a multi select!).
+For example, lets select an option on a list:
+ form.fields.name('list').options[0].select
+Now lets take a look at checkboxes and radio buttons.  To select a checkbox,
+just check it like this:
+ form.checkboxes.name('box').check
+Radio buttons are very similar to checkboxes, but they know how to uncheck
+other radio buttons of the same name.  Just check a radio button like you
+would a checkbox:
+ form.radiobuttons.name('box')[1].check
+Mechanize also makes file uploads easy!  Just find the file upload field, and
+tell it what file name you want to upload:
+  form.file_uploads.file_name = "somefile.jpg"
+== Scraping Data
+Mechanize uses hpricot[http://code.whytheluckystiff.net/hpricot/] to parse
+html.  What does this mean for you?  You can treat a mechanize page like
+an hpricot object.  After you have used Mechanize to navigate to the page
+that you need to scrape, then scrape it using hpricot methods:
+  agent.get('http://someurl.com/').search("//p[@class='posted']")
+For more information on this powerful scraper, take a look at
+HpricotBasics[http://code.whytheluckystiff.net/hpricot/wiki/HpricotBasics]

data/NOTES CHANGED Viewed

@@ -1,5 +1,33 @@
 = Mechanize Release Notes
+== 0.6.0 (Rufus)
+WWW::Mechanize 0.6.0 aka Rufus is ready!  This hpricot flavored pie has
+finished cooling on the window sill and is ready for you to eat.  But if you
+don't want to eat it, you can just download it and use it.  I would
+understand that.
+The best new feature in this release in my opinion is the hpricot flavoring
+packed inside.  Mechanize now uses hpricot as its html parser.  This means
+mechanize gets a huge speed boost, and you can use the power of hpricot for
+scraping data.  Page objects returned from mechanize will allow you to use
+hpricot search methods:
+ agent.get('http://rubyforge.org').search("//strong")
+or
+ agent.get('http://rubyforge.org')/"strong"
+The click method on mechanize has been updated so that you can click on links
+you find using hpricot methods:
+ agent.click (page/"a").first
+Or click on frames:
+ agent.click (page/"frame").first
+The cookie parser has been overhauled to be more RFC 2109 compliant and to
+use WEBrick cookies.  Dependencies on ruby-web and mime-types have been
+removed in favor of using hpricot and WEBrick respectively.
+attr_finder and REXML helper methods have been removed.
 == 0.5.4 (Sylvester)
 WWW::Mechanize 0.5.4 aka Sylvester is fresh out the the frying pan and in to

data/README CHANGED Viewed

@@ -1,20 +1,23 @@
 = WWW::Mechanize
-The Mechanize library is used for automating interaction with a website.  It
+The Mechanize library is used for automating interaction with websites.
+Mechanize automatically stores and sends cookies, follows redirects,
 can follow links, and submit forms.  Form fields can be populated and
-submitted.  A history of URL's is maintained and can be queried.
+submitted.  Mechanize also keeps track of the sites that you have visited as
+a history.
 == Dependencies
 * ruby 1.8.2
+* hpricot[http://code.whytheluckystiff.net/hpricot/]
 Note that the files in the net-overrides/ directory are taken from Ruby 1.9.0.
-* ruby-web 1.1.0 (http://rubyforge.org/projects/ruby-web/)
 == Examples
-See the EXAMPLES[link://files/EXAMPLES.html] file
+If you are just starting, check out the GUIDE[link://files/GUIDE.html].
+Also, check out the EXAMPLES[link://files/EXAMPLES.html] file.
 == Authors
@@ -24,7 +27,8 @@ Copyright (c) 2005 by Michael Neumann (mneumann@ntecs.de)
 New Code:
 Copyright (c) 2006 by Aaron Patterson (aaronp@rubyforge.org)
-This library comes with a shameless plug for employing me (Aaron) programming
+This library comes with a shameless plug for employing me
+(Aaron[http://tenderlovemaking.com/]) programming
 Ruby, my favorite language!
 == License

data/lib/mechanize.rb CHANGED Viewed

@@ -15,11 +15,10 @@ require 'net/http'
 require 'net/https'
 require 'uri'
-require 'webrick'
+require 'webrick/httputils'
 require 'zlib'
 require 'stringio'
-require 'web/htmltools/xmltree'   # narf
-require 'mechanize/module'
+require 'mechanize/hpricot'
 require 'mechanize/mech_version'
 require 'mechanize/cookie'
 require 'mechanize/errors'
@@ -29,7 +28,6 @@ require 'mechanize/form_elements'
 require 'mechanize/list'
 require 'mechanize/page'
 require 'mechanize/page_elements'
-require 'mechanize/parsing'
 require 'mechanize/inspect'
 module WWW
@@ -132,7 +130,7 @@ class Mechanize
   # Fetches the URL passed in and returns a page.
   def get(url)
-    cur_page = current_page() || Page.new
+    cur_page = current_page || Page.new( nil, {'content-type'=>'text/html'})
     # fetch the page
     abs_uri = to_absolute_uri(url, cur_page)
@@ -151,7 +149,9 @@ class Mechanize
   # Clicks the WWW::Mechanize::Link object passed in and returns the
   # page fetched.
   def click(link)
-    uri = to_absolute_uri(link.href.strip)
+    uri = to_absolute_uri(
+      link.attributes['href'] || link.attributes['src'] || link.href
+    )
     get(uri)
   end
@@ -168,11 +168,10 @@ class Mechanize
   # or
   #  agent.post('http://example.com/', [ ["foo", "bar"] ])
   def post(url, query={})
-    cur_page = current_page() || Page.new
-    node = REXML::Element.new
-    node.add_attribute('method', 'POST')
-    node.add_attribute('enctype', 'application/x-www-form-urlencoded')
+    node = Hpricot::Elem.new(Hpricot::STag.new('form'))
+    node.attributes = {}
+    node.attributes['method'] = 'POST'
+    node.attributes['enctype'] = 'application/x-www-form-urlencoded'
     form = Form.new(node)
     query.each { |k,v|
@@ -246,7 +245,7 @@ class Mechanize
   end
   def post_form(url, form)
-    cur_page = current_page() || Page.new
+    cur_page = current_page || Page.new(nil, {'content-type'=>'text/html'})
     request_data = form.request_data
@@ -279,7 +278,7 @@ class Mechanize
     log.info("#{ request.class }: #{ uri.to_s }") if log
-    page = Page.new(uri)
+    page = nil
     http_obj = Net::HTTP.new( uri.host,
                           uri.port,
@@ -323,7 +322,7 @@ class Mechanize
     # Add User-Agent header to request
     request.add_field('User-Agent', @user_agent) if @user_agent
-    request.basic_auth(@user, @password) if @user
+    request.basic_auth(@user, @password) if @user || @password
     # Log specified headers for the request
     if log
@@ -348,7 +347,7 @@ class Mechanize
         (response.get_fields('Set-Cookie')||[]).each do |cookie|
           Cookie::parse(uri, cookie) { |c|
             log.debug("saved cookie: #{c}") if log
-            @cookie_jar.add(c)
+            @cookie_jar.add(uri, c)
           }
         end

data/lib/mechanize/cookie.rb CHANGED Viewed

@@ -1,69 +1,48 @@
 require 'yaml'
 require 'time'
+require 'webrick/cookie'
 module WWW
   class Mechanize
   # This class is used to represent an HTTP Cookie.
-    class Cookie
-      attr_reader :name, :value, :path, :domain, :expires, :secure
-      def initialize(cookie)
-        @name     = cookie[:name]
-        @value    = cookie[:value]
-        @path     = cookie[:path]
-        @domain   = cookie[:domain]
-        @expires  = cookie[:expires]
-        @secure   = cookie[:secure]
-      end
-      def Cookie::parse(uri, raw_cookie, &block)
-        esc = raw_cookie.gsub(/(expires=[^,]*),([^;]*(;|$))/i) { "#{$1}#{$2}" }
-        esc.split(/,/).each do |cookie_text|
-          cookie = Hash.new
-          valid_cookie = true
-          cookie_text.split(/; ?/).each do |data|
-            name, value = data.split('=', 2)
-            next unless name
-            name.strip!
-            # Set the cookie to invalid if the domain is incorrect
-            case name.downcase
-            when 'path'
-              cookie[:path] = value
+    class Cookie < WEBrick::Cookie
+      def self.parse(uri, str)
+        cookies = []
+        str.gsub(/(,([^;,]*=)|,$)/) { "\r\n#{$2}" }.split(/\r\n/).each { |c|
+          cookie_elem = c.split(/;/)
+          first_elem = cookie_elem.shift
+          first_elem.strip!
+          key, value = first_elem.split(/=/, 2)
+          cookie = new(key, WEBrick::HTTPUtils.dequote(value))
+          cookie_elem.each{|pair|
+            pair.strip!
+            key, value = pair.split(/=/, 2)
+            if value
+              value = WEBrick::HTTPUtils.dequote(value.strip)
+            end
+            case key.downcase
+            when "domain"  then cookie.domain  = value.sub(/^\./, '')
+            when "path"    then cookie.path    = value
             when 'expires'
-              cookie[:expires] = begin
+              cookie.expires = begin
                 Time::parse(value)
               rescue
                 Time.now
               end
-            when 'secure'
-              cookie[:secure] = true
-            when 'domain' # Reject the cookie if it isn't for this domain
-              cookie[:domain] = value.sub(/^\./, '')
-              # Reject cookies not for this domain
-              # TODO Move the logic to reject based on host to the jar
-              unless uri.host =~ /#{cookie[:domain]}$/
-                valid_cookie = false
-              end
-            when 'httponly'
-              # do nothing
-          # http://msdn.microsoft.com/workshop/author/dhtml/httponly_cookies.asp
-            else
-              cookie[:name]  = name
-              cookie[:value] = value
+            when "max-age" then cookie.max_age = Integer(value)
+            when "comment" then cookie.comment = value
+            when "version" then cookie.version = Integer(value)
+            when "secure"  then cookie.secure = true
             end
-          end
-          # Don't yield this cookie if it is invalid
-          next unless valid_cookie
-          cookie[:path]    ||= uri.path
-          cookie[:secure]  ||= false
-          cookie[:domain]  ||= uri.host
-          yield Cookie.new(cookie)
-        end
+          }
+          cookie.path    ||= uri.path
+          cookie.secure  ||= false
+          cookie.domain  ||= uri.host
+          # Move this in to the cookie jar
+          yield cookie if block_given?
+          cookies << cookie
+        }
+        return cookies
       end
       def to_s
@@ -81,7 +60,8 @@ module WWW
       end
       # Add a cookie to the Jar.
-      def add(cookie)
+      def add(uri, cookie)
+        return unless uri.host =~ /#{cookie.domain}$/
         unless @jar.has_key?(cookie.domain)
           @jar[cookie.domain] = Hash.new
         end

data/lib/mechanize/form.rb CHANGED Viewed

@@ -1,5 +1,3 @@
-require 'mime/types'
 module WWW
   class Mechanize
     # =Synopsis
@@ -26,12 +24,13 @@ module WWW
       attr_reader :form_node, :elements_node
       attr_accessor :method, :action, :name
-      attr_finder :fields, :buttons, :file_uploads, :radiobuttons, :checkboxes
+      attr_reader :fields, :buttons, :file_uploads, :radiobuttons, :checkboxes
       attr_reader :enctype
       def initialize(form_node, elements_node)
         @form_node, @elements_node = form_node, elements_node
+        @form_node.attributes ||= {}
         @method = (@form_node.attributes['method'] || 'GET').upcase
         @action = @form_node.attributes['action']
         @name = @form_node.attributes['name']
@@ -41,22 +40,6 @@ module WWW
         parse
       end
-      # In the case of malformed HTML, fields of multiple forms might occure in this forms'
-      # field array. If the fields have the same name, posterior fields overwrite former fields.
-      # To avoid this, this method rejects all posterior duplicate fields.
-      def uniq_fields!
-        names_in = {}
-        fields.reject! {|f|
-          if names_in.include?(f.name)
-            true
-          else
-            names_in[f.name] = true
-            false
-          end
-        }
-      end
       # This method builds an array of arrays that represent the query
       # parameters to be used with this form.  The return value can then
       # be used to create a query string for this form.
@@ -130,38 +113,45 @@ module WWW
         @radiobuttons = WWW::Mechanize::List.new
         @checkboxes   = WWW::Mechanize::List.new
-        @elements_node.each_recursive {|node|
+        # Find all input tags
+        (@elements_node/'input').each do |node|
+          node.attributes ||= {}
           type = (node.attributes['type'] || 'text').downcase
+          name = node.attributes['name']
+          next if type != 'submit' && name.nil?
+          case type
+          when 'text', 'password', 'hidden', 'int'
+            @fields << Field.new(node.attributes['name'], node.attributes['value'] || '')
+          when 'radio'
+            @radiobuttons << RadioButton.new(node.attributes['name'], node.attributes['value'], node.attributes.has_key?('checked'), self)
+          when 'checkbox'
+            @checkboxes << CheckBox.new(node.attributes['name'], node.attributes['value'], node.attributes.has_key?('checked'), self)
+          when 'file'
+            @file_uploads << FileUpload.new(node.attributes['name'], nil)
+          when 'submit'
+            @buttons << Button.new(node.attributes['name'], node.attributes['value'])
+          when 'image'
+            @buttons << ImageButton.new(node.attributes['name'], node.attributes['value'])
+          end
+        end
-          # Don't add fields that don't have a name
-          next if type != 'submit' && node.attributes['name'].nil?
+        # Find all textarea tags
+        (@elements_node/'textarea').each do |node|
+          next if node.attributes.nil?
+          next if node.attributes['name'].nil?
+          @fields << Field.new(node.attributes['name'], node.all_text)
+        end
-          case node.name.downcase
-          when 'input'
-            case type
-            when 'text', 'password', 'hidden', 'int'
-              @fields << Field.new(node.attributes['name'], node.attributes['value'] || '')
-            when 'radio'
-              @radiobuttons << RadioButton.new(node.attributes['name'], node.attributes['value'], node.attributes.has_key?('checked'), self)
-            when 'checkbox'
-              @checkboxes << CheckBox.new(node.attributes['name'], node.attributes['value'], node.attributes.has_key?('checked'), self)
-            when 'file'
-              @file_uploads << FileUpload.new(node.attributes['name'], nil)
-            when 'submit'
-              @buttons << Button.new(node.attributes['name'], node.attributes['value'])
-            when 'image'
-              @buttons << ImageButton.new(node.attributes['name'], node.attributes['value'])
-            end
-          when 'textarea'
-            @fields << Field.new(node.attributes['name'], node.all_text)
-          when 'select'
-            if node.attributes.has_key? 'multiple'
-              @fields << MultiSelectList.new(node.attributes['name'], node)
-            else
-              @fields << SelectList.new(node.attributes['name'], node)
-            end
+        # Find all select tags
+        (@elements_node/'select').each do |node|
+          next if node.attributes.nil?
+          next if node.attributes['name'].nil?
+          if node.attributes.has_key? 'multiple'
+            @fields << MultiSelectList.new(node.attributes['name'], node)
+          else
+            @fields << SelectList.new(node.attributes['name'], node)
           end
-        }
+        end
       end
       def rand_string(len = 10)
@@ -189,7 +179,8 @@ module WWW
         if file.file_data.nil? and ! file.file_name.nil?
           file.file_data = ::File.open(file.file_name, "rb") { |f| f.read }
-          file.mime_type = MIME::Types.type_for(file.file_name).first
+          file.mime_type = WEBrick::HTTPUtils.mime_type(file.file_name,
+                                          WEBrick::HTTPUtils::DefaultMimeTypes)
         end
         if file.mime_type != nil