RubyGems - mechanize - Versions diffs - 2.1 → 2.1.1 - Mend

mechanize 2.1 → 2.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of mechanize might be problematic. Click here for more details.

Files changed (32) hide show

data.tar.gz.sig +0 -0
data/CHANGELOG.rdoc +28 -0
data/Manifest.txt +1 -1
data/README.rdoc +1 -1
data/Rakefile +1 -1
data/examples/wikipedia_links_to_philosophy.rb +159 -0
data/lib/mechanize.rb +68 -5
data/lib/mechanize/download.rb +9 -8
data/lib/mechanize/form.rb +8 -0
data/lib/mechanize/form/field.rb +8 -0
data/lib/mechanize/http/agent.rb +107 -65
data/lib/mechanize/http/www_authenticate_parser.rb +14 -0
data/lib/mechanize/page.rb +8 -10
data/lib/mechanize/page/meta_refresh.rb +8 -1
data/lib/mechanize/parser.rb +1 -1
data/lib/mechanize/response_read_error.rb +15 -4
data/lib/mechanize/test_case.rb +10 -0
data/lib/mechanize/util.rb +23 -15
data/test/htdocs/tc_referer.html +1 -1
data/test/test_mechanize.rb +48 -2
data/test/test_mechanize_download.rb +11 -1
data/test/test_mechanize_file.rb +7 -0
data/test/test_mechanize_form.rb +16 -1
data/test/test_mechanize_http_agent.rb +155 -26
data/test/test_mechanize_page_encoding.rb +6 -0
data/test/test_mechanize_page_meta_refresh.rb +10 -0
data/test/test_mechanize_parser.rb +10 -0
data/test/test_mechanize_response_read_error.rb +28 -0
data/test/test_mechanize_util.rb +5 -0
metadata +47 -30
metadata.gz.sig +0 -0
data/FAQ.rdoc +0 -11

data.tar.gz.sig CHANGED Viewed

Binary file

data/CHANGELOG.rdoc CHANGED Viewed

@@ -1,5 +1,33 @@
 = Mechanize CHANGELOG
+=== 2.1.1 / 2010-02-03
+* Bug fixes
+  * Set missing idle_timeout default.  Issue #196
+  * Meta refresh URIs are now escaped (excluding %).  Issue #177
+  * Fix charset name extraction.  Issue #180
+  * A Referer URI sent on request no longer includes user information
+    or fragment part.
+  * Tempfiles for storing response bodies are unlinked upon creation to avoid
+    possible lack of finalization.  Issue #183
+  * The default maximum history size is now 50 pages to avoid filling up a
+    disk with tempfiles accidentally.  Related to Issue #183
+  * Errors in bodies with deflate and gzip responses now result in a
+    Mechanize::Error instead of silently being ignored and causing future
+    errors.  Issue #185
+  * Mechanize now raises an UnauthorizedError instead of crashing when a 403
+    response does not contain a www-authenticate header.  Issue #181
+  * Mechanize gives a useful exception when attempting to click buttons across
+    pages.  Issue #186
+  * Added note to Mechanize#cert_store describing how to add certificates in
+    case your system does not come with a default set.  Issue #179
+  * Invalid content-disposition headers are now ignored.  Issue #191
+  * Fix NTLM by recognizing the "Negotiation" challenge instead of endlessly
+    looping.  Issue #192
+  * Allow specification of the NTLM domain through Mechanize#auth.  Issue #193
+  * Documented how to convert a Mechanize::ResponseReadError into a File or
+    Page, along with a new method #force_parse.  Issue #176
 === 2.1 / 2011-12-20
 * Deprecations

data/Manifest.txt CHANGED Viewed

@@ -1,7 +1,6 @@
 .autotest
 CHANGELOG.rdoc
 EXAMPLES.rdoc
-FAQ.rdoc
 GUIDE.rdoc
 LICENSE.rdoc
 Manifest.txt
@@ -12,6 +11,7 @@ examples/mech-dump.rb
 examples/proxy_req.rb
 examples/rubyforge.rb
 examples/spider.rb
+examples/wikipedia_links_to_philosophy.rb
 lib/mechanize.rb
 lib/mechanize/content_type_error.rb
 lib/mechanize/cookie.rb

data/README.rdoc CHANGED Viewed

@@ -13,7 +13,7 @@ a history.
 == Dependencies
-* ruby 1.8.7
+* ruby 1.8.7, 1.9.2, or 1.9.3
 * nokogiri[http://nokogiri.rubyforge.org]
 == SUPPORT:

data/Rakefile CHANGED Viewed

@@ -17,7 +17,7 @@ hoe = Hoe.spec 'mechanize' do
   rdoc_locations << 'drbrain@rubyforge.org:/var/www/gforge-projects/mechanize/'
   self.extra_deps << ['net-http-digest_auth', '~> 1.1', '>= 1.1.1']
-  self.extra_deps << ['net-http-persistent',  '~> 2.3', '>= 2.3.2']
+  self.extra_deps << ['net-http-persistent',  '~> 2.4', '>= 2.4.1']
   self.extra_deps << ['nokogiri',             '~> 1.4']
   self.extra_deps << ['ntlm-http',            '~> 0.1', '>= 0.1.1']
   self.extra_deps << ['webrobots',            '~> 0.0', '>= 0.0.9']

data/examples/wikipedia_links_to_philosophy.rb ADDED Viewed

@@ -0,0 +1,159 @@
+require 'mechanize'
+require 'tsort'
+##
+# This example implements the alt-text of http://xkcd.com/903/ which states:
+#
+# Wikipedia trivia: if you take any article, click on the first link in the
+# article text not in parentheses or italics, and then repeat, you will
+# eventually end up at "Philosophy".
+class WikipediaLinksToPhilosophy
+  def initialize
+    @agent = Mechanize.new
+    @agent.user_agent_alias = 'Mac Safari' # Wikipedia blocks "mechanize"
+    @history = @agent.history
+    @wiki_url = URI 'http://en.wikipedia.org'
+    @search_url = @wiki_url + '/w/index.php'
+    @random_url = @wiki_url + '/wiki/Special:Random'
+    @title = nil
+    @seen = nil
+  end
+  ##
+  # Retrieves the title of the current page
+  def extract_title
+    @page.title =~ /(.*) - Wikipedia/
+    @title = $1
+  end
+  ##
+  # Retrieves the initial page.  If +query+ is not given a random page is
+  # chosen
+  def fetch_first_page query
+    if query then
+      search query
+    else
+      random
+    end
+  end
+  ##
+  # The search is finished if we've seen the page before or we've reached
+  # Philosophy
+  def finished?
+    @seen or @title == 'Philosophy'
+  end
+  ##
+  # Follows the first non-parenthetical, non-italic link in the main body of
+  # the article.
+  def follow_first_link
+    puts @title
+    # > p > a rejects italics
+    links = @page.root.css('.mw-content-ltr > p > a[href^="/wiki/"]')
+    # reject disambiguation and special pages, images and files
+    links = links.reject do |link_node|
+      link_node['href'] =~ %r%/wiki/\w+:|\(disambiguation\)%
+    end
+    links = links.reject do |link_node|
+      in_parenthetical? link_node
+    end
+    link = links.first
+    unless link then
+      # disambiguation page? try the first item in the list
+      link =
+        @page.root.css('.mw-content-ltr > ul > li > a[href^="/wiki/"]').first
+    end
+    # convert a Nokogiri HTML element back to a mechanize link
+    link = Mechanize::Page::Link.new link, @agent, @page
+    return if @seen = @agent.visited?(link)
+    @page = link.click
+    extract_title
+  end
+  ##
+  # Is +link_node+ in an open parenthetical section?
+  def in_parenthetical? link_node
+    siblings = link_node.parent.children
+    seen = false
+    before = siblings.reject do |node|
+      seen or (seen = node == link_node)
+    end
+    preceding_text = before.map { |node| node.text }.join
+    open  = preceding_text.count '('
+    close = preceding_text.count ')'
+    open > close
+  end
+  ##
+  # Prints the result of the search
+  def print_result
+    if @seen then
+      puts "[Loop detected]"
+    else
+      puts @title
+    end
+    puts
+    # subtract initial search or Special:Random
+    puts "After #{@agent.history.length - 1} pages"
+  end
+  ##
+  # Retrieves a random page from wikipedia
+  def random
+    @page = @agent.get @random_url
+    extract_title
+  end
+  ##
+  # Entry point
+  def run query = nil
+    fetch_first_page query
+    follow_first_link until finished?
+    print_result
+  end
+  ##
+  # Searches for +query+ on wikipedia
+  def search query
+    @page = @agent.get @search_url, search: query
+    extract_title
+  end
+end
+WikipediaLinksToPhilosophy.new.run ARGV.shift if $0 == __FILE__

data/lib/mechanize.rb CHANGED Viewed

@@ -4,7 +4,6 @@ require 'iconv' if RUBY_VERSION < '1.9.2'
 require 'mutex_m'
 require 'net/http/digest_auth'
 require 'net/http/persistent'
-require 'nkf'
 require 'nokogiri'
 require 'openssl'
 require 'pp'
@@ -16,7 +15,7 @@ require 'zlib'
 ##
 # The Mechanize library is used for automating interactions with a website.  It
 # can follow links and submit forms.  Form fields can be populated and
-# submitted.  A history of URL's is maintained and can be queried.
+# submitted.  A history of URLs is maintained and can be queried.
 #
 # == Example
 #
@@ -33,13 +32,47 @@ require 'zlib'
 #
 #   search_results = agent.submit search_form
 #   puts search_results.body
+#
+# == Issues with mechanize
+#
+# If you think you have a bug with mechanize, but aren't sure, please file a
+# ticket at https://github.com/tenderlove/mechanize/issues
+#
+# Here are some common problems you may experience with mechanize
+#
+# === Problems connecting to SSL sites
+#
+# Mechanize defaults to validating SSL certificates using the default CA
+# certificates for your platform.  At this time, Windows users do not have
+# integration between the OS default CA certificates and OpenSSL.  #cert_store
+# explains how to download and use Mozilla's CA certificates to allow SSL
+# sites to work.
+#
+# === Problems with content-length
+#
+# Some sites return an incorrect content-length value.  Unlike a browser,
+# mechanize raises an error when the content-length header does not match the
+# response length since it does not know if there was a connection problem or
+# if the mismatch is a server bug.
+#
+# The error raised, Mechanize::ResponseReadError, can be converted to a parsed
+# Page, File, etc. depending upon the content-type:
+#
+#   agent = Mechanize.new
+#   uri = URI 'http://example/invalid_content_length'
+#
+#   begin
+#     page = agent.get uri
+#   rescue Mechanize::ResponseReadError => e
+#     page = e.force_parse
+#   end
 class Mechanize
   ##
   # The version of Mechanize you are using.
-  VERSION = '2.1'
+  VERSION = '2.1.1'
   ##
   # Base mechanize error class
@@ -137,6 +170,9 @@ class Mechanize
     @default_encoding = nil
     @force_default_encoding = false
+    # defaults
+    @agent.max_history = 50
     yield self if block_given?
     @agent.set_proxy @proxy_addr, @proxy_port, @proxy_user, @proxy_pass
@@ -179,6 +215,11 @@ class Mechanize
   ##
   # Sets the maximum number of items allowed in the history to +length+.
+  #
+  # Setting the maximum history length to nil will make the history size
+  # unlimited.  Take care when doing this, mechanize stores page bodies in the
+  # temporary files directory for pages in the history.  For a long-running
+  # mechanize program this can be quite large.
   def max_history= length
     @agent.history.max_size = length
@@ -518,10 +559,12 @@ class Mechanize
   ##
   # Sets the user and password to be used for HTTP authentication.
+  # sets the optional domain for NTLM authentication
-  def auth(user, password)
+  def auth(user, password, domain = nil)
     @agent.user     = user
     @agent.password = password
+    @agent.domain = domain
   end
   alias basic_auth auth
@@ -869,7 +912,25 @@ class Mechanize
   ##
   # An OpenSSL certificate store for verifying server certificates.  This
-  # defaults to the default certificate store.
+  # defaults to the default certificate store for your system.
+  #
+  # If your system does not ship with a default set of certificates you can
+  # retrieve a copy of the set from Mozilla here:
+  # http://curl.haxx.se/docs/caextract.html
+  #
+  # (Note that this set does not have an HTTPS download option so you may
+  # wish to use the firefox-db2pem.sh script to extract the certificates
+  # from a local install to avoid man-in-the-middle attacks.)
+  #
+  # After downloading or generating a cacert.pem from the above link you
+  # can create a certificate store from the pem file like this:
+  #
+  #   cert_store = OpenSSL::X509::Store.new
+  #   cert_store.add_file 'cacert.pem'
+  #
+  # And have mechanize use it with:
+  #
+  #   agent.cert_store = cert_store
   def cert_store
     @agent.cert_store
@@ -877,6 +938,8 @@ class Mechanize
   ##
   # Sets the OpenSSL certificate store to +store+.
+  #
+  # See also #cert_store
   def cert_store= cert_store
     @agent.cert_store = cert_store

data/lib/mechanize/download.rb CHANGED Viewed

@@ -9,6 +9,12 @@ class Mechanize::Download
   include Mechanize::Parser
+  ##
+  # The filename for this file based on the content-disposition of the
+  # response or the basename of the URL
+  attr_accessor :filename
   ##
   # Accessor for the IO-like that contains the body
@@ -43,15 +49,10 @@ class Mechanize::Download
     dirname = File.dirname filename
     FileUtils.mkdir_p dirname
-    # Ruby 1.8.7 implements StringIO#path, can't use respond_to? :path
-    if StringIO === @body_io then
-      open filename, 'wb' do |io|
-        until @body_io.eof? do
-          io.write @body_io.read 16384
-        end
+    open filename, 'wb' do |io|
+      until @body_io.eof? do
+        io.write @body_io.read 16384
       end
-    else
-      FileUtils.mv @body_io.path, filename
     end
   end

data/lib/mechanize/form.rb CHANGED Viewed

@@ -255,6 +255,14 @@ class Mechanize::Form
   # This method adds a button to the query.  If the form needs to be
   # submitted with multiple buttons, pass each button to this method.
   def add_button_to_query(button)
+    unless button.node.document == @form_node.document then
+      message =
+        "#{button.inspect} does not belong to the same page as " \
+        "the form #{@name.inspect} in #{@page.uri}"
+      raise ArgumentError, message
+    end
     @clicked_buttons << button
   end

data/lib/mechanize/form/field.rb CHANGED Viewed

@@ -50,5 +50,13 @@ class Mechanize::Form::Field
   def dom_class
     node['class']
   end
+  def inspect # :nodoc:
+    "[%s:0x%x type: %s name: %s value: %s]" % [
+      self.class.name.sub(/Mechanize::Form::/, '').downcase,
+      object_id, @type, @name, @value
+    ]
+  end
 end

data/lib/mechanize/http/agent.rb CHANGED Viewed

@@ -47,6 +47,7 @@ class Mechanize::HTTP::Agent
   attr_reader :digest_challenges # :nodoc:
   attr_accessor :user
   attr_accessor :password
+  attr_accessor :domain
   # :section: Redirection
@@ -156,7 +157,7 @@ class Mechanize::HTTP::Agent
     @follow_meta_refresh_self = false
     @gzip_enabled             = true
     @history                  = Mechanize::History.new
-    @idle_timeout             = nil
+    @idle_timeout             = 5
     @keep_alive               = true
     @keep_alive_time          = 300
     @max_file_buffer          = 10240
@@ -184,6 +185,7 @@ class Mechanize::HTTP::Agent
     @digest_challenges    = {}
     @password             = nil # HTTP auth password
     @user                 = nil # HTTP auth user
+    @domain               = nil # NTLM HTTP domain
     # SSL
     @ca_file         = nil
@@ -264,7 +266,7 @@ class Mechanize::HTTP::Agent
     response = connection.request(uri, request) { |res|
       response_log res
-      response_body_io = response_read res, request
+      response_body_io = response_read res, request, uri
       res
     }
@@ -392,6 +394,62 @@ class Mechanize::HTTP::Agent
     end
   end
+  ##
+  # Decodes a gzip-encoded +body_io+.  If it cannot be decoded, inflate is
+  # tried followed by raising an error.
+  def content_encoding_gunzip body_io
+    log.debug('gzip response') if log
+    zio = Zlib::GzipReader.new body_io
+    out_io = Tempfile.new 'mechanize-decode'
+    out_io.unlink
+    out_io.binmode
+    until zio.eof? do
+      out_io.write zio.read 16384
+    end
+    zio.finish
+    return out_io
+  rescue Zlib::Error
+    log.error('unable to gunzip response, trying raw inflate') if log
+    body_io.rewind
+    body_io.read 10
+    begin
+      return inflate body_io, -Zlib::MAX_WBITS
+    rescue Zlib::Error => e
+      log.error("unable to gunzip response: #{e}") if log
+      raise
+    end
+  ensure
+    zio.close if zio and not zio.closed?
+  end
+  ##
+  # Decodes a deflate-encoded +body_io+.  If it cannot be decoded, raw inflate
+  # is tried followed by raising an error.
+  def content_encoding_inflate body_io
+    log.debug('deflate body') if log
+    return inflate body_io
+  rescue Zlib::Error
+    log.error('unable to inflate response, trying raw deflate') if log
+    body_io.rewind
+    begin
+      return inflate body_io, -Zlib::MAX_WBITS
+    rescue Zlib::Error => e
+      log.error("unable to inflate response: #{e}") if log
+      raise
+    end
+  end
   def disable_keep_alive request
     request['connection'] = 'close' unless @keep_alive
   end
@@ -491,11 +549,17 @@ class Mechanize::HTTP::Agent
     end
   end
+  # Sets a Referer header.  Fragment part is removed as demanded by
+  # RFC 2616 14.36, and user information part is removed just like
+  # major browsers do.
   def request_referer request, uri, referer
     return unless referer
     return if 'https' == referer.scheme.downcase and
               'https' != uri.scheme.downcase
+    if referer.fragment || referer.user || referer.password
+      referer = referer.dup
+      referer.fragment = referer.user = referer.password = nil
+    end
     request['Referer'] = referer
   end
@@ -602,7 +666,11 @@ class Mechanize::HTTP::Agent
                             referer)
     raise Mechanize::UnauthorizedError, page unless @user || @password
-    challenges = @authenticate_parser.parse response['www-authenticate']
+    www_authenticate = response['www-authenticate']
+    raise Mechanize::UnauthorizedError, page unless www_authenticate
+    challenges = @authenticate_parser.parse www_authenticate
     if challenge = challenges.find { |c| c.scheme =~ /^Digest$/i } then
       realm = challenge.realm uri
@@ -631,7 +699,7 @@ class Mechanize::HTTP::Agent
       if challenge.params then
         type_2 = Net::NTLM::Message.decode64 challenge.params
-        type_3 = type_2.response({ :user => @user, :password => @password, },
+        type_3 = type_2.response({ :user => @user, :password => @password, :domain => @domain },
                                  { :ntlmv2 => true }).encode64
         headers['Authorization'] = "NTLM #{type_3}"
@@ -656,71 +724,42 @@ class Mechanize::HTTP::Agent
   end
   def response_content_encoding response, body_io
-    length = response.content_length
-    length = case body_io
-             when IO, Tempfile then
-               body_io.stat.size
-             else
-               body_io.length
-             end unless length
-    out_io = nil
-    case response['Content-Encoding']
-    when nil, 'none', '7bit' then
-      out_io = body_io
-    when 'deflate' then
-      log.debug('deflate body') if log
-      return if length.zero?
-      begin
-        out_io = inflate body_io
-      rescue Zlib::BufError, Zlib::DataError
-        log.error('Unable to inflate page, retrying with raw deflate') if log
-        body_io.rewind
-        begin
-          out_io = inflate body_io, -Zlib::MAX_WBITS
-        rescue Zlib::BufError, Zlib::DataError
-          log.error("unable to inflate page: #{$!}") if log
-          nil
-        end
+    length = response.content_length ||
+      case body_io
+      when Tempfile, IO then
+        body_io.stat.size
+      else
+        body_io.length
       end
-    when 'gzip', 'x-gzip' then
-      log.debug('gzip body') if log
-      return if length.zero?
+    return body_io if length.zero?
-      begin
-        zio = Zlib::GzipReader.new body_io
-        out_io = Tempfile.new 'mechanize-decode'
-        out_io.binmode
-        until zio.eof? do
-          out_io.write zio.read 16384
-        end
-      rescue Zlib::BufError, Zlib::GzipFile::Error
-        log.error('Unable to gunzip body, trying raw inflate') if log
-        body_io.rewind
-        body_io.read 10
-        out_io = inflate body_io, -Zlib::MAX_WBITS
-      rescue Zlib::DataError
-        log.error("unable to gunzip page: #{$!}") if log
-        ''
-      ensure
-        zio.close if zio and not zio.closed?
-      end
-    else
-      raise Mechanize::Error,
-            "Unsupported Content-Encoding: #{response['Content-Encoding']}"
-    end
+    out_io = case response['Content-Encoding']
+             when nil, 'none', '7bit' then
+               body_io
+             when 'deflate' then
+               content_encoding_inflate body_io
+             when 'gzip', 'x-gzip' then
+               content_encoding_gunzip body_io
+             else
+               raise Mechanize::Error,
+                 "unsupported content-encoding: #{response['Content-Encoding']}"
+             end
     out_io.flush
     out_io.rewind
     out_io
+  rescue Zlib::Error => e
+    message = "error handling content-encoding #{response['Content-Encoding']}:"
+    message << " #{e.message} (#{e.class})"
+    raise Mechanize::Error, message
+  ensure
+    begin
+      body_io.close! if Tempfile === body_io and out_io.path != body_io.path
+    rescue IOError
+      # HACK ruby 1.8 raises IOError when closing the stream
+    end
   end
   def response_cookies response, uri, page
@@ -778,11 +817,12 @@ class Mechanize::HTTP::Agent
     @context.parse uri, response, body_io
   end
-  def response_read response, request
+  def response_read response, request, uri
     content_length = response.content_length
     if content_length and content_length > @max_file_buffer then
       body_io = Tempfile.new 'mechanize-raw'
+      body_io.unlink
       body_io.binmode if defined? body_io.binmode
     else
       body_io = StringIO.new
@@ -797,7 +837,8 @@ class Mechanize::HTTP::Agent
         if StringIO === body_io and total > @max_file_buffer then
           new_io = Tempfile.new 'mechanize-raw'
-          new_io.binmode if defined? binmode
+          new_io.unlink
+          new_io.binmode
           new_io.write body_io.string
@@ -809,7 +850,8 @@ class Mechanize::HTTP::Agent
       }
     rescue Net::HTTP::Persistent::Error => e
       body_io.rewind
-      raise Mechanize::ResponseReadError.new(e, response, body_io)
+      raise Mechanize::ResponseReadError.new(e, response, body_io, uri,
+                                             @context)
     end
     body_io.flush