RubyGems - nokogumbo - Versions diffs - 0.5.2 → 0.6 - Mend

nokogumbo 0.5.2 → 0.6

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Files changed (3) hide show

data/README.md CHANGED Viewed

@@ -13,18 +13,31 @@ require 'nokogumbo'
 doc = Nokogiri::HTML5(string)
 ```
+Because HTML is often fetched via the web, a convenience interface is also
+provided:
+```ruby
+require 'nokogumbo'
+doc = Nokogiri::HTML5.get(uri)
+```
 Notes:
 -----
-* The `Nokogumbo.parse` function takes a string and passes it to the
+* The `Nokogiri::HTML5.parse` function takes a string and passes it to the
 <code>gumbo_parse_with_options</code> method, using the default options.
-The resulting Gumbo parse tree is the walked, producing a Nokogiri parse tree.
-The original Gumbo parse tree is then destroyed, and the Nokogiri parse tree
-is returned.
+The resulting Gumbo parse tree is the walked, producing a libxml2 parse tree.
+The original Gumbo parse tree is then destroyed, and single Nokogiri Ruby
+object is constructed to wrap the libxml2 parse tree.  Nokogiri only produces
+Ruby objects as necessary, so all scanning is done using the underlying
+libxml2 libraries.
+* The `Nokogiri::HTML5.get` function takes care of following redirects,
+https, and determining the character encoding of the result, based on the
+rules defined in the HTML5 specification for doing so.
 * Instead of uppercase element names, lowercase element names are produced.
-* Instead of returning 'unknown' as the element name for unknown tags, the
+* Instead of returning `unknown` as the element name for unknown tags, the
 original tag name is returned verbatim.
 * The gem itself includes a copy of the Gumbo HTML5 parser.

data/lib/nokogumbo.rb CHANGED Viewed

@@ -8,12 +8,92 @@ module Nokogiri
   module HTML5
     def self.parse(string)
+      if string.respond_to? :read
+        string = string.read
+      end
       # convert to UTF-8 (Ruby 1.9+)
       if string.respond_to?(:encoding) and string.encoding != Encoding::UTF_8
-        string = string.encode(Encoding::UTF_8)
+        string = reencode(string)
       end
       Nokogumbo.parse(string)
     end
+    def self.get(uri, limit=10)
+      require 'net/http'
+      uri = URI(uri) unless URI === uri
+      http = Net::HTTP.new(uri.host, uri.port)
+      if uri.scheme == 'https'
+        http.use_ssl = true
+        http.verify_mode = OpenSSL::SSL::VERIFY_NONE
+      end
+      request = Net::HTTP::Get.new(uri.request_uri)
+      response = http.request(request)
+      case response
+      when Net::HTTPSuccess
+        parse(reencode(response.body, response['content-type']))
+      when Net::HTTPRedirection
+        response.value if limit <= 1
+        get(response['location'], limit-1)
+      else
+        response.value
+      end
+    end
+  private
+    # Charset sniffing is a complex and controversial topic that understandably
+    # isn't done _by default_ by the Ruby Net::HTTP library.  This being said,
+    # it is a very real problem for consumers of HTML as the default for HTML
+    # is iso-8859-1, most "good" producers use utf-8, and the Gumbo parser
+    # *only* supports utf-8.
+    #
+    # Accordingly, Nokogiri::HTML::Document.parse provides limited encoding
+    # detection.  Following this lead, Nokogiri::HTML5 attempts to do likewise,
+    # while attempting to more closely follow the HTML5 standard.
+    #
+    # http://bugs.ruby-lang.org/issues/2567
+    # http://www.w3.org/TR/html5/syntax.html#determining-the-character-encoding
+    #
+    def self.reencode(body, content_type=nil)
+      return body unless body.respond_to? :encoding
+      if body.encoding == Encoding::ASCII_8BIT
+        encoding = nil
+        # look for a Byte Order Mark (BOM)
+        if body[0..1] == "\xFE\xFF"
+          encoding = 'utf-16be'
+        elsif body[0..1] == "\xFF\xFE"
+          encoding = 'utf-16le'
+        elsif body[0..2] == "\xEF\xBB\xBF"
+          encoding = 'utf-8'
+        end
+        # look for a charset in a content-encoding header
+        if content_type
+          encoding ||= content_type[/charset=(.*?)($|\s|;)/i, 1]
+        end
+        # look for a charset in a meta tag in the first 1024 bytes
+        if not encoding
+          data = body[0..1023].gsub(/<!--.*?(-->|\Z)/m, '')
+          data.scan(/<meta.*?>/m).each do |meta|
+            encoding ||= meta[/charset="?(.*?)($|"|\s|>)/im, 1]
+          end
+        end
+        # if all else fails, default to the official default encoding for HTML
+        encoding ||= Encoding::ISO_8859_1
+        # change the encoding to match the detected or inferred encoding
+        body.force_encoding(encoding)
+      end
+      body.encode(Encoding::UTF_8)
+    end
   end
 end

metadata CHANGED Viewed

@@ -1,7 +1,7 @@
 --- !ruby/object:Gem::Specification
 name: nokogumbo
 version: !ruby/object:Gem::Version
-  version: 0.5.2
+  version: '0.6'
   prerelease:
 platform: ruby
 authors:
@@ -9,7 +9,7 @@ authors:
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2013-08-21 00:00:00.000000000 Z
+date: 2013-08-22 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: nokogiri