nokogumbo 0.6 → 0.7
This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
- data/README.md +7 -4
- data/lib/nokogumbo.rb +9 -1
- metadata +2 -2
data/README.md
CHANGED
@@ -3,7 +3,8 @@ Nokogumbo - a Nokogiri interface to the Gumbo HTML5 parser.
|
|
3
3
|
|
4
4
|
Nokogumbo provides the ability for a Ruby program to invoke the
|
5
5
|
[Gumbo HTML5 parser](https://github.com/google/gumbo-parser#readme)
|
6
|
-
and to access the result as a
|
6
|
+
and to access the result as a
|
7
|
+
[Nokogiri::HTML::Document](http://nokogiri.org/Nokogiri/HTML/Document.html).
|
7
8
|
|
8
9
|
Usage:
|
9
10
|
-----
|
@@ -25,10 +26,12 @@ Notes:
|
|
25
26
|
|
26
27
|
* The `Nokogiri::HTML5.parse` function takes a string and passes it to the
|
27
28
|
<code>gumbo_parse_with_options</code> method, using the default options.
|
28
|
-
The resulting Gumbo parse tree is the walked, producing a
|
29
|
+
The resulting Gumbo parse tree is the walked, producing a
|
30
|
+
[libxml2](http://xmlsoft.org/html/)
|
31
|
+
[xmlDoc](http://xmlsoft.org/html/libxml-tree.html#xmlDoc).
|
29
32
|
The original Gumbo parse tree is then destroyed, and single Nokogiri Ruby
|
30
|
-
object is constructed to wrap the
|
31
|
-
Ruby objects as necessary, so all
|
33
|
+
object is constructed to wrap the xmlDoc structure. Nokogiri only produces
|
34
|
+
Ruby objects as necessary, so all searching is done using the underlying
|
32
35
|
libxml2 libraries.
|
33
36
|
|
34
37
|
* The `Nokogiri::HTML5.get` function takes care of following redirects,
|
data/lib/nokogumbo.rb
CHANGED
@@ -2,11 +2,15 @@ require 'nokogiri'
|
|
2
2
|
require 'nokogumboc'
|
3
3
|
|
4
4
|
module Nokogiri
|
5
|
+
# Parse an HTML document. +string+ contains the document. +string+
|
6
|
+
# may also be an IO-like object. Returns a +Nokogiri::HTML::Document+.
|
5
7
|
def self.HTML5(string)
|
6
8
|
Nokogiri::HTML5.parse(string)
|
7
9
|
end
|
8
10
|
|
9
11
|
module HTML5
|
12
|
+
# Parse an HTML document. +string+ contains the document. +string+
|
13
|
+
# may also be an IO-like object. Returns a +Nokogiri::HTML::Document+.
|
10
14
|
def self.parse(string)
|
11
15
|
if string.respond_to? :read
|
12
16
|
string = string.read
|
@@ -20,6 +24,10 @@ module Nokogiri
|
|
20
24
|
Nokogumbo.parse(string)
|
21
25
|
end
|
22
26
|
|
27
|
+
# Fetch and parse a HTML document from the web, following redirects,
|
28
|
+
# handling https, and determining the character encoding using HTML5
|
29
|
+
# rules. +uri+ may be a +String+ or a +URI+. +limit+ controls the
|
30
|
+
# number of redirects that will be followed.
|
23
31
|
def self.get(uri, limit=10)
|
24
32
|
require 'net/http'
|
25
33
|
uri = URI(uri) unless URI === uri
|
@@ -82,7 +90,7 @@ module Nokogiri
|
|
82
90
|
if not encoding
|
83
91
|
data = body[0..1023].gsub(/<!--.*?(-->|\Z)/m, '')
|
84
92
|
data.scan(/<meta.*?>/m).each do |meta|
|
85
|
-
encoding ||= meta[/charset="?(
|
93
|
+
encoding ||= meta[/charset=["']?([^>]*?)($|["'\s>])/im, 1]
|
86
94
|
end
|
87
95
|
end
|
88
96
|
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: nokogumbo
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: '0.
|
4
|
+
version: '0.7'
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2013-08-
|
12
|
+
date: 2013-08-25 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: nokogiri
|