nokogumbo 0.6 → 0.7
Sign up to get free protection for your applications and to get access to all the features.
- data/README.md +7 -4
- data/lib/nokogumbo.rb +9 -1
- metadata +2 -2
data/README.md
CHANGED
@@ -3,7 +3,8 @@ Nokogumbo - a Nokogiri interface to the Gumbo HTML5 parser.
|
|
3
3
|
|
4
4
|
Nokogumbo provides the ability for a Ruby program to invoke the
|
5
5
|
[Gumbo HTML5 parser](https://github.com/google/gumbo-parser#readme)
|
6
|
-
and to access the result as a
|
6
|
+
and to access the result as a
|
7
|
+
[Nokogiri::HTML::Document](http://nokogiri.org/Nokogiri/HTML/Document.html).
|
7
8
|
|
8
9
|
Usage:
|
9
10
|
-----
|
@@ -25,10 +26,12 @@ Notes:
|
|
25
26
|
|
26
27
|
* The `Nokogiri::HTML5.parse` function takes a string and passes it to the
|
27
28
|
<code>gumbo_parse_with_options</code> method, using the default options.
|
28
|
-
The resulting Gumbo parse tree is the walked, producing a
|
29
|
+
The resulting Gumbo parse tree is the walked, producing a
|
30
|
+
[libxml2](http://xmlsoft.org/html/)
|
31
|
+
[xmlDoc](http://xmlsoft.org/html/libxml-tree.html#xmlDoc).
|
29
32
|
The original Gumbo parse tree is then destroyed, and single Nokogiri Ruby
|
30
|
-
object is constructed to wrap the
|
31
|
-
Ruby objects as necessary, so all
|
33
|
+
object is constructed to wrap the xmlDoc structure. Nokogiri only produces
|
34
|
+
Ruby objects as necessary, so all searching is done using the underlying
|
32
35
|
libxml2 libraries.
|
33
36
|
|
34
37
|
* The `Nokogiri::HTML5.get` function takes care of following redirects,
|
data/lib/nokogumbo.rb
CHANGED
@@ -2,11 +2,15 @@ require 'nokogiri'
|
|
2
2
|
require 'nokogumboc'
|
3
3
|
|
4
4
|
module Nokogiri
|
5
|
+
# Parse an HTML document. +string+ contains the document. +string+
|
6
|
+
# may also be an IO-like object. Returns a +Nokogiri::HTML::Document+.
|
5
7
|
def self.HTML5(string)
|
6
8
|
Nokogiri::HTML5.parse(string)
|
7
9
|
end
|
8
10
|
|
9
11
|
module HTML5
|
12
|
+
# Parse an HTML document. +string+ contains the document. +string+
|
13
|
+
# may also be an IO-like object. Returns a +Nokogiri::HTML::Document+.
|
10
14
|
def self.parse(string)
|
11
15
|
if string.respond_to? :read
|
12
16
|
string = string.read
|
@@ -20,6 +24,10 @@ module Nokogiri
|
|
20
24
|
Nokogumbo.parse(string)
|
21
25
|
end
|
22
26
|
|
27
|
+
# Fetch and parse a HTML document from the web, following redirects,
|
28
|
+
# handling https, and determining the character encoding using HTML5
|
29
|
+
# rules. +uri+ may be a +String+ or a +URI+. +limit+ controls the
|
30
|
+
# number of redirects that will be followed.
|
23
31
|
def self.get(uri, limit=10)
|
24
32
|
require 'net/http'
|
25
33
|
uri = URI(uri) unless URI === uri
|
@@ -82,7 +90,7 @@ module Nokogiri
|
|
82
90
|
if not encoding
|
83
91
|
data = body[0..1023].gsub(/<!--.*?(-->|\Z)/m, '')
|
84
92
|
data.scan(/<meta.*?>/m).each do |meta|
|
85
|
-
encoding ||= meta[/charset="?(
|
93
|
+
encoding ||= meta[/charset=["']?([^>]*?)($|["'\s>])/im, 1]
|
86
94
|
end
|
87
95
|
end
|
88
96
|
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: nokogumbo
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: '0.
|
4
|
+
version: '0.7'
|
5
5
|
prerelease:
|
6
6
|
platform: ruby
|
7
7
|
authors:
|
@@ -9,7 +9,7 @@ authors:
|
|
9
9
|
autorequire:
|
10
10
|
bindir: bin
|
11
11
|
cert_chain: []
|
12
|
-
date: 2013-08-
|
12
|
+
date: 2013-08-25 00:00:00.000000000 Z
|
13
13
|
dependencies:
|
14
14
|
- !ruby/object:Gem::Dependency
|
15
15
|
name: nokogiri
|