nokogiri-html5-inference 0.2.0 → 0.3.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +12 -4
- data/README.md +21 -43
- data/lib/nokogiri/html5/inference/version.rb +1 -1
- data/lib/nokogiri/html5/inference.rb +29 -43
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: '068c7d6b9bea47be40be5ac4429d673f9eed2d6666a0f30fa0ffa52d71bd7d96'
|
4
|
+
data.tar.gz: 0d265567d61bc23fd08997772651ec15c2438c25b969380e9bce82a63f961310
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 15998fdfc4c3a2207ca150c47b4ec2d9d9c0012e1288634d092f1dc76d0c729c730c6a56a591d49ea303a38f4652b498940b338972f905f493e09b78d2fce879
|
7
|
+
data.tar.gz: 55ecdad7efa27e7580aa5fed39c5f9ce3860a6992582e7dfb21fb0a32c4ad355591af75c44e71a43327d75c4afe20ff8da832d5079fd39678f91df26ece89b06
|
data/CHANGELOG.md
CHANGED
@@ -1,15 +1,23 @@
|
|
1
|
-
##
|
1
|
+
## Unreleased
|
2
2
|
|
3
|
-
|
3
|
+
|
4
|
+
## v0.3.0 - 2024-05-05
|
5
|
+
|
6
|
+
- Use a `<template>` tag as the context node for the majority of fragment parsing, which greatly simplifies this gem. #7 @flavorjones @stevecheckoway
|
7
|
+
- Clean up the README. @marcoroth
|
8
|
+
- `Nokogiri::HTML5::Inference.parse` always returns a `Nokogiri::XML::Nodeset` for fragments. Previously this method sometimes returns a `Nokogiri::HTML5::DocumentFragment`, but some API inconsistencies between `DocumentFragment` and `NodeSet` made using the returned object tricky. We hope this provides a more consistent development experience. @flavorjones
|
9
|
+
|
10
|
+
|
11
|
+
## v0.2.0 - 2024-04-26
|
4
12
|
|
5
13
|
- When a `<head>` tag is seen first in the input string, include the `<body>` tag in the returned fragment or node set. (#3, #4) @flavorjones
|
6
14
|
|
7
15
|
|
8
|
-
##
|
16
|
+
## v0.1.1 - 2024-04-24
|
9
17
|
|
10
18
|
- Make protected methods `#context` and `#pluck_path` public, but keeping them undocumented.
|
11
19
|
|
12
20
|
|
13
|
-
##
|
21
|
+
## v0.1.0 - 2024-04-24
|
14
22
|
|
15
23
|
- Initial release
|
data/README.md
CHANGED
@@ -1,23 +1,16 @@
|
|
1
|
-
# Nokogiri::
|
1
|
+
# Nokogiri::HTML5::Inference
|
2
2
|
|
3
3
|
Given HTML5 input, make a reasonable guess at how to parse it correctly.
|
4
4
|
|
5
|
-
Nokogiri::HTML5::Inference makes reasonable inferences that work for both HTML5 documents and HTML5
|
6
|
-
fragments, and for all the different HTML5 tags that a web developer might need in a view library.
|
5
|
+
`Nokogiri::HTML5::Inference` makes reasonable inferences that work for both HTML5 documents and HTML5 fragments, and for all the different HTML5 tags that a web developer might need in a view library.
|
7
6
|
|
8
7
|
This is useful for parsing trusted content like view snippets, particularly for morphing cases like StimulusReflex.
|
9
8
|
|
10
9
|
## The problem this library solves
|
11
10
|
|
12
|
-
The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very precise
|
13
|
-
context-dependent parsing rules which can make it challenging to "just parse" a fragment of HTML
|
14
|
-
without knowing the parent node -- also called the "context node" -- in which it will be inserted.
|
11
|
+
The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very precise context-dependent parsing rules which can make it challenging to "just parse" a fragment of HTML without knowing the parent node -- also called the "context node" -- in which it will be inserted.
|
15
12
|
|
16
|
-
Most content in an HTML5 document can be parsed assuming the parser's mode will be in the
|
17
|
-
["in body" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody),
|
18
|
-
but there are some notable exceptions. Perhaps the most problematic to web developers are the
|
19
|
-
table-related tags, which will not be parsed properly unless the parser is in the
|
20
|
-
["in table" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
|
13
|
+
Most content in an HTML5 document can be parsed assuming the parser's mode will be in the ["in body" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody), but there are some notable exceptions. Perhaps the most problematic to web developers are the table-related tags, which will not be parsed properly unless the parser is in the ["in table" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
|
21
14
|
|
22
15
|
For example:
|
23
16
|
|
@@ -26,9 +19,7 @@ Nokogiri::HTML5::DocumentFragment.parse("<td>foo</td>").to_html
|
|
26
19
|
# => "foo" # where did the tag go!?
|
27
20
|
```
|
28
21
|
|
29
|
-
In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed here",
|
30
|
-
and drop the tag. This particular fragment must be parsed "in the context" of a table in order to
|
31
|
-
parse properly.
|
22
|
+
In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed here", and drop the tag. This particular fragment must be parsed "in the context" of a table in order to parse properly.
|
32
23
|
|
33
24
|
Thankfully, libgumbo and Nokogiri allow us to set the context node:
|
34
25
|
|
@@ -41,22 +32,20 @@ Nokogiri::HTML5::DocumentFragment.new(
|
|
41
32
|
# => "<tbody><tr><td>foo</td></tr></tbody>"
|
42
33
|
```
|
43
34
|
|
44
|
-
This result is _almost_ correct, but we're seeing another HTML5 parsing rule in action: there may be
|
45
|
-
_intermediate parent tags_ that the HTML5 spec requires to be inserted by the parser. In this case,
|
46
|
-
the `<td>` tag must be wrapped in `<tbody><tr>` tags.
|
35
|
+
This result is _almost_ correct, but we're seeing another HTML5 parsing rule in action: there may be _intermediate parent tags_ that the HTML5 spec requires to be inserted by the parser. In this case, the `<td>` tag must be wrapped in `<tbody><tr>` tags.
|
47
36
|
|
48
|
-
We can
|
37
|
+
We can fix this to only return the tags we provided by using the `<template>` tag as the context node, which the HTML5 spec provides exactly for this purpose:
|
49
38
|
|
50
39
|
``` ruby
|
51
40
|
Nokogiri::HTML5::DocumentFragment.new(
|
52
41
|
Nokogiri::HTML5::Document.new,
|
53
42
|
"<td>foo</td>",
|
54
|
-
"
|
55
|
-
).
|
43
|
+
"template" # <--- this is the context node
|
44
|
+
).to_html
|
56
45
|
# => "<td>foo</td>"
|
57
46
|
```
|
58
47
|
|
59
|
-
Huzzah! That works. And it's precisely what Nokogiri::HTML5::Inference.parse does:
|
48
|
+
Huzzah! That works. And it's precisely what `Nokogiri::HTML5::Inference.parse` does:
|
60
49
|
|
61
50
|
``` ruby
|
62
51
|
Nokogiri::HTML5::Inference.parse("<td>foo</td>").to_html
|
@@ -68,7 +57,7 @@ Nokogiri::HTML5::Inference.parse("<td>foo</td>").to_html
|
|
68
57
|
|
69
58
|
Given an input String containing HTML5, infer the best way to parse it by calling `Nokogiri::HTML5::Inference.parse`.
|
70
59
|
|
71
|
-
If the input is a document, you'll get a Nokogiri::HTML5::Document back:
|
60
|
+
If the input is a document, you'll get a `Nokogiri::HTML5::Document` back:
|
72
61
|
|
73
62
|
``` ruby
|
74
63
|
html = <<~HTML
|
@@ -103,19 +92,7 @@ Nokogiri::HTML5::Inference.parse(html)
|
|
103
92
|
# })
|
104
93
|
```
|
105
94
|
|
106
|
-
If the input is a fragment
|
107
|
-
|
108
|
-
``` ruby
|
109
|
-
Nokogiri::HTML5::Inference.parse("<div>hello,</div><div>world!</div>")
|
110
|
-
# => #(DocumentFragment:0x34f8 {
|
111
|
-
# name = "#document-fragment",
|
112
|
-
# children = [
|
113
|
-
# #(Element:0x3624 { name = "div", children = [ #(Text "hello,")] }),
|
114
|
-
# #(Element:0x3804 { name = "div", children = [ #(Text "world!")] })]
|
115
|
-
# })
|
116
|
-
```
|
117
|
-
|
118
|
-
or, if there are intermediate parent tags that need to be removed, you'll get a Nokogiri::XML::NodeSet:
|
95
|
+
If the input is a fragment, you'll get back a `Nokogiri::XML::NodeSet`:
|
119
96
|
|
120
97
|
``` ruby
|
121
98
|
Nokogiri::HTML5::Inference.parse("<tr><td>hello</td><td>world!</td></tr>")
|
@@ -128,14 +105,12 @@ Nokogiri::HTML5::Inference.parse("<tr><td>hello</td><td>world!</td></tr>")
|
|
128
105
|
# ]
|
129
106
|
```
|
130
107
|
|
131
|
-
|
132
|
-
methods like `#children`, and serialization methods like `#to_html`.
|
108
|
+
Both of these return types respond to the same query methods like `#css` and `#xpath`, tree-traversal methods like `#children`, and serialization methods like `#to_html`.
|
133
109
|
|
134
110
|
|
135
111
|
## Caveats
|
136
112
|
|
137
|
-
The implementation is currently pretty hacky and only looks at the first tag in the input to make
|
138
|
-
decisions. Nonetheless, it is a step forward from what Nokogiri and libgumbo do out-of-the-box.
|
113
|
+
The implementation is currently pretty hacky and only looks at the first tag in the input to make decisions. Nonetheless, it is a step forward from what Nokogiri and libgumbo do out-of-the-box.
|
139
114
|
|
140
115
|
The implementation also is almost certainly incomplete, meaning there are HTML5 tags that aren't handled by this library as you might expect.
|
141
116
|
|
@@ -148,12 +123,15 @@ We would welcome bug reports and pull requests improving this library!
|
|
148
123
|
|
149
124
|
Install the gem and add to the application's Gemfile by executing:
|
150
125
|
|
151
|
-
|
126
|
+
```bash
|
127
|
+
bundle add nokgiri-html5-inference
|
128
|
+
```
|
152
129
|
|
153
130
|
If bundler is not being used to manage dependencies, install the gem by executing:
|
154
131
|
|
155
|
-
|
156
|
-
|
132
|
+
```bash
|
133
|
+
gem install nokgiri-html5-inference
|
134
|
+
```
|
157
135
|
|
158
136
|
## Development
|
159
137
|
|
@@ -174,4 +152,4 @@ The gem is available as open source under the terms of the [MIT License](https:/
|
|
174
152
|
|
175
153
|
## Code of Conduct
|
176
154
|
|
177
|
-
Everyone interacting in the Nokogiri::
|
155
|
+
Everyone interacting in the `Nokogiri::HTML5::Inference` project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/flavorjones/nokogiri-html5-inference/blob/main/CODE_OF_CONDUCT.md).
|
@@ -12,15 +12,18 @@ else
|
|
12
12
|
module HTML5
|
13
13
|
# :markup: markdown
|
14
14
|
#
|
15
|
-
# The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very
|
16
|
-
# context-dependent parsing rules which can make it challenging to "just parse" a
|
17
|
-
# without knowing the parent node -- also called the "context node" -- in
|
15
|
+
# The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very
|
16
|
+
# precise context-dependent parsing rules which can make it challenging to "just parse" a
|
17
|
+
# fragment of HTML without knowing the parent node -- also called the "context node" -- in
|
18
|
+
# which it will be inserted.
|
18
19
|
#
|
19
20
|
# Most content in an HTML5 document can be parsed assuming the parser's mode will be in the
|
20
|
-
# ["in body" insertion
|
21
|
-
# but there
|
22
|
-
#
|
23
|
-
# ["in
|
21
|
+
# ["in body" insertion
|
22
|
+
# mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody), but there
|
23
|
+
# are some notable exceptions. Perhaps the most problematic to web developers are the
|
24
|
+
# table-related tags, which will not be parsed properly unless the parser is in the ["in
|
25
|
+
# table" insertion
|
26
|
+
# mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
|
24
27
|
#
|
25
28
|
# For example:
|
26
29
|
#
|
@@ -29,9 +32,9 @@ else
|
|
29
32
|
# # => "foo" # where did the tag go!?
|
30
33
|
# ```
|
31
34
|
#
|
32
|
-
# In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed
|
33
|
-
# and drop the tag. This particular fragment must be parsed "in the context" of a
|
34
|
-
# parse properly.
|
35
|
+
# In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed
|
36
|
+
# here", and drop the tag. This particular fragment must be parsed "in the context" of a
|
37
|
+
# table in order to parse properly.
|
35
38
|
#
|
36
39
|
# Thankfully, libgumbo and Nokogiri allow us to set the context node:
|
37
40
|
#
|
@@ -44,63 +47,52 @@ else
|
|
44
47
|
# # => "<tbody><tr><td>foo</td></tr></tbody>"
|
45
48
|
# ```
|
46
49
|
#
|
47
|
-
# This result is _almost_ correct, but we're seeing another HTML5 parsing rule in action:
|
48
|
-
# _intermediate parent tags_ that the HTML5 spec requires to be inserted by the
|
49
|
-
# the `<td>` tag must be wrapped in `<tbody><tr>` tags.
|
50
|
+
# This result is _almost_ correct, but we're seeing another HTML5 parsing rule in action:
|
51
|
+
# there may be _intermediate parent tags_ that the HTML5 spec requires to be inserted by the
|
52
|
+
# parser. In this case, the `<td>` tag must be wrapped in `<tbody><tr>` tags.
|
50
53
|
#
|
51
|
-
# We can
|
54
|
+
# We can fix this to only return the tags we provided by using the `<template>` tag as the
|
55
|
+
# context node, which the HTML5 spec provides exactly for this purpose:
|
52
56
|
#
|
53
57
|
# ``` ruby
|
54
58
|
# Nokogiri::HTML5::DocumentFragment.new(
|
55
59
|
# Nokogiri::HTML5::Document.new,
|
56
60
|
# "<td>foo</td>",
|
57
|
-
# "
|
58
|
-
# ).
|
61
|
+
# "template" # <--- this is the context node
|
62
|
+
# ).to_html
|
59
63
|
# # => "<td>foo</td>"
|
60
64
|
# ```
|
61
65
|
#
|
62
|
-
# Huzzah! That works. And it's precisely what Nokogiri::HTML5::Inference.parse does:
|
66
|
+
# Huzzah! That works. And it's precisely what `Nokogiri::HTML5::Inference.parse` does:
|
63
67
|
#
|
64
68
|
# ``` ruby
|
65
69
|
# Nokogiri::HTML5::Inference.parse("<td>foo</td>").to_html
|
66
70
|
# # => "<td>foo</td>"
|
67
71
|
# ```
|
72
|
+
#
|
68
73
|
module Inference
|
69
74
|
# Tags that must be parsed in a specific HTML5 insertion mode, for which we must use a
|
70
75
|
# context node.
|
71
76
|
module ContextTags # :nodoc:
|
72
|
-
TABLE = %w[thead tbody tfoot tr td th col colgroup caption].freeze
|
73
77
|
HTML = %w[head body].freeze
|
74
78
|
end
|
75
79
|
|
76
80
|
# Regular expressions used to determine if we need to use a context node.
|
77
81
|
module ContextRegexp # :nodoc:
|
78
82
|
DOCUMENT = /\A\s*(<!doctype\s+html\b|<html\b)/i
|
79
|
-
TABLE = /\A\s*<(#{ContextTags::TABLE.join("|")})\b/i
|
80
83
|
HTML = /\A\s*<(#{ContextTags::HTML.join("|")})\b/i
|
81
84
|
end
|
82
85
|
|
83
|
-
# Tags that get an intermediate parent created for them according to the HTML5 spec.
|
84
|
-
module PluckTags # :nodoc:
|
85
|
-
TBODY = %w[tr].freeze
|
86
|
-
TBODY_TR = %w[td th].freeze
|
87
|
-
COLGROUP = %w[col].freeze
|
88
|
-
end
|
89
|
-
|
90
86
|
# Regular expressions used to determine if we will need to skip an intermediate parent or
|
91
87
|
# otherwise narrow the fragment DOM that is returned.
|
92
88
|
module PluckRegexp # :nodoc:
|
93
|
-
TBODY = /\A\s*<(#{PluckTags::TBODY.join("|")})\b/i
|
94
|
-
TBODY_TR = /\A\s*<(#{PluckTags::TBODY_TR.join("|")})\b/i
|
95
|
-
COLGROUP = /\A\s*<(#{PluckTags::COLGROUP.join("|")})\b/i
|
96
|
-
HTML_INNER = /\A\s*<(head)\b/i
|
97
89
|
BODY_OUTER = /\A\s*<(body)\b/i
|
98
90
|
end
|
99
91
|
|
100
92
|
class << self
|
101
93
|
#
|
102
94
|
# call-seq:
|
103
|
-
# parse(input, pluck: true) => (Nokogiri::HTML5::Document | Nokogiri::
|
95
|
+
# parse(input, pluck: true) => (Nokogiri::HTML5::Document | Nokogiri::XML::NodeSet)
|
104
96
|
#
|
105
97
|
# Based on the start of the input HTML5 string, guess whether it's a full document or a
|
106
98
|
# fragment and, using the fragment context node if necessary, parse it properly and
|
@@ -118,14 +110,13 @@ else
|
|
118
110
|
#
|
119
111
|
# [Keyword Parameters]
|
120
112
|
# - +pluck+ (Boolean) Default: +true+. Set to +false+ if you want the method to always
|
121
|
-
# return
|
122
|
-
#
|
123
|
-
#
|
113
|
+
# return what Nokogiri parsed, without attempting to remove any sibling or intermediate
|
114
|
+
# parent nodes. This shouldn't be necessary if the library is working properly, but may
|
115
|
+
# be useful to allow user to work around a bad guess.
|
124
116
|
#
|
125
117
|
# [Returns]
|
126
118
|
# - A +Nokogiri::HTML5::Document+ if the input appears to represent a full document.
|
127
|
-
# - A +Nokogiri::
|
128
|
-
# appears to be a fragment.
|
119
|
+
# - A +Nokogiri::XML::NodeSet+ if the input appears to be a fragment.
|
129
120
|
#
|
130
121
|
def parse(input, pluck: true)
|
131
122
|
context = Nokogiri::HTML5::Inference.context(input)
|
@@ -136,7 +127,7 @@ else
|
|
136
127
|
if pluck && (path = pluck_path(input))
|
137
128
|
fragment.xpath(path)
|
138
129
|
else
|
139
|
-
fragment
|
130
|
+
fragment.children
|
140
131
|
end
|
141
132
|
end
|
142
133
|
end
|
@@ -158,9 +149,8 @@ else
|
|
158
149
|
def context(input) # :nodoc:
|
159
150
|
case input
|
160
151
|
when ContextRegexp::DOCUMENT then nil
|
161
|
-
when ContextRegexp::TABLE then "table"
|
162
152
|
when ContextRegexp::HTML then "html"
|
163
|
-
else "
|
153
|
+
else "template"
|
164
154
|
end
|
165
155
|
end
|
166
156
|
|
@@ -181,10 +171,6 @@ else
|
|
181
171
|
#
|
182
172
|
def pluck_path(input) # :nodoc:
|
183
173
|
case input
|
184
|
-
when PluckRegexp::TBODY then "tbody/*"
|
185
|
-
when PluckRegexp::TBODY_TR then "tbody/tr/*"
|
186
|
-
when PluckRegexp::COLGROUP then "colgroup/*"
|
187
|
-
when PluckRegexp::HTML_INNER then "./*"
|
188
174
|
when PluckRegexp::BODY_OUTER then "body"
|
189
175
|
end
|
190
176
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: nokogiri-html5-inference
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.3.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Mike Dalessio
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2024-
|
11
|
+
date: 2024-05-05 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: nokogiri
|