nokogiri-html5-inference 0.1.1 → 0.3.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +16 -3
- data/README.md +35 -47
- data/lib/nokogiri/html5/inference/version.rb +1 -1
- data/lib/nokogiri/html5/inference.rb +55 -65
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz: '
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: '068c7d6b9bea47be40be5ac4429d673f9eed2d6666a0f30fa0ffa52d71bd7d96'
|
4
|
+
data.tar.gz: 0d265567d61bc23fd08997772651ec15c2438c25b969380e9bce82a63f961310
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 15998fdfc4c3a2207ca150c47b4ec2d9d9c0012e1288634d092f1dc76d0c729c730c6a56a591d49ea303a38f4652b498940b338972f905f493e09b78d2fce879
|
7
|
+
data.tar.gz: 55ecdad7efa27e7580aa5fed39c5f9ce3860a6992582e7dfb21fb0a32c4ad355591af75c44e71a43327d75c4afe20ff8da832d5079fd39678f91df26ece89b06
|
data/CHANGELOG.md
CHANGED
@@ -1,10 +1,23 @@
|
|
1
|
-
##
|
1
|
+
## Unreleased
|
2
2
|
|
3
|
-
|
3
|
+
|
4
|
+
## v0.3.0 - 2024-05-05
|
5
|
+
|
6
|
+
- Use a `<template>` tag as the context node for the majority of fragment parsing, which greatly simplifies this gem. #7 @flavorjones @stevecheckoway
|
7
|
+
- Clean up the README. @marcoroth
|
8
|
+
- `Nokogiri::HTML5::Inference.parse` always returns a `Nokogiri::XML::Nodeset` for fragments. Previously this method sometimes returns a `Nokogiri::HTML5::DocumentFragment`, but some API inconsistencies between `DocumentFragment` and `NodeSet` made using the returned object tricky. We hope this provides a more consistent development experience. @flavorjones
|
9
|
+
|
10
|
+
|
11
|
+
## v0.2.0 - 2024-04-26
|
12
|
+
|
13
|
+
- When a `<head>` tag is seen first in the input string, include the `<body>` tag in the returned fragment or node set. (#3, #4) @flavorjones
|
14
|
+
|
15
|
+
|
16
|
+
## v0.1.1 - 2024-04-24
|
4
17
|
|
5
18
|
- Make protected methods `#context` and `#pluck_path` public, but keeping them undocumented.
|
6
19
|
|
7
20
|
|
8
|
-
##
|
21
|
+
## v0.1.0 - 2024-04-24
|
9
22
|
|
10
23
|
- Initial release
|
data/README.md
CHANGED
@@ -1,73 +1,70 @@
|
|
1
|
-
# Nokogiri::
|
1
|
+
# Nokogiri::HTML5::Inference
|
2
2
|
|
3
3
|
Given HTML5 input, make a reasonable guess at how to parse it correctly.
|
4
4
|
|
5
|
-
|
5
|
+
`Nokogiri::HTML5::Inference` makes reasonable inferences that work for both HTML5 documents and HTML5 fragments, and for all the different HTML5 tags that a web developer might need in a view library.
|
6
|
+
|
7
|
+
This is useful for parsing trusted content like view snippets, particularly for morphing cases like StimulusReflex.
|
6
8
|
|
7
9
|
## The problem this library solves
|
8
10
|
|
9
|
-
The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very precise
|
10
|
-
context-dependent parsing rules which can make it challenging to "just parse" a fragment of HTML
|
11
|
-
without knowing the parent node -- also called the "context node" -- in which it will be inserted.
|
11
|
+
The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very precise context-dependent parsing rules which can make it challenging to "just parse" a fragment of HTML without knowing the parent node -- also called the "context node" -- in which it will be inserted.
|
12
12
|
|
13
|
-
Most content in an HTML5 document can be parsed assuming the parser's mode will be in the
|
14
|
-
["in body" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody),
|
15
|
-
but there are some notable exceptions. Perhaps the most problematic to web developers are the
|
16
|
-
table-related tags, which will not be parsed properly unless the parser is in the
|
17
|
-
["in table" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
|
13
|
+
Most content in an HTML5 document can be parsed assuming the parser's mode will be in the ["in body" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody), but there are some notable exceptions. Perhaps the most problematic to web developers are the table-related tags, which will not be parsed properly unless the parser is in the ["in table" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
|
18
14
|
|
19
15
|
For example:
|
20
16
|
|
21
17
|
``` ruby
|
22
18
|
Nokogiri::HTML5::DocumentFragment.parse("<td>foo</td>").to_html
|
23
|
-
# => "foo"
|
19
|
+
# => "foo" # where did the tag go!?
|
24
20
|
```
|
25
21
|
|
26
|
-
In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed here",
|
27
|
-
|
28
|
-
|
22
|
+
In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed here", and drop the tag. This particular fragment must be parsed "in the context" of a table in order to parse properly.
|
23
|
+
|
24
|
+
Thankfully, libgumbo and Nokogiri allow us to set the context node:
|
29
25
|
|
30
26
|
``` ruby
|
31
27
|
Nokogiri::HTML5::DocumentFragment.new(
|
32
28
|
Nokogiri::HTML5::Document.new,
|
33
29
|
"<td>foo</td>",
|
34
|
-
"table" # this is the context node
|
30
|
+
"table" # <--- this is the context node
|
35
31
|
).to_html
|
36
32
|
# => "<tbody><tr><td>foo</td></tr></tbody>"
|
37
33
|
```
|
38
34
|
|
39
|
-
This is _almost_ correct, but we're seeing another HTML5 parsing rule in action: there may be
|
40
|
-
_intermediate parent tags_ that the HTML5 spec requires to be inserted by the parser. In this case,
|
41
|
-
the `<td>` tag must be wrapped in `<tbody><tr>` tags.
|
35
|
+
This result is _almost_ correct, but we're seeing another HTML5 parsing rule in action: there may be _intermediate parent tags_ that the HTML5 spec requires to be inserted by the parser. In this case, the `<td>` tag must be wrapped in `<tbody><tr>` tags.
|
42
36
|
|
43
|
-
We can
|
37
|
+
We can fix this to only return the tags we provided by using the `<template>` tag as the context node, which the HTML5 spec provides exactly for this purpose:
|
44
38
|
|
45
39
|
``` ruby
|
46
40
|
Nokogiri::HTML5::DocumentFragment.new(
|
47
41
|
Nokogiri::HTML5::Document.new,
|
48
42
|
"<td>foo</td>",
|
49
|
-
"
|
50
|
-
).
|
43
|
+
"template" # <--- this is the context node
|
44
|
+
).to_html
|
51
45
|
# => "<td>foo</td>"
|
52
46
|
```
|
53
47
|
|
54
|
-
|
55
|
-
|
56
|
-
|
48
|
+
Huzzah! That works. And it's precisely what `Nokogiri::HTML5::Inference.parse` does:
|
49
|
+
|
50
|
+
``` ruby
|
51
|
+
Nokogiri::HTML5::Inference.parse("<td>foo</td>").to_html
|
52
|
+
# => "<td>foo</td>"
|
53
|
+
```
|
57
54
|
|
58
55
|
|
59
56
|
## Usage
|
60
57
|
|
61
58
|
Given an input String containing HTML5, infer the best way to parse it by calling `Nokogiri::HTML5::Inference.parse`.
|
62
59
|
|
63
|
-
If the input is a document, you'll get a Nokogiri::HTML5::Document back:
|
60
|
+
If the input is a document, you'll get a `Nokogiri::HTML5::Document` back:
|
64
61
|
|
65
62
|
``` ruby
|
66
63
|
html = <<~HTML
|
67
64
|
<!doctype html>
|
68
65
|
<html lang="en">
|
69
66
|
<head>
|
70
|
-
<meta
|
67
|
+
<meta charset="utf-8">
|
71
68
|
</head>
|
72
69
|
<body>
|
73
70
|
<h1>Hello, world!</h1>
|
@@ -95,19 +92,7 @@ Nokogiri::HTML5::Inference.parse(html)
|
|
95
92
|
# })
|
96
93
|
```
|
97
94
|
|
98
|
-
If the input is a fragment
|
99
|
-
|
100
|
-
``` ruby
|
101
|
-
Nokogiri::HTML5::Inference.parse("<div>hello,</div><div>world!</div>")
|
102
|
-
# => #(DocumentFragment:0x34f8 {
|
103
|
-
# name = "#document-fragment",
|
104
|
-
# children = [
|
105
|
-
# #(Element:0x3624 { name = "div", children = [ #(Text "hello,")] }),
|
106
|
-
# #(Element:0x3804 { name = "div", children = [ #(Text "world!")] })]
|
107
|
-
# })
|
108
|
-
```
|
109
|
-
|
110
|
-
or, if there are intermediate parent tags that need to be removed, you'll get a Nokogiri::XML::NodeSet:
|
95
|
+
If the input is a fragment, you'll get back a `Nokogiri::XML::NodeSet`:
|
111
96
|
|
112
97
|
``` ruby
|
113
98
|
Nokogiri::HTML5::Inference.parse("<tr><td>hello</td><td>world!</td></tr>")
|
@@ -120,17 +105,17 @@ Nokogiri::HTML5::Inference.parse("<tr><td>hello</td><td>world!</td></tr>")
|
|
120
105
|
# ]
|
121
106
|
```
|
122
107
|
|
123
|
-
|
124
|
-
methods like `#children`, and serialization methods like `#to_html`.
|
108
|
+
Both of these return types respond to the same query methods like `#css` and `#xpath`, tree-traversal methods like `#children`, and serialization methods like `#to_html`.
|
125
109
|
|
126
110
|
|
127
111
|
## Caveats
|
128
112
|
|
129
|
-
The implementation is currently pretty hacky and only looks at the first tag in the input to make
|
130
|
-
decisions. Nonetheless, it is a step forward from what Nokogiri and libgumbo do out-of-the-box.
|
113
|
+
The implementation is currently pretty hacky and only looks at the first tag in the input to make decisions. Nonetheless, it is a step forward from what Nokogiri and libgumbo do out-of-the-box.
|
131
114
|
|
132
115
|
The implementation also is almost certainly incomplete, meaning there are HTML5 tags that aren't handled by this library as you might expect.
|
133
116
|
|
117
|
+
This implementation is probably OK for handling untrusted content, but it's still new and I haven't really thought very hard about it yet. If you want to use it on untrusted content, open an issue and talk with us about your use case so we can help keep you secure!
|
118
|
+
|
134
119
|
We would welcome bug reports and pull requests improving this library!
|
135
120
|
|
136
121
|
|
@@ -138,12 +123,15 @@ We would welcome bug reports and pull requests improving this library!
|
|
138
123
|
|
139
124
|
Install the gem and add to the application's Gemfile by executing:
|
140
125
|
|
141
|
-
|
126
|
+
```bash
|
127
|
+
bundle add nokgiri-html5-inference
|
128
|
+
```
|
142
129
|
|
143
130
|
If bundler is not being used to manage dependencies, install the gem by executing:
|
144
131
|
|
145
|
-
|
146
|
-
|
132
|
+
```bash
|
133
|
+
gem install nokgiri-html5-inference
|
134
|
+
```
|
147
135
|
|
148
136
|
## Development
|
149
137
|
|
@@ -164,4 +152,4 @@ The gem is available as open source under the terms of the [MIT License](https:/
|
|
164
152
|
|
165
153
|
## Code of Conduct
|
166
154
|
|
167
|
-
Everyone interacting in the Nokogiri::
|
155
|
+
Everyone interacting in the `Nokogiri::HTML5::Inference` project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/flavorjones/nokogiri-html5-inference/blob/main/CODE_OF_CONDUCT.md).
|
@@ -12,91 +12,87 @@ else
|
|
12
12
|
module HTML5
|
13
13
|
# :markup: markdown
|
14
14
|
#
|
15
|
-
#
|
16
|
-
# context-dependent parsing rules which can make it challenging to "just parse" a
|
17
|
-
# without knowing the parent node -- also called the "context node" -- in
|
15
|
+
# The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very
|
16
|
+
# precise context-dependent parsing rules which can make it challenging to "just parse" a
|
17
|
+
# fragment of HTML without knowing the parent node -- also called the "context node" -- in
|
18
|
+
# which it will be inserted.
|
18
19
|
#
|
19
|
-
#
|
20
|
-
#
|
21
|
-
# but there
|
22
|
-
#
|
23
|
-
# ["in
|
20
|
+
# Most content in an HTML5 document can be parsed assuming the parser's mode will be in the
|
21
|
+
# ["in body" insertion
|
22
|
+
# mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody), but there
|
23
|
+
# are some notable exceptions. Perhaps the most problematic to web developers are the
|
24
|
+
# table-related tags, which will not be parsed properly unless the parser is in the ["in
|
25
|
+
# table" insertion
|
26
|
+
# mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
|
24
27
|
#
|
25
|
-
#
|
28
|
+
# For example:
|
26
29
|
#
|
27
|
-
#
|
28
|
-
#
|
29
|
-
#
|
30
|
-
#
|
30
|
+
# ``` ruby
|
31
|
+
# Nokogiri::HTML5::DocumentFragment.parse("<td>foo</td>").to_html
|
32
|
+
# # => "foo" # where did the tag go!?
|
33
|
+
# ```
|
31
34
|
#
|
32
|
-
#
|
33
|
-
# and drop the tag. This fragment must be parsed "in the context" of a
|
34
|
-
#
|
35
|
+
# In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed
|
36
|
+
# here", and drop the tag. This particular fragment must be parsed "in the context" of a
|
37
|
+
# table in order to parse properly.
|
35
38
|
#
|
36
|
-
#
|
37
|
-
# Nokogiri::HTML5::DocumentFragment.new(
|
38
|
-
# Nokogiri::HTML5::Document.new,
|
39
|
-
# "<td>foo</td>",
|
40
|
-
# "table" # this is the context node
|
41
|
-
# ).to_html
|
42
|
-
# # => "<tbody><tr><td>foo</td></tr></tbody>"
|
43
|
-
# ```
|
39
|
+
# Thankfully, libgumbo and Nokogiri allow us to set the context node:
|
44
40
|
#
|
45
|
-
#
|
46
|
-
#
|
47
|
-
#
|
41
|
+
# ``` ruby
|
42
|
+
# Nokogiri::HTML5::DocumentFragment.new(
|
43
|
+
# Nokogiri::HTML5::Document.new,
|
44
|
+
# "<td>foo</td>",
|
45
|
+
# "table" # <--- this is the context node
|
46
|
+
# ).to_html
|
47
|
+
# # => "<tbody><tr><td>foo</td></tr></tbody>"
|
48
|
+
# ```
|
48
49
|
#
|
49
|
-
#
|
50
|
+
# This result is _almost_ correct, but we're seeing another HTML5 parsing rule in action:
|
51
|
+
# there may be _intermediate parent tags_ that the HTML5 spec requires to be inserted by the
|
52
|
+
# parser. In this case, the `<td>` tag must be wrapped in `<tbody><tr>` tags.
|
50
53
|
#
|
51
|
-
#
|
52
|
-
#
|
53
|
-
# Nokogiri::HTML5::Document.new,
|
54
|
-
# "<td>foo</td>",
|
55
|
-
# "table" # this is the context node
|
56
|
-
# ).xpath("tbody/tr/*").to_html
|
57
|
-
# # => "<td>foo</td>"
|
58
|
-
# ```
|
54
|
+
# We can fix this to only return the tags we provided by using the `<template>` tag as the
|
55
|
+
# context node, which the HTML5 spec provides exactly for this purpose:
|
59
56
|
#
|
60
|
-
#
|
61
|
-
#
|
62
|
-
#
|
57
|
+
# ``` ruby
|
58
|
+
# Nokogiri::HTML5::DocumentFragment.new(
|
59
|
+
# Nokogiri::HTML5::Document.new,
|
60
|
+
# "<td>foo</td>",
|
61
|
+
# "template" # <--- this is the context node
|
62
|
+
# ).to_html
|
63
|
+
# # => "<td>foo</td>"
|
64
|
+
# ```
|
65
|
+
#
|
66
|
+
# Huzzah! That works. And it's precisely what `Nokogiri::HTML5::Inference.parse` does:
|
67
|
+
#
|
68
|
+
# ``` ruby
|
69
|
+
# Nokogiri::HTML5::Inference.parse("<td>foo</td>").to_html
|
70
|
+
# # => "<td>foo</td>"
|
71
|
+
# ```
|
63
72
|
#
|
64
73
|
module Inference
|
65
74
|
# Tags that must be parsed in a specific HTML5 insertion mode, for which we must use a
|
66
75
|
# context node.
|
67
76
|
module ContextTags # :nodoc:
|
68
|
-
TABLE = %w[thead tbody tfoot tr td th col colgroup caption].freeze
|
69
77
|
HTML = %w[head body].freeze
|
70
78
|
end
|
71
79
|
|
72
80
|
# Regular expressions used to determine if we need to use a context node.
|
73
81
|
module ContextRegexp # :nodoc:
|
74
82
|
DOCUMENT = /\A\s*(<!doctype\s+html\b|<html\b)/i
|
75
|
-
TABLE = /\A\s*<(#{ContextTags::TABLE.join("|")})\b/i
|
76
83
|
HTML = /\A\s*<(#{ContextTags::HTML.join("|")})\b/i
|
77
84
|
end
|
78
85
|
|
79
|
-
# Tags that get an intermediate parent created for them according to the HTML5 spec.
|
80
|
-
module PluckTags # :nodoc:
|
81
|
-
TBODY = %w[tr].freeze
|
82
|
-
TBODY_TR = %w[td th].freeze
|
83
|
-
COLGROUP = %w[col].freeze
|
84
|
-
end
|
85
|
-
|
86
86
|
# Regular expressions used to determine if we will need to skip an intermediate parent or
|
87
87
|
# otherwise narrow the fragment DOM that is returned.
|
88
88
|
module PluckRegexp # :nodoc:
|
89
|
-
TBODY = /\A\s*<(#{PluckTags::TBODY.join("|")})\b/i
|
90
|
-
TBODY_TR = /\A\s*<(#{PluckTags::TBODY_TR.join("|")})\b/i
|
91
|
-
COLGROUP = /\A\s*<(#{PluckTags::COLGROUP.join("|")})\b/i
|
92
|
-
HEAD_OUTER = /\A\s*<(head)\b/i
|
93
89
|
BODY_OUTER = /\A\s*<(body)\b/i
|
94
90
|
end
|
95
91
|
|
96
92
|
class << self
|
97
93
|
#
|
98
94
|
# call-seq:
|
99
|
-
# parse(input, pluck: true) => (Nokogiri::HTML5::Document | Nokogiri::
|
95
|
+
# parse(input, pluck: true) => (Nokogiri::HTML5::Document | Nokogiri::XML::NodeSet)
|
100
96
|
#
|
101
97
|
# Based on the start of the input HTML5 string, guess whether it's a full document or a
|
102
98
|
# fragment and, using the fragment context node if necessary, parse it properly and
|
@@ -114,14 +110,13 @@ else
|
|
114
110
|
#
|
115
111
|
# [Keyword Parameters]
|
116
112
|
# - +pluck+ (Boolean) Default: +true+. Set to +false+ if you want the method to always
|
117
|
-
# return
|
118
|
-
#
|
119
|
-
#
|
113
|
+
# return what Nokogiri parsed, without attempting to remove any sibling or intermediate
|
114
|
+
# parent nodes. This shouldn't be necessary if the library is working properly, but may
|
115
|
+
# be useful to allow user to work around a bad guess.
|
120
116
|
#
|
121
117
|
# [Returns]
|
122
118
|
# - A +Nokogiri::HTML5::Document+ if the input appears to represent a full document.
|
123
|
-
# - A +Nokogiri::
|
124
|
-
# appears to be a fragment.
|
119
|
+
# - A +Nokogiri::XML::NodeSet+ if the input appears to be a fragment.
|
125
120
|
#
|
126
121
|
def parse(input, pluck: true)
|
127
122
|
context = Nokogiri::HTML5::Inference.context(input)
|
@@ -132,7 +127,7 @@ else
|
|
132
127
|
if pluck && (path = pluck_path(input))
|
133
128
|
fragment.xpath(path)
|
134
129
|
else
|
135
|
-
fragment
|
130
|
+
fragment.children
|
136
131
|
end
|
137
132
|
end
|
138
133
|
end
|
@@ -154,9 +149,8 @@ else
|
|
154
149
|
def context(input) # :nodoc:
|
155
150
|
case input
|
156
151
|
when ContextRegexp::DOCUMENT then nil
|
157
|
-
when ContextRegexp::TABLE then "table"
|
158
152
|
when ContextRegexp::HTML then "html"
|
159
|
-
else "
|
153
|
+
else "template"
|
160
154
|
end
|
161
155
|
end
|
162
156
|
|
@@ -177,10 +171,6 @@ else
|
|
177
171
|
#
|
178
172
|
def pluck_path(input) # :nodoc:
|
179
173
|
case input
|
180
|
-
when PluckRegexp::TBODY then "tbody/*"
|
181
|
-
when PluckRegexp::TBODY_TR then "tbody/tr/*"
|
182
|
-
when PluckRegexp::COLGROUP then "colgroup/*"
|
183
|
-
when PluckRegexp::HEAD_OUTER then "head"
|
184
174
|
when PluckRegexp::BODY_OUTER then "body"
|
185
175
|
end
|
186
176
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: nokogiri-html5-inference
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.3.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Mike Dalessio
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2024-
|
11
|
+
date: 2024-05-05 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: nokogiri
|