nokogiri-html5-inference 0.1.1 → 0.2.0
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/CHANGELOG.md +5 -0
- data/README.md +20 -10
- data/lib/nokogiri/html5/inference/version.rb +1 -1
- data/lib/nokogiri/html5/inference.rb +45 -41
- metadata +2 -2
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: bd21e524361107504a10fa58bd991b9b89bd40f968b7550b90d38454156e65c2
|
4
|
+
data.tar.gz: ddbedefb2798713c3ae46d27df75efa5c0bfe10d1782dc419d8587dd6a155a74
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: c46997b8c93033fb53e5c2b141a34bdbe93d6c1f55a3cd3de76abd6116c807b60ae2a26dfb02e8ce06adf441308fbf0e2522eb0b302ef11fedb1231c33df0300
|
7
|
+
data.tar.gz: e24dc98db23641a8edb55824a7a4c0f7d858f6c0d2fe99f854a56c62ebc97a8e4f0f79df849798bb2097b9e0253213ebbbf3adbd988a3580b55ca891f8375a63
|
data/CHANGELOG.md
CHANGED
@@ -1,5 +1,10 @@
|
|
1
1
|
## [Unreleased]
|
2
2
|
|
3
|
+
## [0.2.0] - 2024-04-26
|
4
|
+
|
5
|
+
- When a `<head>` tag is seen first in the input string, include the `<body>` tag in the returned fragment or node set. (#3, #4) @flavorjones
|
6
|
+
|
7
|
+
|
3
8
|
## [0.1.1] - 2024-04-24
|
4
9
|
|
5
10
|
- Make protected methods `#context` and `#pluck_path` public, but keeping them undocumented.
|
data/README.md
CHANGED
@@ -2,7 +2,10 @@
|
|
2
2
|
|
3
3
|
Given HTML5 input, make a reasonable guess at how to parse it correctly.
|
4
4
|
|
5
|
-
|
5
|
+
Nokogiri::HTML5::Inference makes reasonable inferences that work for both HTML5 documents and HTML5
|
6
|
+
fragments, and for all the different HTML5 tags that a web developer might need in a view library.
|
7
|
+
|
8
|
+
This is useful for parsing trusted content like view snippets, particularly for morphing cases like StimulusReflex.
|
6
9
|
|
7
10
|
## The problem this library solves
|
8
11
|
|
@@ -20,23 +23,25 @@ For example:
|
|
20
23
|
|
21
24
|
``` ruby
|
22
25
|
Nokogiri::HTML5::DocumentFragment.parse("<td>foo</td>").to_html
|
23
|
-
# => "foo"
|
26
|
+
# => "foo" # where did the tag go!?
|
24
27
|
```
|
25
28
|
|
26
29
|
In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed here",
|
27
|
-
and drop the tag. This fragment must be parsed "in the context" of a table in order to
|
28
|
-
properly.
|
30
|
+
and drop the tag. This particular fragment must be parsed "in the context" of a table in order to
|
31
|
+
parse properly.
|
32
|
+
|
33
|
+
Thankfully, libgumbo and Nokogiri allow us to set the context node:
|
29
34
|
|
30
35
|
``` ruby
|
31
36
|
Nokogiri::HTML5::DocumentFragment.new(
|
32
37
|
Nokogiri::HTML5::Document.new,
|
33
38
|
"<td>foo</td>",
|
34
|
-
"table" # this is the context node
|
39
|
+
"table" # <--- this is the context node
|
35
40
|
).to_html
|
36
41
|
# => "<tbody><tr><td>foo</td></tr></tbody>"
|
37
42
|
```
|
38
43
|
|
39
|
-
This is _almost_ correct, but we're seeing another HTML5 parsing rule in action: there may be
|
44
|
+
This result is _almost_ correct, but we're seeing another HTML5 parsing rule in action: there may be
|
40
45
|
_intermediate parent tags_ that the HTML5 spec requires to be inserted by the parser. In this case,
|
41
46
|
the `<td>` tag must be wrapped in `<tbody><tr>` tags.
|
42
47
|
|
@@ -51,9 +56,12 @@ Nokogiri::HTML5::DocumentFragment.new(
|
|
51
56
|
# => "<td>foo</td>"
|
52
57
|
```
|
53
58
|
|
54
|
-
|
55
|
-
|
56
|
-
|
59
|
+
Huzzah! That works. And it's precisely what Nokogiri::HTML5::Inference.parse does:
|
60
|
+
|
61
|
+
``` ruby
|
62
|
+
Nokogiri::HTML5::Inference.parse("<td>foo</td>").to_html
|
63
|
+
# => "<td>foo</td>"
|
64
|
+
```
|
57
65
|
|
58
66
|
|
59
67
|
## Usage
|
@@ -67,7 +75,7 @@ html = <<~HTML
|
|
67
75
|
<!doctype html>
|
68
76
|
<html lang="en">
|
69
77
|
<head>
|
70
|
-
<meta
|
78
|
+
<meta charset="utf-8">
|
71
79
|
</head>
|
72
80
|
<body>
|
73
81
|
<h1>Hello, world!</h1>
|
@@ -131,6 +139,8 @@ decisions. Nonetheless, it is a step forward from what Nokogiri and libgumbo do
|
|
131
139
|
|
132
140
|
The implementation also is almost certainly incomplete, meaning there are HTML5 tags that aren't handled by this library as you might expect.
|
133
141
|
|
142
|
+
This implementation is probably OK for handling untrusted content, but it's still new and I haven't really thought very hard about it yet. If you want to use it on untrusted content, open an issue and talk with us about your use case so we can help keep you secure!
|
143
|
+
|
134
144
|
We would welcome bug reports and pull requests improving this library!
|
135
145
|
|
136
146
|
|
@@ -12,55 +12,59 @@ else
|
|
12
12
|
module HTML5
|
13
13
|
# :markup: markdown
|
14
14
|
#
|
15
|
-
#
|
16
|
-
#
|
17
|
-
#
|
15
|
+
# The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very precise
|
16
|
+
# context-dependent parsing rules which can make it challenging to "just parse" a fragment of HTML
|
17
|
+
# without knowing the parent node -- also called the "context node" -- in which it will be inserted.
|
18
18
|
#
|
19
|
-
#
|
20
|
-
#
|
21
|
-
#
|
22
|
-
#
|
23
|
-
#
|
19
|
+
# Most content in an HTML5 document can be parsed assuming the parser's mode will be in the
|
20
|
+
# ["in body" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody),
|
21
|
+
# but there are some notable exceptions. Perhaps the most problematic to web developers are the
|
22
|
+
# table-related tags, which will not be parsed properly unless the parser is in the
|
23
|
+
# ["in table" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
|
24
24
|
#
|
25
|
-
#
|
25
|
+
# For example:
|
26
26
|
#
|
27
|
-
#
|
28
|
-
#
|
29
|
-
#
|
30
|
-
#
|
27
|
+
# ``` ruby
|
28
|
+
# Nokogiri::HTML5::DocumentFragment.parse("<td>foo</td>").to_html
|
29
|
+
# # => "foo" # where did the tag go!?
|
30
|
+
# ```
|
31
31
|
#
|
32
|
-
#
|
33
|
-
#
|
34
|
-
# properly.
|
32
|
+
# In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed here",
|
33
|
+
# and drop the tag. This particular fragment must be parsed "in the context" of a table in order to
|
34
|
+
# parse properly.
|
35
35
|
#
|
36
|
-
#
|
37
|
-
# Nokogiri::HTML5::DocumentFragment.new(
|
38
|
-
# Nokogiri::HTML5::Document.new,
|
39
|
-
# "<td>foo</td>",
|
40
|
-
# "table" # this is the context node
|
41
|
-
# ).to_html
|
42
|
-
# # => "<tbody><tr><td>foo</td></tr></tbody>"
|
43
|
-
# ```
|
36
|
+
# Thankfully, libgumbo and Nokogiri allow us to set the context node:
|
44
37
|
#
|
45
|
-
#
|
46
|
-
#
|
47
|
-
#
|
38
|
+
# ``` ruby
|
39
|
+
# Nokogiri::HTML5::DocumentFragment.new(
|
40
|
+
# Nokogiri::HTML5::Document.new,
|
41
|
+
# "<td>foo</td>",
|
42
|
+
# "table" # <--- this is the context node
|
43
|
+
# ).to_html
|
44
|
+
# # => "<tbody><tr><td>foo</td></tr></tbody>"
|
45
|
+
# ```
|
48
46
|
#
|
49
|
-
#
|
47
|
+
# This result is _almost_ correct, but we're seeing another HTML5 parsing rule in action: there may be
|
48
|
+
# _intermediate parent tags_ that the HTML5 spec requires to be inserted by the parser. In this case,
|
49
|
+
# the `<td>` tag must be wrapped in `<tbody><tr>` tags.
|
50
50
|
#
|
51
|
-
#
|
52
|
-
# Nokogiri::HTML5::DocumentFragment.new(
|
53
|
-
# Nokogiri::HTML5::Document.new,
|
54
|
-
# "<td>foo</td>",
|
55
|
-
# "table" # this is the context node
|
56
|
-
# ).xpath("tbody/tr/*").to_html
|
57
|
-
# # => "<td>foo</td>"
|
58
|
-
# ```
|
51
|
+
# We can narrow down the result set with an XPath query to get back only the intended tags:
|
59
52
|
#
|
60
|
-
#
|
61
|
-
#
|
62
|
-
#
|
53
|
+
# ``` ruby
|
54
|
+
# Nokogiri::HTML5::DocumentFragment.new(
|
55
|
+
# Nokogiri::HTML5::Document.new,
|
56
|
+
# "<td>foo</td>",
|
57
|
+
# "table" # this is the context node
|
58
|
+
# ).xpath("tbody/tr/*").to_html
|
59
|
+
# # => "<td>foo</td>"
|
60
|
+
# ```
|
63
61
|
#
|
62
|
+
# Huzzah! That works. And it's precisely what Nokogiri::HTML5::Inference.parse does:
|
63
|
+
#
|
64
|
+
# ``` ruby
|
65
|
+
# Nokogiri::HTML5::Inference.parse("<td>foo</td>").to_html
|
66
|
+
# # => "<td>foo</td>"
|
67
|
+
# ```
|
64
68
|
module Inference
|
65
69
|
# Tags that must be parsed in a specific HTML5 insertion mode, for which we must use a
|
66
70
|
# context node.
|
@@ -89,7 +93,7 @@ else
|
|
89
93
|
TBODY = /\A\s*<(#{PluckTags::TBODY.join("|")})\b/i
|
90
94
|
TBODY_TR = /\A\s*<(#{PluckTags::TBODY_TR.join("|")})\b/i
|
91
95
|
COLGROUP = /\A\s*<(#{PluckTags::COLGROUP.join("|")})\b/i
|
92
|
-
|
96
|
+
HTML_INNER = /\A\s*<(head)\b/i
|
93
97
|
BODY_OUTER = /\A\s*<(body)\b/i
|
94
98
|
end
|
95
99
|
|
@@ -180,7 +184,7 @@ else
|
|
180
184
|
when PluckRegexp::TBODY then "tbody/*"
|
181
185
|
when PluckRegexp::TBODY_TR then "tbody/tr/*"
|
182
186
|
when PluckRegexp::COLGROUP then "colgroup/*"
|
183
|
-
when PluckRegexp::
|
187
|
+
when PluckRegexp::HTML_INNER then "./*"
|
184
188
|
when PluckRegexp::BODY_OUTER then "body"
|
185
189
|
end
|
186
190
|
end
|
metadata
CHANGED
@@ -1,14 +1,14 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: nokogiri-html5-inference
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 0.
|
4
|
+
version: 0.2.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Mike Dalessio
|
8
8
|
autorequire:
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
|
-
date: 2024-04-
|
11
|
+
date: 2024-04-26 00:00:00.000000000 Z
|
12
12
|
dependencies:
|
13
13
|
- !ruby/object:Gem::Dependency
|
14
14
|
name: nokogiri
|