nokogiri-html5-inference 0.1.1 → 0.2.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: '0942dd8a89c2d930794a10583ed84746da2c434af00ccb3c3d52c4932fa1b812'
4
- data.tar.gz: c09d3bf45f24570c4c4d80571a16e35c7fdefa82be467fead21334194788938f
3
+ metadata.gz: bd21e524361107504a10fa58bd991b9b89bd40f968b7550b90d38454156e65c2
4
+ data.tar.gz: ddbedefb2798713c3ae46d27df75efa5c0bfe10d1782dc419d8587dd6a155a74
5
5
  SHA512:
6
- metadata.gz: 5e931d3e3c4a3a516c046bf04ad352c58acd0118fca531fc3ce7f5a9ce3a93f1918ac96447c53814185e58d3d51c578ea4e45dbd15f68e9d290a696a6d6a0783
7
- data.tar.gz: 49d6050376b44467248e3c7c85251add238773c954914d874c282108c5569249394852ce3028693fef547ea60e41346e2be328a105e9076106250449d1d02b29
6
+ metadata.gz: c46997b8c93033fb53e5c2b141a34bdbe93d6c1f55a3cd3de76abd6116c807b60ae2a26dfb02e8ce06adf441308fbf0e2522eb0b302ef11fedb1231c33df0300
7
+ data.tar.gz: e24dc98db23641a8edb55824a7a4c0f7d858f6c0d2fe99f854a56c62ebc97a8e4f0f79df849798bb2097b9e0253213ebbbf3adbd988a3580b55ca891f8375a63
data/CHANGELOG.md CHANGED
@@ -1,5 +1,10 @@
1
1
  ## [Unreleased]
2
2
 
3
+ ## [0.2.0] - 2024-04-26
4
+
5
+ - When a `<head>` tag is seen first in the input string, include the `<body>` tag in the returned fragment or node set. (#3, #4) @flavorjones
6
+
7
+
3
8
  ## [0.1.1] - 2024-04-24
4
9
 
5
10
  - Make protected methods `#context` and `#pluck_path` public, but keeping them undocumented.
data/README.md CHANGED
@@ -2,7 +2,10 @@
2
2
 
3
3
  Given HTML5 input, make a reasonable guess at how to parse it correctly.
4
4
 
5
- Infer from the HTML5 input whether it's a fragment or a document, and if it's a fragment what the proper context node should be. This is useful for parsing trusted content like view snippets, particularly for morphing cases like StimulusReflex.
5
+ Nokogiri::HTML5::Inference makes reasonable inferences that work for both HTML5 documents and HTML5
6
+ fragments, and for all the different HTML5 tags that a web developer might need in a view library.
7
+
8
+ This is useful for parsing trusted content like view snippets, particularly for morphing cases like StimulusReflex.
6
9
 
7
10
  ## The problem this library solves
8
11
 
@@ -20,23 +23,25 @@ For example:
20
23
 
21
24
  ``` ruby
22
25
  Nokogiri::HTML5::DocumentFragment.parse("<td>foo</td>").to_html
23
- # => "foo"
26
+ # => "foo" # where did the tag go!?
24
27
  ```
25
28
 
26
29
  In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed here",
27
- and drop the tag. This fragment must be parsed "in the context" of a table in order to parse
28
- properly. Thankfully, libgumbo and Nokogiri allow us to do this:
30
+ and drop the tag. This particular fragment must be parsed "in the context" of a table in order to
31
+ parse properly.
32
+
33
+ Thankfully, libgumbo and Nokogiri allow us to set the context node:
29
34
 
30
35
  ``` ruby
31
36
  Nokogiri::HTML5::DocumentFragment.new(
32
37
  Nokogiri::HTML5::Document.new,
33
38
  "<td>foo</td>",
34
- "table" # this is the context node
39
+ "table" # <--- this is the context node
35
40
  ).to_html
36
41
  # => "<tbody><tr><td>foo</td></tr></tbody>"
37
42
  ```
38
43
 
39
- This is _almost_ correct, but we're seeing another HTML5 parsing rule in action: there may be
44
+ This result is _almost_ correct, but we're seeing another HTML5 parsing rule in action: there may be
40
45
  _intermediate parent tags_ that the HTML5 spec requires to be inserted by the parser. In this case,
41
46
  the `<td>` tag must be wrapped in `<tbody><tr>` tags.
42
47
 
@@ -51,9 +56,12 @@ Nokogiri::HTML5::DocumentFragment.new(
51
56
  # => "<td>foo</td>"
52
57
  ```
53
58
 
54
- Hurrah! This is precisely what Nokogiri::HTML5::Inference.parse does: make reasonable inferences
55
- that work for both HTML5 documents and HTML5 fragments, and for all the different HTML5 tags that a
56
- web developer might need in a view library.
59
+ Huzzah! That works. And it's precisely what Nokogiri::HTML5::Inference.parse does:
60
+
61
+ ``` ruby
62
+ Nokogiri::HTML5::Inference.parse("<td>foo</td>").to_html
63
+ # => "<td>foo</td>"
64
+ ```
57
65
 
58
66
 
59
67
  ## Usage
@@ -67,7 +75,7 @@ html = <<~HTML
67
75
  <!doctype html>
68
76
  <html lang="en">
69
77
  <head>
70
- <meta encoding="UTF-8">
78
+ <meta charset="utf-8">
71
79
  </head>
72
80
  <body>
73
81
  <h1>Hello, world!</h1>
@@ -131,6 +139,8 @@ decisions. Nonetheless, it is a step forward from what Nokogiri and libgumbo do
131
139
 
132
140
  The implementation also is almost certainly incomplete, meaning there are HTML5 tags that aren't handled by this library as you might expect.
133
141
 
142
+ This implementation is probably OK for handling untrusted content, but it's still new and I haven't really thought very hard about it yet. If you want to use it on untrusted content, open an issue and talk with us about your use case so we can help keep you secure!
143
+
134
144
  We would welcome bug reports and pull requests improving this library!
135
145
 
136
146
 
@@ -3,7 +3,7 @@
3
3
  module Nokogiri
4
4
  module HTML5
5
5
  module Inference
6
- VERSION = "0.1.1"
6
+ VERSION = "0.2.0"
7
7
  end
8
8
  end
9
9
  end
@@ -12,55 +12,59 @@ else
12
12
  module HTML5
13
13
  # :markup: markdown
14
14
  #
15
- # The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very precise
16
- # context-dependent parsing rules which can make it challenging to "just parse" a fragment of HTML
17
- # without knowing the parent node -- also called the "context node" -- in which it will be inserted.
15
+ # The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very precise
16
+ # context-dependent parsing rules which can make it challenging to "just parse" a fragment of HTML
17
+ # without knowing the parent node -- also called the "context node" -- in which it will be inserted.
18
18
  #
19
- # Most content in an HTML5 document can be parsed assuming the parser's mode will be in the
20
- # ["in body" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody),
21
- # but there are some notable exceptions. Perhaps the most problematic to web developers are the
22
- # table-related tags, which will not be parsed properly unless the parser is in the
23
- # ["in table" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
19
+ # Most content in an HTML5 document can be parsed assuming the parser's mode will be in the
20
+ # ["in body" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody),
21
+ # but there are some notable exceptions. Perhaps the most problematic to web developers are the
22
+ # table-related tags, which will not be parsed properly unless the parser is in the
23
+ # ["in table" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
24
24
  #
25
- # For example:
25
+ # For example:
26
26
  #
27
- # ``` ruby
28
- # Nokogiri::HTML5::DocumentFragment.parse("<td>foo</td>").to_html
29
- # # => "foo"
30
- # ```
27
+ # ``` ruby
28
+ # Nokogiri::HTML5::DocumentFragment.parse("<td>foo</td>").to_html
29
+ # # => "foo" # where did the tag go!?
30
+ # ```
31
31
  #
32
- # In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed here",
33
- # and drop the tag. This fragment must be parsed "in the context" of a table in order to parse
34
- # properly. Thankfully, libgumbo and Nokogiri allow us to do this:
32
+ # In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed here",
33
+ # and drop the tag. This particular fragment must be parsed "in the context" of a table in order to
34
+ # parse properly.
35
35
  #
36
- # ``` ruby
37
- # Nokogiri::HTML5::DocumentFragment.new(
38
- # Nokogiri::HTML5::Document.new,
39
- # "<td>foo</td>",
40
- # "table" # this is the context node
41
- # ).to_html
42
- # # => "<tbody><tr><td>foo</td></tr></tbody>"
43
- # ```
36
+ # Thankfully, libgumbo and Nokogiri allow us to set the context node:
44
37
  #
45
- # This is _almost_ correct, but we're seeing another HTML5 parsing rule in action: there may be
46
- # _intermediate parent tags_ that the HTML5 spec requires to be inserted by the parser. In this case,
47
- # the `<td>` tag must be wrapped in `<tbody><tr>` tags.
38
+ # ``` ruby
39
+ # Nokogiri::HTML5::DocumentFragment.new(
40
+ # Nokogiri::HTML5::Document.new,
41
+ # "<td>foo</td>",
42
+ # "table" # <--- this is the context node
43
+ # ).to_html
44
+ # # => "<tbody><tr><td>foo</td></tr></tbody>"
45
+ # ```
48
46
  #
49
- # We can narrow down the result set with an XPath query to get back only the intended tags:
47
+ # This result is _almost_ correct, but we're seeing another HTML5 parsing rule in action: there may be
48
+ # _intermediate parent tags_ that the HTML5 spec requires to be inserted by the parser. In this case,
49
+ # the `<td>` tag must be wrapped in `<tbody><tr>` tags.
50
50
  #
51
- # ``` ruby
52
- # Nokogiri::HTML5::DocumentFragment.new(
53
- # Nokogiri::HTML5::Document.new,
54
- # "<td>foo</td>",
55
- # "table" # this is the context node
56
- # ).xpath("tbody/tr/*").to_html
57
- # # => "<td>foo</td>"
58
- # ```
51
+ # We can narrow down the result set with an XPath query to get back only the intended tags:
59
52
  #
60
- # Hurrah! This is precisely what Nokogiri::HTML5::Inference.parse does: make reasonable inferences
61
- # that work for both HTML5 documents and HTML5 fragments, and for all the different HTML5 tags that a
62
- # web developer might need in a view library.
53
+ # ``` ruby
54
+ # Nokogiri::HTML5::DocumentFragment.new(
55
+ # Nokogiri::HTML5::Document.new,
56
+ # "<td>foo</td>",
57
+ # "table" # this is the context node
58
+ # ).xpath("tbody/tr/*").to_html
59
+ # # => "<td>foo</td>"
60
+ # ```
63
61
  #
62
+ # Huzzah! That works. And it's precisely what Nokogiri::HTML5::Inference.parse does:
63
+ #
64
+ # ``` ruby
65
+ # Nokogiri::HTML5::Inference.parse("<td>foo</td>").to_html
66
+ # # => "<td>foo</td>"
67
+ # ```
64
68
  module Inference
65
69
  # Tags that must be parsed in a specific HTML5 insertion mode, for which we must use a
66
70
  # context node.
@@ -89,7 +93,7 @@ else
89
93
  TBODY = /\A\s*<(#{PluckTags::TBODY.join("|")})\b/i
90
94
  TBODY_TR = /\A\s*<(#{PluckTags::TBODY_TR.join("|")})\b/i
91
95
  COLGROUP = /\A\s*<(#{PluckTags::COLGROUP.join("|")})\b/i
92
- HEAD_OUTER = /\A\s*<(head)\b/i
96
+ HTML_INNER = /\A\s*<(head)\b/i
93
97
  BODY_OUTER = /\A\s*<(body)\b/i
94
98
  end
95
99
 
@@ -180,7 +184,7 @@ else
180
184
  when PluckRegexp::TBODY then "tbody/*"
181
185
  when PluckRegexp::TBODY_TR then "tbody/tr/*"
182
186
  when PluckRegexp::COLGROUP then "colgroup/*"
183
- when PluckRegexp::HEAD_OUTER then "head"
187
+ when PluckRegexp::HTML_INNER then "./*"
184
188
  when PluckRegexp::BODY_OUTER then "body"
185
189
  end
186
190
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: nokogiri-html5-inference
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.1
4
+ version: 0.2.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Mike Dalessio
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2024-04-24 00:00:00.000000000 Z
11
+ date: 2024-04-26 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: nokogiri