nokogiri-html5-inference 0.1.1 → 0.3.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: '0942dd8a89c2d930794a10583ed84746da2c434af00ccb3c3d52c4932fa1b812'
4
- data.tar.gz: c09d3bf45f24570c4c4d80571a16e35c7fdefa82be467fead21334194788938f
3
+ metadata.gz: '068c7d6b9bea47be40be5ac4429d673f9eed2d6666a0f30fa0ffa52d71bd7d96'
4
+ data.tar.gz: 0d265567d61bc23fd08997772651ec15c2438c25b969380e9bce82a63f961310
5
5
  SHA512:
6
- metadata.gz: 5e931d3e3c4a3a516c046bf04ad352c58acd0118fca531fc3ce7f5a9ce3a93f1918ac96447c53814185e58d3d51c578ea4e45dbd15f68e9d290a696a6d6a0783
7
- data.tar.gz: 49d6050376b44467248e3c7c85251add238773c954914d874c282108c5569249394852ce3028693fef547ea60e41346e2be328a105e9076106250449d1d02b29
6
+ metadata.gz: 15998fdfc4c3a2207ca150c47b4ec2d9d9c0012e1288634d092f1dc76d0c729c730c6a56a591d49ea303a38f4652b498940b338972f905f493e09b78d2fce879
7
+ data.tar.gz: 55ecdad7efa27e7580aa5fed39c5f9ce3860a6992582e7dfb21fb0a32c4ad355591af75c44e71a43327d75c4afe20ff8da832d5079fd39678f91df26ece89b06
data/CHANGELOG.md CHANGED
@@ -1,10 +1,23 @@
1
- ## [Unreleased]
1
+ ## Unreleased
2
2
 
3
- ## [0.1.1] - 2024-04-24
3
+
4
+ ## v0.3.0 - 2024-05-05
5
+
6
+ - Use a `<template>` tag as the context node for the majority of fragment parsing, which greatly simplifies this gem. #7 @flavorjones @stevecheckoway
7
+ - Clean up the README. @marcoroth
8
+ - `Nokogiri::HTML5::Inference.parse` always returns a `Nokogiri::XML::Nodeset` for fragments. Previously this method sometimes returns a `Nokogiri::HTML5::DocumentFragment`, but some API inconsistencies between `DocumentFragment` and `NodeSet` made using the returned object tricky. We hope this provides a more consistent development experience. @flavorjones
9
+
10
+
11
+ ## v0.2.0 - 2024-04-26
12
+
13
+ - When a `<head>` tag is seen first in the input string, include the `<body>` tag in the returned fragment or node set. (#3, #4) @flavorjones
14
+
15
+
16
+ ## v0.1.1 - 2024-04-24
4
17
 
5
18
  - Make protected methods `#context` and `#pluck_path` public, but keeping them undocumented.
6
19
 
7
20
 
8
- ## [0.1.0] - 2024-04-24
21
+ ## v0.1.0 - 2024-04-24
9
22
 
10
23
  - Initial release
data/README.md CHANGED
@@ -1,73 +1,70 @@
1
- # Nokogiri::Html5::Inference
1
+ # Nokogiri::HTML5::Inference
2
2
 
3
3
  Given HTML5 input, make a reasonable guess at how to parse it correctly.
4
4
 
5
- Infer from the HTML5 input whether it's a fragment or a document, and if it's a fragment what the proper context node should be. This is useful for parsing trusted content like view snippets, particularly for morphing cases like StimulusReflex.
5
+ `Nokogiri::HTML5::Inference` makes reasonable inferences that work for both HTML5 documents and HTML5 fragments, and for all the different HTML5 tags that a web developer might need in a view library.
6
+
7
+ This is useful for parsing trusted content like view snippets, particularly for morphing cases like StimulusReflex.
6
8
 
7
9
  ## The problem this library solves
8
10
 
9
- The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very precise
10
- context-dependent parsing rules which can make it challenging to "just parse" a fragment of HTML
11
- without knowing the parent node -- also called the "context node" -- in which it will be inserted.
11
+ The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very precise context-dependent parsing rules which can make it challenging to "just parse" a fragment of HTML without knowing the parent node -- also called the "context node" -- in which it will be inserted.
12
12
 
13
- Most content in an HTML5 document can be parsed assuming the parser's mode will be in the
14
- ["in body" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody),
15
- but there are some notable exceptions. Perhaps the most problematic to web developers are the
16
- table-related tags, which will not be parsed properly unless the parser is in the
17
- ["in table" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
13
+ Most content in an HTML5 document can be parsed assuming the parser's mode will be in the ["in body" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody), but there are some notable exceptions. Perhaps the most problematic to web developers are the table-related tags, which will not be parsed properly unless the parser is in the ["in table" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
18
14
 
19
15
  For example:
20
16
 
21
17
  ``` ruby
22
18
  Nokogiri::HTML5::DocumentFragment.parse("<td>foo</td>").to_html
23
- # => "foo"
19
+ # => "foo" # where did the tag go!?
24
20
  ```
25
21
 
26
- In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed here",
27
- and drop the tag. This fragment must be parsed "in the context" of a table in order to parse
28
- properly. Thankfully, libgumbo and Nokogiri allow us to do this:
22
+ In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed here", and drop the tag. This particular fragment must be parsed "in the context" of a table in order to parse properly.
23
+
24
+ Thankfully, libgumbo and Nokogiri allow us to set the context node:
29
25
 
30
26
  ``` ruby
31
27
  Nokogiri::HTML5::DocumentFragment.new(
32
28
  Nokogiri::HTML5::Document.new,
33
29
  "<td>foo</td>",
34
- "table" # this is the context node
30
+ "table" # <--- this is the context node
35
31
  ).to_html
36
32
  # => "<tbody><tr><td>foo</td></tr></tbody>"
37
33
  ```
38
34
 
39
- This is _almost_ correct, but we're seeing another HTML5 parsing rule in action: there may be
40
- _intermediate parent tags_ that the HTML5 spec requires to be inserted by the parser. In this case,
41
- the `<td>` tag must be wrapped in `<tbody><tr>` tags.
35
+ This result is _almost_ correct, but we're seeing another HTML5 parsing rule in action: there may be _intermediate parent tags_ that the HTML5 spec requires to be inserted by the parser. In this case, the `<td>` tag must be wrapped in `<tbody><tr>` tags.
42
36
 
43
- We can narrow down the result set with an XPath query to get back only the intended tags:
37
+ We can fix this to only return the tags we provided by using the `<template>` tag as the context node, which the HTML5 spec provides exactly for this purpose:
44
38
 
45
39
  ``` ruby
46
40
  Nokogiri::HTML5::DocumentFragment.new(
47
41
  Nokogiri::HTML5::Document.new,
48
42
  "<td>foo</td>",
49
- "table" # this is the context node
50
- ).xpath("tbody/tr/*").to_html
43
+ "template" # <--- this is the context node
44
+ ).to_html
51
45
  # => "<td>foo</td>"
52
46
  ```
53
47
 
54
- Hurrah! This is precisely what Nokogiri::HTML5::Inference.parse does: make reasonable inferences
55
- that work for both HTML5 documents and HTML5 fragments, and for all the different HTML5 tags that a
56
- web developer might need in a view library.
48
+ Huzzah! That works. And it's precisely what `Nokogiri::HTML5::Inference.parse` does:
49
+
50
+ ``` ruby
51
+ Nokogiri::HTML5::Inference.parse("<td>foo</td>").to_html
52
+ # => "<td>foo</td>"
53
+ ```
57
54
 
58
55
 
59
56
  ## Usage
60
57
 
61
58
  Given an input String containing HTML5, infer the best way to parse it by calling `Nokogiri::HTML5::Inference.parse`.
62
59
 
63
- If the input is a document, you'll get a Nokogiri::HTML5::Document back:
60
+ If the input is a document, you'll get a `Nokogiri::HTML5::Document` back:
64
61
 
65
62
  ``` ruby
66
63
  html = <<~HTML
67
64
  <!doctype html>
68
65
  <html lang="en">
69
66
  <head>
70
- <meta encoding="UTF-8">
67
+ <meta charset="utf-8">
71
68
  </head>
72
69
  <body>
73
70
  <h1>Hello, world!</h1>
@@ -95,19 +92,7 @@ Nokogiri::HTML5::Inference.parse(html)
95
92
  # })
96
93
  ```
97
94
 
98
- If the input is a fragment that is parsed normally, you'll either get a Nokogiri::HTML5::DocumentFragment back:
99
-
100
- ``` ruby
101
- Nokogiri::HTML5::Inference.parse("<div>hello,</div><div>world!</div>")
102
- # => #(DocumentFragment:0x34f8 {
103
- # name = "#document-fragment",
104
- # children = [
105
- # #(Element:0x3624 { name = "div", children = [ #(Text "hello,")] }),
106
- # #(Element:0x3804 { name = "div", children = [ #(Text "world!")] })]
107
- # })
108
- ```
109
-
110
- or, if there are intermediate parent tags that need to be removed, you'll get a Nokogiri::XML::NodeSet:
95
+ If the input is a fragment, you'll get back a `Nokogiri::XML::NodeSet`:
111
96
 
112
97
  ``` ruby
113
98
  Nokogiri::HTML5::Inference.parse("<tr><td>hello</td><td>world!</td></tr>")
@@ -120,17 +105,17 @@ Nokogiri::HTML5::Inference.parse("<tr><td>hello</td><td>world!</td></tr>")
120
105
  # ]
121
106
  ```
122
107
 
123
- All of these return types respond to the same query methods like `#css` and `#xpath`, tree-traversal
124
- methods like `#children`, and serialization methods like `#to_html`.
108
+ Both of these return types respond to the same query methods like `#css` and `#xpath`, tree-traversal methods like `#children`, and serialization methods like `#to_html`.
125
109
 
126
110
 
127
111
  ## Caveats
128
112
 
129
- The implementation is currently pretty hacky and only looks at the first tag in the input to make
130
- decisions. Nonetheless, it is a step forward from what Nokogiri and libgumbo do out-of-the-box.
113
+ The implementation is currently pretty hacky and only looks at the first tag in the input to make decisions. Nonetheless, it is a step forward from what Nokogiri and libgumbo do out-of-the-box.
131
114
 
132
115
  The implementation also is almost certainly incomplete, meaning there are HTML5 tags that aren't handled by this library as you might expect.
133
116
 
117
+ This implementation is probably OK for handling untrusted content, but it's still new and I haven't really thought very hard about it yet. If you want to use it on untrusted content, open an issue and talk with us about your use case so we can help keep you secure!
118
+
134
119
  We would welcome bug reports and pull requests improving this library!
135
120
 
136
121
 
@@ -138,12 +123,15 @@ We would welcome bug reports and pull requests improving this library!
138
123
 
139
124
  Install the gem and add to the application's Gemfile by executing:
140
125
 
141
- $ bundle add nokgiri-html5-inference
126
+ ```bash
127
+ bundle add nokgiri-html5-inference
128
+ ```
142
129
 
143
130
  If bundler is not being used to manage dependencies, install the gem by executing:
144
131
 
145
- $ gem install nokgiri-html5-inference
146
-
132
+ ```bash
133
+ gem install nokgiri-html5-inference
134
+ ```
147
135
 
148
136
  ## Development
149
137
 
@@ -164,4 +152,4 @@ The gem is available as open source under the terms of the [MIT License](https:/
164
152
 
165
153
  ## Code of Conduct
166
154
 
167
- Everyone interacting in the Nokogiri::Html5::Inference project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/flavorjones/nokogiri-html5-inference/blob/main/CODE_OF_CONDUCT.md).
155
+ Everyone interacting in the `Nokogiri::HTML5::Inference` project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/flavorjones/nokogiri-html5-inference/blob/main/CODE_OF_CONDUCT.md).
@@ -3,7 +3,7 @@
3
3
  module Nokogiri
4
4
  module HTML5
5
5
  module Inference
6
- VERSION = "0.1.1"
6
+ VERSION = "0.3.0"
7
7
  end
8
8
  end
9
9
  end
@@ -12,91 +12,87 @@ else
12
12
  module HTML5
13
13
  # :markup: markdown
14
14
  #
15
- # The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very precise
16
- # context-dependent parsing rules which can make it challenging to "just parse" a fragment of HTML
17
- # without knowing the parent node -- also called the "context node" -- in which it will be inserted.
15
+ # The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very
16
+ # precise context-dependent parsing rules which can make it challenging to "just parse" a
17
+ # fragment of HTML without knowing the parent node -- also called the "context node" -- in
18
+ # which it will be inserted.
18
19
  #
19
- # Most content in an HTML5 document can be parsed assuming the parser's mode will be in the
20
- # ["in body" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody),
21
- # but there are some notable exceptions. Perhaps the most problematic to web developers are the
22
- # table-related tags, which will not be parsed properly unless the parser is in the
23
- # ["in table" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
20
+ # Most content in an HTML5 document can be parsed assuming the parser's mode will be in the
21
+ # ["in body" insertion
22
+ # mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody), but there
23
+ # are some notable exceptions. Perhaps the most problematic to web developers are the
24
+ # table-related tags, which will not be parsed properly unless the parser is in the ["in
25
+ # table" insertion
26
+ # mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
24
27
  #
25
- # For example:
28
+ # For example:
26
29
  #
27
- # ``` ruby
28
- # Nokogiri::HTML5::DocumentFragment.parse("<td>foo</td>").to_html
29
- # # => "foo"
30
- # ```
30
+ # ``` ruby
31
+ # Nokogiri::HTML5::DocumentFragment.parse("<td>foo</td>").to_html
32
+ # # => "foo" # where did the tag go!?
33
+ # ```
31
34
  #
32
- # In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed here",
33
- # and drop the tag. This fragment must be parsed "in the context" of a table in order to parse
34
- # properly. Thankfully, libgumbo and Nokogiri allow us to do this:
35
+ # In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed
36
+ # here", and drop the tag. This particular fragment must be parsed "in the context" of a
37
+ # table in order to parse properly.
35
38
  #
36
- # ``` ruby
37
- # Nokogiri::HTML5::DocumentFragment.new(
38
- # Nokogiri::HTML5::Document.new,
39
- # "<td>foo</td>",
40
- # "table" # this is the context node
41
- # ).to_html
42
- # # => "<tbody><tr><td>foo</td></tr></tbody>"
43
- # ```
39
+ # Thankfully, libgumbo and Nokogiri allow us to set the context node:
44
40
  #
45
- # This is _almost_ correct, but we're seeing another HTML5 parsing rule in action: there may be
46
- # _intermediate parent tags_ that the HTML5 spec requires to be inserted by the parser. In this case,
47
- # the `<td>` tag must be wrapped in `<tbody><tr>` tags.
41
+ # ``` ruby
42
+ # Nokogiri::HTML5::DocumentFragment.new(
43
+ # Nokogiri::HTML5::Document.new,
44
+ # "<td>foo</td>",
45
+ # "table" # <--- this is the context node
46
+ # ).to_html
47
+ # # => "<tbody><tr><td>foo</td></tr></tbody>"
48
+ # ```
48
49
  #
49
- # We can narrow down the result set with an XPath query to get back only the intended tags:
50
+ # This result is _almost_ correct, but we're seeing another HTML5 parsing rule in action:
51
+ # there may be _intermediate parent tags_ that the HTML5 spec requires to be inserted by the
52
+ # parser. In this case, the `<td>` tag must be wrapped in `<tbody><tr>` tags.
50
53
  #
51
- # ``` ruby
52
- # Nokogiri::HTML5::DocumentFragment.new(
53
- # Nokogiri::HTML5::Document.new,
54
- # "<td>foo</td>",
55
- # "table" # this is the context node
56
- # ).xpath("tbody/tr/*").to_html
57
- # # => "<td>foo</td>"
58
- # ```
54
+ # We can fix this to only return the tags we provided by using the `<template>` tag as the
55
+ # context node, which the HTML5 spec provides exactly for this purpose:
59
56
  #
60
- # Hurrah! This is precisely what Nokogiri::HTML5::Inference.parse does: make reasonable inferences
61
- # that work for both HTML5 documents and HTML5 fragments, and for all the different HTML5 tags that a
62
- # web developer might need in a view library.
57
+ # ``` ruby
58
+ # Nokogiri::HTML5::DocumentFragment.new(
59
+ # Nokogiri::HTML5::Document.new,
60
+ # "<td>foo</td>",
61
+ # "template" # <--- this is the context node
62
+ # ).to_html
63
+ # # => "<td>foo</td>"
64
+ # ```
65
+ #
66
+ # Huzzah! That works. And it's precisely what `Nokogiri::HTML5::Inference.parse` does:
67
+ #
68
+ # ``` ruby
69
+ # Nokogiri::HTML5::Inference.parse("<td>foo</td>").to_html
70
+ # # => "<td>foo</td>"
71
+ # ```
63
72
  #
64
73
  module Inference
65
74
  # Tags that must be parsed in a specific HTML5 insertion mode, for which we must use a
66
75
  # context node.
67
76
  module ContextTags # :nodoc:
68
- TABLE = %w[thead tbody tfoot tr td th col colgroup caption].freeze
69
77
  HTML = %w[head body].freeze
70
78
  end
71
79
 
72
80
  # Regular expressions used to determine if we need to use a context node.
73
81
  module ContextRegexp # :nodoc:
74
82
  DOCUMENT = /\A\s*(<!doctype\s+html\b|<html\b)/i
75
- TABLE = /\A\s*<(#{ContextTags::TABLE.join("|")})\b/i
76
83
  HTML = /\A\s*<(#{ContextTags::HTML.join("|")})\b/i
77
84
  end
78
85
 
79
- # Tags that get an intermediate parent created for them according to the HTML5 spec.
80
- module PluckTags # :nodoc:
81
- TBODY = %w[tr].freeze
82
- TBODY_TR = %w[td th].freeze
83
- COLGROUP = %w[col].freeze
84
- end
85
-
86
86
  # Regular expressions used to determine if we will need to skip an intermediate parent or
87
87
  # otherwise narrow the fragment DOM that is returned.
88
88
  module PluckRegexp # :nodoc:
89
- TBODY = /\A\s*<(#{PluckTags::TBODY.join("|")})\b/i
90
- TBODY_TR = /\A\s*<(#{PluckTags::TBODY_TR.join("|")})\b/i
91
- COLGROUP = /\A\s*<(#{PluckTags::COLGROUP.join("|")})\b/i
92
- HEAD_OUTER = /\A\s*<(head)\b/i
93
89
  BODY_OUTER = /\A\s*<(body)\b/i
94
90
  end
95
91
 
96
92
  class << self
97
93
  #
98
94
  # call-seq:
99
- # parse(input, pluck: true) => (Nokogiri::HTML5::Document | Nokogiri::HTML5::DocumentFragment | Nokogiri::XML::NodeSet)
95
+ # parse(input, pluck: true) => (Nokogiri::HTML5::Document | Nokogiri::XML::NodeSet)
100
96
  #
101
97
  # Based on the start of the input HTML5 string, guess whether it's a full document or a
102
98
  # fragment and, using the fragment context node if necessary, parse it properly and
@@ -114,14 +110,13 @@ else
114
110
  #
115
111
  # [Keyword Parameters]
116
112
  # - +pluck+ (Boolean) Default: +true+. Set to +false+ if you want the method to always
117
- # return <tt>DocumentFragment</tt>s as-parsed, without attempting to remove
118
- # intermediate parent nodes. This shouldn't be necessary if the library is working
119
- # properly, but may be useful to allow user to work around a bad guess.
113
+ # return what Nokogiri parsed, without attempting to remove any sibling or intermediate
114
+ # parent nodes. This shouldn't be necessary if the library is working properly, but may
115
+ # be useful to allow user to work around a bad guess.
120
116
  #
121
117
  # [Returns]
122
118
  # - A +Nokogiri::HTML5::Document+ if the input appears to represent a full document.
123
- # - A +Nokogiri::HTML5::DocumentFragment+ or a +Nokogiri::XML::NodeSet+ if the input
124
- # appears to be a fragment.
119
+ # - A +Nokogiri::XML::NodeSet+ if the input appears to be a fragment.
125
120
  #
126
121
  def parse(input, pluck: true)
127
122
  context = Nokogiri::HTML5::Inference.context(input)
@@ -132,7 +127,7 @@ else
132
127
  if pluck && (path = pluck_path(input))
133
128
  fragment.xpath(path)
134
129
  else
135
- fragment
130
+ fragment.children
136
131
  end
137
132
  end
138
133
  end
@@ -154,9 +149,8 @@ else
154
149
  def context(input) # :nodoc:
155
150
  case input
156
151
  when ContextRegexp::DOCUMENT then nil
157
- when ContextRegexp::TABLE then "table"
158
152
  when ContextRegexp::HTML then "html"
159
- else "body"
153
+ else "template"
160
154
  end
161
155
  end
162
156
 
@@ -177,10 +171,6 @@ else
177
171
  #
178
172
  def pluck_path(input) # :nodoc:
179
173
  case input
180
- when PluckRegexp::TBODY then "tbody/*"
181
- when PluckRegexp::TBODY_TR then "tbody/tr/*"
182
- when PluckRegexp::COLGROUP then "colgroup/*"
183
- when PluckRegexp::HEAD_OUTER then "head"
184
174
  when PluckRegexp::BODY_OUTER then "body"
185
175
  end
186
176
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: nokogiri-html5-inference
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.1.1
4
+ version: 0.3.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Mike Dalessio
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2024-04-24 00:00:00.000000000 Z
11
+ date: 2024-05-05 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: nokogiri