nokogiri-html5-inference 0.2.0 → 0.3.0

Sign up to get free protection for your applications and to get access to all the features.
checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: bd21e524361107504a10fa58bd991b9b89bd40f968b7550b90d38454156e65c2
4
- data.tar.gz: ddbedefb2798713c3ae46d27df75efa5c0bfe10d1782dc419d8587dd6a155a74
3
+ metadata.gz: '068c7d6b9bea47be40be5ac4429d673f9eed2d6666a0f30fa0ffa52d71bd7d96'
4
+ data.tar.gz: 0d265567d61bc23fd08997772651ec15c2438c25b969380e9bce82a63f961310
5
5
  SHA512:
6
- metadata.gz: c46997b8c93033fb53e5c2b141a34bdbe93d6c1f55a3cd3de76abd6116c807b60ae2a26dfb02e8ce06adf441308fbf0e2522eb0b302ef11fedb1231c33df0300
7
- data.tar.gz: e24dc98db23641a8edb55824a7a4c0f7d858f6c0d2fe99f854a56c62ebc97a8e4f0f79df849798bb2097b9e0253213ebbbf3adbd988a3580b55ca891f8375a63
6
+ metadata.gz: 15998fdfc4c3a2207ca150c47b4ec2d9d9c0012e1288634d092f1dc76d0c729c730c6a56a591d49ea303a38f4652b498940b338972f905f493e09b78d2fce879
7
+ data.tar.gz: 55ecdad7efa27e7580aa5fed39c5f9ce3860a6992582e7dfb21fb0a32c4ad355591af75c44e71a43327d75c4afe20ff8da832d5079fd39678f91df26ece89b06
data/CHANGELOG.md CHANGED
@@ -1,15 +1,23 @@
1
- ## [Unreleased]
1
+ ## Unreleased
2
2
 
3
- ## [0.2.0] - 2024-04-26
3
+
4
+ ## v0.3.0 - 2024-05-05
5
+
6
+ - Use a `<template>` tag as the context node for the majority of fragment parsing, which greatly simplifies this gem. #7 @flavorjones @stevecheckoway
7
+ - Clean up the README. @marcoroth
8
+ - `Nokogiri::HTML5::Inference.parse` always returns a `Nokogiri::XML::Nodeset` for fragments. Previously this method sometimes returns a `Nokogiri::HTML5::DocumentFragment`, but some API inconsistencies between `DocumentFragment` and `NodeSet` made using the returned object tricky. We hope this provides a more consistent development experience. @flavorjones
9
+
10
+
11
+ ## v0.2.0 - 2024-04-26
4
12
 
5
13
  - When a `<head>` tag is seen first in the input string, include the `<body>` tag in the returned fragment or node set. (#3, #4) @flavorjones
6
14
 
7
15
 
8
- ## [0.1.1] - 2024-04-24
16
+ ## v0.1.1 - 2024-04-24
9
17
 
10
18
  - Make protected methods `#context` and `#pluck_path` public, but keeping them undocumented.
11
19
 
12
20
 
13
- ## [0.1.0] - 2024-04-24
21
+ ## v0.1.0 - 2024-04-24
14
22
 
15
23
  - Initial release
data/README.md CHANGED
@@ -1,23 +1,16 @@
1
- # Nokogiri::Html5::Inference
1
+ # Nokogiri::HTML5::Inference
2
2
 
3
3
  Given HTML5 input, make a reasonable guess at how to parse it correctly.
4
4
 
5
- Nokogiri::HTML5::Inference makes reasonable inferences that work for both HTML5 documents and HTML5
6
- fragments, and for all the different HTML5 tags that a web developer might need in a view library.
5
+ `Nokogiri::HTML5::Inference` makes reasonable inferences that work for both HTML5 documents and HTML5 fragments, and for all the different HTML5 tags that a web developer might need in a view library.
7
6
 
8
7
  This is useful for parsing trusted content like view snippets, particularly for morphing cases like StimulusReflex.
9
8
 
10
9
  ## The problem this library solves
11
10
 
12
- The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very precise
13
- context-dependent parsing rules which can make it challenging to "just parse" a fragment of HTML
14
- without knowing the parent node -- also called the "context node" -- in which it will be inserted.
11
+ The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very precise context-dependent parsing rules which can make it challenging to "just parse" a fragment of HTML without knowing the parent node -- also called the "context node" -- in which it will be inserted.
15
12
 
16
- Most content in an HTML5 document can be parsed assuming the parser's mode will be in the
17
- ["in body" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody),
18
- but there are some notable exceptions. Perhaps the most problematic to web developers are the
19
- table-related tags, which will not be parsed properly unless the parser is in the
20
- ["in table" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
13
+ Most content in an HTML5 document can be parsed assuming the parser's mode will be in the ["in body" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody), but there are some notable exceptions. Perhaps the most problematic to web developers are the table-related tags, which will not be parsed properly unless the parser is in the ["in table" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
21
14
 
22
15
  For example:
23
16
 
@@ -26,9 +19,7 @@ Nokogiri::HTML5::DocumentFragment.parse("<td>foo</td>").to_html
26
19
  # => "foo" # where did the tag go!?
27
20
  ```
28
21
 
29
- In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed here",
30
- and drop the tag. This particular fragment must be parsed "in the context" of a table in order to
31
- parse properly.
22
+ In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed here", and drop the tag. This particular fragment must be parsed "in the context" of a table in order to parse properly.
32
23
 
33
24
  Thankfully, libgumbo and Nokogiri allow us to set the context node:
34
25
 
@@ -41,22 +32,20 @@ Nokogiri::HTML5::DocumentFragment.new(
41
32
  # => "<tbody><tr><td>foo</td></tr></tbody>"
42
33
  ```
43
34
 
44
- This result is _almost_ correct, but we're seeing another HTML5 parsing rule in action: there may be
45
- _intermediate parent tags_ that the HTML5 spec requires to be inserted by the parser. In this case,
46
- the `<td>` tag must be wrapped in `<tbody><tr>` tags.
35
+ This result is _almost_ correct, but we're seeing another HTML5 parsing rule in action: there may be _intermediate parent tags_ that the HTML5 spec requires to be inserted by the parser. In this case, the `<td>` tag must be wrapped in `<tbody><tr>` tags.
47
36
 
48
- We can narrow down the result set with an XPath query to get back only the intended tags:
37
+ We can fix this to only return the tags we provided by using the `<template>` tag as the context node, which the HTML5 spec provides exactly for this purpose:
49
38
 
50
39
  ``` ruby
51
40
  Nokogiri::HTML5::DocumentFragment.new(
52
41
  Nokogiri::HTML5::Document.new,
53
42
  "<td>foo</td>",
54
- "table" # this is the context node
55
- ).xpath("tbody/tr/*").to_html
43
+ "template" # <--- this is the context node
44
+ ).to_html
56
45
  # => "<td>foo</td>"
57
46
  ```
58
47
 
59
- Huzzah! That works. And it's precisely what Nokogiri::HTML5::Inference.parse does:
48
+ Huzzah! That works. And it's precisely what `Nokogiri::HTML5::Inference.parse` does:
60
49
 
61
50
  ``` ruby
62
51
  Nokogiri::HTML5::Inference.parse("<td>foo</td>").to_html
@@ -68,7 +57,7 @@ Nokogiri::HTML5::Inference.parse("<td>foo</td>").to_html
68
57
 
69
58
  Given an input String containing HTML5, infer the best way to parse it by calling `Nokogiri::HTML5::Inference.parse`.
70
59
 
71
- If the input is a document, you'll get a Nokogiri::HTML5::Document back:
60
+ If the input is a document, you'll get a `Nokogiri::HTML5::Document` back:
72
61
 
73
62
  ``` ruby
74
63
  html = <<~HTML
@@ -103,19 +92,7 @@ Nokogiri::HTML5::Inference.parse(html)
103
92
  # })
104
93
  ```
105
94
 
106
- If the input is a fragment that is parsed normally, you'll either get a Nokogiri::HTML5::DocumentFragment back:
107
-
108
- ``` ruby
109
- Nokogiri::HTML5::Inference.parse("<div>hello,</div><div>world!</div>")
110
- # => #(DocumentFragment:0x34f8 {
111
- # name = "#document-fragment",
112
- # children = [
113
- # #(Element:0x3624 { name = "div", children = [ #(Text "hello,")] }),
114
- # #(Element:0x3804 { name = "div", children = [ #(Text "world!")] })]
115
- # })
116
- ```
117
-
118
- or, if there are intermediate parent tags that need to be removed, you'll get a Nokogiri::XML::NodeSet:
95
+ If the input is a fragment, you'll get back a `Nokogiri::XML::NodeSet`:
119
96
 
120
97
  ``` ruby
121
98
  Nokogiri::HTML5::Inference.parse("<tr><td>hello</td><td>world!</td></tr>")
@@ -128,14 +105,12 @@ Nokogiri::HTML5::Inference.parse("<tr><td>hello</td><td>world!</td></tr>")
128
105
  # ]
129
106
  ```
130
107
 
131
- All of these return types respond to the same query methods like `#css` and `#xpath`, tree-traversal
132
- methods like `#children`, and serialization methods like `#to_html`.
108
+ Both of these return types respond to the same query methods like `#css` and `#xpath`, tree-traversal methods like `#children`, and serialization methods like `#to_html`.
133
109
 
134
110
 
135
111
  ## Caveats
136
112
 
137
- The implementation is currently pretty hacky and only looks at the first tag in the input to make
138
- decisions. Nonetheless, it is a step forward from what Nokogiri and libgumbo do out-of-the-box.
113
+ The implementation is currently pretty hacky and only looks at the first tag in the input to make decisions. Nonetheless, it is a step forward from what Nokogiri and libgumbo do out-of-the-box.
139
114
 
140
115
  The implementation also is almost certainly incomplete, meaning there are HTML5 tags that aren't handled by this library as you might expect.
141
116
 
@@ -148,12 +123,15 @@ We would welcome bug reports and pull requests improving this library!
148
123
 
149
124
  Install the gem and add to the application's Gemfile by executing:
150
125
 
151
- $ bundle add nokgiri-html5-inference
126
+ ```bash
127
+ bundle add nokgiri-html5-inference
128
+ ```
152
129
 
153
130
  If bundler is not being used to manage dependencies, install the gem by executing:
154
131
 
155
- $ gem install nokgiri-html5-inference
156
-
132
+ ```bash
133
+ gem install nokgiri-html5-inference
134
+ ```
157
135
 
158
136
  ## Development
159
137
 
@@ -174,4 +152,4 @@ The gem is available as open source under the terms of the [MIT License](https:/
174
152
 
175
153
  ## Code of Conduct
176
154
 
177
- Everyone interacting in the Nokogiri::Html5::Inference project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/flavorjones/nokogiri-html5-inference/blob/main/CODE_OF_CONDUCT.md).
155
+ Everyone interacting in the `Nokogiri::HTML5::Inference` project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/flavorjones/nokogiri-html5-inference/blob/main/CODE_OF_CONDUCT.md).
@@ -3,7 +3,7 @@
3
3
  module Nokogiri
4
4
  module HTML5
5
5
  module Inference
6
- VERSION = "0.2.0"
6
+ VERSION = "0.3.0"
7
7
  end
8
8
  end
9
9
  end
@@ -12,15 +12,18 @@ else
12
12
  module HTML5
13
13
  # :markup: markdown
14
14
  #
15
- # The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very precise
16
- # context-dependent parsing rules which can make it challenging to "just parse" a fragment of HTML
17
- # without knowing the parent node -- also called the "context node" -- in which it will be inserted.
15
+ # The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very
16
+ # precise context-dependent parsing rules which can make it challenging to "just parse" a
17
+ # fragment of HTML without knowing the parent node -- also called the "context node" -- in
18
+ # which it will be inserted.
18
19
  #
19
20
  # Most content in an HTML5 document can be parsed assuming the parser's mode will be in the
20
- # ["in body" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody),
21
- # but there are some notable exceptions. Perhaps the most problematic to web developers are the
22
- # table-related tags, which will not be parsed properly unless the parser is in the
23
- # ["in table" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
21
+ # ["in body" insertion
22
+ # mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody), but there
23
+ # are some notable exceptions. Perhaps the most problematic to web developers are the
24
+ # table-related tags, which will not be parsed properly unless the parser is in the ["in
25
+ # table" insertion
26
+ # mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
24
27
  #
25
28
  # For example:
26
29
  #
@@ -29,9 +32,9 @@ else
29
32
  # # => "foo" # where did the tag go!?
30
33
  # ```
31
34
  #
32
- # In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed here",
33
- # and drop the tag. This particular fragment must be parsed "in the context" of a table in order to
34
- # parse properly.
35
+ # In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed
36
+ # here", and drop the tag. This particular fragment must be parsed "in the context" of a
37
+ # table in order to parse properly.
35
38
  #
36
39
  # Thankfully, libgumbo and Nokogiri allow us to set the context node:
37
40
  #
@@ -44,63 +47,52 @@ else
44
47
  # # => "<tbody><tr><td>foo</td></tr></tbody>"
45
48
  # ```
46
49
  #
47
- # This result is _almost_ correct, but we're seeing another HTML5 parsing rule in action: there may be
48
- # _intermediate parent tags_ that the HTML5 spec requires to be inserted by the parser. In this case,
49
- # the `<td>` tag must be wrapped in `<tbody><tr>` tags.
50
+ # This result is _almost_ correct, but we're seeing another HTML5 parsing rule in action:
51
+ # there may be _intermediate parent tags_ that the HTML5 spec requires to be inserted by the
52
+ # parser. In this case, the `<td>` tag must be wrapped in `<tbody><tr>` tags.
50
53
  #
51
- # We can narrow down the result set with an XPath query to get back only the intended tags:
54
+ # We can fix this to only return the tags we provided by using the `<template>` tag as the
55
+ # context node, which the HTML5 spec provides exactly for this purpose:
52
56
  #
53
57
  # ``` ruby
54
58
  # Nokogiri::HTML5::DocumentFragment.new(
55
59
  # Nokogiri::HTML5::Document.new,
56
60
  # "<td>foo</td>",
57
- # "table" # this is the context node
58
- # ).xpath("tbody/tr/*").to_html
61
+ # "template" # <--- this is the context node
62
+ # ).to_html
59
63
  # # => "<td>foo</td>"
60
64
  # ```
61
65
  #
62
- # Huzzah! That works. And it's precisely what Nokogiri::HTML5::Inference.parse does:
66
+ # Huzzah! That works. And it's precisely what `Nokogiri::HTML5::Inference.parse` does:
63
67
  #
64
68
  # ``` ruby
65
69
  # Nokogiri::HTML5::Inference.parse("<td>foo</td>").to_html
66
70
  # # => "<td>foo</td>"
67
71
  # ```
72
+ #
68
73
  module Inference
69
74
  # Tags that must be parsed in a specific HTML5 insertion mode, for which we must use a
70
75
  # context node.
71
76
  module ContextTags # :nodoc:
72
- TABLE = %w[thead tbody tfoot tr td th col colgroup caption].freeze
73
77
  HTML = %w[head body].freeze
74
78
  end
75
79
 
76
80
  # Regular expressions used to determine if we need to use a context node.
77
81
  module ContextRegexp # :nodoc:
78
82
  DOCUMENT = /\A\s*(<!doctype\s+html\b|<html\b)/i
79
- TABLE = /\A\s*<(#{ContextTags::TABLE.join("|")})\b/i
80
83
  HTML = /\A\s*<(#{ContextTags::HTML.join("|")})\b/i
81
84
  end
82
85
 
83
- # Tags that get an intermediate parent created for them according to the HTML5 spec.
84
- module PluckTags # :nodoc:
85
- TBODY = %w[tr].freeze
86
- TBODY_TR = %w[td th].freeze
87
- COLGROUP = %w[col].freeze
88
- end
89
-
90
86
  # Regular expressions used to determine if we will need to skip an intermediate parent or
91
87
  # otherwise narrow the fragment DOM that is returned.
92
88
  module PluckRegexp # :nodoc:
93
- TBODY = /\A\s*<(#{PluckTags::TBODY.join("|")})\b/i
94
- TBODY_TR = /\A\s*<(#{PluckTags::TBODY_TR.join("|")})\b/i
95
- COLGROUP = /\A\s*<(#{PluckTags::COLGROUP.join("|")})\b/i
96
- HTML_INNER = /\A\s*<(head)\b/i
97
89
  BODY_OUTER = /\A\s*<(body)\b/i
98
90
  end
99
91
 
100
92
  class << self
101
93
  #
102
94
  # call-seq:
103
- # parse(input, pluck: true) => (Nokogiri::HTML5::Document | Nokogiri::HTML5::DocumentFragment | Nokogiri::XML::NodeSet)
95
+ # parse(input, pluck: true) => (Nokogiri::HTML5::Document | Nokogiri::XML::NodeSet)
104
96
  #
105
97
  # Based on the start of the input HTML5 string, guess whether it's a full document or a
106
98
  # fragment and, using the fragment context node if necessary, parse it properly and
@@ -118,14 +110,13 @@ else
118
110
  #
119
111
  # [Keyword Parameters]
120
112
  # - +pluck+ (Boolean) Default: +true+. Set to +false+ if you want the method to always
121
- # return <tt>DocumentFragment</tt>s as-parsed, without attempting to remove
122
- # intermediate parent nodes. This shouldn't be necessary if the library is working
123
- # properly, but may be useful to allow user to work around a bad guess.
113
+ # return what Nokogiri parsed, without attempting to remove any sibling or intermediate
114
+ # parent nodes. This shouldn't be necessary if the library is working properly, but may
115
+ # be useful to allow user to work around a bad guess.
124
116
  #
125
117
  # [Returns]
126
118
  # - A +Nokogiri::HTML5::Document+ if the input appears to represent a full document.
127
- # - A +Nokogiri::HTML5::DocumentFragment+ or a +Nokogiri::XML::NodeSet+ if the input
128
- # appears to be a fragment.
119
+ # - A +Nokogiri::XML::NodeSet+ if the input appears to be a fragment.
129
120
  #
130
121
  def parse(input, pluck: true)
131
122
  context = Nokogiri::HTML5::Inference.context(input)
@@ -136,7 +127,7 @@ else
136
127
  if pluck && (path = pluck_path(input))
137
128
  fragment.xpath(path)
138
129
  else
139
- fragment
130
+ fragment.children
140
131
  end
141
132
  end
142
133
  end
@@ -158,9 +149,8 @@ else
158
149
  def context(input) # :nodoc:
159
150
  case input
160
151
  when ContextRegexp::DOCUMENT then nil
161
- when ContextRegexp::TABLE then "table"
162
152
  when ContextRegexp::HTML then "html"
163
- else "body"
153
+ else "template"
164
154
  end
165
155
  end
166
156
 
@@ -181,10 +171,6 @@ else
181
171
  #
182
172
  def pluck_path(input) # :nodoc:
183
173
  case input
184
- when PluckRegexp::TBODY then "tbody/*"
185
- when PluckRegexp::TBODY_TR then "tbody/tr/*"
186
- when PluckRegexp::COLGROUP then "colgroup/*"
187
- when PluckRegexp::HTML_INNER then "./*"
188
174
  when PluckRegexp::BODY_OUTER then "body"
189
175
  end
190
176
  end
metadata CHANGED
@@ -1,14 +1,14 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: nokogiri-html5-inference
3
3
  version: !ruby/object:Gem::Version
4
- version: 0.2.0
4
+ version: 0.3.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Mike Dalessio
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2024-04-26 00:00:00.000000000 Z
11
+ date: 2024-05-05 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: nokogiri