RubyGems - nokogiri-html5-inference - Versions diffs - 0.1.1 → 0.3.0 - Mend

nokogiri-html5-inference 0.1.1 → 0.3.0

Files changed (6) hide show

checksums.yaml +4 -4
data/CHANGELOG.md +16 -3
data/README.md +35 -47
data/lib/nokogiri/html5/inference/version.rb +1 -1
data/lib/nokogiri/html5/inference.rb +55 -65
metadata +2 -2

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: '0942dd8a89c2d930794a10583ed84746da2c434af00ccb3c3d52c4932fa1b812'
-  data.tar.gz: c09d3bf45f24570c4c4d80571a16e35c7fdefa82be467fead21334194788938f
+  metadata.gz: '068c7d6b9bea47be40be5ac4429d673f9eed2d6666a0f30fa0ffa52d71bd7d96'
+  data.tar.gz: 0d265567d61bc23fd08997772651ec15c2438c25b969380e9bce82a63f961310
 SHA512:
-  metadata.gz: 5e931d3e3c4a3a516c046bf04ad352c58acd0118fca531fc3ce7f5a9ce3a93f1918ac96447c53814185e58d3d51c578ea4e45dbd15f68e9d290a696a6d6a0783
-  data.tar.gz: 49d6050376b44467248e3c7c85251add238773c954914d874c282108c5569249394852ce3028693fef547ea60e41346e2be328a105e9076106250449d1d02b29
+  metadata.gz: 15998fdfc4c3a2207ca150c47b4ec2d9d9c0012e1288634d092f1dc76d0c729c730c6a56a591d49ea303a38f4652b498940b338972f905f493e09b78d2fce879
+  data.tar.gz: 55ecdad7efa27e7580aa5fed39c5f9ce3860a6992582e7dfb21fb0a32c4ad355591af75c44e71a43327d75c4afe20ff8da832d5079fd39678f91df26ece89b06

data/CHANGELOG.md CHANGED Viewed

@@ -1,10 +1,23 @@
-## [Unreleased]
+## Unreleased
-## [0.1.1] - 2024-04-24
+## v0.3.0 - 2024-05-05
+- Use a `<template>` tag as the context node for the majority of fragment parsing, which greatly simplifies this gem. #7 @flavorjones @stevecheckoway
+- Clean up the README. @marcoroth
+- `Nokogiri::HTML5::Inference.parse` always returns a `Nokogiri::XML::Nodeset` for fragments. Previously this method sometimes returns a `Nokogiri::HTML5::DocumentFragment`, but some API inconsistencies between `DocumentFragment` and `NodeSet` made using the returned object tricky. We hope this provides a more consistent development experience. @flavorjones
+## v0.2.0 - 2024-04-26
+- When a `<head>` tag is seen first in the input string, include the `<body>` tag in the returned fragment or node set. (#3, #4) @flavorjones
+## v0.1.1 - 2024-04-24
 - Make protected methods `#context` and `#pluck_path` public, but keeping them undocumented.
-## [0.1.0] - 2024-04-24
+## v0.1.0 - 2024-04-24
 - Initial release

data/README.md CHANGED Viewed

@@ -1,73 +1,70 @@
-# Nokogiri::Html5::Inference
+# Nokogiri::HTML5::Inference
 Given HTML5 input, make a reasonable guess at how to parse it correctly.
-Infer from the HTML5 input whether it's a fragment or a document, and if it's a fragment what the proper context node should be. This is useful for parsing trusted content like view snippets, particularly for morphing cases like StimulusReflex.
+`Nokogiri::HTML5::Inference` makes reasonable inferences that work for both HTML5 documents and HTML5 fragments, and for all the different HTML5 tags that a web developer might need in a view library.
+This is useful for parsing trusted content like view snippets, particularly for morphing cases like StimulusReflex.
 ## The problem this library solves
-The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very precise
-context-dependent parsing rules which can make it challenging to "just parse" a fragment of HTML
-without knowing the parent node -- also called the "context node" -- in which it will be inserted.
+The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very precise context-dependent parsing rules which can make it challenging to "just parse" a fragment of HTML without knowing the parent node -- also called the "context node" -- in which it will be inserted.
-Most content in an HTML5 document can be parsed assuming the parser's mode will be in the
-["in body" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody),
-but there are some notable exceptions. Perhaps the most problematic to web developers are the
-table-related tags, which will not be parsed properly unless the parser is in the
-["in table" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
+Most content in an HTML5 document can be parsed assuming the parser's mode will be in the ["in body" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody), but there are some notable exceptions. Perhaps the most problematic to web developers are the table-related tags, which will not be parsed properly unless the parser is in the ["in table" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
 For example:
 ``` ruby
 Nokogiri::HTML5::DocumentFragment.parse("<td>foo</td>").to_html
-# => "foo"
+# => "foo" # where did the tag go!?
 ```
-In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed here",
-and drop the tag. This fragment must be parsed "in the context" of a table in order to parse
-properly. Thankfully, libgumbo and Nokogiri allow us to do this:
+In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed here", and drop the tag. This particular fragment must be parsed "in the context" of a table in order to parse properly.
+Thankfully, libgumbo and Nokogiri allow us to set the context node:
 ``` ruby
 Nokogiri::HTML5::DocumentFragment.new(
   Nokogiri::HTML5::Document.new,
   "<td>foo</td>",
-  "table"  # this is the context node
+  "table"  # <--- this is the context node
 ).to_html
 # => "<tbody><tr><td>foo</td></tr></tbody>"
 ```
-This is _almost_ correct, but we're seeing another HTML5 parsing rule in action: there may be
-_intermediate parent tags_ that the HTML5 spec requires to be inserted by the parser. In this case,
-the `<td>` tag must be wrapped in `<tbody><tr>` tags.
+This result is _almost_ correct, but we're seeing another HTML5 parsing rule in action: there may be _intermediate parent tags_ that the HTML5 spec requires to be inserted by the parser. In this case, the `<td>` tag must be wrapped in `<tbody><tr>` tags.
-We can narrow down the result set with an XPath query to get back only the intended tags:
+We can fix this to only return the tags we provided by using the `<template>` tag as the context node, which the HTML5 spec provides exactly for this purpose:
 ``` ruby
 Nokogiri::HTML5::DocumentFragment.new(
   Nokogiri::HTML5::Document.new,
   "<td>foo</td>",
-  "table"  # this is the context node
-).xpath("tbody/tr/*").to_html
+  "template"  # <--- this is the context node
+).to_html
 # => "<td>foo</td>"
 ```
-Hurrah! This is precisely what Nokogiri::HTML5::Inference.parse does: make reasonable inferences
-that work for both HTML5 documents and HTML5 fragments, and for all the different HTML5 tags that a
-web developer might need in a view library.
+Huzzah! That works. And it's precisely what `Nokogiri::HTML5::Inference.parse` does:
+``` ruby
+Nokogiri::HTML5::Inference.parse("<td>foo</td>").to_html
+# => "<td>foo</td>"
+```
 ## Usage
 Given an input String containing HTML5, infer the best way to parse it by calling `Nokogiri::HTML5::Inference.parse`.
-If the input is a document, you'll get a Nokogiri::HTML5::Document back:
+If the input is a document, you'll get a `Nokogiri::HTML5::Document` back:
 ``` ruby
 html = <<~HTML
   <!doctype html>
   <html lang="en">
     <head>
-      <meta encoding="UTF-8">
+      <meta charset="utf-8">
     </head>
     <body>
       <h1>Hello, world!</h1>
@@ -95,19 +92,7 @@ Nokogiri::HTML5::Inference.parse(html)
 #      })
 ```
-If the input is a fragment that is parsed normally, you'll either get a Nokogiri::HTML5::DocumentFragment back:
-``` ruby
-Nokogiri::HTML5::Inference.parse("<div>hello,</div><div>world!</div>")
-# => #(DocumentFragment:0x34f8 {
-#      name = "#document-fragment",
-#      children = [
-#        #(Element:0x3624 { name = "div", children = [ #(Text "hello,")] }),
-#        #(Element:0x3804 { name = "div", children = [ #(Text "world!")] })]
-#      })
-```
-or, if there are intermediate parent tags that need to be removed, you'll get a Nokogiri::XML::NodeSet:
+If the input is a fragment, you'll get back a `Nokogiri::XML::NodeSet`:
 ``` ruby
 Nokogiri::HTML5::Inference.parse("<tr><td>hello</td><td>world!</td></tr>")
@@ -120,17 +105,17 @@ Nokogiri::HTML5::Inference.parse("<tr><td>hello</td><td>world!</td></tr>")
 #    ]
 ```
-All of these return types respond to the same query methods like `#css` and `#xpath`, tree-traversal
-methods like `#children`, and serialization methods like `#to_html`.
+Both of these return types respond to the same query methods like `#css` and `#xpath`, tree-traversal methods like `#children`, and serialization methods like `#to_html`.
 ## Caveats
-The implementation is currently pretty hacky and only looks at the first tag in the input to make
-decisions. Nonetheless, it is a step forward from what Nokogiri and libgumbo do out-of-the-box.
+The implementation is currently pretty hacky and only looks at the first tag in the input to make decisions. Nonetheless, it is a step forward from what Nokogiri and libgumbo do out-of-the-box.
 The implementation also is almost certainly incomplete, meaning there are HTML5 tags that aren't handled by this library as you might expect.
+This implementation is probably OK for handling untrusted content, but it's still new and I haven't really thought very hard about it yet. If you want to use it on untrusted content, open an issue and talk with us about your use case so we can help keep you secure!
 We would welcome bug reports and pull requests improving this library!
@@ -138,12 +123,15 @@ We would welcome bug reports and pull requests improving this library!
 Install the gem and add to the application's Gemfile by executing:
-    $ bundle add nokgiri-html5-inference
+```bash
+bundle add nokgiri-html5-inference
+```
 If bundler is not being used to manage dependencies, install the gem by executing:
-    $ gem install nokgiri-html5-inference
+```bash
+gem install nokgiri-html5-inference
+```
 ## Development
@@ -164,4 +152,4 @@ The gem is available as open source under the terms of the [MIT License](https:/
 ## Code of Conduct
-Everyone interacting in the Nokogiri::Html5::Inference project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/flavorjones/nokogiri-html5-inference/blob/main/CODE_OF_CONDUCT.md).
+Everyone interacting in the `Nokogiri::HTML5::Inference` project's codebases, issue trackers, chat rooms and mailing lists is expected to follow the [code of conduct](https://github.com/flavorjones/nokogiri-html5-inference/blob/main/CODE_OF_CONDUCT.md).

data/lib/nokogiri/html5/inference/version.rb CHANGED Viewed

@@ -3,7 +3,7 @@
 module Nokogiri
   module HTML5
     module Inference
-      VERSION = "0.1.1"
+      VERSION = "0.3.0"
     end
   end
 end

data/lib/nokogiri/html5/inference.rb CHANGED Viewed

@@ -12,91 +12,87 @@ else
     module HTML5
       # :markup: markdown
       #
-      # The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very precise
-      # context-dependent parsing rules which can make it challenging to "just parse" a fragment of HTML
-      # without knowing the parent node -- also called the "context node" -- in which it will be inserted.
+      #  The [HTML5 Spec](https://html.spec.whatwg.org/multipage/parsing.html) defines some very
+      #  precise context-dependent parsing rules which can make it challenging to "just parse" a
+      #  fragment of HTML without knowing the parent node -- also called the "context node" -- in
+      #  which it will be inserted.
       #
-      # Most content in an HTML5 document can be parsed assuming the parser's mode will be in the
-      # ["in body" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody),
-      # but there are some notable exceptions. Perhaps the most problematic to web developers are the
-      # table-related tags, which will not be parsed properly unless the parser is in the
-      # ["in table" insertion mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
+      #  Most content in an HTML5 document can be parsed assuming the parser's mode will be in the
+      #  ["in body" insertion
+      #  mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody), but there
+      #  are some notable exceptions. Perhaps the most problematic to web developers are the
+      #  table-related tags, which will not be parsed properly unless the parser is in the ["in
+      #  table" insertion
+      #  mode](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-intable).
       #
-      # For example:
+      #  For example:
       #
-      # ``` ruby
-      # Nokogiri::HTML5::DocumentFragment.parse("<td>foo</td>").to_html
-      # # => "foo"
-      # ```
+      #  ``` ruby
+      #  Nokogiri::HTML5::DocumentFragment.parse("<td>foo</td>").to_html
+      #  # => "foo" # where did the tag go!?
+      #  ```
       #
-      # In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed here",
-      # and drop the tag. This fragment must be parsed "in the context" of a table in order to parse
-      # properly. Thankfully, libgumbo and Nokogiri allow us to do this:
+      #  In the default "in body" mode, the parser will log an error, "Start tag 'td' isn't allowed
+      #  here", and drop the tag. This particular fragment must be parsed "in the context" of a
+      #  table in order to parse properly.
       #
-      # ``` ruby
-      # Nokogiri::HTML5::DocumentFragment.new(
-      #   Nokogiri::HTML5::Document.new,
-      #   "<td>foo</td>",
-      #   "table"  # this is the context node
-      # ).to_html
-      # # => "<tbody><tr><td>foo</td></tr></tbody>"
-      # ```
+      #  Thankfully, libgumbo and Nokogiri allow us to set the context node:
       #
-      # This is _almost_ correct, but we're seeing another HTML5 parsing rule in action: there may be
-      # _intermediate parent tags_ that the HTML5 spec requires to be inserted by the parser. In this case,
-      # the `<td>` tag must be wrapped in `<tbody><tr>` tags.
+      #  ``` ruby
+      #  Nokogiri::HTML5::DocumentFragment.new(
+      #    Nokogiri::HTML5::Document.new,
+      #    "<td>foo</td>",
+      #    "table"  # <--- this is the context node
+      #  ).to_html
+      #  # => "<tbody><tr><td>foo</td></tr></tbody>"
+      #  ```
       #
-      # We can narrow down the result set with an XPath query to get back only the intended tags:
+      #  This result is _almost_ correct, but we're seeing another HTML5 parsing rule in action:
+      #  there may be _intermediate parent tags_ that the HTML5 spec requires to be inserted by the
+      #  parser. In this case, the `<td>` tag must be wrapped in `<tbody><tr>` tags.
       #
-      # ``` ruby
-      # Nokogiri::HTML5::DocumentFragment.new(
-      #   Nokogiri::HTML5::Document.new,
-      #   "<td>foo</td>",
-      #   "table"  # this is the context node
-      # ).xpath("tbody/tr/*").to_html
-      # # => "<td>foo</td>"
-      # ```
+      #  We can fix this to only return the tags we provided by using the `<template>` tag as the
+      #  context node, which the HTML5 spec provides exactly for this purpose:
       #
-      # Hurrah! This is precisely what Nokogiri::HTML5::Inference.parse does: make reasonable inferences
-      # that work for both HTML5 documents and HTML5 fragments, and for all the different HTML5 tags that a
-      # web developer might need in a view library.
+      #  ``` ruby
+      #  Nokogiri::HTML5::DocumentFragment.new(
+      #    Nokogiri::HTML5::Document.new,
+      #    "<td>foo</td>",
+      #    "template"  # <--- this is the context node
+      #  ).to_html
+      #  # => "<td>foo</td>"
+      #  ```
+      #
+      #  Huzzah! That works. And it's precisely what `Nokogiri::HTML5::Inference.parse` does:
+      #
+      #  ``` ruby
+      #  Nokogiri::HTML5::Inference.parse("<td>foo</td>").to_html
+      #  # => "<td>foo</td>"
+      #  ```
       #
       module Inference
         # Tags that must be parsed in a specific HTML5 insertion mode, for which we must use a
         # context node.
         module ContextTags # :nodoc:
-          TABLE = %w[thead tbody tfoot tr td th col colgroup caption].freeze
           HTML = %w[head body].freeze
         end
         # Regular expressions used to determine if we need to use a context node.
         module ContextRegexp # :nodoc:
           DOCUMENT = /\A\s*(<!doctype\s+html\b|<html\b)/i
-          TABLE = /\A\s*<(#{ContextTags::TABLE.join("|")})\b/i
           HTML = /\A\s*<(#{ContextTags::HTML.join("|")})\b/i
         end
-        # Tags that get an intermediate parent created for them according to the HTML5 spec.
-        module PluckTags # :nodoc:
-          TBODY = %w[tr].freeze
-          TBODY_TR = %w[td th].freeze
-          COLGROUP = %w[col].freeze
-        end
         # Regular expressions used to determine if we will need to skip an intermediate parent or
         # otherwise narrow the fragment DOM that is returned.
         module PluckRegexp # :nodoc:
-          TBODY = /\A\s*<(#{PluckTags::TBODY.join("|")})\b/i
-          TBODY_TR = /\A\s*<(#{PluckTags::TBODY_TR.join("|")})\b/i
-          COLGROUP = /\A\s*<(#{PluckTags::COLGROUP.join("|")})\b/i
-          HEAD_OUTER = /\A\s*<(head)\b/i
           BODY_OUTER = /\A\s*<(body)\b/i
         end
         class << self
           #
           #  call-seq:
-          #    parse(input, pluck: true) => (Nokogiri::HTML5::Document | Nokogiri::HTML5::DocumentFragment | Nokogiri::XML::NodeSet)
+          #    parse(input, pluck: true) => (Nokogiri::HTML5::Document | Nokogiri::XML::NodeSet)
           #
           #  Based on the start of the input HTML5 string, guess whether it's a full document or a
           #  fragment and, using the fragment context node if necessary, parse it properly and
@@ -114,14 +110,13 @@ else
           #
           #  [Keyword Parameters]
           #  - +pluck+ (Boolean) Default: +true+. Set to +false+ if you want the method to always
-          #    return <tt>DocumentFragment</tt>s as-parsed, without attempting to remove
-          #    intermediate parent nodes. This shouldn't be necessary if the library is working
-          #    properly, but may be useful to allow user to work around a bad guess.
+          #    return what Nokogiri parsed, without attempting to remove any sibling or intermediate
+          #    parent nodes. This shouldn't be necessary if the library is working properly, but may
+          #    be useful to allow user to work around a bad guess.
           #
           #  [Returns]
           #  - A +Nokogiri::HTML5::Document+ if the input appears to represent a full document.
-          #  - A +Nokogiri::HTML5::DocumentFragment+ or a +Nokogiri::XML::NodeSet+ if the input
-          #    appears to be a fragment.
+          #  - A +Nokogiri::XML::NodeSet+ if the input appears to be a fragment.
           #
           def parse(input, pluck: true)
             context = Nokogiri::HTML5::Inference.context(input)
@@ -132,7 +127,7 @@ else
               if pluck && (path = pluck_path(input))
                 fragment.xpath(path)
               else
-                fragment
+                fragment.children
               end
             end
           end
@@ -154,9 +149,8 @@ else
           def context(input) # :nodoc:
             case input
             when ContextRegexp::DOCUMENT then nil
-            when ContextRegexp::TABLE then "table"
             when ContextRegexp::HTML then "html"
-            else "body"
+            else "template"
             end
           end
@@ -177,10 +171,6 @@ else
           #
           def pluck_path(input) # :nodoc:
             case input
-            when PluckRegexp::TBODY then "tbody/*"
-            when PluckRegexp::TBODY_TR then "tbody/tr/*"
-            when PluckRegexp::COLGROUP then "colgroup/*"
-            when PluckRegexp::HEAD_OUTER then "head"
             when PluckRegexp::BODY_OUTER then "body"
             end
           end

metadata CHANGED Viewed

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: nokogiri-html5-inference
 version: !ruby/object:Gem::Version
-  version: 0.1.1
+  version: 0.3.0
 platform: ruby
 authors:
 - Mike Dalessio
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2024-04-24 00:00:00.000000000 Z
+date: 2024-05-05 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: nokogiri