RubyGems - sanitize - Versions diffs - 4.6.6 → 5.0.0 - Mend

sanitize 4.6.6 → 5.0.0

Potentially problematic release.

This version of sanitize might be problematic. Click here for more details.

Files changed (17) hide show

checksums.yaml +4 -4
data/HISTORY.md +56 -0
data/README.md +4 -4
data/lib/sanitize.rb +2 -44
data/lib/sanitize/config/default.rb +6 -4
data/lib/sanitize/transformers/clean_element.rb +44 -3
data/lib/sanitize/version.rb +1 -1
data/test/test_clean_comment.rb +1 -5
data/test/test_clean_css.rb +1 -1
data/test/test_clean_doctype.rb +8 -8
data/test/test_clean_element.rb +108 -23
data/test/test_malicious_html.rb +15 -6
data/test/test_parser.rb +2 -31
data/test/test_sanitize.rb +4 -4
data/test/test_transformers.rb +4 -4
data/test/test_unicode.rb +12 -12
metadata +12 -12

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: c5672f967be01303dd78eba5c0a1ab45729d15b604e2f2cbb6108c69864ad5f6
-  data.tar.gz: 8ff91d1efafb67205b6ba07697d2c9f920e34df5e59f357433e54a6f9f0cca76
+  metadata.gz: c88243234986bc11c6e1da92e05f9ea153d6016f5e5c3c8e8ad6602b7225e07f
+  data.tar.gz: abf83048949361fbcaf7fdb1d03066c9787303ceee39c42d69a245d300bc4453
 SHA512:
-  metadata.gz: '0981c67f49e789e6ccb6becb2a5407ac3db48b96823f48bef3a284fcc8b2fe539545ec0db8f0449dc5db5039d35a7a193970e3d72c076f99152c001d87be8659'
-  data.tar.gz: 25af08d3f6524b70aaee67cab17a5e8568697be138954f1ff1e1bc8da591df5ccb405bc1212c24daeb81a3b2c4e659e56e60bdc76938b9a10e096449ba38b657
+  metadata.gz: f72364a3ec7939a07d30f681c58f4bd4bafa804dff0ecef69a8fb31b16d2e77439c4b1e18c756e370b756067e1bacd7bd8ea8943d447ad144396068da57798a2
+  data.tar.gz: 1ac997e7ae3f0ffc65d002e439b63bf755acda220bb295a7d648d474333e9d9747259f4cca2af715da8df9f425c17eb8a8148ba5cf12c91cbfee71a74da15eda

data/HISTORY.md CHANGED

@@ -1,5 +1,41 @@
 # Sanitize History
+## 5.0.0 (2018-10-14)
+For most users, upgrading from 4.x shouldn't require any changes. However, the
+minimum required Ruby version has changed, and Sanitize 5.x's HTML output may
+differ in some small ways from 4.x's output. If this matters to you, please
+review the changes below carefully.
+### Potentially Breaking Changes
+* Ruby 2.3.0 is now the oldest officially supported Ruby version. Sanitize may
+  work in older 2.x Rubies, but they aren't actively tested. Sanitize definitely
+  no longer works in Ruby 1.9.x.
+* Upgraded to Nokogumbo 2.x, which fixes various bugs and adds
+  standard-compliant HTML serialization. [@stevecheckoway - #189][189]
+* Children of the following elements are now removed by default when these
+  elements are removed, rather than being preserved and escaped:
+  - `iframe`
+  - `noembed`
+  - `noframes`
+  - `noscript`
+  - `script`
+  - `style`
+* Children of whitelisted `iframe` elements are now always removed. In modern
+  HTML, `iframe` elements should never have children. In HTML 4 and earlier
+  `iframe` elements were allowed to contain fallback content for legacy
+  browsers, but it's been almost two decades since that was useful.
+* Fixed a bug that caused `:remove_contents` to behave as if it were set to
+  `true` when it was actually an Array.
+[189]:https://github.com/rgrove/sanitize/pull/189
 ## 4.6.6 (2018-07-23)
 * Improved performance and memory usage by optimizing `Sanitize#transform_node!`
@@ -324,6 +360,26 @@ Sanitize.fragment(html, Sanitize::Config.merge(Sanitize::Config::BASIC,
 [n1008]:https://github.com/sparklemotion/nokogiri/issues/1008
+## 2.1.1 (2018-09-30)
+* [CVE-2018-3740][176]: Fixed an HTML injection vulnerability that could allow
+  XSS (backported from Sanitize 4.6.3). [@dometto - #188][188]
+  When Sanitize <= 2.1.0 is used in combination with libxml2 >= 2.9.2, a
+  specially crafted HTML fragment can cause libxml2 to generate improperly
+  escaped output, allowing non-whitelisted attributes to be used on whitelisted
+  elements.
+  Sanitize now performs additional escaping on affected attributes to prevent
+  this.
+  Many thanks to the Shopify Application Security Team for responsibly reporting
+  this issue.
+[176]:https://github.com/rgrove/sanitize/issues/176
+[188]:https://github.com/rgrove/sanitize/pull/188
 ## 2.1.0 (2014-01-13)
 * Added support for whitelisting arbitrary HTML5 `data-*` attributes. Use the

data/README.md CHANGED

@@ -441,13 +441,13 @@ include the symbol `:relative` in the protocol array:
 #### :remove_contents (boolean or Array or Set)
-If set to `true`, Sanitize will remove the contents of any non-whitelisted
+If this is `true`, Sanitize will remove the contents of any non-whitelisted
 elements in addition to the elements themselves. By default, Sanitize leaves the
 safe parts of an element's contents behind when the element is removed.
-If set to an array of element names, then only the contents of the specified
-elements (when filtered) will be removed, and the contents of all other filtered
-elements will be left behind.
+If this is an Array or Set of element names, then only the contents of the
+specified elements (when filtered) will be removed, and the contents of all
+other filtered elements will be left behind.
 The default value is `false`.

data/lib/sanitize.rb CHANGED

@@ -121,19 +121,7 @@ class Sanitize
     return '' unless html
     html = preprocess(html)
-    doc  = Nokogiri::HTML5.parse("<html><body>#{html}")
-    # Hack to allow fragments containing <body>. Borrowed from
-    # Nokogiri::HTML::DocumentFragment.
-    if html =~ /\A<body(?:\s|>)/i
-      path = '/html/body'
-    else
-      path = '/html/body/node()'
-    end
-    frag = doc.fragment
-    frag << doc.xpath(path)
+    frag  = Nokogiri::HTML5.fragment(html)
     node!(frag)
     to_html(frag)
   end
@@ -184,37 +172,7 @@ class Sanitize
   end
   def to_html(node)
-    replace_meta = false
-    # Hacky workaround for a libxml2 bug that adds an undesired Content-Type
-    # meta tag to all serialized HTML documents.
-    #
-    # https://github.com/sparklemotion/nokogiri/issues/1008
-    if node.type == Nokogiri::XML::Node::DOCUMENT_NODE ||
-        node.type == Nokogiri::XML::Node::HTML_DOCUMENT_NODE
-      regex_meta   = %r|(<html[^>]*>\s*<head[^>]*>\s*)<meta http-equiv="Content-Type" content="text/html; charset=utf-8">|i
-      # Only replace the content-type meta tag if <meta> isn't whitelisted or
-      # the original document didn't actually include a content-type meta tag.
-      replace_meta = !@config[:elements].include?('meta') ||
-        node.xpath('/html/head/meta[@http-equiv]').none? do |meta|
-          meta['http-equiv'].casecmp('content-type').zero?
-        end
-    end
-    so = Nokogiri::XML::Node::SaveOptions
-    # Serialize to HTML without any formatting to prevent Nokogiri from adding
-    # newlines after certain tags.
-    html = node.to_html(
-      :encoding  => 'utf-8',
-      :indent    => 0,
-      :save_with => so::NO_DECLARATION | so::NO_EMPTY_TAGS | so::AS_HTML
-    )
-    html.gsub!(regex_meta, '\1') if replace_meta
-    html
+    node.to_html(preserve_newline: true)
   end
   def transform_node!(node, node_whitelist)

data/lib/sanitize/config/default.rb CHANGED

@@ -66,10 +66,12 @@ class Sanitize
       # leaves the safe parts of an element's contents behind when the element
       # is removed.
       #
-      # If this is an Array of element names, then only the contents of the
-      # specified elements (when filtered) will be removed, and the contents of
-      # all other filtered elements will be left behind.
-      :remove_contents => false,
+      # If this is an Array or Set of element names, then only the contents of
+      # the specified elements (when filtered) will be removed, and the contents
+      # of all other filtered elements will be left behind.
+      :remove_contents => %w[
+        iframe noembed noframes noscript script style
+      ],
       # Transformers allow you to filter or alter nodes using custom logic. See
       # README.md for details and examples.

data/lib/sanitize/transformers/clean_element.rb CHANGED

@@ -67,7 +67,7 @@ class Sanitize; module Transformers; class CleanElement
       @whitespace_elements = config[:whitespace_elements]
     end
-    if config[:remove_contents].is_a?(Set)
+    if config[:remove_contents].is_a?(Enumerable)
       @remove_element_contents.merge(config[:remove_contents].map(&:to_s))
     else
       @remove_all_contents = !!config[:remove_contents]
@@ -97,8 +97,10 @@ class Sanitize; module Transformers; class CleanElement
         end
       end
-      unless @remove_all_contents || @remove_element_contents.include?(name)
-        node.add_previous_sibling(node.children)
+      unless node.children.empty?
+        unless @remove_all_contents || @remove_element_contents.include?(name)
+          node.add_previous_sibling(node.children)
+        end
       end
       node.unlink
@@ -166,6 +168,11 @@ class Sanitize; module Transformers; class CleanElement
         # affected attributes, some of which can exist on any element and some
         # of which can only exist on `<a>` elements.
         #
+        # This fix is technically no longer necessary with Nokogumbo >= 2.0
+        # since it no longer uses libxml2's serializer, but it's retained to
+        # avoid breaking use cases where people might be sanitizing individual
+        # Nokogiri nodes and then serializing them manually without Nokogumbo.
+        #
         # The relevant libxml2 code is here:
         # <https://github.com/GNOME/libxml2/commit/960f0e275616cadc29671a218d7fb9b69eb35588>
         if UNSAFE_LIBXML_ATTRS_GLOBAL.include?(attr_name) ||
@@ -180,6 +187,40 @@ class Sanitize; module Transformers; class CleanElement
     if @add_attributes.include?(name)
       @add_attributes[name].each {|key, val| node[key] = val }
     end
+    # Element-specific special cases.
+    case name
+    # If this is a whitelisted iframe that has children, remove all its
+    # children. The HTML standard says iframes shouldn't have content, but when
+    # they do, this content is parsed as text and is serialized verbatim without
+    # being escaped, which is unsafe because legacy browsers may still render it
+    # and execute `<script>` content. So the safe and correct thing to do is to
+    # always remove iframe content.
+    when 'iframe'
+      if !node.children.empty?
+        node.children.each do |child|
+          child.unlink
+        end
+      end
+    # Prevent the use of `<meta>` elements that set a charset other than UTF-8,
+    # since Sanitize's output is always UTF-8.
+    when 'meta'
+      if node.has_attribute?('charset') &&
+          node['charset'].downcase != 'utf-8'
+        node['charset'] = 'utf-8'
+      end
+      if node.has_attribute?('http-equiv') &&
+          node.has_attribute?('content') &&
+          node['http-equiv'].downcase == 'content-type' &&
+          node['content'].downcase =~ /;\s*charset\s*=\s*(?!utf-8)/
+        node['content'] = node['content'].gsub(/;\s*charset\s*=.+\z/, ';charset=utf-8')
+      end
+    end
   end
 end; end; end

data/lib/sanitize/version.rb CHANGED

@@ -1,5 +1,5 @@
 # encoding: utf-8
 class Sanitize
-  VERSION = '4.6.6'
+  VERSION = '5.0.0'
 end

data/test/test_clean_comment.rb CHANGED

@@ -20,7 +20,7 @@ describe 'Sanitize::Transformers::CleanComment' do
       # Special case: the comment markup is inside a <script>, which makes it
       # text content and not an actual HTML comment.
-      @s.fragment("<script><!-- comment --></script>").must_equal '&lt;!-- comment --&gt;'
+      @s.fragment("<script><!-- comment --></script>").must_equal ''
       Sanitize.fragment("<script><!-- comment --></script>", :allow_comments => false, :elements => ['script'])
         .must_equal '<script><!-- comment --></script>'
@@ -40,10 +40,6 @@ describe 'Sanitize::Transformers::CleanComment' do
       @s.fragment("foo <!-- <!-- <!-- --> --> -->bar").must_equal 'foo <!-- <!-- <!-- --> --&gt; --&gt;bar'
       @s.fragment("foo <div <!-- comment -->>bar</div>").must_equal 'foo <div>&gt;bar</div>'
-      # Special case: the comment markup is inside a <script>, which makes it
-      # text content and not an actual HTML comment.
-      @s.fragment("<script><!-- comment --></script>").must_equal '&lt;!-- comment --&gt;'
       Sanitize.fragment("<script><!-- comment --></script>", :allow_comments => true, :elements => ['script'])
         .must_equal '<script><!-- comment --></script>'
     end

data/test/test_clean_css.rb CHANGED

@@ -13,7 +13,7 @@ describe 'Sanitize::Transformers::CSS::CleanAttribute' do
     @s.fragment(%[
       <div style="color: #fff; width: expression(alert(1)); /* <-- evil! */"></div>
     ].strip).must_equal %[
-      <div style="color: #fff;  /* &lt;-- evil! */"></div>
+      <div style="color: #fff;  /* <-- evil! */"></div>
     ].strip
   end

data/test/test_clean_doctype.rb CHANGED

@@ -11,7 +11,7 @@ describe 'Sanitize::Transformers::CleanDoctype' do
     end
     it 'should remove doctype declarations' do
-      @s.document('<!DOCTYPE html><html>foo</html>').must_equal "<html>foo</html>\n"
+      @s.document('<!DOCTYPE html><html>foo</html>').must_equal "<html>foo</html>"
       @s.fragment('<!DOCTYPE html>foo').must_equal 'foo'
     end
@@ -34,27 +34,27 @@ describe 'Sanitize::Transformers::CleanDoctype' do
     it 'should allow doctype declarations in documents' do
       @s.document('<!DOCTYPE html><html>foo</html>')
-        .must_equal "<!DOCTYPE html>\n<html>foo</html>\n"
+        .must_equal "<!DOCTYPE html><html>foo</html>"
       @s.document('<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"><html>foo</html>')
-        .must_equal "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01//EN\">\n<html>foo</html>\n"
+        .must_equal "<!DOCTYPE html><html>foo</html>"
       @s.document("<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\"\n    \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\"><html>foo</html>")
-        .must_equal "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n<html>foo</html>\n"
+        .must_equal "<!DOCTYPE html><html>foo</html>"
     end
     it 'should not allow obviously invalid doctype declarations in documents' do
       @s.document('<!DOCTYPE blah blah blah><html>foo</html>')
-        .must_equal "<!DOCTYPE html>\n<html>foo</html>\n"
+        .must_equal "<!DOCTYPE html><html>foo</html>"
       @s.document('<!DOCTYPE blah><html>foo</html>')
-        .must_equal "<!DOCTYPE html>\n<html>foo</html>\n"
+        .must_equal "<!DOCTYPE html><html>foo</html>"
       @s.document('<!DOCTYPE html BLAH "-//W3C//DTD HTML 4.01//EN"><html>foo</html>')
-        .must_equal "<!DOCTYPE html>\n<html>foo</html>\n"
+        .must_equal "<!DOCTYPE html><html>foo</html>"
       @s.document('<!whatever><html>foo</html>')
-        .must_equal "<html>foo</html>\n"
+        .must_equal "<html>foo</html>"
     end
     it 'should not allow doctype definitions in fragments' do

data/test/test_clean_element.rb CHANGED

@@ -8,25 +8,22 @@ describe 'Sanitize::Transformers::CleanElement' do
   strings = {
     :basic => {
       :html       => '<b>Lo<!-- comment -->rem</b> <a href="pants" title="foo" style="text-decoration: underline;">ipsum</a> <a href="http://foo.com/"><strong>dolor</strong></a> sit<br/>amet <style>.foo { color: #fff; }</style> <script>alert("hello world");</script>',
-      :default    => 'Lorem ipsum dolor sit amet .foo { color: #fff; } alert("hello world");',
-      :restricted => '<b>Lorem</b> ipsum <strong>dolor</strong> sit amet .foo { color: #fff; } alert("hello world");',
-      :basic      => '<b>Lorem</b> <a href="pants" rel="nofollow">ipsum</a> <a href="http://foo.com/" rel="nofollow"><strong>dolor</strong></a> sit<br>amet .foo { color: #fff; } alert("hello world");',
-      :relaxed    => '<b>Lorem</b> <a href="pants" title="foo" style="text-decoration: underline;">ipsum</a> <a href="http://foo.com/"><strong>dolor</strong></a> sit<br>amet <style>.foo { color: #fff; }</style> alert("hello world");'
+      :default    => 'Lorem ipsum dolor sit amet  ',
+      :restricted => '<b>Lorem</b> ipsum <strong>dolor</strong> sit amet  ',
+      :basic      => '<b>Lorem</b> <a href="pants" rel="nofollow">ipsum</a> <a href="http://foo.com/" rel="nofollow"><strong>dolor</strong></a> sit<br>amet  ',
+      :relaxed    => '<b>Lorem</b> <a href="pants" title="foo" style="text-decoration: underline;">ipsum</a> <a href="http://foo.com/"><strong>dolor</strong></a> sit<br>amet <style>.foo { color: #fff; }</style> '
     },
     :malformed => {
       :html       => 'Lo<!-- comment -->rem</b> <a href=pants title="foo>ipsum <a href="http://foo.com/"><strong>dolor</a></strong> sit<br/>amet <script>alert("hello world");',
-      :default    => 'Lorem dolor sit amet alert("hello world");',
-      :restricted => 'Lorem <strong>dolor</strong> sit amet alert("hello world");',
-      :basic      => 'Lorem <a href="pants" rel="nofollow"><strong>dolor</strong></a> sit<br>amet alert("hello world");',
-      :relaxed    => 'Lorem <a href="pants" title="foo&gt;ipsum &lt;a href="><strong>dolor</strong></a> sit<br>amet alert("hello world");',
+      :default    => 'Lorem dolor sit amet ',
+      :restricted => 'Lorem <strong>dolor</strong> sit amet ',
+      :basic      => 'Lorem <a href="pants" rel="nofollow"><strong>dolor</strong></a> sit<br>amet ',
+      :relaxed    => 'Lorem <a href="pants" title="foo>ipsum <a href="><strong>dolor</strong></a> sit<br>amet ',
     },
     :unclosed => {
       :html       => '<p>a</p><blockquote>b',
       :default    => ' a  b ',
       :restricted => ' a  b ',
       :basic      => '<p>a</p><blockquote>b</blockquote>',
@@ -35,7 +32,6 @@ describe 'Sanitize::Transformers::CleanElement' do
     :malicious => {
       :html       => '<b>Lo<!-- comment -->rem</b> <a href="javascript:pants" title="foo">ipsum</a> <a href="http://foo.com/"><strong>dolor</strong></a> sit<br/>amet <<foo>script>alert("hello world");</script>',
       :default    => 'Lorem ipsum dolor sit amet &lt;script&gt;alert("hello world");',
       :restricted => '<b>Lorem</b> ipsum <strong>dolor</strong> sit amet &lt;script&gt;alert("hello world");',
       :basic      => '<b>Lorem</b> <a rel="nofollow">ipsum</a> <a href="http://foo.com/" rel="nofollow"><strong>dolor</strong></a> sit<br>amet &lt;script&gt;alert("hello world");',
@@ -171,10 +167,10 @@ describe 'Sanitize::Transformers::CleanElement' do
         .must_equal 'foo bar baz quux'
       Sanitize.fragment('<script>alert("<xss>");</script>')
-        .must_equal 'alert("&lt;xss&gt;");'
+        .must_equal ''
       Sanitize.fragment('<<script>script>alert("<xss>");</<script>>')
-        .must_equal '&lt;script&gt;alert("&lt;xss&gt;");&lt;/&lt;script&gt;&gt;'
+        .must_equal '&lt;'
       Sanitize.fragment('< script <>> alert("<xss>");</script>')
         .must_equal '&lt; script &lt;&gt;&gt; alert("");'
@@ -196,6 +192,46 @@ describe 'Sanitize::Transformers::CleanElement' do
         .must_equal ''
     end
+    it 'should escape the content of removed `plaintext` elements' do
+      Sanitize.fragment('<plaintext>hello! <script>alert(0)</script>')
+        .must_equal 'hello! &lt;script&gt;alert(0)&lt;/script&gt;'
+    end
+    it 'should escape the content of removed `xmp` elements' do
+      Sanitize.fragment('<xmp>hello! <script>alert(0)</script></xmp>')
+        .must_equal 'hello! &lt;script&gt;alert(0)&lt;/script&gt;'
+    end
+    it 'should not preserve the content of removed `iframe` elements' do
+      Sanitize.fragment('<iframe>hello! <script>alert(0)</script></iframe>')
+        .must_equal ''
+    end
+    it 'should not preserve the content of removed `noembed` elements' do
+      Sanitize.fragment('<noembed>hello! <script>alert(0)</script></noembed>')
+        .must_equal ''
+    end
+    it 'should not preserve the content of removed `noframes` elements' do
+      Sanitize.fragment('<noframes>hello! <script>alert(0)</script></noframes>')
+        .must_equal ''
+    end
+    it 'should not preserve the content of removed `noscript` elements' do
+      Sanitize.fragment('<noscript>hello! <script>alert(0)</script></noscript>')
+        .must_equal ''
+    end
+    it 'should not preserve the content of removed `script` elements' do
+      Sanitize.fragment('<script>hello! <script>alert(0)</script></script>')
+        .must_equal ''
+    end
+    it 'should not preserve the content of removed `style` elements' do
+      Sanitize.fragment('<style>hello! <script>alert(0)</script></style>')
+        .must_equal ''
+    end
     strings.each do |name, data|
       it "should clean #{name} HTML" do
         Sanitize.fragment(data[:html]).must_equal(data[:default])
@@ -234,7 +270,7 @@ describe 'Sanitize::Transformers::CleanElement' do
     it 'should not choke on valueless attributes' do
       @s.fragment('foo <a href>foo</a> bar')
-        .must_equal 'foo <a href rel="nofollow">foo</a> bar'
+        .must_equal 'foo <a href="" rel="nofollow">foo</a> bar'
     end
     it 'should downcase attribute names' do
@@ -262,7 +298,7 @@ describe 'Sanitize::Transformers::CleanElement' do
     it 'should encode special chars in attribute values' do
       @s.fragment('<a href="http://example.com" title="<b>&eacute;xamples</b> & things">foo</a>')
-        .must_equal '<a href="http://example.com" title="&lt;b&gt;éxamples&lt;/b&gt; &amp; things">foo</a>'
+        .must_equal '<a href="http://example.com" title="<b>éxamples</b> &amp; things">foo</a>'
     end
     strings.each do |name, data|
@@ -344,16 +380,30 @@ describe 'Sanitize::Transformers::CleanElement' do
       ).must_equal 'foo bar   '
     end
-    it 'should remove the contents of specified nodes when :remove_contents is an Array of element names as strings' do
-      Sanitize.fragment('foo bar <div>baz<span>quux</span><script>alert("hello!");</script></div>',
+    it 'should remove the contents of specified nodes when :remove_contents is an Array or Set of element names as strings' do
+      Sanitize.fragment('foo bar <div>baz<span>quux</span> <b>hi</b><script>alert("hello!");</script></div>',
         :remove_contents => ['script', 'span']
-      ).must_equal 'foo bar  baz '
+      ).must_equal 'foo bar  baz hi '
+      Sanitize.fragment('foo bar <div>baz<span>quux</span> <b>hi</b><script>alert("hello!");</script></div>',
+        :remove_contents => Set.new(['script', 'span'])
+      ).must_equal 'foo bar  baz hi '
     end
-    it 'should remove the contents of specified nodes when :remove_contents is an Array of element names as symbols' do
-      Sanitize.fragment('foo bar <div>baz<span>quux</span><script>alert("hello!");</script></div>',
+    it 'should remove the contents of specified nodes when :remove_contents is an Array or Set of element names as symbols' do
+      Sanitize.fragment('foo bar <div>baz<span>quux</span> <b>hi</b><script>alert("hello!");</script></div>',
         :remove_contents => [:script, :span]
-      ).must_equal 'foo bar  baz '
+      ).must_equal 'foo bar  baz hi '
+      Sanitize.fragment('foo bar <div>baz<span>quux</span> <b>hi</b><script>alert("hello!");</script></div>',
+        :remove_contents => Set.new([:script, :span])
+      ).must_equal 'foo bar  baz hi '
+    end
+    it 'should remove the contents of whitelisted iframes' do
+      Sanitize.fragment('<iframe>hi <script>hello</script></iframe>',
+        :elements => ['iframe']
+      ).must_equal '<iframe></iframe>'
     end
     it 'should not allow arbitrary HTML5 data attributes by default' do
@@ -413,7 +463,7 @@ describe 'Sanitize::Transformers::CleanElement' do
       s.fragment('foo<br>bar<br>baz').must_equal "foo\nbar\nbaz"
     end
-    it 'handles protocols correctly regardless of case' do
+    it 'should handle protocols correctly regardless of case' do
       input = '<a href="hTTpS://foo.com/">Text</a>'
       Sanitize.fragment(input, {
@@ -430,5 +480,40 @@ describe 'Sanitize::Transformers::CleanElement' do
         :protocols  => {'a' => {'href' => ['https']}}
       }).must_equal "<a>Text</a>"
     end
+    it 'should prevent `<meta>` tags from being used to set a non-UTF-8 charset' do
+      Sanitize.document('<html><head><meta charset="utf-8"></head><body>Howdy!</body></html>',
+        :elements   => %w[html head meta body],
+        :attributes => {'meta' => ['charset']}
+      ).must_equal "<html><head><meta charset=\"utf-8\"></head><body>Howdy!</body></html>"
+      Sanitize.document('<html><meta charset="utf-8">Howdy!</html>',
+        :elements   => %w[html meta],
+        :attributes => {'meta' => ['charset']}
+      ).must_equal "<html><meta charset=\"utf-8\">Howdy!</html>"
+      Sanitize.document('<html><meta charset="us-ascii">Howdy!</html>',
+        :elements   => %w[html meta],
+        :attributes => {'meta' => ['charset']}
+      ).must_equal "<html><meta charset=\"utf-8\">Howdy!</html>"
+      Sanitize.document('<html><meta http-equiv="content-type" content=" text/html; charset=us-ascii">Howdy!</html>',
+        :elements   => %w[html meta],
+        :attributes => {'meta' => %w[content http-equiv]}
+      ).must_equal "<html><meta http-equiv=\"content-type\" content=\" text/html;charset=utf-8\">Howdy!</html>"
+      Sanitize.document('<html><meta http-equiv="Content-Type" content="text/plain;charset = us-ascii">Howdy!</html>',
+        :elements   => %w[html meta],
+        :attributes => {'meta' => %w[content http-equiv]}
+      ).must_equal "<html><meta http-equiv=\"Content-Type\" content=\"text/plain;charset=utf-8\">Howdy!</html>"
+    end
+    it 'should not modify `<meta>` tags that already set a UTF-8 charset' do
+      Sanitize.document('<html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8"></head><body>Howdy!</body></html>',
+        :elements   => %w[html head meta body],
+        :attributes => {'meta' => %w[content http-equiv]}
+      ).must_equal "<html><head><meta http-equiv=\"Content-Type\" content=\"text/html;charset=utf-8\"></head><body>Howdy!</body></html>"
+    end
   end
 end

data/test/test_malicious_html.rb CHANGED

@@ -43,7 +43,7 @@ describe 'Malicious HTML' do
   describe '<body>' do
     it 'should not be possible to inject JS via a malformed event attribute' do
       @s.document('<html><head></head><body onload!#$%&()*~+-_.,:;?@[/|\\]^`=alert("XSS")></body></html>').
-        must_equal "<html><head></head><body></body></html>\n"
+        must_equal "<html><head></head><body></body></html>"
     end
   end
@@ -65,7 +65,7 @@ describe 'Malicious HTML' do
     it 'should not be possible to inject <script> via a malformed <img> tag' do
       @s.fragment('<img """><script>alert("XSS")</script>">').
-        must_equal '<img>alert("XSS")"&gt;'
+        must_equal '<img>"&gt;'
     end
     it 'should not be possible to inject protocol-based JS' do
@@ -117,12 +117,12 @@ describe 'Malicious HTML' do
   describe '<script>' do
     it 'should not be possible to inject <script> using a malformed non-alphanumeric tag name' do
       @s.fragment(%[<script/xss src="http://ha.ckers.org/xss.js">alert(1)</script>]).
-        must_equal 'alert(1)'
+        must_equal ''
     end
     it 'should not be possible to inject <script> via extraneous open brackets' do
       @s.fragment(%[<<script>alert("XSS");//<</script>]).
-        must_equal '&lt;alert("XSS");//&lt;'
+        must_equal '&lt;'
     end
   end
@@ -166,7 +166,12 @@ describe 'Malicious HTML' do
         input = %[<#{tag_name} #{attr_name}='examp<!--" onmouseover=alert(1)>-->le.com'>foo</#{tag_name}>]
         it 'should escape unsafe characters in attributes' do
-          @s.fragment(input).must_equal(%[<#{tag_name} #{attr_name}="examp<!--%22%20onmouseover=alert(1)>-->le.com">foo</#{tag_name}>])
+          output = %[<#{tag_name} #{attr_name}="examp<!--%22%20onmouseover=alert(1)>-->le.com">foo</#{tag_name}>]
+          @s.fragment(input).must_equal(output)
+          fragment = Nokogiri::HTML.fragment(input)
+          @s.node!(fragment)
+          fragment.to_html.must_equal(output)
         end
         it 'should round-trip to the same output' do
@@ -179,7 +184,11 @@ describe 'Malicious HTML' do
         input = %[<#{tag_name} #{attr_name}='examp<!--" onmouseover=alert(1)>-->le.com'>foo</#{tag_name}>]
         it 'should not escape characters unnecessarily' do
-          @s.fragment(input).must_equal(input)
+          @s.fragment(input).must_equal(%[<#{tag_name} #{attr_name}="examp<!--&quot; onmouseover=alert(1)>-->le.com">foo</#{tag_name}>])
+          fragment = Nokogiri::HTML.fragment(input)
+          @s.node!(fragment)
+          fragment.to_html.must_equal(%[<#{tag_name} #{attr_name}='examp<!--" onmouseover=alert(1)>-->le.com'>foo</#{tag_name}>])
         end
         it 'should round-trip to the same output' do

data/test/test_parser.rb CHANGED

@@ -19,8 +19,8 @@ describe 'Parser' do
   end
   it 'should not have the Nokogiri 1.4.2+ unterminated script/style element bug' do
-    Sanitize.fragment('foo <script>bar').must_equal 'foo bar'
-    Sanitize.fragment('foo <style>bar').must_equal 'foo bar'
+    Sanitize.fragment('foo <script>bar').must_equal 'foo '
+    Sanitize.fragment('foo <style>bar').must_equal 'foo '
   end
   it 'ambiguous non-tag brackets like "1 > 2 and 2 < 1" should be parsed correctly' do
@@ -28,35 +28,6 @@ describe 'Parser' do
     Sanitize.fragment('OMG HAPPY BIRTHDAY! *<:-D').must_equal 'OMG HAPPY BIRTHDAY! *&lt;:-D'
   end
-  # https://github.com/sparklemotion/nokogiri/issues/1008
-  it 'should work around the libxml2 content-type meta tag bug' do
-    Sanitize.document('<html><head></head><body>Howdy!</body></html>',
-      :elements => %w[html head body]
-    ).must_equal "<html><head></head><body>Howdy!</body></html>\n"
-    Sanitize.document('<html><head></head><body>Howdy!</body></html>',
-      :elements => %w[html head meta body]
-    ).must_equal "<html><head></head><body>Howdy!</body></html>\n"
-    Sanitize.document('<html><head><meta charset="utf-8"></head><body>Howdy!</body></html>',
-      :elements   => %w[html head meta body],
-      :attributes => {'meta' => ['charset']}
-    ).must_equal "<html><head><meta charset=\"utf-8\"></head><body>Howdy!</body></html>\n"
-    Sanitize.document('<html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8"></head><body>Howdy!</body></html>',
-      :elements   => %w[html head meta body],
-      :attributes => {'meta' => %w[charset content http-equiv]}
-    ).must_equal "<html><head><meta http-equiv=\"Content-Type\" content=\"text/html;charset=utf-8\"></head><body>Howdy!</body></html>\n"
-    # Edge case: an existing content-type meta tag with a non-UTF-8 content type
-    # will be converted to UTF-8, since that's the only output encoding we
-    # support.
-    Sanitize.document('<html><head><meta http-equiv="content-type" content="text/html;charset=us-ascii"></head><body>Howdy!</body></html>',
-      :elements   => %w[html head meta body],
-      :attributes => {'meta' => %w[charset content http-equiv]}
-    ).must_equal "<html><head><meta http-equiv=\"content-type\" content=\"text/html; charset=utf-8\"></head><body>Howdy!</body></html>\n"
-  end
   describe 'when siblings are added after a node during traversal' do
     it 'the added siblings should be traversed' do
       html = %[

data/test/test_sanitize.rb CHANGED

@@ -25,7 +25,7 @@ describe 'Sanitize' do
       it 'should sanitize an HTML document' do
         @s.document('<!doctype html><html><b>Lo<!-- comment -->rem</b> <a href="pants" title="foo">ipsum</a> <a href="http://foo.com/"><strong>dolor</strong></a> sit<br/>amet <script>alert("hello world");</script></html>')
-          .must_equal "<html>Lorem ipsum dolor sit amet alert(\"hello world\");</html>\n"
+          .must_equal "<html>Lorem ipsum dolor sit amet </html>"
       end
       it 'should not modify the input string' do
@@ -35,14 +35,14 @@ describe 'Sanitize' do
       end
       it 'should not choke on frozen documents' do
-        @s.document('<!doctype html><html><b>foo</b>'.freeze).must_equal "<html>foo</html>\n"
+        @s.document('<!doctype html><html><b>foo</b>'.freeze).must_equal "<html>foo</html>"
       end
     end
     describe '#fragment' do
       it 'should sanitize an HTML fragment' do
         @s.fragment('<b>Lo<!-- comment -->rem</b> <a href="pants" title="foo">ipsum</a> <a href="http://foo.com/"><strong>dolor</strong></a> sit<br/>amet <script>alert("hello world");</script>')
-          .must_equal 'Lorem ipsum dolor sit amet alert("hello world");'
+          .must_equal 'Lorem ipsum dolor sit amet '
       end
       it 'should not modify the input string' do
@@ -71,7 +71,7 @@ describe 'Sanitize' do
         doc.xpath('/html/body/node()').each {|node| frag << node }
         @s.node!(frag)
-        frag.to_html.must_equal 'Lorem ipsum dolor sit amet alert("hello world");'
+        frag.to_html.must_equal 'Lorem ipsum dolor sit amet '
       end
       describe "when the given node is a document and <html> isn't whitelisted" do

data/test/test_transformers.rb CHANGED

@@ -172,28 +172,28 @@ describe 'Transformers' do
       input = '<iframe width="420" height="315" src="http://www.youtube.com/embed/QH2-TGUlwu4" frameborder="0" allowfullscreen bogus="bogus"><script>alert()</script></iframe>'
       Sanitize.fragment(input, :transformers => youtube_transformer)
-        .must_equal '<iframe width="420" height="315" src="http://www.youtube.com/embed/QH2-TGUlwu4" frameborder="0" allowfullscreen="">&lt;script&gt;alert()&lt;/script&gt;</iframe>'
+        .must_equal '<iframe width="420" height="315" src="http://www.youtube.com/embed/QH2-TGUlwu4" frameborder="0" allowfullscreen=""></iframe>'
     end
     it 'should allow HTTPS YouTube video embeds' do
       input = '<iframe width="420" height="315" src="https://www.youtube.com/embed/QH2-TGUlwu4" frameborder="0" allowfullscreen bogus="bogus"><script>alert()</script></iframe>'
       Sanitize.fragment(input, :transformers => youtube_transformer)
-        .must_equal '<iframe width="420" height="315" src="https://www.youtube.com/embed/QH2-TGUlwu4" frameborder="0" allowfullscreen="">&lt;script&gt;alert()&lt;/script&gt;</iframe>'
+        .must_equal '<iframe width="420" height="315" src="https://www.youtube.com/embed/QH2-TGUlwu4" frameborder="0" allowfullscreen=""></iframe>'
     end
     it 'should allow protocol-relative YouTube video embeds' do
       input = '<iframe width="420" height="315" src="//www.youtube.com/embed/QH2-TGUlwu4" frameborder="0" allowfullscreen bogus="bogus"><script>alert()</script></iframe>'
       Sanitize.fragment(input, :transformers => youtube_transformer)
-        .must_equal '<iframe width="420" height="315" src="//www.youtube.com/embed/QH2-TGUlwu4" frameborder="0" allowfullscreen="">&lt;script&gt;alert()&lt;/script&gt;</iframe>'
+        .must_equal '<iframe width="420" height="315" src="//www.youtube.com/embed/QH2-TGUlwu4" frameborder="0" allowfullscreen=""></iframe>'
     end
     it 'should allow privacy-enhanced YouTube video embeds' do
       input = '<iframe width="420" height="315" src="https://www.youtube-nocookie.com/embed/QH2-TGUlwu4" frameborder="0" allowfullscreen bogus="bogus"><script>alert()</script></iframe>'
       Sanitize.fragment(input, :transformers => youtube_transformer)
-        .must_equal '<iframe width="420" height="315" src="https://www.youtube-nocookie.com/embed/QH2-TGUlwu4" frameborder="0" allowfullscreen="">&lt;script&gt;alert()&lt;/script&gt;</iframe>'
+        .must_equal '<iframe width="420" height="315" src="https://www.youtube-nocookie.com/embed/QH2-TGUlwu4" frameborder="0" allowfullscreen=""></iframe>'
     end
     it 'should not allow non-YouTube video embeds' do

data/test/test_unicode.rb CHANGED

@@ -23,61 +23,61 @@ describe 'Unicode' do
     end
     it 'should strip deprecated grave and acute clones' do
-      @s.document("a\u0340b\u0341c").must_equal "<html><head></head><body>abc</body></html>\n"
+      @s.document("a\u0340b\u0341c").must_equal "<html><head></head><body>abc</body></html>"
       @s.fragment("a\u0340b\u0341c").must_equal 'abc'
     end
     it 'should strip deprecated Khmer characters' do
-      @s.document("a\u17a3b\u17d3c").must_equal "<html><head></head><body>abc</body></html>\n"
+      @s.document("a\u17a3b\u17d3c").must_equal "<html><head></head><body>abc</body></html>"
       @s.fragment("a\u17a3b\u17d3c").must_equal 'abc'
     end
     it 'should strip line and paragraph separator punctuation' do
-      @s.document("a\u2028b\u2029c").must_equal "<html><head></head><body>abc</body></html>\n"
+      @s.document("a\u2028b\u2029c").must_equal "<html><head></head><body>abc</body></html>"
       @s.fragment("a\u2028b\u2029c").must_equal 'abc'
     end
     it 'should strip bidi embedding control characters' do
       @s.document("a\u202ab\u202bc\u202cd\u202de\u202e")
-        .must_equal "<html><head></head><body>abcde</body></html>\n"
+        .must_equal "<html><head></head><body>abcde</body></html>"
       @s.fragment("a\u202ab\u202bc\u202cd\u202de\u202e")
         .must_equal 'abcde'
     end
     it 'should strip deprecated symmetric swapping characters' do
-      @s.document("a\u206ab\u206bc").must_equal "<html><head></head><body>abc</body></html>\n"
+      @s.document("a\u206ab\u206bc").must_equal "<html><head></head><body>abc</body></html>"
       @s.fragment("a\u206ab\u206bc").must_equal 'abc'
     end
     it 'should strip deprecated Arabic form shaping characters' do
-      @s.document("a\u206cb\u206dc").must_equal "<html><head></head><body>abc</body></html>\n"
+      @s.document("a\u206cb\u206dc").must_equal "<html><head></head><body>abc</body></html>"
       @s.fragment("a\u206cb\u206dc").must_equal 'abc'
     end
     it 'should strip deprecated National digit shape characters' do
-      @s.document("a\u206eb\u206fc").must_equal "<html><head></head><body>abc</body></html>\n"
+      @s.document("a\u206eb\u206fc").must_equal "<html><head></head><body>abc</body></html>"
       @s.fragment("a\u206eb\u206fc").must_equal 'abc'
     end
     it 'should strip interlinear annotation characters' do
-      @s.document("a\ufff9b\ufffac\ufffb").must_equal "<html><head></head><body>abc</body></html>\n"
+      @s.document("a\ufff9b\ufffac\ufffb").must_equal "<html><head></head><body>abc</body></html>"
       @s.fragment("a\ufff9b\ufffac\ufffb").must_equal 'abc'
     end
     it 'should strip BOM/zero-width non-breaking space characters' do
-      @s.document("a\ufeffbc").must_equal "<html><head></head><body>abc</body></html>\n"
+      @s.document("a\ufeffbc").must_equal "<html><head></head><body>abc</body></html>"
       @s.fragment("a\ufeffbc").must_equal 'abc'
     end
     it 'should strip object replacement characters' do
-      @s.document("a\ufffcbc").must_equal "<html><head></head><body>abc</body></html>\n"
+      @s.document("a\ufffcbc").must_equal "<html><head></head><body>abc</body></html>"
       @s.fragment("a\ufffcbc").must_equal 'abc'
     end
     it 'should strip musical notation scoping characters' do
       @s.document("a\u{1d173}b\u{1d174}c\u{1d175}d\u{1d176}e\u{1d177}f\u{1d178}g\u{1d179}h\u{1d17a}")
-        .must_equal "<html><head></head><body>abcdefgh</body></html>\n"
+        .must_equal "<html><head></head><body>abcdefgh</body></html>"
       @s.fragment("a\u{1d173}b\u{1d174}c\u{1d175}d\u{1d176}e\u{1d177}f\u{1d178}g\u{1d179}h\u{1d17a}")
         .must_equal 'abcdefgh'
@@ -88,7 +88,7 @@ describe 'Unicode' do
       (0xE0000..0xE007F).each {|n| str << [n].pack('U') }
       str << 'b'
-      @s.document(str).must_equal "<html><head></head><body>ab</body></html>\n"
+      @s.document(str).must_equal "<html><head></head><body>ab</body></html>"
       @s.fragment(str).must_equal 'ab'
     end
   end

metadata CHANGED

@@ -1,14 +1,14 @@
 --- !ruby/object:Gem::Specification
 name: sanitize
 version: !ruby/object:Gem::Version
-  version: 4.6.6
+  version: 5.0.0
 platform: ruby
 authors:
 - Ryan Grove
 autorequire:
 bindir: bin
 cert_chain: []
-date: 2018-07-24 00:00:00.000000000 Z
+date: 2018-10-15 00:00:00.000000000 Z
 dependencies:
 - !ruby/object:Gem::Dependency
   name: crass
@@ -30,56 +30,56 @@ dependencies:
     requirements:
     - - ">="
       - !ruby/object:Gem::Version
-        version: 1.4.4
+        version: 1.8.0
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - ">="
       - !ruby/object:Gem::Version
-        version: 1.4.4
+        version: 1.8.0
 - !ruby/object:Gem::Dependency
   name: nokogumbo
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.4'
+        version: '2.0'
   type: :runtime
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: '1.4'
+        version: '2.0'
 - !ruby/object:Gem::Dependency
   name: minitest
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: 5.10.2
+        version: 5.11.3
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: 5.10.2
+        version: 5.11.3
 - !ruby/object:Gem::Dependency
   name: rake
   requirement: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: 12.0.0
+        version: 12.3.1
   type: :development
   prerelease: false
   version_requirements: !ruby/object:Gem::Requirement
     requirements:
     - - "~>"
       - !ruby/object:Gem::Version
-        version: 12.0.0
+        version: 12.3.1
 description: Sanitize is a whitelist-based HTML and CSS sanitizer. Given a list of
   acceptable elements, attributes, and CSS properties, Sanitize will remove all unacceptable
   HTML and/or CSS from a string.
@@ -129,7 +129,7 @@ required_ruby_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
     - !ruby/object:Gem::Version
-      version: 1.9.2
+      version: 2.1.0
 required_rubygems_version: !ruby/object:Gem::Requirement
   requirements:
   - - ">="
@@ -137,7 +137,7 @@ required_rubygems_version: !ruby/object:Gem::Requirement
       version: 1.2.0
 requirements: []
 rubyforge_project:
-rubygems_version: 2.7.3
+rubygems_version: 2.7.6
 signing_key:
 specification_version: 4
 summary: Whitelist-based HTML and CSS sanitizer.