RubyGems - sanitize - Versions diffs - 4.6.3 → 5.1.0 - Mend

sanitize 4.6.3 → 5.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.

Potentially problematic release.

This version of sanitize might be problematic. Click here for more details.

Files changed (19) hide show

checksums.yaml +4 -4
data/HISTORY.md +101 -1
data/README.md +24 -4
data/lib/sanitize.rb +41 -63
data/lib/sanitize/config/default.rb +10 -4
data/lib/sanitize/transformers/clean_element.rb +44 -3
data/lib/sanitize/version.rb +1 -1
data/test/common.rb +0 -31
data/test/test_clean_comment.rb +1 -5
data/test/test_clean_css.rb +1 -1
data/test/test_clean_doctype.rb +8 -8
data/test/test_clean_element.rb +108 -23
data/test/test_malicious_html.rb +30 -6
data/test/test_parser.rb +2 -31
data/test/test_sanitize.rb +102 -17
data/test/test_sanitize_css.rb +39 -12
data/test/test_transformers.rb +22 -4
metadata +12 -14
data/test/test_unicode.rb +0 -95

checksums.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: 04fe170a57bfd67e2e2f40e19e6add8cd777a9d812f24b66a4350d0cefe9f803
-  data.tar.gz: fb848fbc8cf1878905378f2795c9ad012d4247a1a5491ec4735994902544840d
+  metadata.gz: 8cf7bac25cea64ed464d106bdc57019388598ca9f1a4e7d8eddf3a98bab12267
+  data.tar.gz: e8b1f402b0d67a825b0ad4aad83829816fd9c78cd8445879636cba0a282e8ee5
 SHA512:
-  metadata.gz: dde1af17f562062ea7136d8033df17ed2aeaf39fdc1d037e75118c1ed9718d6ae50f29f9bb1165b0057810fed7a8bcac303e9e0687c3f98dbe514f6cb768cae5
-  data.tar.gz: 408533cd205ec1570041a6c029bb639978aa1785824a2e69df9d695331f274e7cda4e024934300ba35b8792fb39f4a94bd780dc6365cc2eaade06cdc32d3299e
+  metadata.gz: 956edaca6569a5933223da0aa7dcac4880b5164aa59e37256ac896c9fefb271da71425defe7e09e241b1333b441f5a2629893abed6d5a2a47d0726bf03597614
+  data.tar.gz: e45a018b904bcf8cb996f8ed08427e80b8ce058c4fe414782460c5496e88bb6c2a4055304118057621a630e514b4f96bac11bdc686181a6f0097dc7bf912ab04

data/HISTORY.md CHANGED Viewed

@@ -1,8 +1,86 @@
 # Sanitize History
+## 5.1.0 (2019-09-07)
+### Features
+* Added a `:parser_options` config hash, which makes it possible to pass custom
+  parsing options to Nokogumbo. [@austin-wang - #194][194]
+### Bug Fixes
+* Non-characters and non-whitespace control characters are now stripped from
+  HTML input before parsing to comply with the HTML Standard's [preprocessing
+  guidelines][html-preprocessing]. Prior to this Sanitize had adhered to [older
+  W3C guidelines][unicode-xml] that have since been withdrawn. [#179][179]
+[179]:https://github.com/rgrove/sanitize/issues/179
+[194]:https://github.com/rgrove/sanitize/pull/194
+[html-preprocessing]:https://html.spec.whatwg.org/multipage/parsing.html#preprocessing-the-input-stream
+[unicode-xml]:https://www.w3.org/TR/unicode-xml/
+## 5.0.0 (2018-10-14)
+For most users, upgrading from 4.x shouldn't require any changes. However, the
+minimum required Ruby version has changed, and Sanitize 5.x's HTML output may
+differ in some small ways from 4.x's output. If this matters to you, please
+review the changes below carefully.
+### Potentially Breaking Changes
+* Ruby 2.3.0 is now the oldest officially supported Ruby version. Sanitize may
+  work in older 2.x Rubies, but they aren't actively tested. Sanitize definitely
+  no longer works in Ruby 1.9.x.
+* Upgraded to Nokogumbo 2.x, which fixes various bugs and adds
+  standard-compliant HTML serialization. [@stevecheckoway - #189][189]
+* Children of the following elements are now removed by default when these
+  elements are removed, rather than being preserved and escaped:
+  - `iframe`
+  - `noembed`
+  - `noframes`
+  - `noscript`
+  - `script`
+  - `style`
+* Children of whitelisted `iframe` elements are now always removed. In modern
+  HTML, `iframe` elements should never have children. In HTML 4 and earlier
+  `iframe` elements were allowed to contain fallback content for legacy
+  browsers, but it's been almost two decades since that was useful.
+* Fixed a bug that caused `:remove_contents` to behave as if it were set to
+  `true` when it was actually an Array.
+[189]:https://github.com/rgrove/sanitize/pull/189
+## 4.6.6 (2018-07-23)
+* Improved performance and memory usage by optimizing `Sanitize#transform_node!`
+  [@stanhu - #183][183]
+[183]:https://github.com/rgrove/sanitize/pull/183
+## 4.6.5 (2018-05-16)
+* Improved performance slightly by tweaking the order of built-in transformers.
+  [@rafbm - #180][180]
+[180]:https://github.com/rgrove/sanitize/pull/180
+## 4.6.4 (2018-03-20)
+* Fixed: A change introduced in 4.6.2 broke certain transformers that relied on
+  being able to mutate the name of an HTML node. That change has been reverted
+  and a test has been added to cover this case. [@zetter - #177][177]
+[177]:https://github.com/rgrove/sanitize/issues/177
 ## 4.6.3 (2018-03-19)
-* Fixed an HTML injection vulnerability that could allow XSS.
+* [CVE-2018-3740][176]: Fixed an HTML injection vulnerability that could allow
+  XSS.
   When Sanitize <= 4.6.2 is used in combination with libxml2 >= 2.9.2, a
   specially crafted HTML fragment can cause libxml2 to generate improperly
@@ -15,6 +93,8 @@
   Many thanks to the Shopify Application Security Team for responsibly reporting
   this issue.
+[176]:https://github.com/rgrove/sanitize/issues/176
 ## 4.6.2 (2018-03-19)
 * Reduced string allocations to optimize memory usage. [@janklimo - #175][175]
@@ -299,6 +379,26 @@ Sanitize.fragment(html, Sanitize::Config.merge(Sanitize::Config::BASIC,
 [n1008]:https://github.com/sparklemotion/nokogiri/issues/1008
+## 2.1.1 (2018-09-30)
+* [CVE-2018-3740][176]: Fixed an HTML injection vulnerability that could allow
+  XSS (backported from Sanitize 4.6.3). [@dometto - #188][188]
+  When Sanitize <= 2.1.0 is used in combination with libxml2 >= 2.9.2, a
+  specially crafted HTML fragment can cause libxml2 to generate improperly
+  escaped output, allowing non-whitelisted attributes to be used on whitelisted
+  elements.
+  Sanitize now performs additional escaping on affected attributes to prevent
+  this.
+  Many thanks to the Shopify Application Security Team for responsibly reporting
+  this issue.
+[176]:https://github.com/rgrove/sanitize/issues/176
+[188]:https://github.com/rgrove/sanitize/pull/188
 ## 2.1.0 (2014-01-13)
 * Added support for whitelisting arbitrary HTML5 `data-*` attributes. Use the

data/README.md CHANGED Viewed

@@ -417,6 +417,17 @@ elements not in this array will be removed.
 ]
 ```
+#### :parser_options (Hash)
+[Parsing options](https://github.com/rubys/nokogumbo/tree/v2.0.1#parsing-options) supplied to `nokogumbo`.
+```ruby
+:parser_options => {
+  max_errors: -1,
+  max_tree_depth: -1
+}
+```
 #### :protocols (Hash)
 URL protocols to allow in specific attributes. If an attribute is listed here
@@ -441,13 +452,13 @@ include the symbol `:relative` in the protocol array:
 #### :remove_contents (boolean or Array or Set)
-If set to `true`, Sanitize will remove the contents of any non-whitelisted
+If this is `true`, Sanitize will remove the contents of any non-whitelisted
 elements in addition to the elements themselves. By default, Sanitize leaves the
 safe parts of an element's contents behind when the element is removed.
-If set to an array of element names, then only the contents of the specified
-elements (when filtered) will be removed, and the contents of all other filtered
-elements will be left behind.
+If this is an Array or Set of element names, then only the contents of the
+specified elements (when filtered) will be removed, and the contents of all
+other filtered elements will be left behind.
 The default value is `false`.
@@ -474,6 +485,15 @@ children, in which case it will be inserted after those children.
 }
 ```
+The default elements with whitespace added before and after are:
+```
+address article aside blockquote br dd div dl dt
+footer h1 h2 h3 h4 h5 h6 header hgroup hr li nav
+ol p pre section ul
+```
 ## Transformers
 Transformers allow you to filter and modify HTML nodes using your own custom

data/lib/sanitize.rb CHANGED Viewed

@@ -19,6 +19,20 @@ require_relative 'sanitize/transformers/clean_element'
 class Sanitize
   attr_reader :config
+  # Matches one or more control characters that should be removed from HTML
+  # before parsing, as defined by the HTML living standard.
+  #
+  # -   https://html.spec.whatwg.org/multipage/parsing.html#preprocessing-the-input-stream
+  # -   https://infra.spec.whatwg.org/#control
+  REGEX_HTML_CONTROL_CHARACTERS = /[\u0001-\u0008\u000b\u000e-\u001f\u007f-\u009f]+/u
+  # Matches one or more non-characters that should be removed from HTML before
+  # parsing, as defined by the HTML living standard.
+  #
+  # -   https://html.spec.whatwg.org/multipage/parsing.html#preprocessing-the-input-stream
+  # -   https://infra.spec.whatwg.org/#noncharacter
+  REGEX_HTML_NON_CHARACTERS = /[\ufdd0-\ufdef\ufffe\uffff\u{1fffe}\u{1ffff}\u{2fffe}\u{2ffff}\u{3fffe}\u{3ffff}\u{4fffe}\u{4ffff}\u{5fffe}\u{5ffff}\u{6fffe}\u{6ffff}\u{7fffe}\u{7ffff}\u{8fffe}\u{8ffff}\u{9fffe}\u{9ffff}\u{afffe}\u{affff}\u{bfffe}\u{bffff}\u{cfffe}\u{cffff}\u{dfffe}\u{dffff}\u{efffe}\u{effff}\u{ffffe}\u{fffff}\u{10fffe}\u{10ffff}]+/u
   # Matches an attribute value that could be treated by a browser as a URL
   # with a protocol prefix, such as "http:" or "javascript:". Any string of zero
   # or more characters followed by a colon is considered a match, even if the
@@ -26,11 +40,12 @@ class Sanitize
   # IE6 and Opera will still parse).
   REGEX_PROTOCOL = /\A\s*([^\/#]*?)(?:\:|&#0*58|&#x0*3a)/i
-  # Matches Unicode characters that should be stripped from HTML before passing
-  # it to the parser.
+  # Matches one or more characters that should be stripped from HTML before
+  # parsing. This is a combination of `REGEX_HTML_CONTROL_CHARACTERS` and
+  # `REGEX_HTML_NON_CHARACTERS`.
   #
-  # http://www.w3.org/TR/unicode-xml/#Charlist
-  REGEX_UNSUITABLE_CHARS = /[\u0000\u0340\u0341\u17a3\u17d3\u2028\u2029\u202a-\u202e\u206a-\u206f\ufff9-\ufffb\ufeff\ufffc\u{1d173}-\u{1d17a}\u{e0000}-\u{e007f}]/u
+  # https://html.spec.whatwg.org/multipage/parsing.html#preprocessing-the-input-stream
+  REGEX_UNSUITABLE_CHARS = /(?:#{REGEX_HTML_CONTROL_CHARACTERS}|#{REGEX_HTML_NON_CHARACTERS})/u
   #--
   # Class Methods
@@ -81,6 +96,7 @@ class Sanitize
     # Default transformers always run at the end of the chain, after any custom
     # transformers.
+    @transformers << Transformers::CleanElement.new(@config)
     @transformers << Transformers::CleanComment unless @config[:allow_comments]
     if @config[:elements].include?('style')
@@ -93,10 +109,10 @@ class Sanitize
       @transformers << Transformers::CSS::CleanAttribute.new(scss)
     end
-    @transformers <<
-        Transformers::CleanDoctype <<
-        Transformers::CleanCDATA <<
-        Transformers::CleanElement.new(@config)
+    @transformers << Transformers::CleanDoctype
+    @transformers << Transformers::CleanCDATA
+    @transformer_config = { config: @config }
   end
   # Returns a sanitized copy of the given _html_ document.
@@ -107,7 +123,7 @@ class Sanitize
   def document(html)
     return '' unless html
-    doc = Nokogiri::HTML5.parse(preprocess(html))
+    doc = Nokogiri::HTML5.parse(preprocess(html), **@config[:parser_options])
     node!(doc)
     to_html(doc)
   end
@@ -119,20 +135,7 @@ class Sanitize
   def fragment(html)
     return '' unless html
-    html = preprocess(html)
-    doc  = Nokogiri::HTML5.parse("<html><body>#{html}")
-    # Hack to allow fragments containing <body>. Borrowed from
-    # Nokogiri::HTML::DocumentFragment.
-    if html =~ /\A<body(?:\s|>)/i
-      path = '/html/body'
-    else
-      path = '/html/body/node()'
-    end
-    frag = doc.fragment
-    frag << doc.xpath(path)
+    frag = Nokogiri::HTML5.fragment(preprocess(html), **@config[:parser_options])
     node!(frag)
     to_html(frag)
   end
@@ -183,50 +186,25 @@ class Sanitize
   end
   def to_html(node)
-    replace_meta = false
-    # Hacky workaround for a libxml2 bug that adds an undesired Content-Type
-    # meta tag to all serialized HTML documents.
-    #
-    # https://github.com/sparklemotion/nokogiri/issues/1008
-    if node.type == Nokogiri::XML::Node::DOCUMENT_NODE ||
-        node.type == Nokogiri::XML::Node::HTML_DOCUMENT_NODE
-      regex_meta   = %r|(<html[^>]*>\s*<head[^>]*>\s*)<meta http-equiv="Content-Type" content="text/html; charset=utf-8">|i
-      # Only replace the content-type meta tag if <meta> isn't whitelisted or
-      # the original document didn't actually include a content-type meta tag.
-      replace_meta = !@config[:elements].include?('meta') ||
-        node.xpath('/html/head/meta[@http-equiv]').none? do |meta|
-          meta['http-equiv'].casecmp('content-type').zero?
-        end
-    end
-    so = Nokogiri::XML::Node::SaveOptions
-    # Serialize to HTML without any formatting to prevent Nokogiri from adding
-    # newlines after certain tags.
-    html = node.to_html(
-      :encoding  => 'utf-8',
-      :indent    => 0,
-      :save_with => so::NO_DECLARATION | so::NO_EMPTY_TAGS | so::AS_HTML
-    )
-    html.gsub!(regex_meta, '\1') if replace_meta
-    html
+    node.to_html(preserve_newline: true)
   end
   def transform_node!(node, node_whitelist)
-    node_name = node.name.downcase
     @transformers.each do |transformer|
-      result = transformer.call(
-        :config         => @config,
-        :is_whitelisted => node_whitelist.include?(node),
-        :node           => node,
-        :node_name      => node_name,
-        :node_whitelist => node_whitelist
-      )
+      # Since transform_node! may be called in a tight loop to process thousands
+      # of items, we can optimize both memory and CPU performance by:
+      #
+      # 1. Reusing the same config hash for each transformer
+      # 2. Directly assigning values to hash instead of using merge!. Not only
+      # does merge! create a new hash, it is also 2.6x slower:
+      # https://github.com/JuanitoFatas/fast-ruby#hashmerge-vs-hashmerge-code
+      config = @transformer_config
+      config[:is_whitelisted] = node_whitelist.include?(node)
+      config[:node] = node
+      config[:node_name] = node.name.downcase
+      config[:node_whitelist] = node_whitelist
+      result = transformer.call(config)
       if result.is_a?(Hash) && result[:node_whitelist].respond_to?(:each)
         node_whitelist.merge(result[:node_whitelist])

data/lib/sanitize/config/default.rb CHANGED Viewed

@@ -56,6 +56,10 @@ class Sanitize
       # that all HTML will be stripped).
       :elements => [],
+      # HTML parsing options to pass to Nokogumbo.
+      # https://github.com/rubys/nokogumbo/tree/v2.0.1#parsing-options
+      :parser_options => {},
       # URL handling protocols to allow in specific attributes. By default, no
       # protocols are allowed. Use :relative in place of a protocol if you want
       # to allow relative URLs sans protocol.
@@ -66,10 +70,12 @@ class Sanitize
       # leaves the safe parts of an element's contents behind when the element
       # is removed.
       #
-      # If this is an Array of element names, then only the contents of the
-      # specified elements (when filtered) will be removed, and the contents of
-      # all other filtered elements will be left behind.
-      :remove_contents => false,
+      # If this is an Array or Set of element names, then only the contents of
+      # the specified elements (when filtered) will be removed, and the contents
+      # of all other filtered elements will be left behind.
+      :remove_contents => %w[
+        iframe noembed noframes noscript script style
+      ],
       # Transformers allow you to filter or alter nodes using custom logic. See
       # README.md for details and examples.

data/lib/sanitize/transformers/clean_element.rb CHANGED Viewed

@@ -67,7 +67,7 @@ class Sanitize; module Transformers; class CleanElement
       @whitespace_elements = config[:whitespace_elements]
     end
-    if config[:remove_contents].is_a?(Set)
+    if config[:remove_contents].is_a?(Enumerable)
       @remove_element_contents.merge(config[:remove_contents].map(&:to_s))
     else
       @remove_all_contents = !!config[:remove_contents]
@@ -97,8 +97,10 @@ class Sanitize; module Transformers; class CleanElement
         end
       end
-      unless @remove_all_contents || @remove_element_contents.include?(name)
-        node.add_previous_sibling(node.children)
+      unless node.children.empty?
+        unless @remove_all_contents || @remove_element_contents.include?(name)
+          node.add_previous_sibling(node.children)
+        end
       end
       node.unlink
@@ -166,6 +168,11 @@ class Sanitize; module Transformers; class CleanElement
         # affected attributes, some of which can exist on any element and some
         # of which can only exist on `<a>` elements.
         #
+        # This fix is technically no longer necessary with Nokogumbo >= 2.0
+        # since it no longer uses libxml2's serializer, but it's retained to
+        # avoid breaking use cases where people might be sanitizing individual
+        # Nokogiri nodes and then serializing them manually without Nokogumbo.
+        #
         # The relevant libxml2 code is here:
         # <https://github.com/GNOME/libxml2/commit/960f0e275616cadc29671a218d7fb9b69eb35588>
         if UNSAFE_LIBXML_ATTRS_GLOBAL.include?(attr_name) ||
@@ -180,6 +187,40 @@ class Sanitize; module Transformers; class CleanElement
     if @add_attributes.include?(name)
       @add_attributes[name].each {|key, val| node[key] = val }
     end
+    # Element-specific special cases.
+    case name
+    # If this is a whitelisted iframe that has children, remove all its
+    # children. The HTML standard says iframes shouldn't have content, but when
+    # they do, this content is parsed as text and is serialized verbatim without
+    # being escaped, which is unsafe because legacy browsers may still render it
+    # and execute `<script>` content. So the safe and correct thing to do is to
+    # always remove iframe content.
+    when 'iframe'
+      if !node.children.empty?
+        node.children.each do |child|
+          child.unlink
+        end
+      end
+    # Prevent the use of `<meta>` elements that set a charset other than UTF-8,
+    # since Sanitize's output is always UTF-8.
+    when 'meta'
+      if node.has_attribute?('charset') &&
+          node['charset'].downcase != 'utf-8'
+        node['charset'] = 'utf-8'
+      end
+      if node.has_attribute?('http-equiv') &&
+          node.has_attribute?('content') &&
+          node['http-equiv'].downcase == 'content-type' &&
+          node['content'].downcase =~ /;\s*charset\s*=\s*(?!utf-8)/
+        node['content'] = node['content'].gsub(/;\s*charset\s*=.+\z/, ';charset=utf-8')
+      end
+    end
   end
 end; end; end

data/lib/sanitize/version.rb CHANGED Viewed

@@ -1,5 +1,5 @@
 # encoding: utf-8
 class Sanitize
-  VERSION = '4.6.3'
+  VERSION = '5.1.0'
 end