RubyGems - sanitize - Versions diffs - 4.6.5 → 5.2.1 - Mend

sanitize 4.6.5 → 5.2.1

Potentially problematic release.

This version of sanitize might be problematic. Click here for more details.

Files changed (23) hide show

checksums.yaml +4 -4
data/HISTORY.md +154 -16
data/README.md +70 -41
data/lib/sanitize.rb +52 -67
data/lib/sanitize/config/default.rb +10 -4
data/lib/sanitize/css.rb +2 -2
data/lib/sanitize/transformers/clean_comment.rb +1 -1
data/lib/sanitize/transformers/clean_css.rb +3 -3
data/lib/sanitize/transformers/clean_doctype.rb +1 -1
data/lib/sanitize/transformers/clean_element.rb +54 -13
data/lib/sanitize/version.rb +1 -1
data/test/common.rb +0 -31
data/test/test_clean_comment.rb +1 -5
data/test/test_clean_css.rb +1 -1
data/test/test_clean_doctype.rb +8 -8
data/test/test_clean_element.rb +121 -26
data/test/test_malicious_html.rb +50 -7
data/test/test_parser.rb +3 -32
data/test/test_sanitize.rb +103 -18
data/test/test_sanitize_css.rb +43 -16
data/test/test_transformers.rb +29 -23
metadata +16 -18
data/test/test_unicode.rb +0 -95

checksums.yaml CHANGED

@@ -1,7 +1,7 @@
 ---
 SHA256:
-  metadata.gz: eab36cec7ac13bd15bd00b1141990e9efc35332c95391cb405128ddfe891e242
-  data.tar.gz: f69f77cf6febfa74b1bdc5103d245543f38ddcfe223d474dffab5913846525ec
+  metadata.gz: 3d1290690a9d32db9e06b8fb19c7e285c94a1d91ed51a4eb7e96389e427348d9
+  data.tar.gz: 5131063daf1763c83978954bed9ee3a783099e40aa71e50de26d06b8ae0c1054
 SHA512:
-  metadata.gz: 3358c2574bcdd0e3a8c08460f2dd31ecd3ade8e04ed60380f8037c46dd0f67321ac2c4ccace6e1f82080acecb2dc71054630c2c1fec54b2a99cb50c0476dd0b2
-  data.tar.gz: c10686ec8aacadf3268eafd407e4e2259e88deae363829d5ebe1b2877ed8e15c658bb86fe97b786d116c00b10e6eaff14baf5bb71a7a737ec507f3ab65f61187
+  metadata.gz: bfcb7cda6aa70590f642583b41936bc09d8929210046cebdd0d0ff452ccb3213844b4c40d4e205e79c0cd64a2a0d56e16790e38f4c8f247b8abfa32dbec22297
+  data.tar.gz: 0ea5a6d6848f9a125f17e4e23145adff4d3c4ccfe30a3407466fae074ed33cbd4b1869eb5a9f0a72b808449b8cf166a3695c2a6d63b16a83b047fd260bfe50bd

data/HISTORY.md CHANGED

@@ -1,5 +1,123 @@
 # Sanitize History
+## 5.2.1 (2020-06-16)
+### Bug Fixes
+* Fixed an HTML sanitization bypass that could allow XSS. This issue affects
+  Sanitize versions 3.0.0 through 5.2.0.
+  When HTML was sanitized using the "relaxed" config or a custom config that
+  allows certain elements, some content in a `<math>` or `<svg>` element may not
+  have beeen sanitized correctly even if `math` and `svg` were not in the
+  allowlist. This could allow carefully crafted input to sneak arbitrary HTML
+  through Sanitize, potentially enabling an XSS (cross-site scripting) attack.
+  You are likely to be vulnerable to this issue if you use Sanitize's relaxed
+  config or a custom config that allows one or more of the following HTML
+  elements:
+    -   `iframe`
+    -   `math`
+    -   `noembed`
+    -   `noframes`
+    -   `noscript`
+    -   `plaintext`
+    -   `script`
+    -   `style`
+    -   `svg`
+    -   `xmp`
+  See the security advisory for more details, including a workaround if you're
+  not able to upgrade: [GHSA-p4x4-rw2p-8j8m]
+  Many thanks to Michał Bentkowski of Securitum for reporting this issue and
+  helping to verify the fix.
+[GHSA-p4x4-rw2p-8j8m]:https://github.com/rgrove/sanitize/security/advisories/GHSA-p4x4-rw2p-8j8m
+## 5.2.0 (2020-06-06)
+### Changes
+* The term "whitelist" has been replaced with "allowlist" throughout Sanitize's
+  source and documentation.
+  While the etymology of "whitelist" may not be explicitly racist in origin or
+  intent, there are inherent racial connotations in the implication that white
+  is good and black (as in "blacklist") is not.
+  This is a change I should have made long ago, and I apologize for not making
+  it sooner.
+* In transformer input, the `:is_whitelisted` and `:node_whitelist` keys are now
+  deprecated. New `:is_allowlisted` and `:node_allowlist` keys have been added.
+  The old keys will continue to work in order to avoid breaking existing code,
+  but they are no longer documented and may be removed in a future semver major
+  release.
+## 5.1.0 (2019-09-07)
+### Features
+* Added a `:parser_options` config hash, which makes it possible to pass custom
+  parsing options to Nokogumbo. [@austin-wang - #194][194]
+### Bug Fixes
+* Non-characters and non-whitespace control characters are now stripped from
+  HTML input before parsing to comply with the HTML Standard's [preprocessing
+  guidelines][html-preprocessing]. Prior to this Sanitize had adhered to [older
+  W3C guidelines][unicode-xml] that have since been withdrawn. [#179][179]
+[179]:https://github.com/rgrove/sanitize/issues/179
+[194]:https://github.com/rgrove/sanitize/pull/194
+[html-preprocessing]:https://html.spec.whatwg.org/multipage/parsing.html#preprocessing-the-input-stream
+[unicode-xml]:https://www.w3.org/TR/unicode-xml/
+## 5.0.0 (2018-10-14)
+For most users, upgrading from 4.x shouldn't require any changes. However, the
+minimum required Ruby version has changed, and Sanitize 5.x's HTML output may
+differ in some small ways from 4.x's output. If this matters to you, please
+review the changes below carefully.
+### Potentially Breaking Changes
+* Ruby 2.3.0 is now the oldest officially supported Ruby version. Sanitize may
+  work in older 2.x Rubies, but they aren't actively tested. Sanitize definitely
+  no longer works in Ruby 1.9.x.
+* Upgraded to Nokogumbo 2.x, which fixes various bugs and adds
+  standard-compliant HTML serialization. [@stevecheckoway - #189][189]
+* Children of the following elements are now removed by default when these
+  elements are removed, rather than being preserved and escaped:
+  - `iframe`
+  - `noembed`
+  - `noframes`
+  - `noscript`
+  - `script`
+  - `style`
+* Children of allowlisted `iframe` elements are now always removed. In modern
+  HTML, `iframe` elements should never have children. In HTML 4 and earlier
+  `iframe` elements were allowed to contain fallback content for legacy
+  browsers, but it's been almost two decades since that was useful.
+* Fixed a bug that caused `:remove_contents` to behave as if it were set to
+  `true` when it was actually an Array.
+[189]:https://github.com/rgrove/sanitize/pull/189
+## 4.6.6 (2018-07-23)
+* Improved performance and memory usage by optimizing `Sanitize#transform_node!`
+  [@stanhu - #183][183]
+[183]:https://github.com/rgrove/sanitize/pull/183
 ## 4.6.5 (2018-05-16)
 * Improved performance slightly by tweaking the order of built-in transformers.
@@ -22,7 +140,7 @@
   When Sanitize <= 4.6.2 is used in combination with libxml2 >= 2.9.2, a
   specially crafted HTML fragment can cause libxml2 to generate improperly
-  escaped output, allowing non-whitelisted attributes to be used on whitelisted
+  escaped output, allowing non-allowlisted attributes to be used on allowlisted
   elements.
   Sanitize now performs additional escaping on affected attributes to prevent
@@ -66,7 +184,7 @@
 ## 4.4.0 (2016-09-29)
-* Added `srcset` to the attribute whitelist for `img` elements in the relaxed
+* Added `srcset` to the attribute allowlist for `img` elements in the relaxed
   config. [@ejtttje - #156][156]
 [156]:https://github.com/rgrove/sanitize/pull/156
@@ -187,7 +305,7 @@
 ## 3.0.4 (2014-12-12)
 * Fixed: Harmless whitespace preceding a URL protocol (such as " http://")
-  caused the URL to be removed even when the protocol was whitelisted.
+  caused the URL to be removed even when the protocol was allowlisted.
   [@benubois - #126][126]
 [126]:https://github.com/rgrove/sanitize/pull/126
@@ -196,7 +314,7 @@
 ## 3.0.3 (2014-10-29)
 * Fixed: Some CSS selectors weren't parsed correctly inside the body of a
-  `@media` block, causing them to be removed even when whitelist rules should
+  `@media` block, causing them to be removed even when allowlist rules should
   have allowed them to remain. [#121][121]
 [121]:https://github.com/rgrove/sanitize/issues/121
@@ -261,7 +379,7 @@ Sanitize.fragment(html, Sanitize::Config.merge(Sanitize::Config::BASIC,
 * The `clean_node!` method was renamed to `node!`.
 * The `document` method now raises a `Sanitize::Error` if the `<html>` element
-  isn't whitelisted, rather than a `RuntimeError`. This error is also now raised
+  isn't allowlisted, rather than a `RuntimeError`. This error is also now raised
   regardless of the `:remove_contents` config setting.
 * The `:output` config has been removed. Output is now always HTML, not XHTML.
@@ -272,7 +390,7 @@ Sanitize.fragment(html, Sanitize::Config.merge(Sanitize::Config::BASIC,
 * Added advanced CSS sanitization support using [Crass][crass], which is fully
   compliant with the CSS Syntax Module Level 3 parsing spec. The contents of
-  whitelisted `<style>` elements and `style` attributes in HTML will be
+  allowlisted `<style>` elements and `style` attributes in HTML will be
   sanitized as CSS, or you can use the `Sanitize::CSS` class to manually
   sanitize CSS stylesheets or properties.
@@ -317,9 +435,29 @@ Sanitize.fragment(html, Sanitize::Config.merge(Sanitize::Config::BASIC,
 [n1008]:https://github.com/sparklemotion/nokogiri/issues/1008
+## 2.1.1 (2018-09-30)
+* [CVE-2018-3740][176]: Fixed an HTML injection vulnerability that could allow
+  XSS (backported from Sanitize 4.6.3). [@dometto - #188][188]
+  When Sanitize <= 2.1.0 is used in combination with libxml2 >= 2.9.2, a
+  specially crafted HTML fragment can cause libxml2 to generate improperly
+  escaped output, allowing non-allowlisted attributes to be used on allowlisted
+  elements.
+  Sanitize now performs additional escaping on affected attributes to prevent
+  this.
+  Many thanks to the Shopify Application Security Team for responsibly reporting
+  this issue.
+[176]:https://github.com/rgrove/sanitize/issues/176
+[188]:https://github.com/rgrove/sanitize/pull/188
 ## 2.1.0 (2014-01-13)
-* Added support for whitelisting arbitrary HTML5 `data-*` attributes. Use the
+* Added support for allowlisting arbitrary HTML5 `data-*` attributes. Use the
   symbol `:data` instead of an attribute name in the `:attributes` config to
   indicate that arbitrary data attributes should be allowed on an element.
@@ -400,12 +538,12 @@ Sanitize.fragment(html, Sanitize::Config.merge(Sanitize::Config::BASIC,
   the default depth-first mode.
 * Added the `abbr`, `dfn`, `kbd`, `mark`, `s`, `samp`, `time`, and `var`
-  elements to the whitelists for the basic and relaxed configs.
+  elements to the allowlists for the basic and relaxed configs.
 * Added the `bdo`, `del`, `figcaption`, `figure`, `hgroup`, `ins`, `rp`, `rt`,
-  `ruby`, and `wbr` elements to the whitelist for the relaxed config.
+  `ruby`, and `wbr` elements to the allowlist for the relaxed config.
-* The `dir`, `lang`, and `title` attributes are now whitelisted for all
+* The `dir`, `lang`, and `title` attributes are now allowlisted for all
   elements in the relaxed config.
 * Bumped minimum Nokogiri version to 1.4.4 to avoid a bug in 1.4.2+
@@ -416,7 +554,7 @@ Sanitize.fragment(html, Sanitize::Config.merge(Sanitize::Config::BASIC,
 ## 1.2.1 (2010-04-20)
 * Added a `:remove_contents` config setting. If set to `true`, Sanitize will
-  remove the contents of all non-whitelisted elements in addition to the
+  remove the contents of all non-allowlisted elements in addition to the
   elements themselves. If set to an array of element names, Sanitize will
   remove the contents of only those elements (when filtered), and leave the
   contents of other filtered elements. [Thanks to Rafael Souza for the array
@@ -444,7 +582,7 @@ Sanitize.fragment(html, Sanitize::Config.merge(Sanitize::Config::BASIC,
 * Added `Sanitize.clean_node!`, which sanitizes a `Nokogiri::XML::Node` and
   all its children.
-* Added elements `<h1>` through `<h6>` to the Relaxed whitelist. [Suggested by
+* Added elements `<h1>` through `<h6>` to the Relaxed allowlist. [Suggested by
   David Reese]
@@ -464,7 +602,7 @@ Sanitize.fragment(html, Sanitize::Config.merge(Sanitize::Config::BASIC,
 * Added a workaround for an Hpricot bug that prevents attribute names from
   being downcased in recent versions of Hpricot. This was exploitable to
-  prevent non-whitelisted protocols from being cleaned. [Reported by Ben
+  prevent non-allowlisted protocols from being cleaned. [Reported by Ben
   Wanicur]
@@ -494,7 +632,7 @@ Sanitize.fragment(html, Sanitize::Config.merge(Sanitize::Config::BASIC,
 ## 1.0.5 (2009-02-05)
-* Fixed a bug introduced in version 1.0.3 that prevented non-whitelisted
+* Fixed a bug introduced in version 1.0.3 that prevented non-allowlisted
   protocols from being cleaned when relative URLs were allowed. [Reported by
   Dev Purkayastha]
@@ -504,7 +642,7 @@ Sanitize.fragment(html, Sanitize::Config.merge(Sanitize::Config::BASIC,
 ## 1.0.4 (2009-01-16)
-* Fixed a bug that made it possible to sneak a non-whitelisted element through
+* Fixed a bug that made it possible to sneak a non-allowlisted element through
   by repeating it several times in a row. All versions of Sanitize prior to
   1.0.4 are vulnerable. [Reported by Cristobal]
@@ -512,7 +650,7 @@ Sanitize.fragment(html, Sanitize::Config.merge(Sanitize::Config::BASIC,
 ## 1.0.3 (2009-01-15)
 * Fixed a bug whereby incomplete Unicode or hex entities could be used to
-  prevent non-whitelisted protocols from being cleaned. Since IE6 and Opera
+  prevent non-allowlisted protocols from being cleaned. Since IE6 and Opera
   still decode the incomplete entities, users of those browsers may be
   vulnerable to malicious script injection on websites using versions of
   Sanitize prior to 1.0.3.

data/README.md CHANGED

@@ -1,20 +1,19 @@
 Sanitize
 ========
-Sanitize is a whitelist-based HTML and CSS sanitizer. Given a list of acceptable
-elements, attributes, and CSS properties, Sanitize will remove all unacceptable
-HTML and/or CSS from a string.
+Sanitize is an allowlist-based HTML and CSS sanitizer. It removes all HTML
+and/or CSS from a string except the elements, attributes, and properties you
+choose to allow.
 Using a simple configuration syntax, you can tell Sanitize to allow certain HTML
 elements, certain attributes within those elements, and even certain URL
-protocols within attributes that contain URLs. You can also whitelist CSS
-properties, @ rules, and URL protocols you wish to allow in elements or
-attributes containing CSS. Any HTML or CSS that you don't explicitly allow will
-be removed.
+protocols within attributes that contain URLs. You can also allow specific CSS
+properties, @ rules, and URL protocols in elements or attributes containing CSS.
+Any HTML or CSS that you don't explicitly allow will be removed.
 Sanitize is based on [Google's Gumbo HTML5 parser][gumbo], which parses HTML
 exactly the same way modern browsers do, and [Crass][crass], which parses CSS
-exactly the same way modern browsers do. As long as your whitelist config only
+exactly the same way modern browsers do. As long as your allowlist config only
 allows safe markup and CSS, even the most malformed or malicious input will be
 transformed into safe output.
@@ -73,6 +72,11 @@ Sanitize can sanitize the following types of input:
 * Standalone CSS stylesheets
 * Standalone CSS properties
+However, please note that Sanitize _cannot_ fully sanitize the contents of
+`<math>` or `<svg>` elements, since these elements don't follow the same parsing
+rules as the rest of HTML. If this is something you need, you may want to look
+for another solution.
 ### HTML Fragments
 A fragment is a snippet of HTML that doesn't contain a root-level `<html>`
@@ -88,7 +92,7 @@ Sanitize.fragment(html)
 # => 'foo'
 ```
-To keep certain elements, add them to the element whitelist.
+To keep certain elements, add them to the element allowlist.
 ```ruby
 Sanitize.fragment(html, :elements => ['b'])
@@ -97,7 +101,7 @@ Sanitize.fragment(html, :elements => ['b'])
 ### HTML Documents
-When sanitizing a document, the `<html>` element must be whitelisted. You can
+When sanitizing a document, the `<html>` element must be allowlisted. You can
 also set `:allow_doctype` to `true` to allow well-formed document type
 definitions.
@@ -123,8 +127,8 @@ Sanitize.document(html,
 ### CSS in HTML
-To sanitize CSS in an HTML fragment or document, first whitelist the `<style>`
-element and/or the `style` attribute. Then whitelist the CSS properties,
+To sanitize CSS in an HTML fragment or document, first allowlist the `<style>`
+element and/or the `style` attribute. Then allowlist the CSS properties,
 @ rules, and URL protocols you wish to allow. You can also choose whether to
 allow CSS comments or browser compatibility hacks.
@@ -267,7 +271,7 @@ new copy using `Sanitize::Config.merge()`, like so:
 ```ruby
 # Create a customized copy of the Basic config, adding <div> and <table> to the
-# existing whitelisted elements.
+# existing allowlisted elements.
 Sanitize.fragment(html, Sanitize::Config.merge(Sanitize::Config::BASIC,
   :elements        => Sanitize::Config::BASIC[:elements] + ['div', 'table'],
   :remove_contents => true
@@ -395,8 +399,7 @@ Proc.new { |url| url.start_with?("https://fonts.googleapis.com") }
 ##### :css => :properties (Array or Set)
-Whitelist of CSS property names to allow. Names should be specified in
-lowercase.
+List of CSS property names to allow. Names should be specified in lowercase.
 ##### :css => :protocols (Array or Set)
@@ -417,6 +420,23 @@ elements not in this array will be removed.
 ]
 ```
+**Warning:** Sanitize cannot fully sanitize the contents of `<math>` or `<svg>`
+elements, since these elements don't follow the same parsing rules as the rest
+of HTML. If you add `math` or `svg` to the allowlist, you must assume that any
+content inside them will be allowed, even if that content would otherwise be
+removed by Sanitize.
+#### :parser_options (Hash)
+[Parsing options](https://github.com/rubys/nokogumbo/tree/v2.0.1#parsing-options) supplied to `nokogumbo`.
+```ruby
+:parser_options => {
+  max_errors: -1,
+  max_tree_depth: -1
+}
+```
 #### :protocols (Hash)
 URL protocols to allow in specific attributes. If an attribute is listed here
@@ -441,13 +461,13 @@ include the symbol `:relative` in the protocol array:
 #### :remove_contents (boolean or Array or Set)
-If set to `true`, Sanitize will remove the contents of any non-whitelisted
+If this is `true`, Sanitize will remove the contents of any non-allowlisted
 elements in addition to the elements themselves. By default, Sanitize leaves the
 safe parts of an element's contents behind when the element is removed.
-If set to an array of element names, then only the contents of the specified
-elements (when filtered) will be removed, and the contents of all other filtered
-elements will be left behind.
+If this is an Array or Set of element names, then only the contents of the
+specified elements (when filtered) will be removed, and the contents of all
+other filtered elements will be left behind.
 The default value is `false`.
@@ -474,6 +494,15 @@ children, in which case it will be inserted after those children.
 }
 ```
+The default elements with whitespace added before and after are:
+```
+address article aside blockquote br dd div dl dt
+footer h1 h2 h3 h4 h5 h6 header hgroup hr li nav
+ol p pre section ul
+```
 ## Transformers
 Transformers allow you to filter and modify HTML nodes using your own custom
@@ -498,33 +527,33 @@ argument a Hash that contains the following items:
   * **:config** - The current Sanitize configuration Hash.
-  * **:is_whitelisted** - `true` if the current node has been whitelisted by a
+  * **:is_allowlisted** - `true` if the current node has been allowlisted by a
     previous transformer, `false` otherwise. It's generally bad form to remove
-    a node that a previous transformer has whitelisted.
+    a node that a previous transformer has allowlisted.
   * **:node** - A `Nokogiri::XML::Node` object representing an HTML node. The
     node may be an element, a text node, a comment, a CDATA node, or a document
     fragment. Use Nokogiri's inspection methods (`element?`, `text?`, etc.) to
     selectively ignore node types you aren't interested in.
+  * **:node_allowlist** - Set of `Nokogiri::XML::Node` objects in the current
+    document that have been allowlisted by previous transformers, if any. It's
+    generally bad form to remove a node that a previous transformer has
+    allowlisted.
   * **:node_name** - The name of the current HTML node, always lowercase (e.g.
     "div" or "span"). For non-element nodes, the name will be something like
     "text", "comment", "#cdata-section", "#document-fragment", etc.
-  * **:node_whitelist** - Set of `Nokogiri::XML::Node` objects in the current
-    document that have been whitelisted by previous transformers, if any. It's
-    generally bad form to remove a node that a previous transformer has
-    whitelisted.
 ### Output
 A transformer doesn't have to return anything, but may optionally return a Hash,
 which may contain the following items:
-  * **:node_whitelist** -  Array or Set of specific Nokogiri::XML::Node objects
-    to add to the document's whitelist, bypassing the current Sanitize config.
-    These specific nodes and all their attributes will be whitelisted, but
-    their children will not be.
+  * **:node_allowlist** -  Array or Set of specific `Nokogiri::XML::Node`
+    objects to add to the document's allowlist, bypassing the current Sanitize
+    config. These specific nodes and all their attributes will be allowlisted,
+    but their children will not be.
 If a transformer returns anything other than a Hash, the return value will be
 ignored.
@@ -567,16 +596,16 @@ Transformers have a tremendous amount of power, including the power to
 completely bypass Sanitize's built-in filtering. Be careful! Your safety is in
 your own hands.
-### Example: Transformer to whitelist image URLs by domain
+### Example: Transformer to allow image URLs by domain
 The following example demonstrates how to remove image elements unless they use
 a relative URL or are hosted on a specific domain. It assumes that the `<img>`
-element and its `src` attribute are already whitelisted.
+element and its `src` attribute are already allowlisted.
 ```ruby
 require 'uri'
-image_whitelist_transformer = lambda do |env|
+image_allowlist_transformer = lambda do |env|
   # Ignore everything except <img> elements.
   return unless env[:node_name] == 'img'
@@ -592,20 +621,20 @@ image_whitelist_transformer = lambda do |env|
 end
 ```
-### Example: Transformer to whitelist YouTube video embeds
+### Example: Transformer to allow YouTube video embeds
 The following example demonstrates how to create a transformer that will safely
-whitelist valid YouTube video embeds without having to blindly allow other kinds
-of embedded content, which would be the case if you tried to do this by just
-whitelisting all `<iframe>` elements:
+allow valid YouTube video embeds without having to allow other kinds of embedded
+content, which would be the case if you tried to do this by just allowing all
+`<iframe>` elements:
 ```ruby
 youtube_transformer = lambda do |env|
   node      = env[:node]
   node_name = env[:node_name]
-  # Don't continue if this node is already whitelisted or is not an element.
-  return if env[:is_whitelisted] || !node.element?
+  # Don't continue if this node is already allowlisted or is not an element.
+  return if env[:is_allowlisted] || !node.element?
   # Don't continue unless the node is an iframe.
   return unless node_name == 'iframe'
@@ -626,8 +655,8 @@ youtube_transformer = lambda do |env|
   # Now that we're sure that this is a valid YouTube embed and that there are
   # no unwanted elements or attributes hidden inside it, we can tell Sanitize
-  # to whitelist the current node.
-  {:node_whitelist => [node]}
+  # to allowlist the current node.
+  {:node_allowlist => [node]}
 end
 html = %[