sanitize 4.6.4 → 6.0.2

Sign up to get free protection for your applications and to get access to all the features.
data/README.md CHANGED
@@ -1,38 +1,36 @@
1
1
  Sanitize
2
2
  ========
3
3
 
4
- Sanitize is a whitelist-based HTML and CSS sanitizer. Given a list of acceptable
5
- elements, attributes, and CSS properties, Sanitize will remove all unacceptable
6
- HTML and/or CSS from a string.
4
+ Sanitize is an allowlist-based HTML and CSS sanitizer. It removes all HTML
5
+ and/or CSS from a string except the elements, attributes, and properties you
6
+ choose to allow.
7
7
 
8
8
  Using a simple configuration syntax, you can tell Sanitize to allow certain HTML
9
9
  elements, certain attributes within those elements, and even certain URL
10
- protocols within attributes that contain URLs. You can also whitelist CSS
11
- properties, @ rules, and URL protocols you wish to allow in elements or
12
- attributes containing CSS. Any HTML or CSS that you don't explicitly allow will
13
- be removed.
14
-
15
- Sanitize is based on [Google's Gumbo HTML5 parser][gumbo], which parses HTML
16
- exactly the same way modern browsers do, and [Crass][crass], which parses CSS
17
- exactly the same way modern browsers do. As long as your whitelist config only
18
- allows safe markup and CSS, even the most malformed or malicious input will be
19
- transformed into safe output.
20
-
21
- [![Build Status](https://travis-ci.org/rgrove/sanitize.svg?branch=master)](https://travis-ci.org/rgrove/sanitize)
10
+ protocols within attributes that contain URLs. You can also allow specific CSS
11
+ properties, @ rules, and URL protocols in elements or attributes containing CSS.
12
+ Any HTML or CSS that you don't explicitly allow will be removed.
13
+
14
+ Sanitize is based on the [Nokogiri HTML5 parser][nokogiri], which parses HTML
15
+ the same way modern browsers do, and [Crass][crass], which parses CSS the same
16
+ way modern browsers do. As long as your allowlist config only allows safe markup
17
+ and CSS, even the most malformed or malicious input will be transformed into
18
+ safe output.
19
+
22
20
  [![Gem Version](https://badge.fury.io/rb/sanitize.svg)](http://badge.fury.io/rb/sanitize)
21
+ [![Tests](https://github.com/rgrove/sanitize/workflows/Tests/badge.svg)](https://github.com/rgrove/sanitize/actions?query=workflow%3ATests)
23
22
 
24
23
  [crass]:https://github.com/rgrove/crass
25
- [gumbo]:https://github.com/google/gumbo-parser
24
+ [nokogiri]:https://github.com/sparklemotion/nokogiri
26
25
 
27
26
  Links
28
27
  -----
29
28
 
30
29
  * [Home](https://github.com/rgrove/sanitize/)
31
- * [API Docs](http://rubydoc.info/github/rgrove/sanitize/master)
30
+ * [API Docs](https://rubydoc.info/github/rgrove/sanitize/Sanitize)
32
31
  * [Issues](https://github.com/rgrove/sanitize/issues)
33
- * [Release History](https://github.com/rgrove/sanitize/blob/master/HISTORY.md#sanitize-history)
34
- * [Online Demo](https://sanitize.herokuapp.com/)
35
- * [Biased comparison of Ruby HTML sanitization libraries](https://github.com/rgrove/sanitize/blob/master/COMPARISON.md)
32
+ * [Release History](https://github.com/rgrove/sanitize/releases)
33
+ * [Online Demo](https://sanitize-web.fly.dev/)
36
34
 
37
35
  Installation
38
36
  -------------
@@ -73,6 +71,12 @@ Sanitize can sanitize the following types of input:
73
71
  * Standalone CSS stylesheets
74
72
  * Standalone CSS properties
75
73
 
74
+ > **Warning**
75
+ >
76
+ > Sanitize cannot fully sanitize the contents of `<math>` or `<svg>` elements. MathML and SVG elements are [foreign elements](https://html.spec.whatwg.org/multipage/syntax.html#foreign-elements) that don't follow normal HTML parsing rules.
77
+ >
78
+ > By default, Sanitize will remove all MathML and SVG elements. If you add MathML or SVG elements to a custom element allowlist, you may create a security vulnerability in your application.
79
+
76
80
  ### HTML Fragments
77
81
 
78
82
  A fragment is a snippet of HTML that doesn't contain a root-level `<html>`
@@ -88,7 +92,7 @@ Sanitize.fragment(html)
88
92
  # => 'foo'
89
93
  ```
90
94
 
91
- To keep certain elements, add them to the element whitelist.
95
+ To keep certain elements, add them to the element allowlist.
92
96
 
93
97
  ```ruby
94
98
  Sanitize.fragment(html, :elements => ['b'])
@@ -97,7 +101,7 @@ Sanitize.fragment(html, :elements => ['b'])
97
101
 
98
102
  ### HTML Documents
99
103
 
100
- When sanitizing a document, the `<html>` element must be whitelisted. You can
104
+ When sanitizing a document, the `<html>` element must be allowlisted. You can
101
105
  also set `:allow_doctype` to `true` to allow well-formed document type
102
106
  definitions.
103
107
 
@@ -123,8 +127,8 @@ Sanitize.document(html,
123
127
 
124
128
  ### CSS in HTML
125
129
 
126
- To sanitize CSS in an HTML fragment or document, first whitelist the `<style>`
127
- element and/or the `style` attribute. Then whitelist the CSS properties,
130
+ To sanitize CSS in an HTML fragment or document, first allowlist the `<style>`
131
+ element and/or the `style` attribute. Then allowlist the CSS properties,
128
132
  @ rules, and URL protocols you wish to allow. You can also choose whether to
129
133
  allow CSS comments or browser compatibility hacks.
130
134
 
@@ -267,7 +271,7 @@ new copy using `Sanitize::Config.merge()`, like so:
267
271
 
268
272
  ```ruby
269
273
  # Create a customized copy of the Basic config, adding <div> and <table> to the
270
- # existing whitelisted elements.
274
+ # existing allowlisted elements.
271
275
  Sanitize.fragment(html, Sanitize::Config.merge(Sanitize::Config::BASIC,
272
276
  :elements => Sanitize::Config::BASIC[:elements] + ['div', 'table'],
273
277
  :remove_contents => true
@@ -395,8 +399,7 @@ Proc.new { |url| url.start_with?("https://fonts.googleapis.com") }
395
399
 
396
400
  ##### :css => :properties (Array or Set)
397
401
 
398
- Whitelist of CSS property names to allow. Names should be specified in
399
- lowercase.
402
+ List of CSS property names to allow. Names should be specified in lowercase.
400
403
 
401
404
  ##### :css => :protocols (Array or Set)
402
405
 
@@ -417,6 +420,29 @@ elements not in this array will be removed.
417
420
  ]
418
421
  ```
419
422
 
423
+ > **Warning**
424
+ >
425
+ > Sanitize cannot fully sanitize the contents of `<math>` or `<svg>` elements. MathML and SVG elements are [foreign elements](https://html.spec.whatwg.org/multipage/syntax.html#foreign-elements) that don't follow normal HTML parsing rules.
426
+ >
427
+ > By default, Sanitize will remove all MathML and SVG elements. If you add MathML or SVG elements to a custom element allowlist, you must assume that any content inside them will be allowed, even if that content would otherwise be removed or escaped by Sanitize. This may create a security vulnerability in your application.
428
+
429
+ > **Note**
430
+ >
431
+ > Sanitize always removes `<noscript>` elements and their contents, even if `noscript` is in the allowlist.
432
+ >
433
+ > This is because a `<noscript>` element's content is parsed differently in browsers depending on whether or not scripting is enabled. Since Nokogiri doesn't support scripting, it always parses `<noscript>` elements as if scripting is disabled. This results in edge cases where it's not possible to reliably sanitize the contents of a `<noscript>` element because Nokogiri can't fully replicate the parsing behavior of a scripting-enabled browser.
434
+
435
+ #### :parser_options (Hash)
436
+
437
+ [Parsing options](https://github.com/rubys/nokogumbo/tree/master#parsing-options) to be supplied to `nokogumbo`.
438
+
439
+ ```ruby
440
+ :parser_options => {
441
+ max_errors: -1,
442
+ max_tree_depth: -1
443
+ }
444
+ ```
445
+
420
446
  #### :protocols (Hash)
421
447
 
422
448
  URL protocols to allow in specific attributes. If an attribute is listed here
@@ -441,15 +467,15 @@ include the symbol `:relative` in the protocol array:
441
467
 
442
468
  #### :remove_contents (boolean or Array or Set)
443
469
 
444
- If set to `true`, Sanitize will remove the contents of any non-whitelisted
470
+ If this is `true`, Sanitize will remove the contents of any non-allowlisted
445
471
  elements in addition to the elements themselves. By default, Sanitize leaves the
446
472
  safe parts of an element's contents behind when the element is removed.
447
473
 
448
- If set to an array of element names, then only the contents of the specified
449
- elements (when filtered) will be removed, and the contents of all other filtered
450
- elements will be left behind.
474
+ If this is an Array or Set of element names, then only the contents of the
475
+ specified elements (when filtered) will be removed, and the contents of all
476
+ other filtered elements will be left behind.
451
477
 
452
- The default value is `false`.
478
+ The default value is `%w[iframe math noembed noframes noscript plaintext script style svg xmp]`.
453
479
 
454
480
  #### :transformers (Array or callable)
455
481
 
@@ -474,6 +500,15 @@ children, in which case it will be inserted after those children.
474
500
  }
475
501
  ```
476
502
 
503
+ The default elements with whitespace added before and after are:
504
+
505
+ ```
506
+ address article aside blockquote br dd div dl dt
507
+ footer h1 h2 h3 h4 h5 h6 header hgroup hr li nav
508
+ ol p pre section ul
509
+
510
+ ```
511
+
477
512
  ## Transformers
478
513
 
479
514
  Transformers allow you to filter and modify HTML nodes using your own custom
@@ -498,33 +533,33 @@ argument a Hash that contains the following items:
498
533
 
499
534
  * **:config** - The current Sanitize configuration Hash.
500
535
 
501
- * **:is_whitelisted** - `true` if the current node has been whitelisted by a
536
+ * **:is_allowlisted** - `true` if the current node has been allowlisted by a
502
537
  previous transformer, `false` otherwise. It's generally bad form to remove
503
- a node that a previous transformer has whitelisted.
538
+ a node that a previous transformer has allowlisted.
504
539
 
505
540
  * **:node** - A `Nokogiri::XML::Node` object representing an HTML node. The
506
541
  node may be an element, a text node, a comment, a CDATA node, or a document
507
542
  fragment. Use Nokogiri's inspection methods (`element?`, `text?`, etc.) to
508
543
  selectively ignore node types you aren't interested in.
509
544
 
545
+ * **:node_allowlist** - Set of `Nokogiri::XML::Node` objects in the current
546
+ document that have been allowlisted by previous transformers, if any. It's
547
+ generally bad form to remove a node that a previous transformer has
548
+ allowlisted.
549
+
510
550
  * **:node_name** - The name of the current HTML node, always lowercase (e.g.
511
551
  "div" or "span"). For non-element nodes, the name will be something like
512
552
  "text", "comment", "#cdata-section", "#document-fragment", etc.
513
553
 
514
- * **:node_whitelist** - Set of `Nokogiri::XML::Node` objects in the current
515
- document that have been whitelisted by previous transformers, if any. It's
516
- generally bad form to remove a node that a previous transformer has
517
- whitelisted.
518
-
519
554
  ### Output
520
555
 
521
556
  A transformer doesn't have to return anything, but may optionally return a Hash,
522
557
  which may contain the following items:
523
558
 
524
- * **:node_whitelist** - Array or Set of specific Nokogiri::XML::Node objects
525
- to add to the document's whitelist, bypassing the current Sanitize config.
526
- These specific nodes and all their attributes will be whitelisted, but
527
- their children will not be.
559
+ * **:node_allowlist** - Array or Set of specific `Nokogiri::XML::Node`
560
+ objects to add to the document's allowlist, bypassing the current Sanitize
561
+ config. These specific nodes and all their attributes will be allowlisted,
562
+ but their children will not be.
528
563
 
529
564
  If a transformer returns anything other than a Hash, the return value will be
530
565
  ignored.
@@ -567,16 +602,16 @@ Transformers have a tremendous amount of power, including the power to
567
602
  completely bypass Sanitize's built-in filtering. Be careful! Your safety is in
568
603
  your own hands.
569
604
 
570
- ### Example: Transformer to whitelist image URLs by domain
605
+ ### Example: Transformer to allow image URLs by domain
571
606
 
572
607
  The following example demonstrates how to remove image elements unless they use
573
608
  a relative URL or are hosted on a specific domain. It assumes that the `<img>`
574
- element and its `src` attribute are already whitelisted.
609
+ element and its `src` attribute are already allowlisted.
575
610
 
576
611
  ```ruby
577
612
  require 'uri'
578
613
 
579
- image_whitelist_transformer = lambda do |env|
614
+ image_allowlist_transformer = lambda do |env|
580
615
  # Ignore everything except <img> elements.
581
616
  return unless env[:node_name] == 'img'
582
617
 
@@ -592,20 +627,20 @@ image_whitelist_transformer = lambda do |env|
592
627
  end
593
628
  ```
594
629
 
595
- ### Example: Transformer to whitelist YouTube video embeds
630
+ ### Example: Transformer to allow YouTube video embeds
596
631
 
597
632
  The following example demonstrates how to create a transformer that will safely
598
- whitelist valid YouTube video embeds without having to blindly allow other kinds
599
- of embedded content, which would be the case if you tried to do this by just
600
- whitelisting all `<iframe>` elements:
633
+ allow valid YouTube video embeds without having to allow other kinds of embedded
634
+ content, which would be the case if you tried to do this by just allowing all
635
+ `<iframe>` elements:
601
636
 
602
637
  ```ruby
603
638
  youtube_transformer = lambda do |env|
604
639
  node = env[:node]
605
640
  node_name = env[:node_name]
606
641
 
607
- # Don't continue if this node is already whitelisted or is not an element.
608
- return if env[:is_whitelisted] || !node.element?
642
+ # Don't continue if this node is already allowlisted or is not an element.
643
+ return if env[:is_allowlisted] || !node.element?
609
644
 
610
645
  # Don't continue unless the node is an iframe.
611
646
  return unless node_name == 'iframe'
@@ -626,8 +661,8 @@ youtube_transformer = lambda do |env|
626
661
 
627
662
  # Now that we're sure that this is a valid YouTube embed and that there are
628
663
  # no unwanted elements or attributes hidden inside it, we can tell Sanitize
629
- # to whitelist the current node.
630
- {:node_whitelist => [node]}
664
+ # to allowlist the current node.
665
+ {:node_allowlist => [node]}
631
666
  end
632
667
 
633
668
  html = %[
@@ -638,25 +673,3 @@ html = %[
638
673
  Sanitize.fragment(html, :transformers => youtube_transformer)
639
674
  # => '<iframe width="420" height="315" src="//www.youtube.com/embed/dQw4w9WgXcQ" frameborder="0" allowfullscreen=""></iframe>'
640
675
  ```
641
-
642
- License
643
- -------
644
-
645
- Copyright (c) 2015 Ryan Grove (ryan@wonko.com)
646
-
647
- Permission is hereby granted, free of charge, to any person obtaining a copy of
648
- this software and associated documentation files (the 'Software'), to deal in
649
- the Software without restriction, including without limitation the rights to
650
- use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
651
- the Software, and to permit persons to whom the Software is furnished to do so,
652
- subject to the following conditions:
653
-
654
- The above copyright notice and this permission notice shall be included in all
655
- copies or substantial portions of the Software.
656
-
657
- THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
658
- IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
659
- FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
660
- COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
661
- IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
662
- CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -54,8 +54,17 @@ class Sanitize
54
54
 
55
55
  # HTML elements to allow. By default, no elements are allowed (which means
56
56
  # that all HTML will be stripped).
57
+ #
58
+ # Warning: Sanitize cannot safely sanitize the contents of foreign
59
+ # elements (elements in the MathML or SVG namespaces). Do not add `math`
60
+ # or `svg` to this list! If you do, you may create a security
61
+ # vulnerability in your application.
57
62
  :elements => [],
58
63
 
64
+ # HTML parsing options to pass to Nokogumbo.
65
+ # https://github.com/rubys/nokogumbo/tree/v2.0.1#parsing-options
66
+ :parser_options => {},
67
+
59
68
  # URL handling protocols to allow in specific attributes. By default, no
60
69
  # protocols are allowed. Use :relative in place of a protocol if you want
61
70
  # to allow relative URLs sans protocol.
@@ -66,10 +75,12 @@ class Sanitize
66
75
  # leaves the safe parts of an element's contents behind when the element
67
76
  # is removed.
68
77
  #
69
- # If this is an Array of element names, then only the contents of the
70
- # specified elements (when filtered) will be removed, and the contents of
71
- # all other filtered elements will be left behind.
72
- :remove_contents => false,
78
+ # If this is an Array or Set of element names, then only the contents of
79
+ # the specified elements (when filtered) will be removed, and the contents
80
+ # of all other filtered elements will be left behind.
81
+ :remove_contents => %w[
82
+ iframe math noembed noframes noscript plaintext script style svg xmp
83
+ ],
73
84
 
74
85
  # Transformers allow you to filter or alter nodes using custom logic. See
75
86
  # README.md for details and examples.
@@ -6,7 +6,7 @@ class Sanitize
6
6
  :elements => BASIC[:elements] + %w[
7
7
  address article aside bdi bdo body caption col colgroup data del div
8
8
  figcaption figure footer h1 h2 h3 h4 h5 h6 head header hgroup hr html
9
- img ins main nav rp rt ruby section span style summary sup table tbody
9
+ img ins main nav rp rt ruby section span style summary table tbody
10
10
  td tfoot th thead title tr wbr
11
11
  ],
12
12
 
data/lib/sanitize/css.rb CHANGED
@@ -175,7 +175,7 @@ class Sanitize; class CSS
175
175
  next prop
176
176
 
177
177
  when :semicolon
178
- # Only preserve the semicolon if it was preceded by a whitelisted
178
+ # Only preserve the semicolon if it was preceded by an allowlisted
179
179
  # property. Otherwise, omit it in order to prevent redundant semicolons.
180
180
  if preceded_by_property
181
181
  preceded_by_property = false
@@ -296,7 +296,7 @@ class Sanitize; class CSS
296
296
  end
297
297
 
298
298
  # Returns `true` if the given node (which may be of type `:url` or
299
- # `:function`, since the CSS syntax can produce both) uses a whitelisted
299
+ # `:function`, since the CSS syntax can produce both) uses an allowlisted
300
300
  # protocol.
301
301
  def valid_url?(node)
302
302
  type = node[:node]
@@ -6,7 +6,7 @@ class Sanitize; module Transformers
6
6
  node = env[:node]
7
7
 
8
8
  if node.type == Nokogiri::XML::Node::COMMENT_NODE
9
- node.unlink unless env[:is_whitelisted]
9
+ node.unlink unless env[:is_allowlisted]
10
10
  end
11
11
  end
12
12
 
@@ -1,6 +1,6 @@
1
1
  class Sanitize; module Transformers; module CSS
2
2
 
3
- # Enforces a CSS whitelist on the contents of `style` attributes.
3
+ # Enforces a CSS allowlist on the contents of `style` attributes.
4
4
  class CleanAttribute
5
5
  def initialize(sanitizer_or_config)
6
6
  if Sanitize::CSS === sanitizer_or_config
@@ -14,7 +14,7 @@ class CleanAttribute
14
14
  node = env[:node]
15
15
 
16
16
  return unless node.type == Nokogiri::XML::Node::ELEMENT_NODE &&
17
- node.key?('style') && !env[:is_whitelisted]
17
+ node.key?('style') && !env[:is_allowlisted]
18
18
 
19
19
  attr = node.attribute('style')
20
20
  css = @scss.properties(attr.value)
@@ -27,7 +27,7 @@ class CleanAttribute
27
27
  end
28
28
  end
29
29
 
30
- # Enforces a CSS whitelist on the contents of `<style>` elements.
30
+ # Enforces a CSS allowlist on the contents of `<style>` elements.
31
31
  class CleanElement
32
32
  def initialize(sanitizer_or_config)
33
33
  if Sanitize::CSS === sanitizer_or_config
@@ -48,6 +48,7 @@ class CleanElement
48
48
  if css.strip.empty?
49
49
  node.unlink
50
50
  else
51
+ css.gsub!('</', '<\/')
51
52
  node.children.unlink
52
53
  node << Nokogiri::XML::Text.new(css, node.document)
53
54
  end
@@ -3,7 +3,7 @@
3
3
  class Sanitize; module Transformers
4
4
 
5
5
  CleanDoctype = lambda do |env|
6
- return if env[:is_whitelisted]
6
+ return if env[:is_allowlisted]
7
7
 
8
8
  node = env[:node]
9
9
 
@@ -1,5 +1,6 @@
1
1
  # encoding: utf-8
2
2
 
3
+ require 'cgi'
3
4
  require 'set'
4
5
 
5
6
  class Sanitize; module Transformers; class CleanElement
@@ -18,6 +19,18 @@ class Sanitize; module Transformers; class CleanElement
18
19
  # http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#embedding-custom-non-visible-data-with-the-data-*-attributes
19
20
  REGEX_DATA_ATTR = /\Adata-(?!xml)[a-z_][\w.\u00E0-\u00F6\u00F8-\u017F\u01DD-\u02AF-]*\z/u
20
21
 
22
+ # Elements whose content is treated as unescaped text by HTML parsers.
23
+ UNESCAPED_TEXT_ELEMENTS = Set.new(%w[
24
+ iframe
25
+ noembed
26
+ noframes
27
+ noscript
28
+ plaintext
29
+ script
30
+ style
31
+ xmp
32
+ ])
33
+
21
34
  # Attributes that need additional escaping on `<a>` elements due to unsafe
22
35
  # libxml2 behavior.
23
36
  UNSAFE_LIBXML_ATTRS_A = Set.new(%w[
@@ -67,7 +80,7 @@ class Sanitize; module Transformers; class CleanElement
67
80
  @whitespace_elements = config[:whitespace_elements]
68
81
  end
69
82
 
70
- if config[:remove_contents].is_a?(Set)
83
+ if config[:remove_contents].is_a?(Enumerable)
71
84
  @remove_element_contents.merge(config[:remove_contents].map(&:to_s))
72
85
  else
73
86
  @remove_all_contents = !!config[:remove_contents]
@@ -76,11 +89,11 @@ class Sanitize; module Transformers; class CleanElement
76
89
 
77
90
  def call(env)
78
91
  node = env[:node]
79
- return if node.type != Nokogiri::XML::Node::ELEMENT_NODE || env[:is_whitelisted]
92
+ return if node.type != Nokogiri::XML::Node::ELEMENT_NODE || env[:is_allowlisted]
80
93
 
81
94
  name = env[:node_name]
82
95
 
83
- # Delete any element that isn't in the config whitelist, unless the node has
96
+ # Delete any element that isn't in the config allowlist, unless the node has
84
97
  # already been deleted from the document.
85
98
  #
86
99
  # It's important that we not try to reparent the children of a node that has
@@ -97,42 +110,41 @@ class Sanitize; module Transformers; class CleanElement
97
110
  end
98
111
  end
99
112
 
100
- unless @remove_all_contents || @remove_element_contents.include?(name)
101
- node.add_previous_sibling(node.children)
113
+ unless node.children.empty?
114
+ unless @remove_all_contents || @remove_element_contents.include?(name)
115
+ node.add_previous_sibling(node.children)
116
+ end
102
117
  end
103
118
 
104
119
  node.unlink
105
120
  return
106
121
  end
107
122
 
108
- attr_whitelist = @attributes[name] || @attributes[:all]
123
+ attr_allowlist = @attributes[name] || @attributes[:all]
109
124
 
110
- if attr_whitelist.nil?
111
- # Delete all attributes from elements with no whitelisted attributes.
125
+ if attr_allowlist.nil?
126
+ # Delete all attributes from elements with no allowlisted attributes.
112
127
  node.attribute_nodes.each {|attr| attr.unlink }
113
128
  else
114
- allow_data_attributes = attr_whitelist.include?(:data)
129
+ allow_data_attributes = attr_allowlist.include?(:data)
115
130
 
116
131
  # Delete any attribute that isn't allowed on this element.
117
132
  node.attribute_nodes.each do |attr|
118
133
  attr_name = attr.name.downcase
119
134
 
120
- unless attr_whitelist.include?(attr_name)
121
- # The attribute isn't whitelisted.
135
+ unless attr_allowlist.include?(attr_name)
136
+ # The attribute isn't in the allowlist, but may still be allowed if
137
+ # it's a data attribute.
122
138
 
123
- if allow_data_attributes && attr_name.start_with?('data-')
124
- # Arbitrary data attributes are allowed. If this is a data
125
- # attribute, continue.
126
- next if attr_name =~ REGEX_DATA_ATTR
139
+ unless allow_data_attributes && attr_name.start_with?('data-') && attr_name =~ REGEX_DATA_ATTR
140
+ # Either the attribute isn't a data attribute or arbitrary data
141
+ # attributes aren't allowed. Remove the attribute.
142
+ attr.unlink
143
+ next
127
144
  end
128
-
129
- # Either the attribute isn't a data attribute or arbitrary data
130
- # attributes aren't allowed. Remove the attribute.
131
- attr.unlink
132
- next
133
145
  end
134
146
 
135
- # The attribute is whitelisted.
147
+ # The attribute is allowed.
136
148
 
137
149
  # Remove any attributes that use unacceptable protocols.
138
150
  if @protocols.include?(name) && @protocols[name].include?(attr_name)
@@ -160,12 +172,17 @@ class Sanitize; module Transformers; class CleanElement
160
172
  # libxml2 >= 2.9.2 doesn't escape comments within some attributes, in an
161
173
  # attempt to preserve server-side includes. This can result in XSS since
162
174
  # an unescaped double quote can allow an attacker to inject a
163
- # non-whitelisted attribute.
175
+ # non-allowlisted attribute.
164
176
  #
165
177
  # Sanitize works around this by implementing its own escaping for
166
178
  # affected attributes, some of which can exist on any element and some
167
179
  # of which can only exist on `<a>` elements.
168
180
  #
181
+ # This fix is technically no longer necessary with Nokogumbo >= 2.0
182
+ # since it no longer uses libxml2's serializer, but it's retained to
183
+ # avoid breaking use cases where people might be sanitizing individual
184
+ # Nokogiri nodes and then serializing them manually without Nokogumbo.
185
+ #
169
186
  # The relevant libxml2 code is here:
170
187
  # <https://github.com/GNOME/libxml2/commit/960f0e275616cadc29671a218d7fb9b69eb35588>
171
188
  if UNSAFE_LIBXML_ATTRS_GLOBAL.include?(attr_name) ||
@@ -180,6 +197,72 @@ class Sanitize; module Transformers; class CleanElement
180
197
  if @add_attributes.include?(name)
181
198
  @add_attributes[name].each {|key, val| node[key] = val }
182
199
  end
200
+
201
+ # Make a best effort to ensure that text nodes in invalid "unescaped text"
202
+ # elements that are inside a math or svg namespace are properly escaped so
203
+ # that they don't get parsed as HTML.
204
+ #
205
+ # Sanitize is explicitly documented as not supporting MathML or SVG, but
206
+ # people sometimes allow `<math>` and `<svg>` elements in their custom
207
+ # configs without realizing that it's not safe. This workaround makes it
208
+ # slightly less unsafe, but you still shouldn't allow `<math>` or `<svg>`
209
+ # because Nokogiri doesn't parse them the same way browsers do and Sanitize
210
+ # can't guarantee that their contents are safe.
211
+ unless node.namespace.nil?
212
+ prefix = node.namespace.prefix
213
+
214
+ if (prefix == 'math' || prefix == 'svg') && UNESCAPED_TEXT_ELEMENTS.include?(name)
215
+ node.children.each do |child|
216
+ if child.type == Nokogiri::XML::Node::TEXT_NODE
217
+ child.content = CGI.escapeHTML(child.content)
218
+ end
219
+ end
220
+ end
221
+ end
222
+
223
+ # Element-specific special cases.
224
+ case name
225
+
226
+ # If this is an allowlisted iframe that has children, remove all its
227
+ # children. The HTML standard says iframes shouldn't have content, but when
228
+ # they do, this content is parsed as text and is serialized verbatim without
229
+ # being escaped, which is unsafe because legacy browsers may still render it
230
+ # and execute `<script>` content. So the safe and correct thing to do is to
231
+ # always remove iframe content.
232
+ when 'iframe'
233
+ if !node.children.empty?
234
+ node.children.each do |child|
235
+ child.unlink
236
+ end
237
+ end
238
+
239
+ # Prevent the use of `<meta>` elements that set a charset other than UTF-8,
240
+ # since Sanitize's output is always UTF-8.
241
+ when 'meta'
242
+ if node.has_attribute?('charset') &&
243
+ node['charset'].downcase != 'utf-8'
244
+
245
+ node['charset'] = 'utf-8'
246
+ end
247
+
248
+ if node.has_attribute?('http-equiv') &&
249
+ node.has_attribute?('content') &&
250
+ node['http-equiv'].downcase == 'content-type' &&
251
+ node['content'].downcase =~ /;\s*charset\s*=\s*(?!utf-8)/
252
+
253
+ node['content'] = node['content'].gsub(/;\s*charset\s*=.+\z/, ';charset=utf-8')
254
+ end
255
+
256
+ # A `<noscript>` element's content is parsed differently in browsers
257
+ # depending on whether or not scripting is enabled. Since Nokogiri doesn't
258
+ # support scripting, it always parses `<noscript>` elements as if scripting
259
+ # is disabled. This results in edge cases where it's not possible to
260
+ # reliably sanitize the contents of a `<noscript>` element because Nokogiri
261
+ # can't fully replicate the parsing behavior of a scripting-enabled browser.
262
+ # The safest thing to do is to simply remove all `<noscript>` elements.
263
+ when 'noscript'
264
+ node.unlink
265
+ end
183
266
  end
184
267
 
185
268
  end; end; end
@@ -1,5 +1,3 @@
1
- # encoding: utf-8
2
-
3
1
  class Sanitize
4
- VERSION = '4.6.4'
2
+ VERSION = '6.0.2'
5
3
  end