sanitize 4.6.4 → 6.0.2
Sign up to get free protection for your applications and to get access to all the features.
- checksums.yaml +4 -4
- data/HISTORY.md +259 -16
- data/LICENSE +1 -1
- data/README.md +89 -76
- data/lib/sanitize/config/default.rb +15 -4
- data/lib/sanitize/config/relaxed.rb +1 -1
- data/lib/sanitize/css.rb +2 -2
- data/lib/sanitize/transformers/clean_comment.rb +1 -1
- data/lib/sanitize/transformers/clean_css.rb +4 -3
- data/lib/sanitize/transformers/clean_doctype.rb +1 -1
- data/lib/sanitize/transformers/clean_element.rb +105 -22
- data/lib/sanitize/version.rb +1 -3
- data/lib/sanitize.rb +56 -72
- data/test/common.rb +0 -31
- data/test/test_clean_comment.rb +16 -20
- data/test/test_clean_css.rb +6 -6
- data/test/test_clean_doctype.rb +22 -22
- data/test/test_clean_element.rb +200 -82
- data/test/test_config.rb +9 -9
- data/test/test_malicious_css.rb +20 -7
- data/test/test_malicious_html.rb +179 -32
- data/test/test_parser.rb +9 -38
- data/test/test_sanitize.rb +114 -29
- data/test/test_sanitize_css.rb +88 -61
- data/test/test_transformers.rb +52 -46
- metadata +17 -33
- data/test/test_unicode.rb +0 -95
data/README.md
CHANGED
@@ -1,38 +1,36 @@
|
|
1
1
|
Sanitize
|
2
2
|
========
|
3
3
|
|
4
|
-
Sanitize is
|
5
|
-
elements, attributes, and
|
6
|
-
|
4
|
+
Sanitize is an allowlist-based HTML and CSS sanitizer. It removes all HTML
|
5
|
+
and/or CSS from a string except the elements, attributes, and properties you
|
6
|
+
choose to allow.
|
7
7
|
|
8
8
|
Using a simple configuration syntax, you can tell Sanitize to allow certain HTML
|
9
9
|
elements, certain attributes within those elements, and even certain URL
|
10
|
-
protocols within attributes that contain URLs. You can also
|
11
|
-
properties, @ rules, and URL protocols
|
12
|
-
|
13
|
-
|
14
|
-
|
15
|
-
|
16
|
-
|
17
|
-
|
18
|
-
|
19
|
-
|
20
|
-
|
21
|
-
[![Build Status](https://travis-ci.org/rgrove/sanitize.svg?branch=master)](https://travis-ci.org/rgrove/sanitize)
|
10
|
+
protocols within attributes that contain URLs. You can also allow specific CSS
|
11
|
+
properties, @ rules, and URL protocols in elements or attributes containing CSS.
|
12
|
+
Any HTML or CSS that you don't explicitly allow will be removed.
|
13
|
+
|
14
|
+
Sanitize is based on the [Nokogiri HTML5 parser][nokogiri], which parses HTML
|
15
|
+
the same way modern browsers do, and [Crass][crass], which parses CSS the same
|
16
|
+
way modern browsers do. As long as your allowlist config only allows safe markup
|
17
|
+
and CSS, even the most malformed or malicious input will be transformed into
|
18
|
+
safe output.
|
19
|
+
|
22
20
|
[![Gem Version](https://badge.fury.io/rb/sanitize.svg)](http://badge.fury.io/rb/sanitize)
|
21
|
+
[![Tests](https://github.com/rgrove/sanitize/workflows/Tests/badge.svg)](https://github.com/rgrove/sanitize/actions?query=workflow%3ATests)
|
23
22
|
|
24
23
|
[crass]:https://github.com/rgrove/crass
|
25
|
-
[
|
24
|
+
[nokogiri]:https://github.com/sparklemotion/nokogiri
|
26
25
|
|
27
26
|
Links
|
28
27
|
-----
|
29
28
|
|
30
29
|
* [Home](https://github.com/rgrove/sanitize/)
|
31
|
-
* [API Docs](
|
30
|
+
* [API Docs](https://rubydoc.info/github/rgrove/sanitize/Sanitize)
|
32
31
|
* [Issues](https://github.com/rgrove/sanitize/issues)
|
33
|
-
* [Release History](https://github.com/rgrove/sanitize/
|
34
|
-
* [Online Demo](https://sanitize.
|
35
|
-
* [Biased comparison of Ruby HTML sanitization libraries](https://github.com/rgrove/sanitize/blob/master/COMPARISON.md)
|
32
|
+
* [Release History](https://github.com/rgrove/sanitize/releases)
|
33
|
+
* [Online Demo](https://sanitize-web.fly.dev/)
|
36
34
|
|
37
35
|
Installation
|
38
36
|
-------------
|
@@ -73,6 +71,12 @@ Sanitize can sanitize the following types of input:
|
|
73
71
|
* Standalone CSS stylesheets
|
74
72
|
* Standalone CSS properties
|
75
73
|
|
74
|
+
> **Warning**
|
75
|
+
>
|
76
|
+
> Sanitize cannot fully sanitize the contents of `<math>` or `<svg>` elements. MathML and SVG elements are [foreign elements](https://html.spec.whatwg.org/multipage/syntax.html#foreign-elements) that don't follow normal HTML parsing rules.
|
77
|
+
>
|
78
|
+
> By default, Sanitize will remove all MathML and SVG elements. If you add MathML or SVG elements to a custom element allowlist, you may create a security vulnerability in your application.
|
79
|
+
|
76
80
|
### HTML Fragments
|
77
81
|
|
78
82
|
A fragment is a snippet of HTML that doesn't contain a root-level `<html>`
|
@@ -88,7 +92,7 @@ Sanitize.fragment(html)
|
|
88
92
|
# => 'foo'
|
89
93
|
```
|
90
94
|
|
91
|
-
To keep certain elements, add them to the element
|
95
|
+
To keep certain elements, add them to the element allowlist.
|
92
96
|
|
93
97
|
```ruby
|
94
98
|
Sanitize.fragment(html, :elements => ['b'])
|
@@ -97,7 +101,7 @@ Sanitize.fragment(html, :elements => ['b'])
|
|
97
101
|
|
98
102
|
### HTML Documents
|
99
103
|
|
100
|
-
When sanitizing a document, the `<html>` element must be
|
104
|
+
When sanitizing a document, the `<html>` element must be allowlisted. You can
|
101
105
|
also set `:allow_doctype` to `true` to allow well-formed document type
|
102
106
|
definitions.
|
103
107
|
|
@@ -123,8 +127,8 @@ Sanitize.document(html,
|
|
123
127
|
|
124
128
|
### CSS in HTML
|
125
129
|
|
126
|
-
To sanitize CSS in an HTML fragment or document, first
|
127
|
-
element and/or the `style` attribute. Then
|
130
|
+
To sanitize CSS in an HTML fragment or document, first allowlist the `<style>`
|
131
|
+
element and/or the `style` attribute. Then allowlist the CSS properties,
|
128
132
|
@ rules, and URL protocols you wish to allow. You can also choose whether to
|
129
133
|
allow CSS comments or browser compatibility hacks.
|
130
134
|
|
@@ -267,7 +271,7 @@ new copy using `Sanitize::Config.merge()`, like so:
|
|
267
271
|
|
268
272
|
```ruby
|
269
273
|
# Create a customized copy of the Basic config, adding <div> and <table> to the
|
270
|
-
# existing
|
274
|
+
# existing allowlisted elements.
|
271
275
|
Sanitize.fragment(html, Sanitize::Config.merge(Sanitize::Config::BASIC,
|
272
276
|
:elements => Sanitize::Config::BASIC[:elements] + ['div', 'table'],
|
273
277
|
:remove_contents => true
|
@@ -395,8 +399,7 @@ Proc.new { |url| url.start_with?("https://fonts.googleapis.com") }
|
|
395
399
|
|
396
400
|
##### :css => :properties (Array or Set)
|
397
401
|
|
398
|
-
|
399
|
-
lowercase.
|
402
|
+
List of CSS property names to allow. Names should be specified in lowercase.
|
400
403
|
|
401
404
|
##### :css => :protocols (Array or Set)
|
402
405
|
|
@@ -417,6 +420,29 @@ elements not in this array will be removed.
|
|
417
420
|
]
|
418
421
|
```
|
419
422
|
|
423
|
+
> **Warning**
|
424
|
+
>
|
425
|
+
> Sanitize cannot fully sanitize the contents of `<math>` or `<svg>` elements. MathML and SVG elements are [foreign elements](https://html.spec.whatwg.org/multipage/syntax.html#foreign-elements) that don't follow normal HTML parsing rules.
|
426
|
+
>
|
427
|
+
> By default, Sanitize will remove all MathML and SVG elements. If you add MathML or SVG elements to a custom element allowlist, you must assume that any content inside them will be allowed, even if that content would otherwise be removed or escaped by Sanitize. This may create a security vulnerability in your application.
|
428
|
+
|
429
|
+
> **Note**
|
430
|
+
>
|
431
|
+
> Sanitize always removes `<noscript>` elements and their contents, even if `noscript` is in the allowlist.
|
432
|
+
>
|
433
|
+
> This is because a `<noscript>` element's content is parsed differently in browsers depending on whether or not scripting is enabled. Since Nokogiri doesn't support scripting, it always parses `<noscript>` elements as if scripting is disabled. This results in edge cases where it's not possible to reliably sanitize the contents of a `<noscript>` element because Nokogiri can't fully replicate the parsing behavior of a scripting-enabled browser.
|
434
|
+
|
435
|
+
#### :parser_options (Hash)
|
436
|
+
|
437
|
+
[Parsing options](https://github.com/rubys/nokogumbo/tree/master#parsing-options) to be supplied to `nokogumbo`.
|
438
|
+
|
439
|
+
```ruby
|
440
|
+
:parser_options => {
|
441
|
+
max_errors: -1,
|
442
|
+
max_tree_depth: -1
|
443
|
+
}
|
444
|
+
```
|
445
|
+
|
420
446
|
#### :protocols (Hash)
|
421
447
|
|
422
448
|
URL protocols to allow in specific attributes. If an attribute is listed here
|
@@ -441,15 +467,15 @@ include the symbol `:relative` in the protocol array:
|
|
441
467
|
|
442
468
|
#### :remove_contents (boolean or Array or Set)
|
443
469
|
|
444
|
-
If
|
470
|
+
If this is `true`, Sanitize will remove the contents of any non-allowlisted
|
445
471
|
elements in addition to the elements themselves. By default, Sanitize leaves the
|
446
472
|
safe parts of an element's contents behind when the element is removed.
|
447
473
|
|
448
|
-
If
|
449
|
-
elements (when filtered) will be removed, and the contents of all
|
450
|
-
elements will be left behind.
|
474
|
+
If this is an Array or Set of element names, then only the contents of the
|
475
|
+
specified elements (when filtered) will be removed, and the contents of all
|
476
|
+
other filtered elements will be left behind.
|
451
477
|
|
452
|
-
The default value is
|
478
|
+
The default value is `%w[iframe math noembed noframes noscript plaintext script style svg xmp]`.
|
453
479
|
|
454
480
|
#### :transformers (Array or callable)
|
455
481
|
|
@@ -474,6 +500,15 @@ children, in which case it will be inserted after those children.
|
|
474
500
|
}
|
475
501
|
```
|
476
502
|
|
503
|
+
The default elements with whitespace added before and after are:
|
504
|
+
|
505
|
+
```
|
506
|
+
address article aside blockquote br dd div dl dt
|
507
|
+
footer h1 h2 h3 h4 h5 h6 header hgroup hr li nav
|
508
|
+
ol p pre section ul
|
509
|
+
|
510
|
+
```
|
511
|
+
|
477
512
|
## Transformers
|
478
513
|
|
479
514
|
Transformers allow you to filter and modify HTML nodes using your own custom
|
@@ -498,33 +533,33 @@ argument a Hash that contains the following items:
|
|
498
533
|
|
499
534
|
* **:config** - The current Sanitize configuration Hash.
|
500
535
|
|
501
|
-
* **:
|
536
|
+
* **:is_allowlisted** - `true` if the current node has been allowlisted by a
|
502
537
|
previous transformer, `false` otherwise. It's generally bad form to remove
|
503
|
-
a node that a previous transformer has
|
538
|
+
a node that a previous transformer has allowlisted.
|
504
539
|
|
505
540
|
* **:node** - A `Nokogiri::XML::Node` object representing an HTML node. The
|
506
541
|
node may be an element, a text node, a comment, a CDATA node, or a document
|
507
542
|
fragment. Use Nokogiri's inspection methods (`element?`, `text?`, etc.) to
|
508
543
|
selectively ignore node types you aren't interested in.
|
509
544
|
|
545
|
+
* **:node_allowlist** - Set of `Nokogiri::XML::Node` objects in the current
|
546
|
+
document that have been allowlisted by previous transformers, if any. It's
|
547
|
+
generally bad form to remove a node that a previous transformer has
|
548
|
+
allowlisted.
|
549
|
+
|
510
550
|
* **:node_name** - The name of the current HTML node, always lowercase (e.g.
|
511
551
|
"div" or "span"). For non-element nodes, the name will be something like
|
512
552
|
"text", "comment", "#cdata-section", "#document-fragment", etc.
|
513
553
|
|
514
|
-
* **:node_whitelist** - Set of `Nokogiri::XML::Node` objects in the current
|
515
|
-
document that have been whitelisted by previous transformers, if any. It's
|
516
|
-
generally bad form to remove a node that a previous transformer has
|
517
|
-
whitelisted.
|
518
|
-
|
519
554
|
### Output
|
520
555
|
|
521
556
|
A transformer doesn't have to return anything, but may optionally return a Hash,
|
522
557
|
which may contain the following items:
|
523
558
|
|
524
|
-
* **:
|
525
|
-
to add to the document's
|
526
|
-
These specific nodes and all their attributes will be
|
527
|
-
their children will not be.
|
559
|
+
* **:node_allowlist** - Array or Set of specific `Nokogiri::XML::Node`
|
560
|
+
objects to add to the document's allowlist, bypassing the current Sanitize
|
561
|
+
config. These specific nodes and all their attributes will be allowlisted,
|
562
|
+
but their children will not be.
|
528
563
|
|
529
564
|
If a transformer returns anything other than a Hash, the return value will be
|
530
565
|
ignored.
|
@@ -567,16 +602,16 @@ Transformers have a tremendous amount of power, including the power to
|
|
567
602
|
completely bypass Sanitize's built-in filtering. Be careful! Your safety is in
|
568
603
|
your own hands.
|
569
604
|
|
570
|
-
### Example: Transformer to
|
605
|
+
### Example: Transformer to allow image URLs by domain
|
571
606
|
|
572
607
|
The following example demonstrates how to remove image elements unless they use
|
573
608
|
a relative URL or are hosted on a specific domain. It assumes that the `<img>`
|
574
|
-
element and its `src` attribute are already
|
609
|
+
element and its `src` attribute are already allowlisted.
|
575
610
|
|
576
611
|
```ruby
|
577
612
|
require 'uri'
|
578
613
|
|
579
|
-
|
614
|
+
image_allowlist_transformer = lambda do |env|
|
580
615
|
# Ignore everything except <img> elements.
|
581
616
|
return unless env[:node_name] == 'img'
|
582
617
|
|
@@ -592,20 +627,20 @@ image_whitelist_transformer = lambda do |env|
|
|
592
627
|
end
|
593
628
|
```
|
594
629
|
|
595
|
-
### Example: Transformer to
|
630
|
+
### Example: Transformer to allow YouTube video embeds
|
596
631
|
|
597
632
|
The following example demonstrates how to create a transformer that will safely
|
598
|
-
|
599
|
-
|
600
|
-
|
633
|
+
allow valid YouTube video embeds without having to allow other kinds of embedded
|
634
|
+
content, which would be the case if you tried to do this by just allowing all
|
635
|
+
`<iframe>` elements:
|
601
636
|
|
602
637
|
```ruby
|
603
638
|
youtube_transformer = lambda do |env|
|
604
639
|
node = env[:node]
|
605
640
|
node_name = env[:node_name]
|
606
641
|
|
607
|
-
# Don't continue if this node is already
|
608
|
-
return if env[:
|
642
|
+
# Don't continue if this node is already allowlisted or is not an element.
|
643
|
+
return if env[:is_allowlisted] || !node.element?
|
609
644
|
|
610
645
|
# Don't continue unless the node is an iframe.
|
611
646
|
return unless node_name == 'iframe'
|
@@ -626,8 +661,8 @@ youtube_transformer = lambda do |env|
|
|
626
661
|
|
627
662
|
# Now that we're sure that this is a valid YouTube embed and that there are
|
628
663
|
# no unwanted elements or attributes hidden inside it, we can tell Sanitize
|
629
|
-
# to
|
630
|
-
{:
|
664
|
+
# to allowlist the current node.
|
665
|
+
{:node_allowlist => [node]}
|
631
666
|
end
|
632
667
|
|
633
668
|
html = %[
|
@@ -638,25 +673,3 @@ html = %[
|
|
638
673
|
Sanitize.fragment(html, :transformers => youtube_transformer)
|
639
674
|
# => '<iframe width="420" height="315" src="//www.youtube.com/embed/dQw4w9WgXcQ" frameborder="0" allowfullscreen=""></iframe>'
|
640
675
|
```
|
641
|
-
|
642
|
-
License
|
643
|
-
-------
|
644
|
-
|
645
|
-
Copyright (c) 2015 Ryan Grove (ryan@wonko.com)
|
646
|
-
|
647
|
-
Permission is hereby granted, free of charge, to any person obtaining a copy of
|
648
|
-
this software and associated documentation files (the 'Software'), to deal in
|
649
|
-
the Software without restriction, including without limitation the rights to
|
650
|
-
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
|
651
|
-
the Software, and to permit persons to whom the Software is furnished to do so,
|
652
|
-
subject to the following conditions:
|
653
|
-
|
654
|
-
The above copyright notice and this permission notice shall be included in all
|
655
|
-
copies or substantial portions of the Software.
|
656
|
-
|
657
|
-
THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
658
|
-
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
|
659
|
-
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
|
660
|
-
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
|
661
|
-
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
|
662
|
-
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
@@ -54,8 +54,17 @@ class Sanitize
|
|
54
54
|
|
55
55
|
# HTML elements to allow. By default, no elements are allowed (which means
|
56
56
|
# that all HTML will be stripped).
|
57
|
+
#
|
58
|
+
# Warning: Sanitize cannot safely sanitize the contents of foreign
|
59
|
+
# elements (elements in the MathML or SVG namespaces). Do not add `math`
|
60
|
+
# or `svg` to this list! If you do, you may create a security
|
61
|
+
# vulnerability in your application.
|
57
62
|
:elements => [],
|
58
63
|
|
64
|
+
# HTML parsing options to pass to Nokogumbo.
|
65
|
+
# https://github.com/rubys/nokogumbo/tree/v2.0.1#parsing-options
|
66
|
+
:parser_options => {},
|
67
|
+
|
59
68
|
# URL handling protocols to allow in specific attributes. By default, no
|
60
69
|
# protocols are allowed. Use :relative in place of a protocol if you want
|
61
70
|
# to allow relative URLs sans protocol.
|
@@ -66,10 +75,12 @@ class Sanitize
|
|
66
75
|
# leaves the safe parts of an element's contents behind when the element
|
67
76
|
# is removed.
|
68
77
|
#
|
69
|
-
# If this is an Array of element names, then only the contents of
|
70
|
-
# specified elements (when filtered) will be removed, and the contents
|
71
|
-
# all other filtered elements will be left behind.
|
72
|
-
:remove_contents =>
|
78
|
+
# If this is an Array or Set of element names, then only the contents of
|
79
|
+
# the specified elements (when filtered) will be removed, and the contents
|
80
|
+
# of all other filtered elements will be left behind.
|
81
|
+
:remove_contents => %w[
|
82
|
+
iframe math noembed noframes noscript plaintext script style svg xmp
|
83
|
+
],
|
73
84
|
|
74
85
|
# Transformers allow you to filter or alter nodes using custom logic. See
|
75
86
|
# README.md for details and examples.
|
@@ -6,7 +6,7 @@ class Sanitize
|
|
6
6
|
:elements => BASIC[:elements] + %w[
|
7
7
|
address article aside bdi bdo body caption col colgroup data del div
|
8
8
|
figcaption figure footer h1 h2 h3 h4 h5 h6 head header hgroup hr html
|
9
|
-
img ins main nav rp rt ruby section span style summary
|
9
|
+
img ins main nav rp rt ruby section span style summary table tbody
|
10
10
|
td tfoot th thead title tr wbr
|
11
11
|
],
|
12
12
|
|
data/lib/sanitize/css.rb
CHANGED
@@ -175,7 +175,7 @@ class Sanitize; class CSS
|
|
175
175
|
next prop
|
176
176
|
|
177
177
|
when :semicolon
|
178
|
-
# Only preserve the semicolon if it was preceded by
|
178
|
+
# Only preserve the semicolon if it was preceded by an allowlisted
|
179
179
|
# property. Otherwise, omit it in order to prevent redundant semicolons.
|
180
180
|
if preceded_by_property
|
181
181
|
preceded_by_property = false
|
@@ -296,7 +296,7 @@ class Sanitize; class CSS
|
|
296
296
|
end
|
297
297
|
|
298
298
|
# Returns `true` if the given node (which may be of type `:url` or
|
299
|
-
# `:function`, since the CSS syntax can produce both) uses
|
299
|
+
# `:function`, since the CSS syntax can produce both) uses an allowlisted
|
300
300
|
# protocol.
|
301
301
|
def valid_url?(node)
|
302
302
|
type = node[:node]
|
@@ -1,6 +1,6 @@
|
|
1
1
|
class Sanitize; module Transformers; module CSS
|
2
2
|
|
3
|
-
# Enforces a CSS
|
3
|
+
# Enforces a CSS allowlist on the contents of `style` attributes.
|
4
4
|
class CleanAttribute
|
5
5
|
def initialize(sanitizer_or_config)
|
6
6
|
if Sanitize::CSS === sanitizer_or_config
|
@@ -14,7 +14,7 @@ class CleanAttribute
|
|
14
14
|
node = env[:node]
|
15
15
|
|
16
16
|
return unless node.type == Nokogiri::XML::Node::ELEMENT_NODE &&
|
17
|
-
node.key?('style') && !env[:
|
17
|
+
node.key?('style') && !env[:is_allowlisted]
|
18
18
|
|
19
19
|
attr = node.attribute('style')
|
20
20
|
css = @scss.properties(attr.value)
|
@@ -27,7 +27,7 @@ class CleanAttribute
|
|
27
27
|
end
|
28
28
|
end
|
29
29
|
|
30
|
-
# Enforces a CSS
|
30
|
+
# Enforces a CSS allowlist on the contents of `<style>` elements.
|
31
31
|
class CleanElement
|
32
32
|
def initialize(sanitizer_or_config)
|
33
33
|
if Sanitize::CSS === sanitizer_or_config
|
@@ -48,6 +48,7 @@ class CleanElement
|
|
48
48
|
if css.strip.empty?
|
49
49
|
node.unlink
|
50
50
|
else
|
51
|
+
css.gsub!('</', '<\/')
|
51
52
|
node.children.unlink
|
52
53
|
node << Nokogiri::XML::Text.new(css, node.document)
|
53
54
|
end
|
@@ -1,5 +1,6 @@
|
|
1
1
|
# encoding: utf-8
|
2
2
|
|
3
|
+
require 'cgi'
|
3
4
|
require 'set'
|
4
5
|
|
5
6
|
class Sanitize; module Transformers; class CleanElement
|
@@ -18,6 +19,18 @@ class Sanitize; module Transformers; class CleanElement
|
|
18
19
|
# http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#embedding-custom-non-visible-data-with-the-data-*-attributes
|
19
20
|
REGEX_DATA_ATTR = /\Adata-(?!xml)[a-z_][\w.\u00E0-\u00F6\u00F8-\u017F\u01DD-\u02AF-]*\z/u
|
20
21
|
|
22
|
+
# Elements whose content is treated as unescaped text by HTML parsers.
|
23
|
+
UNESCAPED_TEXT_ELEMENTS = Set.new(%w[
|
24
|
+
iframe
|
25
|
+
noembed
|
26
|
+
noframes
|
27
|
+
noscript
|
28
|
+
plaintext
|
29
|
+
script
|
30
|
+
style
|
31
|
+
xmp
|
32
|
+
])
|
33
|
+
|
21
34
|
# Attributes that need additional escaping on `<a>` elements due to unsafe
|
22
35
|
# libxml2 behavior.
|
23
36
|
UNSAFE_LIBXML_ATTRS_A = Set.new(%w[
|
@@ -67,7 +80,7 @@ class Sanitize; module Transformers; class CleanElement
|
|
67
80
|
@whitespace_elements = config[:whitespace_elements]
|
68
81
|
end
|
69
82
|
|
70
|
-
if config[:remove_contents].is_a?(
|
83
|
+
if config[:remove_contents].is_a?(Enumerable)
|
71
84
|
@remove_element_contents.merge(config[:remove_contents].map(&:to_s))
|
72
85
|
else
|
73
86
|
@remove_all_contents = !!config[:remove_contents]
|
@@ -76,11 +89,11 @@ class Sanitize; module Transformers; class CleanElement
|
|
76
89
|
|
77
90
|
def call(env)
|
78
91
|
node = env[:node]
|
79
|
-
return if node.type != Nokogiri::XML::Node::ELEMENT_NODE || env[:
|
92
|
+
return if node.type != Nokogiri::XML::Node::ELEMENT_NODE || env[:is_allowlisted]
|
80
93
|
|
81
94
|
name = env[:node_name]
|
82
95
|
|
83
|
-
# Delete any element that isn't in the config
|
96
|
+
# Delete any element that isn't in the config allowlist, unless the node has
|
84
97
|
# already been deleted from the document.
|
85
98
|
#
|
86
99
|
# It's important that we not try to reparent the children of a node that has
|
@@ -97,42 +110,41 @@ class Sanitize; module Transformers; class CleanElement
|
|
97
110
|
end
|
98
111
|
end
|
99
112
|
|
100
|
-
unless
|
101
|
-
|
113
|
+
unless node.children.empty?
|
114
|
+
unless @remove_all_contents || @remove_element_contents.include?(name)
|
115
|
+
node.add_previous_sibling(node.children)
|
116
|
+
end
|
102
117
|
end
|
103
118
|
|
104
119
|
node.unlink
|
105
120
|
return
|
106
121
|
end
|
107
122
|
|
108
|
-
|
123
|
+
attr_allowlist = @attributes[name] || @attributes[:all]
|
109
124
|
|
110
|
-
if
|
111
|
-
# Delete all attributes from elements with no
|
125
|
+
if attr_allowlist.nil?
|
126
|
+
# Delete all attributes from elements with no allowlisted attributes.
|
112
127
|
node.attribute_nodes.each {|attr| attr.unlink }
|
113
128
|
else
|
114
|
-
allow_data_attributes =
|
129
|
+
allow_data_attributes = attr_allowlist.include?(:data)
|
115
130
|
|
116
131
|
# Delete any attribute that isn't allowed on this element.
|
117
132
|
node.attribute_nodes.each do |attr|
|
118
133
|
attr_name = attr.name.downcase
|
119
134
|
|
120
|
-
unless
|
121
|
-
# The attribute isn't
|
135
|
+
unless attr_allowlist.include?(attr_name)
|
136
|
+
# The attribute isn't in the allowlist, but may still be allowed if
|
137
|
+
# it's a data attribute.
|
122
138
|
|
123
|
-
|
124
|
-
#
|
125
|
-
#
|
126
|
-
|
139
|
+
unless allow_data_attributes && attr_name.start_with?('data-') && attr_name =~ REGEX_DATA_ATTR
|
140
|
+
# Either the attribute isn't a data attribute or arbitrary data
|
141
|
+
# attributes aren't allowed. Remove the attribute.
|
142
|
+
attr.unlink
|
143
|
+
next
|
127
144
|
end
|
128
|
-
|
129
|
-
# Either the attribute isn't a data attribute or arbitrary data
|
130
|
-
# attributes aren't allowed. Remove the attribute.
|
131
|
-
attr.unlink
|
132
|
-
next
|
133
145
|
end
|
134
146
|
|
135
|
-
# The attribute is
|
147
|
+
# The attribute is allowed.
|
136
148
|
|
137
149
|
# Remove any attributes that use unacceptable protocols.
|
138
150
|
if @protocols.include?(name) && @protocols[name].include?(attr_name)
|
@@ -160,12 +172,17 @@ class Sanitize; module Transformers; class CleanElement
|
|
160
172
|
# libxml2 >= 2.9.2 doesn't escape comments within some attributes, in an
|
161
173
|
# attempt to preserve server-side includes. This can result in XSS since
|
162
174
|
# an unescaped double quote can allow an attacker to inject a
|
163
|
-
# non-
|
175
|
+
# non-allowlisted attribute.
|
164
176
|
#
|
165
177
|
# Sanitize works around this by implementing its own escaping for
|
166
178
|
# affected attributes, some of which can exist on any element and some
|
167
179
|
# of which can only exist on `<a>` elements.
|
168
180
|
#
|
181
|
+
# This fix is technically no longer necessary with Nokogumbo >= 2.0
|
182
|
+
# since it no longer uses libxml2's serializer, but it's retained to
|
183
|
+
# avoid breaking use cases where people might be sanitizing individual
|
184
|
+
# Nokogiri nodes and then serializing them manually without Nokogumbo.
|
185
|
+
#
|
169
186
|
# The relevant libxml2 code is here:
|
170
187
|
# <https://github.com/GNOME/libxml2/commit/960f0e275616cadc29671a218d7fb9b69eb35588>
|
171
188
|
if UNSAFE_LIBXML_ATTRS_GLOBAL.include?(attr_name) ||
|
@@ -180,6 +197,72 @@ class Sanitize; module Transformers; class CleanElement
|
|
180
197
|
if @add_attributes.include?(name)
|
181
198
|
@add_attributes[name].each {|key, val| node[key] = val }
|
182
199
|
end
|
200
|
+
|
201
|
+
# Make a best effort to ensure that text nodes in invalid "unescaped text"
|
202
|
+
# elements that are inside a math or svg namespace are properly escaped so
|
203
|
+
# that they don't get parsed as HTML.
|
204
|
+
#
|
205
|
+
# Sanitize is explicitly documented as not supporting MathML or SVG, but
|
206
|
+
# people sometimes allow `<math>` and `<svg>` elements in their custom
|
207
|
+
# configs without realizing that it's not safe. This workaround makes it
|
208
|
+
# slightly less unsafe, but you still shouldn't allow `<math>` or `<svg>`
|
209
|
+
# because Nokogiri doesn't parse them the same way browsers do and Sanitize
|
210
|
+
# can't guarantee that their contents are safe.
|
211
|
+
unless node.namespace.nil?
|
212
|
+
prefix = node.namespace.prefix
|
213
|
+
|
214
|
+
if (prefix == 'math' || prefix == 'svg') && UNESCAPED_TEXT_ELEMENTS.include?(name)
|
215
|
+
node.children.each do |child|
|
216
|
+
if child.type == Nokogiri::XML::Node::TEXT_NODE
|
217
|
+
child.content = CGI.escapeHTML(child.content)
|
218
|
+
end
|
219
|
+
end
|
220
|
+
end
|
221
|
+
end
|
222
|
+
|
223
|
+
# Element-specific special cases.
|
224
|
+
case name
|
225
|
+
|
226
|
+
# If this is an allowlisted iframe that has children, remove all its
|
227
|
+
# children. The HTML standard says iframes shouldn't have content, but when
|
228
|
+
# they do, this content is parsed as text and is serialized verbatim without
|
229
|
+
# being escaped, which is unsafe because legacy browsers may still render it
|
230
|
+
# and execute `<script>` content. So the safe and correct thing to do is to
|
231
|
+
# always remove iframe content.
|
232
|
+
when 'iframe'
|
233
|
+
if !node.children.empty?
|
234
|
+
node.children.each do |child|
|
235
|
+
child.unlink
|
236
|
+
end
|
237
|
+
end
|
238
|
+
|
239
|
+
# Prevent the use of `<meta>` elements that set a charset other than UTF-8,
|
240
|
+
# since Sanitize's output is always UTF-8.
|
241
|
+
when 'meta'
|
242
|
+
if node.has_attribute?('charset') &&
|
243
|
+
node['charset'].downcase != 'utf-8'
|
244
|
+
|
245
|
+
node['charset'] = 'utf-8'
|
246
|
+
end
|
247
|
+
|
248
|
+
if node.has_attribute?('http-equiv') &&
|
249
|
+
node.has_attribute?('content') &&
|
250
|
+
node['http-equiv'].downcase == 'content-type' &&
|
251
|
+
node['content'].downcase =~ /;\s*charset\s*=\s*(?!utf-8)/
|
252
|
+
|
253
|
+
node['content'] = node['content'].gsub(/;\s*charset\s*=.+\z/, ';charset=utf-8')
|
254
|
+
end
|
255
|
+
|
256
|
+
# A `<noscript>` element's content is parsed differently in browsers
|
257
|
+
# depending on whether or not scripting is enabled. Since Nokogiri doesn't
|
258
|
+
# support scripting, it always parses `<noscript>` elements as if scripting
|
259
|
+
# is disabled. This results in edge cases where it's not possible to
|
260
|
+
# reliably sanitize the contents of a `<noscript>` element because Nokogiri
|
261
|
+
# can't fully replicate the parsing behavior of a scripting-enabled browser.
|
262
|
+
# The safest thing to do is to simply remove all `<noscript>` elements.
|
263
|
+
when 'noscript'
|
264
|
+
node.unlink
|
265
|
+
end
|
183
266
|
end
|
184
267
|
|
185
268
|
end; end; end
|
data/lib/sanitize/version.rb
CHANGED