sanitize 4.6.3 → 5.1.0

Sign up to get free protection for your applications and to get access to all the features.

Potentially problematic release.


This version of sanitize might be problematic. Click here for more details.

checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA256:
3
- metadata.gz: 04fe170a57bfd67e2e2f40e19e6add8cd777a9d812f24b66a4350d0cefe9f803
4
- data.tar.gz: fb848fbc8cf1878905378f2795c9ad012d4247a1a5491ec4735994902544840d
3
+ metadata.gz: 8cf7bac25cea64ed464d106bdc57019388598ca9f1a4e7d8eddf3a98bab12267
4
+ data.tar.gz: e8b1f402b0d67a825b0ad4aad83829816fd9c78cd8445879636cba0a282e8ee5
5
5
  SHA512:
6
- metadata.gz: dde1af17f562062ea7136d8033df17ed2aeaf39fdc1d037e75118c1ed9718d6ae50f29f9bb1165b0057810fed7a8bcac303e9e0687c3f98dbe514f6cb768cae5
7
- data.tar.gz: 408533cd205ec1570041a6c029bb639978aa1785824a2e69df9d695331f274e7cda4e024934300ba35b8792fb39f4a94bd780dc6365cc2eaade06cdc32d3299e
6
+ metadata.gz: 956edaca6569a5933223da0aa7dcac4880b5164aa59e37256ac896c9fefb271da71425defe7e09e241b1333b441f5a2629893abed6d5a2a47d0726bf03597614
7
+ data.tar.gz: e45a018b904bcf8cb996f8ed08427e80b8ce058c4fe414782460c5496e88bb6c2a4055304118057621a630e514b4f96bac11bdc686181a6f0097dc7bf912ab04
data/HISTORY.md CHANGED
@@ -1,8 +1,86 @@
1
1
  # Sanitize History
2
2
 
3
+ ## 5.1.0 (2019-09-07)
4
+
5
+ ### Features
6
+
7
+ * Added a `:parser_options` config hash, which makes it possible to pass custom
8
+ parsing options to Nokogumbo. [@austin-wang - #194][194]
9
+
10
+ ### Bug Fixes
11
+
12
+ * Non-characters and non-whitespace control characters are now stripped from
13
+ HTML input before parsing to comply with the HTML Standard's [preprocessing
14
+ guidelines][html-preprocessing]. Prior to this Sanitize had adhered to [older
15
+ W3C guidelines][unicode-xml] that have since been withdrawn. [#179][179]
16
+
17
+ [179]:https://github.com/rgrove/sanitize/issues/179
18
+ [194]:https://github.com/rgrove/sanitize/pull/194
19
+ [html-preprocessing]:https://html.spec.whatwg.org/multipage/parsing.html#preprocessing-the-input-stream
20
+ [unicode-xml]:https://www.w3.org/TR/unicode-xml/
21
+
22
+ ## 5.0.0 (2018-10-14)
23
+
24
+ For most users, upgrading from 4.x shouldn't require any changes. However, the
25
+ minimum required Ruby version has changed, and Sanitize 5.x's HTML output may
26
+ differ in some small ways from 4.x's output. If this matters to you, please
27
+ review the changes below carefully.
28
+
29
+ ### Potentially Breaking Changes
30
+
31
+ * Ruby 2.3.0 is now the oldest officially supported Ruby version. Sanitize may
32
+ work in older 2.x Rubies, but they aren't actively tested. Sanitize definitely
33
+ no longer works in Ruby 1.9.x.
34
+
35
+ * Upgraded to Nokogumbo 2.x, which fixes various bugs and adds
36
+ standard-compliant HTML serialization. [@stevecheckoway - #189][189]
37
+
38
+ * Children of the following elements are now removed by default when these
39
+ elements are removed, rather than being preserved and escaped:
40
+
41
+ - `iframe`
42
+ - `noembed`
43
+ - `noframes`
44
+ - `noscript`
45
+ - `script`
46
+ - `style`
47
+
48
+ * Children of whitelisted `iframe` elements are now always removed. In modern
49
+ HTML, `iframe` elements should never have children. In HTML 4 and earlier
50
+ `iframe` elements were allowed to contain fallback content for legacy
51
+ browsers, but it's been almost two decades since that was useful.
52
+
53
+ * Fixed a bug that caused `:remove_contents` to behave as if it were set to
54
+ `true` when it was actually an Array.
55
+
56
+ [189]:https://github.com/rgrove/sanitize/pull/189
57
+
58
+ ## 4.6.6 (2018-07-23)
59
+
60
+ * Improved performance and memory usage by optimizing `Sanitize#transform_node!`
61
+ [@stanhu - #183][183]
62
+
63
+ [183]:https://github.com/rgrove/sanitize/pull/183
64
+
65
+ ## 4.6.5 (2018-05-16)
66
+
67
+ * Improved performance slightly by tweaking the order of built-in transformers.
68
+ [@rafbm - #180][180]
69
+
70
+ [180]:https://github.com/rgrove/sanitize/pull/180
71
+
72
+ ## 4.6.4 (2018-03-20)
73
+
74
+ * Fixed: A change introduced in 4.6.2 broke certain transformers that relied on
75
+ being able to mutate the name of an HTML node. That change has been reverted
76
+ and a test has been added to cover this case. [@zetter - #177][177]
77
+
78
+ [177]:https://github.com/rgrove/sanitize/issues/177
79
+
3
80
  ## 4.6.3 (2018-03-19)
4
81
 
5
- * Fixed an HTML injection vulnerability that could allow XSS.
82
+ * [CVE-2018-3740][176]: Fixed an HTML injection vulnerability that could allow
83
+ XSS.
6
84
 
7
85
  When Sanitize <= 4.6.2 is used in combination with libxml2 >= 2.9.2, a
8
86
  specially crafted HTML fragment can cause libxml2 to generate improperly
@@ -15,6 +93,8 @@
15
93
  Many thanks to the Shopify Application Security Team for responsibly reporting
16
94
  this issue.
17
95
 
96
+ [176]:https://github.com/rgrove/sanitize/issues/176
97
+
18
98
  ## 4.6.2 (2018-03-19)
19
99
 
20
100
  * Reduced string allocations to optimize memory usage. [@janklimo - #175][175]
@@ -299,6 +379,26 @@ Sanitize.fragment(html, Sanitize::Config.merge(Sanitize::Config::BASIC,
299
379
  [n1008]:https://github.com/sparklemotion/nokogiri/issues/1008
300
380
 
301
381
 
382
+ ## 2.1.1 (2018-09-30)
383
+
384
+ * [CVE-2018-3740][176]: Fixed an HTML injection vulnerability that could allow
385
+ XSS (backported from Sanitize 4.6.3). [@dometto - #188][188]
386
+
387
+ When Sanitize <= 2.1.0 is used in combination with libxml2 >= 2.9.2, a
388
+ specially crafted HTML fragment can cause libxml2 to generate improperly
389
+ escaped output, allowing non-whitelisted attributes to be used on whitelisted
390
+ elements.
391
+
392
+ Sanitize now performs additional escaping on affected attributes to prevent
393
+ this.
394
+
395
+ Many thanks to the Shopify Application Security Team for responsibly reporting
396
+ this issue.
397
+
398
+ [176]:https://github.com/rgrove/sanitize/issues/176
399
+ [188]:https://github.com/rgrove/sanitize/pull/188
400
+
401
+
302
402
  ## 2.1.0 (2014-01-13)
303
403
 
304
404
  * Added support for whitelisting arbitrary HTML5 `data-*` attributes. Use the
data/README.md CHANGED
@@ -417,6 +417,17 @@ elements not in this array will be removed.
417
417
  ]
418
418
  ```
419
419
 
420
+ #### :parser_options (Hash)
421
+
422
+ [Parsing options](https://github.com/rubys/nokogumbo/tree/v2.0.1#parsing-options) supplied to `nokogumbo`.
423
+
424
+ ```ruby
425
+ :parser_options => {
426
+ max_errors: -1,
427
+ max_tree_depth: -1
428
+ }
429
+ ```
430
+
420
431
  #### :protocols (Hash)
421
432
 
422
433
  URL protocols to allow in specific attributes. If an attribute is listed here
@@ -441,13 +452,13 @@ include the symbol `:relative` in the protocol array:
441
452
 
442
453
  #### :remove_contents (boolean or Array or Set)
443
454
 
444
- If set to `true`, Sanitize will remove the contents of any non-whitelisted
455
+ If this is `true`, Sanitize will remove the contents of any non-whitelisted
445
456
  elements in addition to the elements themselves. By default, Sanitize leaves the
446
457
  safe parts of an element's contents behind when the element is removed.
447
458
 
448
- If set to an array of element names, then only the contents of the specified
449
- elements (when filtered) will be removed, and the contents of all other filtered
450
- elements will be left behind.
459
+ If this is an Array or Set of element names, then only the contents of the
460
+ specified elements (when filtered) will be removed, and the contents of all
461
+ other filtered elements will be left behind.
451
462
 
452
463
  The default value is `false`.
453
464
 
@@ -474,6 +485,15 @@ children, in which case it will be inserted after those children.
474
485
  }
475
486
  ```
476
487
 
488
+ The default elements with whitespace added before and after are:
489
+
490
+ ```
491
+ address article aside blockquote br dd div dl dt
492
+ footer h1 h2 h3 h4 h5 h6 header hgroup hr li nav
493
+ ol p pre section ul
494
+
495
+ ```
496
+
477
497
  ## Transformers
478
498
 
479
499
  Transformers allow you to filter and modify HTML nodes using your own custom
data/lib/sanitize.rb CHANGED
@@ -19,6 +19,20 @@ require_relative 'sanitize/transformers/clean_element'
19
19
  class Sanitize
20
20
  attr_reader :config
21
21
 
22
+ # Matches one or more control characters that should be removed from HTML
23
+ # before parsing, as defined by the HTML living standard.
24
+ #
25
+ # - https://html.spec.whatwg.org/multipage/parsing.html#preprocessing-the-input-stream
26
+ # - https://infra.spec.whatwg.org/#control
27
+ REGEX_HTML_CONTROL_CHARACTERS = /[\u0001-\u0008\u000b\u000e-\u001f\u007f-\u009f]+/u
28
+
29
+ # Matches one or more non-characters that should be removed from HTML before
30
+ # parsing, as defined by the HTML living standard.
31
+ #
32
+ # - https://html.spec.whatwg.org/multipage/parsing.html#preprocessing-the-input-stream
33
+ # - https://infra.spec.whatwg.org/#noncharacter
34
+ REGEX_HTML_NON_CHARACTERS = /[\ufdd0-\ufdef\ufffe\uffff\u{1fffe}\u{1ffff}\u{2fffe}\u{2ffff}\u{3fffe}\u{3ffff}\u{4fffe}\u{4ffff}\u{5fffe}\u{5ffff}\u{6fffe}\u{6ffff}\u{7fffe}\u{7ffff}\u{8fffe}\u{8ffff}\u{9fffe}\u{9ffff}\u{afffe}\u{affff}\u{bfffe}\u{bffff}\u{cfffe}\u{cffff}\u{dfffe}\u{dffff}\u{efffe}\u{effff}\u{ffffe}\u{fffff}\u{10fffe}\u{10ffff}]+/u
35
+
22
36
  # Matches an attribute value that could be treated by a browser as a URL
23
37
  # with a protocol prefix, such as "http:" or "javascript:". Any string of zero
24
38
  # or more characters followed by a colon is considered a match, even if the
@@ -26,11 +40,12 @@ class Sanitize
26
40
  # IE6 and Opera will still parse).
27
41
  REGEX_PROTOCOL = /\A\s*([^\/#]*?)(?:\:|&#0*58|&#x0*3a)/i
28
42
 
29
- # Matches Unicode characters that should be stripped from HTML before passing
30
- # it to the parser.
43
+ # Matches one or more characters that should be stripped from HTML before
44
+ # parsing. This is a combination of `REGEX_HTML_CONTROL_CHARACTERS` and
45
+ # `REGEX_HTML_NON_CHARACTERS`.
31
46
  #
32
- # http://www.w3.org/TR/unicode-xml/#Charlist
33
- REGEX_UNSUITABLE_CHARS = /[\u0000\u0340\u0341\u17a3\u17d3\u2028\u2029\u202a-\u202e\u206a-\u206f\ufff9-\ufffb\ufeff\ufffc\u{1d173}-\u{1d17a}\u{e0000}-\u{e007f}]/u
47
+ # https://html.spec.whatwg.org/multipage/parsing.html#preprocessing-the-input-stream
48
+ REGEX_UNSUITABLE_CHARS = /(?:#{REGEX_HTML_CONTROL_CHARACTERS}|#{REGEX_HTML_NON_CHARACTERS})/u
34
49
 
35
50
  #--
36
51
  # Class Methods
@@ -81,6 +96,7 @@ class Sanitize
81
96
 
82
97
  # Default transformers always run at the end of the chain, after any custom
83
98
  # transformers.
99
+ @transformers << Transformers::CleanElement.new(@config)
84
100
  @transformers << Transformers::CleanComment unless @config[:allow_comments]
85
101
 
86
102
  if @config[:elements].include?('style')
@@ -93,10 +109,10 @@ class Sanitize
93
109
  @transformers << Transformers::CSS::CleanAttribute.new(scss)
94
110
  end
95
111
 
96
- @transformers <<
97
- Transformers::CleanDoctype <<
98
- Transformers::CleanCDATA <<
99
- Transformers::CleanElement.new(@config)
112
+ @transformers << Transformers::CleanDoctype
113
+ @transformers << Transformers::CleanCDATA
114
+
115
+ @transformer_config = { config: @config }
100
116
  end
101
117
 
102
118
  # Returns a sanitized copy of the given _html_ document.
@@ -107,7 +123,7 @@ class Sanitize
107
123
  def document(html)
108
124
  return '' unless html
109
125
 
110
- doc = Nokogiri::HTML5.parse(preprocess(html))
126
+ doc = Nokogiri::HTML5.parse(preprocess(html), **@config[:parser_options])
111
127
  node!(doc)
112
128
  to_html(doc)
113
129
  end
@@ -119,20 +135,7 @@ class Sanitize
119
135
  def fragment(html)
120
136
  return '' unless html
121
137
 
122
- html = preprocess(html)
123
- doc = Nokogiri::HTML5.parse("<html><body>#{html}")
124
-
125
- # Hack to allow fragments containing <body>. Borrowed from
126
- # Nokogiri::HTML::DocumentFragment.
127
- if html =~ /\A<body(?:\s|>)/i
128
- path = '/html/body'
129
- else
130
- path = '/html/body/node()'
131
- end
132
-
133
- frag = doc.fragment
134
- frag << doc.xpath(path)
135
-
138
+ frag = Nokogiri::HTML5.fragment(preprocess(html), **@config[:parser_options])
136
139
  node!(frag)
137
140
  to_html(frag)
138
141
  end
@@ -183,50 +186,25 @@ class Sanitize
183
186
  end
184
187
 
185
188
  def to_html(node)
186
- replace_meta = false
187
-
188
- # Hacky workaround for a libxml2 bug that adds an undesired Content-Type
189
- # meta tag to all serialized HTML documents.
190
- #
191
- # https://github.com/sparklemotion/nokogiri/issues/1008
192
- if node.type == Nokogiri::XML::Node::DOCUMENT_NODE ||
193
- node.type == Nokogiri::XML::Node::HTML_DOCUMENT_NODE
194
-
195
- regex_meta = %r|(<html[^>]*>\s*<head[^>]*>\s*)<meta http-equiv="Content-Type" content="text/html; charset=utf-8">|i
196
-
197
- # Only replace the content-type meta tag if <meta> isn't whitelisted or
198
- # the original document didn't actually include a content-type meta tag.
199
- replace_meta = !@config[:elements].include?('meta') ||
200
- node.xpath('/html/head/meta[@http-equiv]').none? do |meta|
201
- meta['http-equiv'].casecmp('content-type').zero?
202
- end
203
- end
204
-
205
- so = Nokogiri::XML::Node::SaveOptions
206
-
207
- # Serialize to HTML without any formatting to prevent Nokogiri from adding
208
- # newlines after certain tags.
209
- html = node.to_html(
210
- :encoding => 'utf-8',
211
- :indent => 0,
212
- :save_with => so::NO_DECLARATION | so::NO_EMPTY_TAGS | so::AS_HTML
213
- )
214
-
215
- html.gsub!(regex_meta, '\1') if replace_meta
216
- html
189
+ node.to_html(preserve_newline: true)
217
190
  end
218
191
 
219
192
  def transform_node!(node, node_whitelist)
220
- node_name = node.name.downcase
221
-
222
193
  @transformers.each do |transformer|
223
- result = transformer.call(
224
- :config => @config,
225
- :is_whitelisted => node_whitelist.include?(node),
226
- :node => node,
227
- :node_name => node_name,
228
- :node_whitelist => node_whitelist
229
- )
194
+ # Since transform_node! may be called in a tight loop to process thousands
195
+ # of items, we can optimize both memory and CPU performance by:
196
+ #
197
+ # 1. Reusing the same config hash for each transformer
198
+ # 2. Directly assigning values to hash instead of using merge!. Not only
199
+ # does merge! create a new hash, it is also 2.6x slower:
200
+ # https://github.com/JuanitoFatas/fast-ruby#hashmerge-vs-hashmerge-code
201
+ config = @transformer_config
202
+ config[:is_whitelisted] = node_whitelist.include?(node)
203
+ config[:node] = node
204
+ config[:node_name] = node.name.downcase
205
+ config[:node_whitelist] = node_whitelist
206
+
207
+ result = transformer.call(config)
230
208
 
231
209
  if result.is_a?(Hash) && result[:node_whitelist].respond_to?(:each)
232
210
  node_whitelist.merge(result[:node_whitelist])
@@ -56,6 +56,10 @@ class Sanitize
56
56
  # that all HTML will be stripped).
57
57
  :elements => [],
58
58
 
59
+ # HTML parsing options to pass to Nokogumbo.
60
+ # https://github.com/rubys/nokogumbo/tree/v2.0.1#parsing-options
61
+ :parser_options => {},
62
+
59
63
  # URL handling protocols to allow in specific attributes. By default, no
60
64
  # protocols are allowed. Use :relative in place of a protocol if you want
61
65
  # to allow relative URLs sans protocol.
@@ -66,10 +70,12 @@ class Sanitize
66
70
  # leaves the safe parts of an element's contents behind when the element
67
71
  # is removed.
68
72
  #
69
- # If this is an Array of element names, then only the contents of the
70
- # specified elements (when filtered) will be removed, and the contents of
71
- # all other filtered elements will be left behind.
72
- :remove_contents => false,
73
+ # If this is an Array or Set of element names, then only the contents of
74
+ # the specified elements (when filtered) will be removed, and the contents
75
+ # of all other filtered elements will be left behind.
76
+ :remove_contents => %w[
77
+ iframe noembed noframes noscript script style
78
+ ],
73
79
 
74
80
  # Transformers allow you to filter or alter nodes using custom logic. See
75
81
  # README.md for details and examples.
@@ -67,7 +67,7 @@ class Sanitize; module Transformers; class CleanElement
67
67
  @whitespace_elements = config[:whitespace_elements]
68
68
  end
69
69
 
70
- if config[:remove_contents].is_a?(Set)
70
+ if config[:remove_contents].is_a?(Enumerable)
71
71
  @remove_element_contents.merge(config[:remove_contents].map(&:to_s))
72
72
  else
73
73
  @remove_all_contents = !!config[:remove_contents]
@@ -97,8 +97,10 @@ class Sanitize; module Transformers; class CleanElement
97
97
  end
98
98
  end
99
99
 
100
- unless @remove_all_contents || @remove_element_contents.include?(name)
101
- node.add_previous_sibling(node.children)
100
+ unless node.children.empty?
101
+ unless @remove_all_contents || @remove_element_contents.include?(name)
102
+ node.add_previous_sibling(node.children)
103
+ end
102
104
  end
103
105
 
104
106
  node.unlink
@@ -166,6 +168,11 @@ class Sanitize; module Transformers; class CleanElement
166
168
  # affected attributes, some of which can exist on any element and some
167
169
  # of which can only exist on `<a>` elements.
168
170
  #
171
+ # This fix is technically no longer necessary with Nokogumbo >= 2.0
172
+ # since it no longer uses libxml2's serializer, but it's retained to
173
+ # avoid breaking use cases where people might be sanitizing individual
174
+ # Nokogiri nodes and then serializing them manually without Nokogumbo.
175
+ #
169
176
  # The relevant libxml2 code is here:
170
177
  # <https://github.com/GNOME/libxml2/commit/960f0e275616cadc29671a218d7fb9b69eb35588>
171
178
  if UNSAFE_LIBXML_ATTRS_GLOBAL.include?(attr_name) ||
@@ -180,6 +187,40 @@ class Sanitize; module Transformers; class CleanElement
180
187
  if @add_attributes.include?(name)
181
188
  @add_attributes[name].each {|key, val| node[key] = val }
182
189
  end
190
+
191
+ # Element-specific special cases.
192
+ case name
193
+
194
+ # If this is a whitelisted iframe that has children, remove all its
195
+ # children. The HTML standard says iframes shouldn't have content, but when
196
+ # they do, this content is parsed as text and is serialized verbatim without
197
+ # being escaped, which is unsafe because legacy browsers may still render it
198
+ # and execute `<script>` content. So the safe and correct thing to do is to
199
+ # always remove iframe content.
200
+ when 'iframe'
201
+ if !node.children.empty?
202
+ node.children.each do |child|
203
+ child.unlink
204
+ end
205
+ end
206
+
207
+ # Prevent the use of `<meta>` elements that set a charset other than UTF-8,
208
+ # since Sanitize's output is always UTF-8.
209
+ when 'meta'
210
+ if node.has_attribute?('charset') &&
211
+ node['charset'].downcase != 'utf-8'
212
+
213
+ node['charset'] = 'utf-8'
214
+ end
215
+
216
+ if node.has_attribute?('http-equiv') &&
217
+ node.has_attribute?('content') &&
218
+ node['http-equiv'].downcase == 'content-type' &&
219
+ node['content'].downcase =~ /;\s*charset\s*=\s*(?!utf-8)/
220
+
221
+ node['content'] = node['content'].gsub(/;\s*charset\s*=.+\z/, ';charset=utf-8')
222
+ end
223
+ end
183
224
  end
184
225
 
185
226
  end; end; end
@@ -1,5 +1,5 @@
1
1
  # encoding: utf-8
2
2
 
3
3
  class Sanitize
4
- VERSION = '4.6.3'
4
+ VERSION = '5.1.0'
5
5
  end