sanitize 4.6.3 → 5.1.0
Sign up to get free protection for your applications and to get access to all the features.
Potentially problematic release.
This version of sanitize might be problematic. Click here for more details.
- checksums.yaml +4 -4
- data/HISTORY.md +101 -1
- data/README.md +24 -4
- data/lib/sanitize.rb +41 -63
- data/lib/sanitize/config/default.rb +10 -4
- data/lib/sanitize/transformers/clean_element.rb +44 -3
- data/lib/sanitize/version.rb +1 -1
- data/test/common.rb +0 -31
- data/test/test_clean_comment.rb +1 -5
- data/test/test_clean_css.rb +1 -1
- data/test/test_clean_doctype.rb +8 -8
- data/test/test_clean_element.rb +108 -23
- data/test/test_malicious_html.rb +30 -6
- data/test/test_parser.rb +2 -31
- data/test/test_sanitize.rb +102 -17
- data/test/test_sanitize_css.rb +39 -12
- data/test/test_transformers.rb +22 -4
- metadata +12 -14
- data/test/test_unicode.rb +0 -95
checksums.yaml
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
---
|
2
2
|
SHA256:
|
3
|
-
metadata.gz:
|
4
|
-
data.tar.gz:
|
3
|
+
metadata.gz: 8cf7bac25cea64ed464d106bdc57019388598ca9f1a4e7d8eddf3a98bab12267
|
4
|
+
data.tar.gz: e8b1f402b0d67a825b0ad4aad83829816fd9c78cd8445879636cba0a282e8ee5
|
5
5
|
SHA512:
|
6
|
-
metadata.gz:
|
7
|
-
data.tar.gz:
|
6
|
+
metadata.gz: 956edaca6569a5933223da0aa7dcac4880b5164aa59e37256ac896c9fefb271da71425defe7e09e241b1333b441f5a2629893abed6d5a2a47d0726bf03597614
|
7
|
+
data.tar.gz: e45a018b904bcf8cb996f8ed08427e80b8ce058c4fe414782460c5496e88bb6c2a4055304118057621a630e514b4f96bac11bdc686181a6f0097dc7bf912ab04
|
data/HISTORY.md
CHANGED
@@ -1,8 +1,86 @@
|
|
1
1
|
# Sanitize History
|
2
2
|
|
3
|
+
## 5.1.0 (2019-09-07)
|
4
|
+
|
5
|
+
### Features
|
6
|
+
|
7
|
+
* Added a `:parser_options` config hash, which makes it possible to pass custom
|
8
|
+
parsing options to Nokogumbo. [@austin-wang - #194][194]
|
9
|
+
|
10
|
+
### Bug Fixes
|
11
|
+
|
12
|
+
* Non-characters and non-whitespace control characters are now stripped from
|
13
|
+
HTML input before parsing to comply with the HTML Standard's [preprocessing
|
14
|
+
guidelines][html-preprocessing]. Prior to this Sanitize had adhered to [older
|
15
|
+
W3C guidelines][unicode-xml] that have since been withdrawn. [#179][179]
|
16
|
+
|
17
|
+
[179]:https://github.com/rgrove/sanitize/issues/179
|
18
|
+
[194]:https://github.com/rgrove/sanitize/pull/194
|
19
|
+
[html-preprocessing]:https://html.spec.whatwg.org/multipage/parsing.html#preprocessing-the-input-stream
|
20
|
+
[unicode-xml]:https://www.w3.org/TR/unicode-xml/
|
21
|
+
|
22
|
+
## 5.0.0 (2018-10-14)
|
23
|
+
|
24
|
+
For most users, upgrading from 4.x shouldn't require any changes. However, the
|
25
|
+
minimum required Ruby version has changed, and Sanitize 5.x's HTML output may
|
26
|
+
differ in some small ways from 4.x's output. If this matters to you, please
|
27
|
+
review the changes below carefully.
|
28
|
+
|
29
|
+
### Potentially Breaking Changes
|
30
|
+
|
31
|
+
* Ruby 2.3.0 is now the oldest officially supported Ruby version. Sanitize may
|
32
|
+
work in older 2.x Rubies, but they aren't actively tested. Sanitize definitely
|
33
|
+
no longer works in Ruby 1.9.x.
|
34
|
+
|
35
|
+
* Upgraded to Nokogumbo 2.x, which fixes various bugs and adds
|
36
|
+
standard-compliant HTML serialization. [@stevecheckoway - #189][189]
|
37
|
+
|
38
|
+
* Children of the following elements are now removed by default when these
|
39
|
+
elements are removed, rather than being preserved and escaped:
|
40
|
+
|
41
|
+
- `iframe`
|
42
|
+
- `noembed`
|
43
|
+
- `noframes`
|
44
|
+
- `noscript`
|
45
|
+
- `script`
|
46
|
+
- `style`
|
47
|
+
|
48
|
+
* Children of whitelisted `iframe` elements are now always removed. In modern
|
49
|
+
HTML, `iframe` elements should never have children. In HTML 4 and earlier
|
50
|
+
`iframe` elements were allowed to contain fallback content for legacy
|
51
|
+
browsers, but it's been almost two decades since that was useful.
|
52
|
+
|
53
|
+
* Fixed a bug that caused `:remove_contents` to behave as if it were set to
|
54
|
+
`true` when it was actually an Array.
|
55
|
+
|
56
|
+
[189]:https://github.com/rgrove/sanitize/pull/189
|
57
|
+
|
58
|
+
## 4.6.6 (2018-07-23)
|
59
|
+
|
60
|
+
* Improved performance and memory usage by optimizing `Sanitize#transform_node!`
|
61
|
+
[@stanhu - #183][183]
|
62
|
+
|
63
|
+
[183]:https://github.com/rgrove/sanitize/pull/183
|
64
|
+
|
65
|
+
## 4.6.5 (2018-05-16)
|
66
|
+
|
67
|
+
* Improved performance slightly by tweaking the order of built-in transformers.
|
68
|
+
[@rafbm - #180][180]
|
69
|
+
|
70
|
+
[180]:https://github.com/rgrove/sanitize/pull/180
|
71
|
+
|
72
|
+
## 4.6.4 (2018-03-20)
|
73
|
+
|
74
|
+
* Fixed: A change introduced in 4.6.2 broke certain transformers that relied on
|
75
|
+
being able to mutate the name of an HTML node. That change has been reverted
|
76
|
+
and a test has been added to cover this case. [@zetter - #177][177]
|
77
|
+
|
78
|
+
[177]:https://github.com/rgrove/sanitize/issues/177
|
79
|
+
|
3
80
|
## 4.6.3 (2018-03-19)
|
4
81
|
|
5
|
-
* Fixed an HTML injection vulnerability that could allow
|
82
|
+
* [CVE-2018-3740][176]: Fixed an HTML injection vulnerability that could allow
|
83
|
+
XSS.
|
6
84
|
|
7
85
|
When Sanitize <= 4.6.2 is used in combination with libxml2 >= 2.9.2, a
|
8
86
|
specially crafted HTML fragment can cause libxml2 to generate improperly
|
@@ -15,6 +93,8 @@
|
|
15
93
|
Many thanks to the Shopify Application Security Team for responsibly reporting
|
16
94
|
this issue.
|
17
95
|
|
96
|
+
[176]:https://github.com/rgrove/sanitize/issues/176
|
97
|
+
|
18
98
|
## 4.6.2 (2018-03-19)
|
19
99
|
|
20
100
|
* Reduced string allocations to optimize memory usage. [@janklimo - #175][175]
|
@@ -299,6 +379,26 @@ Sanitize.fragment(html, Sanitize::Config.merge(Sanitize::Config::BASIC,
|
|
299
379
|
[n1008]:https://github.com/sparklemotion/nokogiri/issues/1008
|
300
380
|
|
301
381
|
|
382
|
+
## 2.1.1 (2018-09-30)
|
383
|
+
|
384
|
+
* [CVE-2018-3740][176]: Fixed an HTML injection vulnerability that could allow
|
385
|
+
XSS (backported from Sanitize 4.6.3). [@dometto - #188][188]
|
386
|
+
|
387
|
+
When Sanitize <= 2.1.0 is used in combination with libxml2 >= 2.9.2, a
|
388
|
+
specially crafted HTML fragment can cause libxml2 to generate improperly
|
389
|
+
escaped output, allowing non-whitelisted attributes to be used on whitelisted
|
390
|
+
elements.
|
391
|
+
|
392
|
+
Sanitize now performs additional escaping on affected attributes to prevent
|
393
|
+
this.
|
394
|
+
|
395
|
+
Many thanks to the Shopify Application Security Team for responsibly reporting
|
396
|
+
this issue.
|
397
|
+
|
398
|
+
[176]:https://github.com/rgrove/sanitize/issues/176
|
399
|
+
[188]:https://github.com/rgrove/sanitize/pull/188
|
400
|
+
|
401
|
+
|
302
402
|
## 2.1.0 (2014-01-13)
|
303
403
|
|
304
404
|
* Added support for whitelisting arbitrary HTML5 `data-*` attributes. Use the
|
data/README.md
CHANGED
@@ -417,6 +417,17 @@ elements not in this array will be removed.
|
|
417
417
|
]
|
418
418
|
```
|
419
419
|
|
420
|
+
#### :parser_options (Hash)
|
421
|
+
|
422
|
+
[Parsing options](https://github.com/rubys/nokogumbo/tree/v2.0.1#parsing-options) supplied to `nokogumbo`.
|
423
|
+
|
424
|
+
```ruby
|
425
|
+
:parser_options => {
|
426
|
+
max_errors: -1,
|
427
|
+
max_tree_depth: -1
|
428
|
+
}
|
429
|
+
```
|
430
|
+
|
420
431
|
#### :protocols (Hash)
|
421
432
|
|
422
433
|
URL protocols to allow in specific attributes. If an attribute is listed here
|
@@ -441,13 +452,13 @@ include the symbol `:relative` in the protocol array:
|
|
441
452
|
|
442
453
|
#### :remove_contents (boolean or Array or Set)
|
443
454
|
|
444
|
-
If
|
455
|
+
If this is `true`, Sanitize will remove the contents of any non-whitelisted
|
445
456
|
elements in addition to the elements themselves. By default, Sanitize leaves the
|
446
457
|
safe parts of an element's contents behind when the element is removed.
|
447
458
|
|
448
|
-
If
|
449
|
-
elements (when filtered) will be removed, and the contents of all
|
450
|
-
elements will be left behind.
|
459
|
+
If this is an Array or Set of element names, then only the contents of the
|
460
|
+
specified elements (when filtered) will be removed, and the contents of all
|
461
|
+
other filtered elements will be left behind.
|
451
462
|
|
452
463
|
The default value is `false`.
|
453
464
|
|
@@ -474,6 +485,15 @@ children, in which case it will be inserted after those children.
|
|
474
485
|
}
|
475
486
|
```
|
476
487
|
|
488
|
+
The default elements with whitespace added before and after are:
|
489
|
+
|
490
|
+
```
|
491
|
+
address article aside blockquote br dd div dl dt
|
492
|
+
footer h1 h2 h3 h4 h5 h6 header hgroup hr li nav
|
493
|
+
ol p pre section ul
|
494
|
+
|
495
|
+
```
|
496
|
+
|
477
497
|
## Transformers
|
478
498
|
|
479
499
|
Transformers allow you to filter and modify HTML nodes using your own custom
|
data/lib/sanitize.rb
CHANGED
@@ -19,6 +19,20 @@ require_relative 'sanitize/transformers/clean_element'
|
|
19
19
|
class Sanitize
|
20
20
|
attr_reader :config
|
21
21
|
|
22
|
+
# Matches one or more control characters that should be removed from HTML
|
23
|
+
# before parsing, as defined by the HTML living standard.
|
24
|
+
#
|
25
|
+
# - https://html.spec.whatwg.org/multipage/parsing.html#preprocessing-the-input-stream
|
26
|
+
# - https://infra.spec.whatwg.org/#control
|
27
|
+
REGEX_HTML_CONTROL_CHARACTERS = /[\u0001-\u0008\u000b\u000e-\u001f\u007f-\u009f]+/u
|
28
|
+
|
29
|
+
# Matches one or more non-characters that should be removed from HTML before
|
30
|
+
# parsing, as defined by the HTML living standard.
|
31
|
+
#
|
32
|
+
# - https://html.spec.whatwg.org/multipage/parsing.html#preprocessing-the-input-stream
|
33
|
+
# - https://infra.spec.whatwg.org/#noncharacter
|
34
|
+
REGEX_HTML_NON_CHARACTERS = /[\ufdd0-\ufdef\ufffe\uffff\u{1fffe}\u{1ffff}\u{2fffe}\u{2ffff}\u{3fffe}\u{3ffff}\u{4fffe}\u{4ffff}\u{5fffe}\u{5ffff}\u{6fffe}\u{6ffff}\u{7fffe}\u{7ffff}\u{8fffe}\u{8ffff}\u{9fffe}\u{9ffff}\u{afffe}\u{affff}\u{bfffe}\u{bffff}\u{cfffe}\u{cffff}\u{dfffe}\u{dffff}\u{efffe}\u{effff}\u{ffffe}\u{fffff}\u{10fffe}\u{10ffff}]+/u
|
35
|
+
|
22
36
|
# Matches an attribute value that could be treated by a browser as a URL
|
23
37
|
# with a protocol prefix, such as "http:" or "javascript:". Any string of zero
|
24
38
|
# or more characters followed by a colon is considered a match, even if the
|
@@ -26,11 +40,12 @@ class Sanitize
|
|
26
40
|
# IE6 and Opera will still parse).
|
27
41
|
REGEX_PROTOCOL = /\A\s*([^\/#]*?)(?:\:|�*58|�*3a)/i
|
28
42
|
|
29
|
-
# Matches
|
30
|
-
#
|
43
|
+
# Matches one or more characters that should be stripped from HTML before
|
44
|
+
# parsing. This is a combination of `REGEX_HTML_CONTROL_CHARACTERS` and
|
45
|
+
# `REGEX_HTML_NON_CHARACTERS`.
|
31
46
|
#
|
32
|
-
#
|
33
|
-
REGEX_UNSUITABLE_CHARS = /
|
47
|
+
# https://html.spec.whatwg.org/multipage/parsing.html#preprocessing-the-input-stream
|
48
|
+
REGEX_UNSUITABLE_CHARS = /(?:#{REGEX_HTML_CONTROL_CHARACTERS}|#{REGEX_HTML_NON_CHARACTERS})/u
|
34
49
|
|
35
50
|
#--
|
36
51
|
# Class Methods
|
@@ -81,6 +96,7 @@ class Sanitize
|
|
81
96
|
|
82
97
|
# Default transformers always run at the end of the chain, after any custom
|
83
98
|
# transformers.
|
99
|
+
@transformers << Transformers::CleanElement.new(@config)
|
84
100
|
@transformers << Transformers::CleanComment unless @config[:allow_comments]
|
85
101
|
|
86
102
|
if @config[:elements].include?('style')
|
@@ -93,10 +109,10 @@ class Sanitize
|
|
93
109
|
@transformers << Transformers::CSS::CleanAttribute.new(scss)
|
94
110
|
end
|
95
111
|
|
96
|
-
@transformers <<
|
97
|
-
|
98
|
-
|
99
|
-
|
112
|
+
@transformers << Transformers::CleanDoctype
|
113
|
+
@transformers << Transformers::CleanCDATA
|
114
|
+
|
115
|
+
@transformer_config = { config: @config }
|
100
116
|
end
|
101
117
|
|
102
118
|
# Returns a sanitized copy of the given _html_ document.
|
@@ -107,7 +123,7 @@ class Sanitize
|
|
107
123
|
def document(html)
|
108
124
|
return '' unless html
|
109
125
|
|
110
|
-
doc = Nokogiri::HTML5.parse(preprocess(html))
|
126
|
+
doc = Nokogiri::HTML5.parse(preprocess(html), **@config[:parser_options])
|
111
127
|
node!(doc)
|
112
128
|
to_html(doc)
|
113
129
|
end
|
@@ -119,20 +135,7 @@ class Sanitize
|
|
119
135
|
def fragment(html)
|
120
136
|
return '' unless html
|
121
137
|
|
122
|
-
|
123
|
-
doc = Nokogiri::HTML5.parse("<html><body>#{html}")
|
124
|
-
|
125
|
-
# Hack to allow fragments containing <body>. Borrowed from
|
126
|
-
# Nokogiri::HTML::DocumentFragment.
|
127
|
-
if html =~ /\A<body(?:\s|>)/i
|
128
|
-
path = '/html/body'
|
129
|
-
else
|
130
|
-
path = '/html/body/node()'
|
131
|
-
end
|
132
|
-
|
133
|
-
frag = doc.fragment
|
134
|
-
frag << doc.xpath(path)
|
135
|
-
|
138
|
+
frag = Nokogiri::HTML5.fragment(preprocess(html), **@config[:parser_options])
|
136
139
|
node!(frag)
|
137
140
|
to_html(frag)
|
138
141
|
end
|
@@ -183,50 +186,25 @@ class Sanitize
|
|
183
186
|
end
|
184
187
|
|
185
188
|
def to_html(node)
|
186
|
-
|
187
|
-
|
188
|
-
# Hacky workaround for a libxml2 bug that adds an undesired Content-Type
|
189
|
-
# meta tag to all serialized HTML documents.
|
190
|
-
#
|
191
|
-
# https://github.com/sparklemotion/nokogiri/issues/1008
|
192
|
-
if node.type == Nokogiri::XML::Node::DOCUMENT_NODE ||
|
193
|
-
node.type == Nokogiri::XML::Node::HTML_DOCUMENT_NODE
|
194
|
-
|
195
|
-
regex_meta = %r|(<html[^>]*>\s*<head[^>]*>\s*)<meta http-equiv="Content-Type" content="text/html; charset=utf-8">|i
|
196
|
-
|
197
|
-
# Only replace the content-type meta tag if <meta> isn't whitelisted or
|
198
|
-
# the original document didn't actually include a content-type meta tag.
|
199
|
-
replace_meta = !@config[:elements].include?('meta') ||
|
200
|
-
node.xpath('/html/head/meta[@http-equiv]').none? do |meta|
|
201
|
-
meta['http-equiv'].casecmp('content-type').zero?
|
202
|
-
end
|
203
|
-
end
|
204
|
-
|
205
|
-
so = Nokogiri::XML::Node::SaveOptions
|
206
|
-
|
207
|
-
# Serialize to HTML without any formatting to prevent Nokogiri from adding
|
208
|
-
# newlines after certain tags.
|
209
|
-
html = node.to_html(
|
210
|
-
:encoding => 'utf-8',
|
211
|
-
:indent => 0,
|
212
|
-
:save_with => so::NO_DECLARATION | so::NO_EMPTY_TAGS | so::AS_HTML
|
213
|
-
)
|
214
|
-
|
215
|
-
html.gsub!(regex_meta, '\1') if replace_meta
|
216
|
-
html
|
189
|
+
node.to_html(preserve_newline: true)
|
217
190
|
end
|
218
191
|
|
219
192
|
def transform_node!(node, node_whitelist)
|
220
|
-
node_name = node.name.downcase
|
221
|
-
|
222
193
|
@transformers.each do |transformer|
|
223
|
-
|
224
|
-
|
225
|
-
|
226
|
-
|
227
|
-
|
228
|
-
|
229
|
-
|
194
|
+
# Since transform_node! may be called in a tight loop to process thousands
|
195
|
+
# of items, we can optimize both memory and CPU performance by:
|
196
|
+
#
|
197
|
+
# 1. Reusing the same config hash for each transformer
|
198
|
+
# 2. Directly assigning values to hash instead of using merge!. Not only
|
199
|
+
# does merge! create a new hash, it is also 2.6x slower:
|
200
|
+
# https://github.com/JuanitoFatas/fast-ruby#hashmerge-vs-hashmerge-code
|
201
|
+
config = @transformer_config
|
202
|
+
config[:is_whitelisted] = node_whitelist.include?(node)
|
203
|
+
config[:node] = node
|
204
|
+
config[:node_name] = node.name.downcase
|
205
|
+
config[:node_whitelist] = node_whitelist
|
206
|
+
|
207
|
+
result = transformer.call(config)
|
230
208
|
|
231
209
|
if result.is_a?(Hash) && result[:node_whitelist].respond_to?(:each)
|
232
210
|
node_whitelist.merge(result[:node_whitelist])
|
@@ -56,6 +56,10 @@ class Sanitize
|
|
56
56
|
# that all HTML will be stripped).
|
57
57
|
:elements => [],
|
58
58
|
|
59
|
+
# HTML parsing options to pass to Nokogumbo.
|
60
|
+
# https://github.com/rubys/nokogumbo/tree/v2.0.1#parsing-options
|
61
|
+
:parser_options => {},
|
62
|
+
|
59
63
|
# URL handling protocols to allow in specific attributes. By default, no
|
60
64
|
# protocols are allowed. Use :relative in place of a protocol if you want
|
61
65
|
# to allow relative URLs sans protocol.
|
@@ -66,10 +70,12 @@ class Sanitize
|
|
66
70
|
# leaves the safe parts of an element's contents behind when the element
|
67
71
|
# is removed.
|
68
72
|
#
|
69
|
-
# If this is an Array of element names, then only the contents of
|
70
|
-
# specified elements (when filtered) will be removed, and the contents
|
71
|
-
# all other filtered elements will be left behind.
|
72
|
-
:remove_contents =>
|
73
|
+
# If this is an Array or Set of element names, then only the contents of
|
74
|
+
# the specified elements (when filtered) will be removed, and the contents
|
75
|
+
# of all other filtered elements will be left behind.
|
76
|
+
:remove_contents => %w[
|
77
|
+
iframe noembed noframes noscript script style
|
78
|
+
],
|
73
79
|
|
74
80
|
# Transformers allow you to filter or alter nodes using custom logic. See
|
75
81
|
# README.md for details and examples.
|
@@ -67,7 +67,7 @@ class Sanitize; module Transformers; class CleanElement
|
|
67
67
|
@whitespace_elements = config[:whitespace_elements]
|
68
68
|
end
|
69
69
|
|
70
|
-
if config[:remove_contents].is_a?(
|
70
|
+
if config[:remove_contents].is_a?(Enumerable)
|
71
71
|
@remove_element_contents.merge(config[:remove_contents].map(&:to_s))
|
72
72
|
else
|
73
73
|
@remove_all_contents = !!config[:remove_contents]
|
@@ -97,8 +97,10 @@ class Sanitize; module Transformers; class CleanElement
|
|
97
97
|
end
|
98
98
|
end
|
99
99
|
|
100
|
-
unless
|
101
|
-
|
100
|
+
unless node.children.empty?
|
101
|
+
unless @remove_all_contents || @remove_element_contents.include?(name)
|
102
|
+
node.add_previous_sibling(node.children)
|
103
|
+
end
|
102
104
|
end
|
103
105
|
|
104
106
|
node.unlink
|
@@ -166,6 +168,11 @@ class Sanitize; module Transformers; class CleanElement
|
|
166
168
|
# affected attributes, some of which can exist on any element and some
|
167
169
|
# of which can only exist on `<a>` elements.
|
168
170
|
#
|
171
|
+
# This fix is technically no longer necessary with Nokogumbo >= 2.0
|
172
|
+
# since it no longer uses libxml2's serializer, but it's retained to
|
173
|
+
# avoid breaking use cases where people might be sanitizing individual
|
174
|
+
# Nokogiri nodes and then serializing them manually without Nokogumbo.
|
175
|
+
#
|
169
176
|
# The relevant libxml2 code is here:
|
170
177
|
# <https://github.com/GNOME/libxml2/commit/960f0e275616cadc29671a218d7fb9b69eb35588>
|
171
178
|
if UNSAFE_LIBXML_ATTRS_GLOBAL.include?(attr_name) ||
|
@@ -180,6 +187,40 @@ class Sanitize; module Transformers; class CleanElement
|
|
180
187
|
if @add_attributes.include?(name)
|
181
188
|
@add_attributes[name].each {|key, val| node[key] = val }
|
182
189
|
end
|
190
|
+
|
191
|
+
# Element-specific special cases.
|
192
|
+
case name
|
193
|
+
|
194
|
+
# If this is a whitelisted iframe that has children, remove all its
|
195
|
+
# children. The HTML standard says iframes shouldn't have content, but when
|
196
|
+
# they do, this content is parsed as text and is serialized verbatim without
|
197
|
+
# being escaped, which is unsafe because legacy browsers may still render it
|
198
|
+
# and execute `<script>` content. So the safe and correct thing to do is to
|
199
|
+
# always remove iframe content.
|
200
|
+
when 'iframe'
|
201
|
+
if !node.children.empty?
|
202
|
+
node.children.each do |child|
|
203
|
+
child.unlink
|
204
|
+
end
|
205
|
+
end
|
206
|
+
|
207
|
+
# Prevent the use of `<meta>` elements that set a charset other than UTF-8,
|
208
|
+
# since Sanitize's output is always UTF-8.
|
209
|
+
when 'meta'
|
210
|
+
if node.has_attribute?('charset') &&
|
211
|
+
node['charset'].downcase != 'utf-8'
|
212
|
+
|
213
|
+
node['charset'] = 'utf-8'
|
214
|
+
end
|
215
|
+
|
216
|
+
if node.has_attribute?('http-equiv') &&
|
217
|
+
node.has_attribute?('content') &&
|
218
|
+
node['http-equiv'].downcase == 'content-type' &&
|
219
|
+
node['content'].downcase =~ /;\s*charset\s*=\s*(?!utf-8)/
|
220
|
+
|
221
|
+
node['content'] = node['content'].gsub(/;\s*charset\s*=.+\z/, ';charset=utf-8')
|
222
|
+
end
|
223
|
+
end
|
183
224
|
end
|
184
225
|
|
185
226
|
end; end; end
|
data/lib/sanitize/version.rb
CHANGED