sanitize 1.3.0.dev.20101210 → 2.0.0.dev.20101211
Sign up to get free protection for your applications and to get access to all the features.
Potentially problematic release.
This version of sanitize might be problematic. Click here for more details.
- data/HISTORY +7 -10
- data/README.rdoc +82 -83
- data/lib/sanitize.rb +33 -143
- data/lib/sanitize/config.rb +0 -4
- data/lib/sanitize/transformers/clean_cdata.rb +13 -0
- data/lib/sanitize/transformers/clean_comment.rb +10 -0
- data/lib/sanitize/transformers/clean_element.rb +87 -0
- data/lib/sanitize/version.rb +1 -1
- metadata +10 -7
data/HISTORY
CHANGED
@@ -1,7 +1,13 @@
|
|
1
1
|
Sanitize History
|
2
2
|
================================================================================
|
3
3
|
|
4
|
-
Version
|
4
|
+
Version 2.0.0 (git)
|
5
|
+
* The environment data passed into transformers and the return values expected
|
6
|
+
from transformers have changed. Old transformers will need to be updated.
|
7
|
+
See the README for details.
|
8
|
+
* Transformers now receive nodes of all types, not just element nodes.
|
9
|
+
* Sanitize's own core filtering logic is now implemented as a set of always-on
|
10
|
+
transformers.
|
5
11
|
* The default value for the :output config is now :html. Previously it was
|
6
12
|
:xhtml.
|
7
13
|
* Added a :whitespace_elements config, which specifies elements (such as <br>
|
@@ -15,15 +21,6 @@ Version 1.3.0 (git)
|
|
15
21
|
`ruby`, and `wbr` elements to the whitelist for `Sanitize::Config::RELAXED`.
|
16
22
|
* The `dir`, `lang`, and `title` attributes are now whitelisted for all
|
17
23
|
elements in `Sanitize::Config::RELAXED`.
|
18
|
-
* The environment hash passed into transformers now includes an
|
19
|
-
:allowed_elements Hash to facilitate faster lookups when attempting to
|
20
|
-
determine whether an element is in the whitelist. [Suggested by Nicholas
|
21
|
-
Evans]
|
22
|
-
* The environment hash passed into transformers now includes a
|
23
|
-
:whitelist_nodes Array, so transformers now have insight into what nodes
|
24
|
-
have been whitelisted by other transformers. [Suggested by Nicholas Evans]
|
25
|
-
* Added a :process_text_nodes config setting. If set to true, Sanitize will
|
26
|
-
pass text nodes to transformers. The default is false. [Ardie Saeidi]
|
27
24
|
* Bumped minimum Nokogiri version to 1.4.4 to avoid a bug in 1.4.2+ (issue
|
28
25
|
#315) that caused "</body></html>" to be appended to the CDATA inside
|
29
26
|
unterminated script and style elements.
|
data/README.rdoc
CHANGED
@@ -14,7 +14,7 @@ of fragile regular expressions, Sanitize has no trouble dealing with malformed
|
|
14
14
|
or maliciously-formed HTML, and will always output valid HTML or XHTML.
|
15
15
|
|
16
16
|
*Author*:: Ryan Grove (mailto:ryan@wonko.com)
|
17
|
-
*Version*::
|
17
|
+
*Version*:: 2.0.0 (git)
|
18
18
|
*Copyright*:: Copyright (c) 2010 Ryan Grove. All rights reserved.
|
19
19
|
*License*:: MIT License (http://opensource.org/licenses/mit-license.php)
|
20
20
|
*Website*:: http://github.com/rgrove/sanitize
|
@@ -43,7 +43,7 @@ behind.
|
|
43
43
|
require 'rubygems'
|
44
44
|
require 'sanitize'
|
45
45
|
|
46
|
-
html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg"
|
46
|
+
html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg">'
|
47
47
|
|
48
48
|
Sanitize.clean(html) # => 'foo'
|
49
49
|
|
@@ -77,7 +77,7 @@ are limited to HTTP and HTTPS. In this mode, <code>rel="nofollow"</code> is not
|
|
77
77
|
added to links.
|
78
78
|
|
79
79
|
Sanitize.clean(html, Sanitize::Config::RELAXED)
|
80
|
-
# => '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg"
|
80
|
+
# => '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg">'
|
81
81
|
|
82
82
|
=== Custom Configuration
|
83
83
|
|
@@ -127,10 +127,9 @@ default value is <code>false</code>.
|
|
127
127
|
|
128
128
|
Array of element names to allow. Specify all names in lowercase.
|
129
129
|
|
130
|
-
:elements => [
|
131
|
-
|
132
|
-
|
133
|
-
'sup', 'u', 'ul'
|
130
|
+
:elements => %w[
|
131
|
+
a abbr b blockquote br cite code dd dfn dl dt em i kbd li mark ol p pre
|
132
|
+
q s samp small strike strong sub sup time u ul var
|
134
133
|
]
|
135
134
|
|
136
135
|
==== :output (Symbol)
|
@@ -140,12 +139,7 @@ defaulting to <code>:html</code>.
|
|
140
139
|
|
141
140
|
==== :output_encoding (String)
|
142
141
|
|
143
|
-
Character encoding to use for HTML output. Default is <code>
|
144
|
-
|
145
|
-
==== :process_text_nodes (Boolean)
|
146
|
-
|
147
|
-
Whether or not to process text nodes. Enabling this will allow text nodes to be
|
148
|
-
processed by transformers. The default is <code>false</code>.
|
142
|
+
Character encoding to use for HTML output. Default is <code>utf-8</code>.
|
149
143
|
|
150
144
|
==== :protocols (Hash)
|
151
145
|
|
@@ -171,7 +165,7 @@ If set to +true+, Sanitize will remove the contents of any non-whitelisted
|
|
171
165
|
elements in addition to the elements themselves. By default, Sanitize leaves the
|
172
166
|
safe parts of an element's contents behind when the element is removed.
|
173
167
|
|
174
|
-
If set to an
|
168
|
+
If set to an array of element names, then only the contents of the specified
|
175
169
|
elements (when filtered) will be removed, and the contents of all other filtered
|
176
170
|
elements will be left behind.
|
177
171
|
|
@@ -179,7 +173,8 @@ The default value is <code>false</code>.
|
|
179
173
|
|
180
174
|
==== :transformers
|
181
175
|
|
182
|
-
See
|
176
|
+
Custom transformer or array of custom transformers. See the Transformers section
|
177
|
+
below for details.
|
183
178
|
|
184
179
|
==== :whitespace_elements (Array)
|
185
180
|
|
@@ -196,81 +191,80 @@ By default, the following elements are included in the
|
|
196
191
|
|
197
192
|
=== Transformers
|
198
193
|
|
199
|
-
Transformers allow you to filter and
|
200
|
-
top of (or instead of) Sanitize's core filter. A transformer is any object
|
201
|
-
responds to <code>call()</code> (such as a lambda or proc)
|
202
|
-
<code>nil</code> or a Hash containing certain optional response values.
|
194
|
+
Transformers allow you to filter and modify nodes using your own custom logic,
|
195
|
+
on top of (or instead of) Sanitize's core filter. A transformer is any object
|
196
|
+
that responds to <code>call()</code> (such as a lambda or proc).
|
203
197
|
|
204
198
|
To use one or more transformers, pass them to the <code>:transformers</code>
|
205
|
-
config setting
|
199
|
+
config setting. You may pass a single transformer or an array of transformers.
|
206
200
|
|
207
201
|
Sanitize.clean(html, :transformers => [transformer_one, transformer_two])
|
208
202
|
|
209
203
|
==== Input
|
210
204
|
|
211
205
|
Each registered transformer's <code>call()</code> method will be called once for
|
212
|
-
each
|
213
|
-
Hash that contains the following items:
|
214
|
-
|
215
|
-
[<code>:allowed_elements</code>]
|
216
|
-
Hash with whitelisted element names as keys, to facilitate fast lookups of
|
217
|
-
whitelisted elements.
|
206
|
+
each node in the HTML (including elements, text nodes, comments, etc.), and will
|
207
|
+
receive as an argument an environment Hash that contains the following items:
|
218
208
|
|
219
209
|
[<code>:config</code>]
|
220
210
|
The current Sanitize configuration Hash.
|
221
211
|
|
212
|
+
[<code>:is_whitelisted</code>]
|
213
|
+
<code>true</code> if the current node has been whitelisted by a previous
|
214
|
+
transformer, <code>false</code> otherwise. It's generally bad form to remove a
|
215
|
+
node that a previous transformer has whitelisted.
|
216
|
+
|
222
217
|
[<code>:node</code>]
|
223
|
-
A Nokogiri::XML::Node object representing an HTML
|
218
|
+
A Nokogiri::XML::Node object representing an HTML node. The node may be an
|
219
|
+
element, a text node, a comment, a CDATA node, or a document fragment. Use
|
220
|
+
Nokogiri's inspection methods (<code>element?</code>, <code>text?</code>,
|
221
|
+
etc.) to selectively ignore node types you aren't interested in.
|
224
222
|
|
225
223
|
[<code>:node_name</code>]
|
226
224
|
The name of the current HTML node, always lowercase (e.g. "div" or "span").
|
225
|
+
For non-element nodes, the name will be something like "text", "comment",
|
226
|
+
"#cdata-section", "#document-fragment", etc.
|
227
|
+
|
228
|
+
[<code>:node_whitelist</code>]
|
229
|
+
Set of Nokogiri::XML::Node objects in the current document that have been
|
230
|
+
whitelisted by previous transformers, if any. It's generally bad form to
|
231
|
+
remove a node that a previous transformer has whitelisted.
|
227
232
|
|
228
|
-
|
229
|
-
|
230
|
-
|
233
|
+
==== Output
|
234
|
+
|
235
|
+
A transformer doesn't have to return anything, but may optionally return a Hash,
|
236
|
+
which may contain the following items:
|
237
|
+
|
238
|
+
[<code>:node_whitelist</code>]
|
239
|
+
Array or Set of specific Nokogiri::XML::Node objects to add to the document's
|
240
|
+
whitelist, bypassing the current Sanitize config. These specific nodes and all
|
241
|
+
their attributes will be whitelisted, but their children will not be.
|
242
|
+
|
243
|
+
If a transformer returns anything other than a Hash, the return value will be
|
244
|
+
ignored.
|
231
245
|
|
232
246
|
==== Processing
|
233
247
|
|
234
248
|
Each transformer has full access to the Nokogiri::XML::Node that's passed into
|
235
249
|
it and to the rest of the document via the node's <code>document()</code>
|
236
|
-
method. Any changes
|
237
|
-
|
238
|
-
|
250
|
+
method. Any changes made to the current node or to the document will be
|
251
|
+
reflected instantly in the document and passed on to subsequently-called
|
252
|
+
transformers and to Sanitize itself. A transformer may even call Sanitize
|
253
|
+
internally to perform custom sanitization if needed.
|
239
254
|
|
240
255
|
Nodes are passed into transformers in the order in which they're traversed. It's
|
241
256
|
important to note that Nokogiri traverses markup from the deepest node upward,
|
242
257
|
not from the first node to the last node:
|
243
258
|
|
244
259
|
html = '<div><span>foo</span></div>'
|
245
|
-
transformer = lambda{|env| puts env[:
|
260
|
+
transformer = lambda{|env| puts env[:node_name] }
|
246
261
|
|
247
|
-
# Prints "span",
|
262
|
+
# Prints "text", "span", "div", "#document-fragment".
|
248
263
|
Sanitize.clean(html, :transformers => transformer)
|
249
264
|
|
250
265
|
Transformers have a tremendous amount of power, including the power to
|
251
|
-
completely bypass Sanitize's built-in filtering. Be careful!
|
252
|
-
|
253
|
-
==== Output
|
254
|
-
|
255
|
-
A transformer may return either +nil+ or a Hash. A return value of +nil+
|
256
|
-
indicates that the transformer does not wish to act on the current node in any
|
257
|
-
way. A returned Hash may contain the following items, all of which are optional:
|
258
|
-
|
259
|
-
[<code>:attr_whitelist</code>]
|
260
|
-
Array of attribute names to add to the whitelist for the current node, in
|
261
|
-
addition to any whitelisted attributes already defined in the current config.
|
262
|
-
|
263
|
-
[<code>:node</code>]
|
264
|
-
A Nokogiri::XML::Node object that should replace the current node. All
|
265
|
-
subsequent transformers and Sanitize itself will receive this new node.
|
266
|
-
|
267
|
-
[<code>:whitelist</code>]
|
268
|
-
If _true_, the current node (and only the current node) will be whitelisted,
|
269
|
-
regardless of the current Sanitize config.
|
270
|
-
|
271
|
-
[<code>:whitelist_nodes</code>]
|
272
|
-
Array of specific Nokogiri::XML::Node objects to whitelist, anywhere in the
|
273
|
-
document, regardless of the current Sanitize config.
|
266
|
+
completely bypass Sanitize's built-in filtering. Be careful! Your safety is in
|
267
|
+
your own hands.
|
274
268
|
|
275
269
|
==== Example: Transformer to whitelist YouTube video embeds
|
276
270
|
|
@@ -283,16 +277,20 @@ by just whitelisting all <code><object></code>, <code><embed></code>, and
|
|
283
277
|
lambda do |env|
|
284
278
|
node = env[:node]
|
285
279
|
node_name = env[:node_name]
|
286
|
-
|
280
|
+
|
281
|
+
# Don't continue if this node is already whitelisted or is not an element.
|
282
|
+
return if env[:is_whitelisted] || !node.element?
|
283
|
+
|
284
|
+
parent = node.parent
|
287
285
|
|
288
286
|
# Since the transformer receives the deepest nodes first, we look for a
|
289
287
|
# <param> element or an <embed> element whose parent is an <object>.
|
290
|
-
return
|
288
|
+
return unless (node_name == 'param' || node_name == 'embed') &&
|
291
289
|
parent.name.to_s.downcase == 'object'
|
292
290
|
|
293
291
|
if node_name == 'param'
|
294
292
|
# Quick XPath search to find the <param> node that contains the video URL.
|
295
|
-
return
|
293
|
+
return unless movie_node = parent.search('param[@name="movie"]')[0]
|
296
294
|
url = movie_node['value']
|
297
295
|
else
|
298
296
|
# Since this is an <embed>, the video URL is in the "src" attribute. No
|
@@ -301,17 +299,18 @@ by just whitelisting all <code><object></code>, <code><embed></code>, and
|
|
301
299
|
end
|
302
300
|
|
303
301
|
# Verify that the video URL is actually a valid YouTube video URL.
|
304
|
-
return
|
302
|
+
return unless url =~ /^http:\/\/(?:www\.)?youtube\.com\/v\//
|
305
303
|
|
306
304
|
# We're now certain that this is a YouTube embed, but we still need to run
|
307
305
|
# it through a special Sanitize step to ensure that no unwanted elements or
|
308
306
|
# attributes that don't belong in a YouTube embed can sneak in.
|
309
307
|
Sanitize.clean_node!(parent, {
|
310
|
-
:elements
|
308
|
+
:elements => %w[embed object param],
|
309
|
+
|
311
310
|
:attributes => {
|
312
|
-
'embed' => [
|
313
|
-
'object' => [
|
314
|
-
'param' => [
|
311
|
+
'embed' => %w[allowfullscreen allowscriptaccess height src type width],
|
312
|
+
'object' => %w[height width],
|
313
|
+
'param' => %w[name value]
|
315
314
|
}
|
316
315
|
})
|
317
316
|
|
@@ -319,30 +318,30 @@ by just whitelisting all <code><object></code>, <code><embed></code>, and
|
|
319
318
|
# no unwanted elements or attributes hidden inside it, we can tell Sanitize
|
320
319
|
# to whitelist the current node (<param> or <embed>) and its parent
|
321
320
|
# (<object>).
|
322
|
-
{:
|
321
|
+
{:node_whitelist => [node, parent]}
|
323
322
|
end
|
324
323
|
|
325
324
|
== Contributors
|
326
325
|
|
327
|
-
|
328
|
-
|
329
|
-
|
330
|
-
|
331
|
-
* Wilson Bilkovich
|
332
|
-
* Peter Cooper
|
333
|
-
* Gabe da Silveira
|
334
|
-
* Nicholas Evans
|
335
|
-
* Adam Hooper
|
336
|
-
* Mutwin Kraus
|
337
|
-
* Dev Purkayastha
|
338
|
-
* David Reese
|
339
|
-
* Ardie Saeidi
|
340
|
-
* Rafael Souza
|
341
|
-
* Ben Wanicur
|
326
|
+
Sanitize was created and is currently maintained by Ryan Grove (ryan@wonko.com).
|
327
|
+
|
328
|
+
The following lovely people have also contributed to Sanitize:
|
329
|
+
|
330
|
+
* Wilson Bilkovich (wilson@supremetyrant.com)
|
331
|
+
* Peter Cooper (git@peterc.org)
|
332
|
+
* Gabe da Silveira (gabe@websaviour.com)
|
333
|
+
* Nicholas Evans (owlmanatt@gmail.com)
|
334
|
+
* Adam Hooper (adam@adamhooper.com)
|
335
|
+
* Mutwin Kraus (mutle@blogage.de)
|
336
|
+
* Dev Purkayastha (dev.purkayastha@gmail.com)
|
337
|
+
* David Reese (work@whatcould.com)
|
338
|
+
* Ardie Saeidi (ardalan.saeidi@gmail.com)
|
339
|
+
* Rafael Souza (me@rafaelss.com)
|
340
|
+
* Ben Wanicur (bwanicur@verticalresponse.com)
|
342
341
|
|
343
342
|
== License
|
344
343
|
|
345
|
-
Copyright (c) 2010 Ryan Grove
|
344
|
+
Copyright (c) 2010 Ryan Grove (ryan@wonko.com)
|
346
345
|
|
347
346
|
Permission is hereby granted, free of charge, to any person obtaining a copy of
|
348
347
|
this software and associated documentation files (the 'Software'), to deal in
|
data/lib/sanitize.rb
CHANGED
@@ -21,12 +21,17 @@
|
|
21
21
|
# SOFTWARE.
|
22
22
|
#++
|
23
23
|
|
24
|
+
require 'set'
|
25
|
+
|
24
26
|
require 'nokogiri'
|
25
27
|
require 'sanitize/version'
|
26
28
|
require 'sanitize/config'
|
27
29
|
require 'sanitize/config/restricted'
|
28
30
|
require 'sanitize/config/basic'
|
29
31
|
require 'sanitize/config/relaxed'
|
32
|
+
require 'sanitize/transformers/clean_cdata'
|
33
|
+
require 'sanitize/transformers/clean_comment'
|
34
|
+
require 'sanitize/transformers/clean_element'
|
30
35
|
|
31
36
|
class Sanitize
|
32
37
|
attr_reader :config
|
@@ -45,21 +50,18 @@ class Sanitize
|
|
45
50
|
# Returns a sanitized copy of _html_, using the settings in _config_ if
|
46
51
|
# specified.
|
47
52
|
def self.clean(html, config = {})
|
48
|
-
|
49
|
-
sanitize.clean(html)
|
53
|
+
Sanitize.new(config).clean(html)
|
50
54
|
end
|
51
55
|
|
52
56
|
# Performs Sanitize#clean in place, returning _html_, or +nil+ if no changes
|
53
57
|
# were made.
|
54
58
|
def self.clean!(html, config = {})
|
55
|
-
|
56
|
-
sanitize.clean!(html)
|
59
|
+
Sanitize.new(config).clean!(html)
|
57
60
|
end
|
58
61
|
|
59
62
|
# Sanitizes the specified Nokogiri::XML::Node and all its children.
|
60
63
|
def self.clean_node!(node, config = {})
|
61
|
-
|
62
|
-
sanitize.clean_node!(node)
|
64
|
+
Sanitize.new(config).clean_node!(node)
|
63
65
|
end
|
64
66
|
|
65
67
|
#--
|
@@ -68,31 +70,15 @@ class Sanitize
|
|
68
70
|
|
69
71
|
# Returns a new Sanitize object initialized with the settings in _config_.
|
70
72
|
def initialize(config = {})
|
71
|
-
|
72
|
-
@
|
73
|
-
|
74
|
-
|
75
|
-
#
|
76
|
-
@
|
77
|
-
|
78
|
-
|
79
|
-
|
80
|
-
@config[:whitespace_elements].each {|el| @whitespace_elements[el] = true }
|
81
|
-
|
82
|
-
# Convert the list of :remove_contents elements to a Hash for faster lookup.
|
83
|
-
@remove_all_contents = false
|
84
|
-
@remove_element_contents = {}
|
85
|
-
|
86
|
-
if @config[:remove_contents].is_a?(Array)
|
87
|
-
@config[:remove_contents].each {|el| @remove_element_contents[el] = true }
|
88
|
-
else
|
89
|
-
@remove_all_contents = !!@config[:remove_contents]
|
90
|
-
end
|
91
|
-
|
92
|
-
# Specific nodes to whitelist (along with all their attributes). This array
|
93
|
-
# is generated at runtime by transformers, and is cleared before and after
|
94
|
-
# a fragment is cleaned (so it applies only to a specific fragment).
|
95
|
-
@whitelist_nodes = []
|
73
|
+
@config = Config::DEFAULT.merge(config)
|
74
|
+
@transformers = Array(@config[:transformers].dup)
|
75
|
+
|
76
|
+
# Default transformers. These always run at the end of the transformer
|
77
|
+
# chain, after any custom transformers.
|
78
|
+
@transformers <<
|
79
|
+
Transformers::CleanComment <<
|
80
|
+
Transformers::CleanCDATA <<
|
81
|
+
Transformers::CleanElement.new(@config)
|
96
82
|
end
|
97
83
|
|
98
84
|
# Returns a sanitized copy of _html_.
|
@@ -129,130 +115,34 @@ class Sanitize
|
|
129
115
|
def clean_node!(node)
|
130
116
|
raise ArgumentError unless node.is_a?(Nokogiri::XML::Node)
|
131
117
|
|
132
|
-
|
133
|
-
|
134
|
-
node.traverse do |child|
|
135
|
-
if child.element? || (child.text? && @config[:process_text_nodes])
|
136
|
-
clean_element!(child)
|
137
|
-
elsif child.comment?
|
138
|
-
child.unlink unless @config[:allow_comments]
|
139
|
-
elsif child.cdata?
|
140
|
-
child.replace(Nokogiri::XML::Text.new(child.text, child.document))
|
141
|
-
end
|
142
|
-
end
|
143
|
-
|
144
|
-
@whitelist_nodes = []
|
118
|
+
node_whitelist = Set.new
|
119
|
+
node.traverse {|child| transform_node!(child, node_whitelist) }
|
145
120
|
|
146
121
|
node
|
147
122
|
end
|
148
123
|
|
149
124
|
private
|
150
125
|
|
151
|
-
def
|
152
|
-
|
153
|
-
|
154
|
-
|
155
|
-
|
156
|
-
|
157
|
-
|
158
|
-
|
159
|
-
name = node.name.to_s.downcase
|
160
|
-
|
161
|
-
# Delete any element that isn't in the whitelist.
|
162
|
-
unless transform[:whitelist] || @allowed_elements[name]
|
163
|
-
# Elements like br, div, p, etc. need to be replaced with whitespace in
|
164
|
-
# order to preserve readability.
|
165
|
-
if @whitespace_elements[name]
|
166
|
-
node.add_previous_sibling(' ')
|
167
|
-
node.add_next_sibling(' ') unless node.children.empty?
|
168
|
-
end
|
169
|
-
|
170
|
-
unless @remove_all_contents || @remove_element_contents[name]
|
171
|
-
node.children.each { |n| node.add_previous_sibling(n) }
|
172
|
-
end
|
173
|
-
|
174
|
-
node.unlink
|
175
|
-
|
176
|
-
return
|
177
|
-
end
|
178
|
-
|
179
|
-
attr_whitelist = (transform[:attr_whitelist] +
|
180
|
-
(@config[:attributes][name] || []) +
|
181
|
-
(@config[:attributes][:all] || [])).uniq
|
182
|
-
|
183
|
-
if attr_whitelist.empty?
|
184
|
-
# Delete all attributes from elements with no whitelisted attributes.
|
185
|
-
node.attribute_nodes.each {|attr| attr.remove }
|
186
|
-
else
|
187
|
-
# Delete any attribute that isn't in the whitelist for this element.
|
188
|
-
node.attribute_nodes.each do |attr|
|
189
|
-
attr.unlink unless attr_whitelist.include?(attr.name.downcase)
|
190
|
-
end
|
191
|
-
|
192
|
-
# Delete remaining attributes that use unacceptable protocols.
|
193
|
-
if @config[:protocols].has_key?(name)
|
194
|
-
protocol = @config[:protocols][name]
|
195
|
-
|
196
|
-
node.attribute_nodes.each do |attr|
|
197
|
-
attr_name = attr.name.downcase
|
198
|
-
next false unless protocol.has_key?(attr_name)
|
199
|
-
|
200
|
-
del = if attr.value.to_s.downcase =~ REGEX_PROTOCOL
|
201
|
-
!protocol[attr_name].include?($1.downcase)
|
202
|
-
else
|
203
|
-
!protocol[attr_name].include?(:relative)
|
204
|
-
end
|
205
|
-
|
206
|
-
attr.unlink if del
|
207
|
-
end
|
208
|
-
end
|
209
|
-
end
|
210
|
-
|
211
|
-
# Add required attributes.
|
212
|
-
if @config[:add_attributes].has_key?(name)
|
213
|
-
@config[:add_attributes][name].each do |key, val|
|
214
|
-
node[key] = val
|
215
|
-
end
|
216
|
-
end
|
217
|
-
|
218
|
-
transform
|
219
|
-
end
|
220
|
-
|
221
|
-
def transform_element!(node)
|
222
|
-
output = {
|
223
|
-
:attr_whitelist => [],
|
224
|
-
:node => node,
|
225
|
-
:whitelist => false
|
226
|
-
}
|
227
|
-
|
228
|
-
@config[:transformers].inject(node) do |transformer_node, transformer|
|
229
|
-
transform = transformer.call({
|
230
|
-
:allowed_elements => @allowed_elements,
|
231
|
-
:config => @config,
|
232
|
-
:node => transformer_node,
|
233
|
-
:node_name => transformer_node.name.downcase,
|
234
|
-
:whitelist_nodes => @whitelist_nodes
|
126
|
+
def transform_node!(node, node_whitelist)
|
127
|
+
@transformers.each do |transformer|
|
128
|
+
result = transformer.call({
|
129
|
+
:config => @config,
|
130
|
+
:is_whitelisted => node_whitelist.include?(node),
|
131
|
+
:node => node,
|
132
|
+
:node_name => node.name.downcase,
|
133
|
+
:node_whitelist => node_whitelist
|
235
134
|
})
|
236
135
|
|
237
|
-
|
238
|
-
|
239
|
-
|
240
|
-
if transform[:whitelist_nodes].is_a?(Array)
|
241
|
-
@whitelist_nodes += transform[:whitelist_nodes]
|
242
|
-
@whitelist_nodes.uniq!
|
243
|
-
end
|
136
|
+
# If the node has been unlinked, there's no point running subsequent
|
137
|
+
# transformers.
|
138
|
+
break if node.parent.nil? && !node.fragment?
|
244
139
|
|
245
|
-
|
246
|
-
|
247
|
-
output[:node] = transform[:node].is_a?(Nokogiri::XML::Node) ? transform[:node] : output[:node]
|
248
|
-
else
|
249
|
-
raise Error, "transformer output must be a Hash or nil"
|
140
|
+
if result.is_a?(Hash) && result[:node_whitelist].respond_to?(:each)
|
141
|
+
node_whitelist.merge(result[:node_whitelist])
|
250
142
|
end
|
251
143
|
end
|
252
144
|
|
253
|
-
node
|
254
|
-
|
255
|
-
return output
|
145
|
+
node
|
256
146
|
end
|
257
147
|
|
258
148
|
class Error < StandardError; end
|
data/lib/sanitize/config.rb
CHANGED
@@ -47,10 +47,6 @@ class Sanitize
|
|
47
47
|
# Character encoding to use for HTML output. Default is 'utf-8'.
|
48
48
|
:output_encoding => 'utf-8',
|
49
49
|
|
50
|
-
# Whether or not to process text nodes. Enabling this will allow text
|
51
|
-
# nodes to be processed by transformers.
|
52
|
-
:process_text_nodes => false,
|
53
|
-
|
54
50
|
# URL handling protocols to allow in specific attributes. By default, no
|
55
51
|
# protocols are allowed. Use :relative in place of a protocol if you want
|
56
52
|
# to allow relative URLs sans protocol.
|
@@ -0,0 +1,87 @@
|
|
1
|
+
class Sanitize; module Transformers
|
2
|
+
|
3
|
+
class CleanElement
|
4
|
+
def initialize(config)
|
5
|
+
@config = config
|
6
|
+
|
7
|
+
# For faster lookups.
|
8
|
+
@add_attributes = config[:add_attributes]
|
9
|
+
@allowed_elements = {}
|
10
|
+
@attributes = config[:attributes]
|
11
|
+
@protocols = config[:protocols]
|
12
|
+
@remove_all_contents = false
|
13
|
+
@remove_element_contents = {}
|
14
|
+
@whitespace_elements = {}
|
15
|
+
|
16
|
+
config[:elements].each {|el| @allowed_elements[el] = true }
|
17
|
+
config[:whitespace_elements].each {|el| @whitespace_elements[el] = true }
|
18
|
+
|
19
|
+
if config[:remove_contents].is_a?(Array)
|
20
|
+
config[:remove_contents].each {|el| @remove_element_contents[el] = true }
|
21
|
+
else
|
22
|
+
@remove_all_contents = !!config[:remove_contents]
|
23
|
+
end
|
24
|
+
end
|
25
|
+
|
26
|
+
def call(env)
|
27
|
+
name = env[:node_name]
|
28
|
+
node = env[:node]
|
29
|
+
|
30
|
+
return if env[:is_whitelisted] || !node.element?
|
31
|
+
|
32
|
+
# Delete any element that isn't in the config whitelist.
|
33
|
+
unless @allowed_elements[name]
|
34
|
+
# Elements like br, div, p, etc. need to be replaced with whitespace in
|
35
|
+
# order to preserve readability.
|
36
|
+
if @whitespace_elements[name]
|
37
|
+
node.add_previous_sibling(' ')
|
38
|
+
node.add_next_sibling(' ') unless node.children.empty?
|
39
|
+
end
|
40
|
+
|
41
|
+
unless @remove_all_contents || @remove_element_contents[name]
|
42
|
+
node.children.each {|n| node.add_previous_sibling(n) }
|
43
|
+
end
|
44
|
+
|
45
|
+
node.unlink
|
46
|
+
return
|
47
|
+
end
|
48
|
+
|
49
|
+
attr_whitelist = Set.new((@attributes[name] || []) +
|
50
|
+
(@attributes[:all] || []))
|
51
|
+
|
52
|
+
if attr_whitelist.empty?
|
53
|
+
# Delete all attributes from elements with no whitelisted attributes.
|
54
|
+
node.attribute_nodes.each {|attr| attr.unlink }
|
55
|
+
else
|
56
|
+
# Delete any attribute that isn't in the whitelist for this element.
|
57
|
+
node.attribute_nodes.each do |attr|
|
58
|
+
attr.unlink unless attr_whitelist.include?(attr.name.downcase)
|
59
|
+
end
|
60
|
+
|
61
|
+
# Delete remaining attributes that use unacceptable protocols.
|
62
|
+
if @protocols.has_key?(name)
|
63
|
+
protocol = @protocols[name]
|
64
|
+
|
65
|
+
node.attribute_nodes.each do |attr|
|
66
|
+
attr_name = attr.name.downcase
|
67
|
+
next false unless protocol.has_key?(attr_name)
|
68
|
+
|
69
|
+
del = if attr.value.to_s.downcase =~ REGEX_PROTOCOL
|
70
|
+
!protocol[attr_name].include?($1.downcase)
|
71
|
+
else
|
72
|
+
!protocol[attr_name].include?(:relative)
|
73
|
+
end
|
74
|
+
|
75
|
+
attr.unlink if del
|
76
|
+
end
|
77
|
+
end
|
78
|
+
end
|
79
|
+
|
80
|
+
# Add required attributes.
|
81
|
+
if @add_attributes.has_key?(name)
|
82
|
+
@add_attributes[name].each {|key, val| node[key] = val }
|
83
|
+
end
|
84
|
+
end
|
85
|
+
end
|
86
|
+
|
87
|
+
end; end
|
data/lib/sanitize/version.rb
CHANGED
metadata
CHANGED
@@ -3,12 +3,12 @@ name: sanitize
|
|
3
3
|
version: !ruby/object:Gem::Version
|
4
4
|
prerelease: true
|
5
5
|
segments:
|
6
|
-
-
|
7
|
-
-
|
6
|
+
- 2
|
7
|
+
- 0
|
8
8
|
- 0
|
9
9
|
- dev
|
10
|
-
-
|
11
|
-
version:
|
10
|
+
- 20101211
|
11
|
+
version: 2.0.0.dev.20101211
|
12
12
|
platform: ruby
|
13
13
|
authors:
|
14
14
|
- Ryan Grove
|
@@ -16,7 +16,7 @@ autorequire:
|
|
16
16
|
bindir: bin
|
17
17
|
cert_chain: []
|
18
18
|
|
19
|
-
date: 2010-12-
|
19
|
+
date: 2010-12-11 00:00:00 -08:00
|
20
20
|
default_executable:
|
21
21
|
dependencies:
|
22
22
|
- !ruby/object:Gem::Dependency
|
@@ -80,6 +80,9 @@ files:
|
|
80
80
|
- lib/sanitize/config/relaxed.rb
|
81
81
|
- lib/sanitize/config/restricted.rb
|
82
82
|
- lib/sanitize/config.rb
|
83
|
+
- lib/sanitize/transformers/clean_cdata.rb
|
84
|
+
- lib/sanitize/transformers/clean_comment.rb
|
85
|
+
- lib/sanitize/transformers/clean_element.rb
|
83
86
|
- lib/sanitize/version.rb
|
84
87
|
- lib/sanitize.rb
|
85
88
|
has_rdoc: true
|
@@ -99,8 +102,8 @@ required_ruby_version: !ruby/object:Gem::Requirement
|
|
99
102
|
segments:
|
100
103
|
- 1
|
101
104
|
- 8
|
102
|
-
-
|
103
|
-
version: 1.8.
|
105
|
+
- 7
|
106
|
+
version: 1.8.7
|
104
107
|
required_rubygems_version: !ruby/object:Gem::Requirement
|
105
108
|
none: false
|
106
109
|
requirements:
|