sanitize 1.3.0.dev.20101210 → 2.0.0.dev.20101211

Sign up to get free protection for your applications and to get access to all the features.

Potentially problematic release.


This version of sanitize might be problematic. Click here for more details.

data/HISTORY CHANGED
@@ -1,7 +1,13 @@
1
1
  Sanitize History
2
2
  ================================================================================
3
3
 
4
- Version 1.3.0 (git)
4
+ Version 2.0.0 (git)
5
+ * The environment data passed into transformers and the return values expected
6
+ from transformers have changed. Old transformers will need to be updated.
7
+ See the README for details.
8
+ * Transformers now receive nodes of all types, not just element nodes.
9
+ * Sanitize's own core filtering logic is now implemented as a set of always-on
10
+ transformers.
5
11
  * The default value for the :output config is now :html. Previously it was
6
12
  :xhtml.
7
13
  * Added a :whitespace_elements config, which specifies elements (such as <br>
@@ -15,15 +21,6 @@ Version 1.3.0 (git)
15
21
  `ruby`, and `wbr` elements to the whitelist for `Sanitize::Config::RELAXED`.
16
22
  * The `dir`, `lang`, and `title` attributes are now whitelisted for all
17
23
  elements in `Sanitize::Config::RELAXED`.
18
- * The environment hash passed into transformers now includes an
19
- :allowed_elements Hash to facilitate faster lookups when attempting to
20
- determine whether an element is in the whitelist. [Suggested by Nicholas
21
- Evans]
22
- * The environment hash passed into transformers now includes a
23
- :whitelist_nodes Array, so transformers now have insight into what nodes
24
- have been whitelisted by other transformers. [Suggested by Nicholas Evans]
25
- * Added a :process_text_nodes config setting. If set to true, Sanitize will
26
- pass text nodes to transformers. The default is false. [Ardie Saeidi]
27
24
  * Bumped minimum Nokogiri version to 1.4.4 to avoid a bug in 1.4.2+ (issue
28
25
  #315) that caused "</body></html>" to be appended to the CDATA inside
29
26
  unterminated script and style elements.
data/README.rdoc CHANGED
@@ -14,7 +14,7 @@ of fragile regular expressions, Sanitize has no trouble dealing with malformed
14
14
  or maliciously-formed HTML, and will always output valid HTML or XHTML.
15
15
 
16
16
  *Author*:: Ryan Grove (mailto:ryan@wonko.com)
17
- *Version*:: 1.3.0 (git)
17
+ *Version*:: 2.0.0 (git)
18
18
  *Copyright*:: Copyright (c) 2010 Ryan Grove. All rights reserved.
19
19
  *License*:: MIT License (http://opensource.org/licenses/mit-license.php)
20
20
  *Website*:: http://github.com/rgrove/sanitize
@@ -43,7 +43,7 @@ behind.
43
43
  require 'rubygems'
44
44
  require 'sanitize'
45
45
 
46
- html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
46
+ html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg">'
47
47
 
48
48
  Sanitize.clean(html) # => 'foo'
49
49
 
@@ -77,7 +77,7 @@ are limited to HTTP and HTTPS. In this mode, <code>rel="nofollow"</code> is not
77
77
  added to links.
78
78
 
79
79
  Sanitize.clean(html, Sanitize::Config::RELAXED)
80
- # => '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
80
+ # => '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg">'
81
81
 
82
82
  === Custom Configuration
83
83
 
@@ -127,10 +127,9 @@ default value is <code>false</code>.
127
127
 
128
128
  Array of element names to allow. Specify all names in lowercase.
129
129
 
130
- :elements => [
131
- 'a', 'b', 'blockquote', 'br', 'cite', 'code', 'dd', 'dl', 'dt', 'em',
132
- 'i', 'li', 'ol', 'p', 'pre', 'q', 'small', 'strike', 'strong', 'sub',
133
- 'sup', 'u', 'ul'
130
+ :elements => %w[
131
+ a abbr b blockquote br cite code dd dfn dl dt em i kbd li mark ol p pre
132
+ q s samp small strike strong sub sup time u ul var
134
133
  ]
135
134
 
136
135
  ==== :output (Symbol)
@@ -140,12 +139,7 @@ defaulting to <code>:html</code>.
140
139
 
141
140
  ==== :output_encoding (String)
142
141
 
143
- Character encoding to use for HTML output. Default is <code>'utf-8'</code>.
144
-
145
- ==== :process_text_nodes (Boolean)
146
-
147
- Whether or not to process text nodes. Enabling this will allow text nodes to be
148
- processed by transformers. The default is <code>false</code>.
142
+ Character encoding to use for HTML output. Default is <code>utf-8</code>.
149
143
 
150
144
  ==== :protocols (Hash)
151
145
 
@@ -171,7 +165,7 @@ If set to +true+, Sanitize will remove the contents of any non-whitelisted
171
165
  elements in addition to the elements themselves. By default, Sanitize leaves the
172
166
  safe parts of an element's contents behind when the element is removed.
173
167
 
174
- If set to an Array of element names, then only the contents of the specified
168
+ If set to an array of element names, then only the contents of the specified
175
169
  elements (when filtered) will be removed, and the contents of all other filtered
176
170
  elements will be left behind.
177
171
 
@@ -179,7 +173,8 @@ The default value is <code>false</code>.
179
173
 
180
174
  ==== :transformers
181
175
 
182
- See below.
176
+ Custom transformer or array of custom transformers. See the Transformers section
177
+ below for details.
183
178
 
184
179
  ==== :whitespace_elements (Array)
185
180
 
@@ -196,81 +191,80 @@ By default, the following elements are included in the
196
191
 
197
192
  === Transformers
198
193
 
199
- Transformers allow you to filter and alter nodes using your own custom logic, on
200
- top of (or instead of) Sanitize's core filter. A transformer is any object that
201
- responds to <code>call()</code> (such as a lambda or proc) and returns either
202
- <code>nil</code> or a Hash containing certain optional response values.
194
+ Transformers allow you to filter and modify nodes using your own custom logic,
195
+ on top of (or instead of) Sanitize's core filter. A transformer is any object
196
+ that responds to <code>call()</code> (such as a lambda or proc).
203
197
 
204
198
  To use one or more transformers, pass them to the <code>:transformers</code>
205
- config setting:
199
+ config setting. You may pass a single transformer or an array of transformers.
206
200
 
207
201
  Sanitize.clean(html, :transformers => [transformer_one, transformer_two])
208
202
 
209
203
  ==== Input
210
204
 
211
205
  Each registered transformer's <code>call()</code> method will be called once for
212
- each element node in the HTML, and will receive as an argument an environment
213
- Hash that contains the following items:
214
-
215
- [<code>:allowed_elements</code>]
216
- Hash with whitelisted element names as keys, to facilitate fast lookups of
217
- whitelisted elements.
206
+ each node in the HTML (including elements, text nodes, comments, etc.), and will
207
+ receive as an argument an environment Hash that contains the following items:
218
208
 
219
209
  [<code>:config</code>]
220
210
  The current Sanitize configuration Hash.
221
211
 
212
+ [<code>:is_whitelisted</code>]
213
+ <code>true</code> if the current node has been whitelisted by a previous
214
+ transformer, <code>false</code> otherwise. It's generally bad form to remove a
215
+ node that a previous transformer has whitelisted.
216
+
222
217
  [<code>:node</code>]
223
- A Nokogiri::XML::Node object representing an HTML element.
218
+ A Nokogiri::XML::Node object representing an HTML node. The node may be an
219
+ element, a text node, a comment, a CDATA node, or a document fragment. Use
220
+ Nokogiri's inspection methods (<code>element?</code>, <code>text?</code>,
221
+ etc.) to selectively ignore node types you aren't interested in.
224
222
 
225
223
  [<code>:node_name</code>]
226
224
  The name of the current HTML node, always lowercase (e.g. "div" or "span").
225
+ For non-element nodes, the name will be something like "text", "comment",
226
+ "#cdata-section", "#document-fragment", etc.
227
+
228
+ [<code>:node_whitelist</code>]
229
+ Set of Nokogiri::XML::Node objects in the current document that have been
230
+ whitelisted by previous transformers, if any. It's generally bad form to
231
+ remove a node that a previous transformer has whitelisted.
227
232
 
228
- [<code>:whitelist_nodes</code>]
229
- Array of Nokogiri::XML::Node instances that have already been whitelisted by
230
- previous transformers, if any.
233
+ ==== Output
234
+
235
+ A transformer doesn't have to return anything, but may optionally return a Hash,
236
+ which may contain the following items:
237
+
238
+ [<code>:node_whitelist</code>]
239
+ Array or Set of specific Nokogiri::XML::Node objects to add to the document's
240
+ whitelist, bypassing the current Sanitize config. These specific nodes and all
241
+ their attributes will be whitelisted, but their children will not be.
242
+
243
+ If a transformer returns anything other than a Hash, the return value will be
244
+ ignored.
231
245
 
232
246
  ==== Processing
233
247
 
234
248
  Each transformer has full access to the Nokogiri::XML::Node that's passed into
235
249
  it and to the rest of the document via the node's <code>document()</code>
236
- method. Any changes will be reflected instantly in the document and passed on to
237
- subsequently-called transformers and to Sanitize itself. A transformer may even
238
- call Sanitize internally to perform custom sanitization if needed.
250
+ method. Any changes made to the current node or to the document will be
251
+ reflected instantly in the document and passed on to subsequently-called
252
+ transformers and to Sanitize itself. A transformer may even call Sanitize
253
+ internally to perform custom sanitization if needed.
239
254
 
240
255
  Nodes are passed into transformers in the order in which they're traversed. It's
241
256
  important to note that Nokogiri traverses markup from the deepest node upward,
242
257
  not from the first node to the last node:
243
258
 
244
259
  html = '<div><span>foo</span></div>'
245
- transformer = lambda{|env| puts env[:node].name }
260
+ transformer = lambda{|env| puts env[:node_name] }
246
261
 
247
- # Prints "span", then "div".
262
+ # Prints "text", "span", "div", "#document-fragment".
248
263
  Sanitize.clean(html, :transformers => transformer)
249
264
 
250
265
  Transformers have a tremendous amount of power, including the power to
251
- completely bypass Sanitize's built-in filtering. Be careful!
252
-
253
- ==== Output
254
-
255
- A transformer may return either +nil+ or a Hash. A return value of +nil+
256
- indicates that the transformer does not wish to act on the current node in any
257
- way. A returned Hash may contain the following items, all of which are optional:
258
-
259
- [<code>:attr_whitelist</code>]
260
- Array of attribute names to add to the whitelist for the current node, in
261
- addition to any whitelisted attributes already defined in the current config.
262
-
263
- [<code>:node</code>]
264
- A Nokogiri::XML::Node object that should replace the current node. All
265
- subsequent transformers and Sanitize itself will receive this new node.
266
-
267
- [<code>:whitelist</code>]
268
- If _true_, the current node (and only the current node) will be whitelisted,
269
- regardless of the current Sanitize config.
270
-
271
- [<code>:whitelist_nodes</code>]
272
- Array of specific Nokogiri::XML::Node objects to whitelist, anywhere in the
273
- document, regardless of the current Sanitize config.
266
+ completely bypass Sanitize's built-in filtering. Be careful! Your safety is in
267
+ your own hands.
274
268
 
275
269
  ==== Example: Transformer to whitelist YouTube video embeds
276
270
 
@@ -283,16 +277,20 @@ by just whitelisting all <code><object></code>, <code><embed></code>, and
283
277
  lambda do |env|
284
278
  node = env[:node]
285
279
  node_name = env[:node_name]
286
- parent = node.parent
280
+
281
+ # Don't continue if this node is already whitelisted or is not an element.
282
+ return if env[:is_whitelisted] || !node.element?
283
+
284
+ parent = node.parent
287
285
 
288
286
  # Since the transformer receives the deepest nodes first, we look for a
289
287
  # <param> element or an <embed> element whose parent is an <object>.
290
- return nil unless (node_name == 'param' || node_name == 'embed') &&
288
+ return unless (node_name == 'param' || node_name == 'embed') &&
291
289
  parent.name.to_s.downcase == 'object'
292
290
 
293
291
  if node_name == 'param'
294
292
  # Quick XPath search to find the <param> node that contains the video URL.
295
- return nil unless movie_node = parent.search('param[@name="movie"]')[0]
293
+ return unless movie_node = parent.search('param[@name="movie"]')[0]
296
294
  url = movie_node['value']
297
295
  else
298
296
  # Since this is an <embed>, the video URL is in the "src" attribute. No
@@ -301,17 +299,18 @@ by just whitelisting all <code><object></code>, <code><embed></code>, and
301
299
  end
302
300
 
303
301
  # Verify that the video URL is actually a valid YouTube video URL.
304
- return nil unless url =~ /^http:\/\/(?:www\.)?youtube\.com\/v\//
302
+ return unless url =~ /^http:\/\/(?:www\.)?youtube\.com\/v\//
305
303
 
306
304
  # We're now certain that this is a YouTube embed, but we still need to run
307
305
  # it through a special Sanitize step to ensure that no unwanted elements or
308
306
  # attributes that don't belong in a YouTube embed can sneak in.
309
307
  Sanitize.clean_node!(parent, {
310
- :elements => ['embed', 'object', 'param'],
308
+ :elements => %w[embed object param],
309
+
311
310
  :attributes => {
312
- 'embed' => ['allowfullscreen', 'allowscriptaccess', 'height', 'src', 'type', 'width'],
313
- 'object' => ['height', 'width'],
314
- 'param' => ['name', 'value']
311
+ 'embed' => %w[allowfullscreen allowscriptaccess height src type width],
312
+ 'object' => %w[height width],
313
+ 'param' => %w[name value]
315
314
  }
316
315
  })
317
316
 
@@ -319,30 +318,30 @@ by just whitelisting all <code><object></code>, <code><embed></code>, and
319
318
  # no unwanted elements or attributes hidden inside it, we can tell Sanitize
320
319
  # to whitelist the current node (<param> or <embed>) and its parent
321
320
  # (<object>).
322
- {:whitelist_nodes => [node, parent]}
321
+ {:node_whitelist => [node, parent]}
323
322
  end
324
323
 
325
324
  == Contributors
326
325
 
327
- The following lovely people have contributed to Sanitize in the form of patches
328
- or ideas that later became code:
329
-
330
- * Ryan Grove <ryan@wonko.com>
331
- * Wilson Bilkovich <wilson@supremetyrant.com>
332
- * Peter Cooper <git@peterc.org>
333
- * Gabe da Silveira <gabe@websaviour.com>
334
- * Nicholas Evans <owlmanatt@gmail.com>
335
- * Adam Hooper <adam@adamhooper.com>
336
- * Mutwin Kraus <mutle@blogage.de>
337
- * Dev Purkayastha <dev.purkayastha@gmail.com>
338
- * David Reese <work@whatcould.com>
339
- * Ardie Saeidi <ardalan.saeidi@gmail.com>
340
- * Rafael Souza <me@rafaelss.com>
341
- * Ben Wanicur <bwanicur@verticalresponse.com>
326
+ Sanitize was created and is currently maintained by Ryan Grove (ryan@wonko.com).
327
+
328
+ The following lovely people have also contributed to Sanitize:
329
+
330
+ * Wilson Bilkovich (wilson@supremetyrant.com)
331
+ * Peter Cooper (git@peterc.org)
332
+ * Gabe da Silveira (gabe@websaviour.com)
333
+ * Nicholas Evans (owlmanatt@gmail.com)
334
+ * Adam Hooper (adam@adamhooper.com)
335
+ * Mutwin Kraus (mutle@blogage.de)
336
+ * Dev Purkayastha (dev.purkayastha@gmail.com)
337
+ * David Reese (work@whatcould.com)
338
+ * Ardie Saeidi (ardalan.saeidi@gmail.com)
339
+ * Rafael Souza (me@rafaelss.com)
340
+ * Ben Wanicur (bwanicur@verticalresponse.com)
342
341
 
343
342
  == License
344
343
 
345
- Copyright (c) 2010 Ryan Grove <ryan@wonko.com>
344
+ Copyright (c) 2010 Ryan Grove (ryan@wonko.com)
346
345
 
347
346
  Permission is hereby granted, free of charge, to any person obtaining a copy of
348
347
  this software and associated documentation files (the 'Software'), to deal in
data/lib/sanitize.rb CHANGED
@@ -21,12 +21,17 @@
21
21
  # SOFTWARE.
22
22
  #++
23
23
 
24
+ require 'set'
25
+
24
26
  require 'nokogiri'
25
27
  require 'sanitize/version'
26
28
  require 'sanitize/config'
27
29
  require 'sanitize/config/restricted'
28
30
  require 'sanitize/config/basic'
29
31
  require 'sanitize/config/relaxed'
32
+ require 'sanitize/transformers/clean_cdata'
33
+ require 'sanitize/transformers/clean_comment'
34
+ require 'sanitize/transformers/clean_element'
30
35
 
31
36
  class Sanitize
32
37
  attr_reader :config
@@ -45,21 +50,18 @@ class Sanitize
45
50
  # Returns a sanitized copy of _html_, using the settings in _config_ if
46
51
  # specified.
47
52
  def self.clean(html, config = {})
48
- sanitize = Sanitize.new(config)
49
- sanitize.clean(html)
53
+ Sanitize.new(config).clean(html)
50
54
  end
51
55
 
52
56
  # Performs Sanitize#clean in place, returning _html_, or +nil+ if no changes
53
57
  # were made.
54
58
  def self.clean!(html, config = {})
55
- sanitize = Sanitize.new(config)
56
- sanitize.clean!(html)
59
+ Sanitize.new(config).clean!(html)
57
60
  end
58
61
 
59
62
  # Sanitizes the specified Nokogiri::XML::Node and all its children.
60
63
  def self.clean_node!(node, config = {})
61
- sanitize = Sanitize.new(config)
62
- sanitize.clean_node!(node)
64
+ Sanitize.new(config).clean_node!(node)
63
65
  end
64
66
 
65
67
  #--
@@ -68,31 +70,15 @@ class Sanitize
68
70
 
69
71
  # Returns a new Sanitize object initialized with the settings in _config_.
70
72
  def initialize(config = {})
71
- # Sanitize configuration.
72
- @config = Config::DEFAULT.merge(config)
73
- @config[:transformers] = Array(@config[:transformers].dup)
74
-
75
- # Convert arrays to hashes for faster lookups.
76
- @allowed_elements = {}
77
- @whitespace_elements = {}
78
-
79
- @config[:elements].each {|el| @allowed_elements[el] = true }
80
- @config[:whitespace_elements].each {|el| @whitespace_elements[el] = true }
81
-
82
- # Convert the list of :remove_contents elements to a Hash for faster lookup.
83
- @remove_all_contents = false
84
- @remove_element_contents = {}
85
-
86
- if @config[:remove_contents].is_a?(Array)
87
- @config[:remove_contents].each {|el| @remove_element_contents[el] = true }
88
- else
89
- @remove_all_contents = !!@config[:remove_contents]
90
- end
91
-
92
- # Specific nodes to whitelist (along with all their attributes). This array
93
- # is generated at runtime by transformers, and is cleared before and after
94
- # a fragment is cleaned (so it applies only to a specific fragment).
95
- @whitelist_nodes = []
73
+ @config = Config::DEFAULT.merge(config)
74
+ @transformers = Array(@config[:transformers].dup)
75
+
76
+ # Default transformers. These always run at the end of the transformer
77
+ # chain, after any custom transformers.
78
+ @transformers <<
79
+ Transformers::CleanComment <<
80
+ Transformers::CleanCDATA <<
81
+ Transformers::CleanElement.new(@config)
96
82
  end
97
83
 
98
84
  # Returns a sanitized copy of _html_.
@@ -129,130 +115,34 @@ class Sanitize
129
115
  def clean_node!(node)
130
116
  raise ArgumentError unless node.is_a?(Nokogiri::XML::Node)
131
117
 
132
- @whitelist_nodes = []
133
-
134
- node.traverse do |child|
135
- if child.element? || (child.text? && @config[:process_text_nodes])
136
- clean_element!(child)
137
- elsif child.comment?
138
- child.unlink unless @config[:allow_comments]
139
- elsif child.cdata?
140
- child.replace(Nokogiri::XML::Text.new(child.text, child.document))
141
- end
142
- end
143
-
144
- @whitelist_nodes = []
118
+ node_whitelist = Set.new
119
+ node.traverse {|child| transform_node!(child, node_whitelist) }
145
120
 
146
121
  node
147
122
  end
148
123
 
149
124
  private
150
125
 
151
- def clean_element!(node)
152
- # Run this node through all configured transformers.
153
- transform = transform_element!(node)
154
-
155
- # If this node is in the dynamic whitelist array (built at runtime by
156
- # transformers), let it live with all of its attributes intact.
157
- return if @whitelist_nodes.include?(node)
158
-
159
- name = node.name.to_s.downcase
160
-
161
- # Delete any element that isn't in the whitelist.
162
- unless transform[:whitelist] || @allowed_elements[name]
163
- # Elements like br, div, p, etc. need to be replaced with whitespace in
164
- # order to preserve readability.
165
- if @whitespace_elements[name]
166
- node.add_previous_sibling(' ')
167
- node.add_next_sibling(' ') unless node.children.empty?
168
- end
169
-
170
- unless @remove_all_contents || @remove_element_contents[name]
171
- node.children.each { |n| node.add_previous_sibling(n) }
172
- end
173
-
174
- node.unlink
175
-
176
- return
177
- end
178
-
179
- attr_whitelist = (transform[:attr_whitelist] +
180
- (@config[:attributes][name] || []) +
181
- (@config[:attributes][:all] || [])).uniq
182
-
183
- if attr_whitelist.empty?
184
- # Delete all attributes from elements with no whitelisted attributes.
185
- node.attribute_nodes.each {|attr| attr.remove }
186
- else
187
- # Delete any attribute that isn't in the whitelist for this element.
188
- node.attribute_nodes.each do |attr|
189
- attr.unlink unless attr_whitelist.include?(attr.name.downcase)
190
- end
191
-
192
- # Delete remaining attributes that use unacceptable protocols.
193
- if @config[:protocols].has_key?(name)
194
- protocol = @config[:protocols][name]
195
-
196
- node.attribute_nodes.each do |attr|
197
- attr_name = attr.name.downcase
198
- next false unless protocol.has_key?(attr_name)
199
-
200
- del = if attr.value.to_s.downcase =~ REGEX_PROTOCOL
201
- !protocol[attr_name].include?($1.downcase)
202
- else
203
- !protocol[attr_name].include?(:relative)
204
- end
205
-
206
- attr.unlink if del
207
- end
208
- end
209
- end
210
-
211
- # Add required attributes.
212
- if @config[:add_attributes].has_key?(name)
213
- @config[:add_attributes][name].each do |key, val|
214
- node[key] = val
215
- end
216
- end
217
-
218
- transform
219
- end
220
-
221
- def transform_element!(node)
222
- output = {
223
- :attr_whitelist => [],
224
- :node => node,
225
- :whitelist => false
226
- }
227
-
228
- @config[:transformers].inject(node) do |transformer_node, transformer|
229
- transform = transformer.call({
230
- :allowed_elements => @allowed_elements,
231
- :config => @config,
232
- :node => transformer_node,
233
- :node_name => transformer_node.name.downcase,
234
- :whitelist_nodes => @whitelist_nodes
126
+ def transform_node!(node, node_whitelist)
127
+ @transformers.each do |transformer|
128
+ result = transformer.call({
129
+ :config => @config,
130
+ :is_whitelisted => node_whitelist.include?(node),
131
+ :node => node,
132
+ :node_name => node.name.downcase,
133
+ :node_whitelist => node_whitelist
235
134
  })
236
135
 
237
- if transform.nil?
238
- transformer_node
239
- elsif transform.is_a?(Hash)
240
- if transform[:whitelist_nodes].is_a?(Array)
241
- @whitelist_nodes += transform[:whitelist_nodes]
242
- @whitelist_nodes.uniq!
243
- end
136
+ # If the node has been unlinked, there's no point running subsequent
137
+ # transformers.
138
+ break if node.parent.nil? && !node.fragment?
244
139
 
245
- output[:attr_whitelist] += transform[:attr_whitelist] if transform[:attr_whitelist].is_a?(Array)
246
- output[:whitelist] ||= true if transform[:whitelist]
247
- output[:node] = transform[:node].is_a?(Nokogiri::XML::Node) ? transform[:node] : output[:node]
248
- else
249
- raise Error, "transformer output must be a Hash or nil"
140
+ if result.is_a?(Hash) && result[:node_whitelist].respond_to?(:each)
141
+ node_whitelist.merge(result[:node_whitelist])
250
142
  end
251
143
  end
252
144
 
253
- node.replace(output[:node]) if node != output[:node]
254
-
255
- return output
145
+ node
256
146
  end
257
147
 
258
148
  class Error < StandardError; end
@@ -47,10 +47,6 @@ class Sanitize
47
47
  # Character encoding to use for HTML output. Default is 'utf-8'.
48
48
  :output_encoding => 'utf-8',
49
49
 
50
- # Whether or not to process text nodes. Enabling this will allow text
51
- # nodes to be processed by transformers.
52
- :process_text_nodes => false,
53
-
54
50
  # URL handling protocols to allow in specific attributes. By default, no
55
51
  # protocols are allowed. Use :relative in place of a protocol if you want
56
52
  # to allow relative URLs sans protocol.
@@ -0,0 +1,13 @@
1
+ class Sanitize; module Transformers
2
+
3
+ CleanCDATA = lambda do |env|
4
+ return if env[:is_whitelisted]
5
+
6
+ node = env[:node]
7
+
8
+ if node.cdata?
9
+ node.replace(Nokogiri::XML::Text.new(node.text, node.document))
10
+ end
11
+ end
12
+
13
+ end; end
@@ -0,0 +1,10 @@
1
+ class Sanitize; module Transformers
2
+
3
+ CleanComment = lambda do |env|
4
+ return if env[:is_whitelisted]
5
+
6
+ node = env[:node]
7
+ node.unlink if node.comment? && !env[:config][:allow_comments]
8
+ end
9
+
10
+ end; end
@@ -0,0 +1,87 @@
1
+ class Sanitize; module Transformers
2
+
3
+ class CleanElement
4
+ def initialize(config)
5
+ @config = config
6
+
7
+ # For faster lookups.
8
+ @add_attributes = config[:add_attributes]
9
+ @allowed_elements = {}
10
+ @attributes = config[:attributes]
11
+ @protocols = config[:protocols]
12
+ @remove_all_contents = false
13
+ @remove_element_contents = {}
14
+ @whitespace_elements = {}
15
+
16
+ config[:elements].each {|el| @allowed_elements[el] = true }
17
+ config[:whitespace_elements].each {|el| @whitespace_elements[el] = true }
18
+
19
+ if config[:remove_contents].is_a?(Array)
20
+ config[:remove_contents].each {|el| @remove_element_contents[el] = true }
21
+ else
22
+ @remove_all_contents = !!config[:remove_contents]
23
+ end
24
+ end
25
+
26
+ def call(env)
27
+ name = env[:node_name]
28
+ node = env[:node]
29
+
30
+ return if env[:is_whitelisted] || !node.element?
31
+
32
+ # Delete any element that isn't in the config whitelist.
33
+ unless @allowed_elements[name]
34
+ # Elements like br, div, p, etc. need to be replaced with whitespace in
35
+ # order to preserve readability.
36
+ if @whitespace_elements[name]
37
+ node.add_previous_sibling(' ')
38
+ node.add_next_sibling(' ') unless node.children.empty?
39
+ end
40
+
41
+ unless @remove_all_contents || @remove_element_contents[name]
42
+ node.children.each {|n| node.add_previous_sibling(n) }
43
+ end
44
+
45
+ node.unlink
46
+ return
47
+ end
48
+
49
+ attr_whitelist = Set.new((@attributes[name] || []) +
50
+ (@attributes[:all] || []))
51
+
52
+ if attr_whitelist.empty?
53
+ # Delete all attributes from elements with no whitelisted attributes.
54
+ node.attribute_nodes.each {|attr| attr.unlink }
55
+ else
56
+ # Delete any attribute that isn't in the whitelist for this element.
57
+ node.attribute_nodes.each do |attr|
58
+ attr.unlink unless attr_whitelist.include?(attr.name.downcase)
59
+ end
60
+
61
+ # Delete remaining attributes that use unacceptable protocols.
62
+ if @protocols.has_key?(name)
63
+ protocol = @protocols[name]
64
+
65
+ node.attribute_nodes.each do |attr|
66
+ attr_name = attr.name.downcase
67
+ next false unless protocol.has_key?(attr_name)
68
+
69
+ del = if attr.value.to_s.downcase =~ REGEX_PROTOCOL
70
+ !protocol[attr_name].include?($1.downcase)
71
+ else
72
+ !protocol[attr_name].include?(:relative)
73
+ end
74
+
75
+ attr.unlink if del
76
+ end
77
+ end
78
+ end
79
+
80
+ # Add required attributes.
81
+ if @add_attributes.has_key?(name)
82
+ @add_attributes[name].each {|key, val| node[key] = val }
83
+ end
84
+ end
85
+ end
86
+
87
+ end; end
@@ -1,3 +1,3 @@
1
1
  class Sanitize
2
- VERSION = '1.3.0.dev.20101210'
2
+ VERSION = '2.0.0.dev.20101211'
3
3
  end
metadata CHANGED
@@ -3,12 +3,12 @@ name: sanitize
3
3
  version: !ruby/object:Gem::Version
4
4
  prerelease: true
5
5
  segments:
6
- - 1
7
- - 3
6
+ - 2
7
+ - 0
8
8
  - 0
9
9
  - dev
10
- - 20101210
11
- version: 1.3.0.dev.20101210
10
+ - 20101211
11
+ version: 2.0.0.dev.20101211
12
12
  platform: ruby
13
13
  authors:
14
14
  - Ryan Grove
@@ -16,7 +16,7 @@ autorequire:
16
16
  bindir: bin
17
17
  cert_chain: []
18
18
 
19
- date: 2010-12-10 00:00:00 -08:00
19
+ date: 2010-12-11 00:00:00 -08:00
20
20
  default_executable:
21
21
  dependencies:
22
22
  - !ruby/object:Gem::Dependency
@@ -80,6 +80,9 @@ files:
80
80
  - lib/sanitize/config/relaxed.rb
81
81
  - lib/sanitize/config/restricted.rb
82
82
  - lib/sanitize/config.rb
83
+ - lib/sanitize/transformers/clean_cdata.rb
84
+ - lib/sanitize/transformers/clean_comment.rb
85
+ - lib/sanitize/transformers/clean_element.rb
83
86
  - lib/sanitize/version.rb
84
87
  - lib/sanitize.rb
85
88
  has_rdoc: true
@@ -99,8 +102,8 @@ required_ruby_version: !ruby/object:Gem::Requirement
99
102
  segments:
100
103
  - 1
101
104
  - 8
102
- - 6
103
- version: 1.8.6
105
+ - 7
106
+ version: 1.8.7
104
107
  required_rubygems_version: !ruby/object:Gem::Requirement
105
108
  none: false
106
109
  requirements: