sanitize 1.1.1.dev.20091102 → 1.2.0.dev.20091104

Sign up to get free protection for your applications and to get access to all the features.

Potentially problematic release.


This version of sanitize might be problematic. Click here for more details.

data/HISTORY CHANGED
@@ -1,7 +1,10 @@
1
1
  Sanitize History
2
2
  ================================================================================
3
3
 
4
- Version 1.1.1.dev (git)
4
+ Version 1.2.0.dev (git)
5
+ * Added support for transformers, which allow you to filter and alter nodes
6
+ using your own custom logic, on top of (or instead of) Sanitize's core
7
+ filter. See the README for details.
5
8
  * Requires Nokogiri >= 1.4.0.
6
9
  * Added elements h1 through h6 to the Relaxed whitelist. [Suggested by David
7
10
  Reese]
data/README.rdoc CHANGED
@@ -15,7 +15,7 @@ or maliciously-formed HTML. When in doubt, Sanitize always errs on the side of
15
15
  caution.
16
16
 
17
17
  *Author*:: Ryan Grove (mailto:ryan@wonko.com)
18
- *Version*:: 1.1.1.dev (git)
18
+ *Version*:: 1.2.0.dev (git)
19
19
  *Copyright*:: Copyright (c) 2009 Ryan Grove. All rights reserved.
20
20
  *License*:: MIT License (http://opensource.org/licenses/mit-license.php)
21
21
  *Website*:: http://github.com/rgrove/sanitize
@@ -38,7 +38,8 @@ Latest development version:
38
38
  == Usage
39
39
 
40
40
  If you don't specify any configuration options, Sanitize will use its strictest
41
- settings by default, which means it will strip all HTML.
41
+ settings by default, which means it will strip all HTML and leave only text
42
+ behind.
42
43
 
43
44
  require 'rubygems'
44
45
  require 'sanitize'
@@ -145,6 +146,72 @@ include the symbol <code>:relative</code> in the protocol array:
145
146
  'a' => {'href' => ['http', 'https', :relative]}
146
147
  }
147
148
 
149
+ === Transformers
150
+
151
+ Transformers allow you to filter and alter nodes using your own custom logic, on
152
+ top of (or instead of) Sanitize's core filter. A transformer is any object that
153
+ responds to <code>call()</code> (such as a lambda or proc) and returns either
154
+ <code>nil</code> or a Hash containing certain optional response values.
155
+
156
+ To use one or more transformers, pass them to the <code>:transformers</code>
157
+ config setting:
158
+
159
+ Sanitize.clean(html, :transformers => [transformer_one, transformer_two])
160
+
161
+ ==== Input
162
+
163
+ Each registered transformer's <code>call()</code> method will be called once for
164
+ each element node in the HTML, and will receive as an argument an environment
165
+ Hash that contains the following items:
166
+
167
+ [<code>:config</code>]
168
+ The current Sanitize configuration Hash.
169
+
170
+ [<code>:node</code>]
171
+ A Nokogiri::XML::Node object representing an HTML element.
172
+
173
+ ==== Processing
174
+
175
+ Each transformer has full access to the Nokogiri::XML::Node that's passed into
176
+ it and to the rest of the document via the node's <code>document()</code>
177
+ method. Any changes will be reflected instantly in the document and passed on to
178
+ subsequently-called transformers and to Sanitize itself. A transformer may even
179
+ call Sanitize internally to perform custom sanitization if needed.
180
+
181
+ Nodes are passed into transformers in the order in which they're traversed. It's
182
+ important to note that Nokogiri traverses markup from the deepest node upward,
183
+ not from the first node to the last node:
184
+
185
+ html = '<div><span>foo</span></div>'
186
+ transformer = lambda{|env| puts env[:node].name }
187
+
188
+ # Prints "span", then "div".
189
+ Sanitize.clean(html, :transformers => transformer)
190
+
191
+ Transformers have a tremendous amount of power, including the power to
192
+ completely bypass Sanitize's built-in filtering. Be careful!
193
+
194
+ ==== Output
195
+
196
+ A transformer may return either +nil+ or a Hash. A return value of +nil+
197
+ indicates that the transformer does not wish to act on the current node in any
198
+ way. A returned Hash may contain the following items, all of which are optional:
199
+
200
+ [<code>:attr_whitelist</code>]
201
+ Array of attribute names to add to the whitelist for the current node, in
202
+ addition to any whitelisted attributes already defined in the current config.
203
+
204
+ [<code>:node</code>]
205
+ A Nokogiri::XML::Node object that should replace the current node. All
206
+ subsequent transformers and Sanitize itself will receive this new node.
207
+
208
+ [<code>:whitelist</code>]
209
+ If _true_, the current node (and only the current node) will be whitelisted,
210
+ regardless of the current Sanitize config.
211
+
212
+ [<code>:whitelist_nodes</code>]
213
+ Array of specific Nokogiri::XML::Node objects to whitelist, anywhere in the
214
+ document, regardless of the current Sanitize config.
148
215
 
149
216
  == Contributors
150
217
 
@@ -20,6 +20,8 @@
20
20
  # SOFTWARE.
21
21
  #++
22
22
 
23
+
24
+
23
25
  class Sanitize
24
26
  module Config
25
27
  RELAXED = {
@@ -47,7 +47,9 @@ class Sanitize
47
47
  # URL handling protocols to allow in specific attributes. By default, no
48
48
  # protocols are allowed. Use :relative in place of a protocol if you want
49
49
  # to allow relative URLs sans protocol.
50
- :protocols => {}
50
+ :protocols => {},
51
+
52
+ :transformers => []
51
53
  }
52
54
  end
53
55
  end
@@ -1,3 +1,3 @@
1
1
  class Sanitize
2
- VERSION = '1.1.1.dev.20091102'
2
+ VERSION = '1.2.0.dev.20091104'
3
3
  end
data/lib/sanitize.rb CHANGED
@@ -29,6 +29,7 @@ require 'sanitize/config/basic'
29
29
  require 'sanitize/config/relaxed'
30
30
 
31
31
  class Sanitize
32
+ attr_reader :config
32
33
 
33
34
  # Matches an attribute value that could be treated by a browser as a URL
34
35
  # with a protocol prefix, such as "http:" or "javascript:". Any string of zero
@@ -37,13 +38,44 @@ class Sanitize
37
38
  # IE6 and Opera will still parse).
38
39
  REGEX_PROTOCOL = /^([A-Za-z0-9\+\-\.\&\;\#\s]*?)(?:\:|&#0*58|&#x0*3a)/i
39
40
 
41
+ #--
42
+ # Class Methods
43
+ #++
44
+
45
+ # Returns a sanitized copy of _html_, using the settings in _config_ if
46
+ # specified.
47
+ def self.clean(html, config = {})
48
+ sanitize = Sanitize.new(config)
49
+ sanitize.clean(html)
50
+ end
51
+
52
+ # Performs Sanitize#clean in place, returning _html_, or +nil+ if no changes
53
+ # were made.
54
+ def self.clean!(html, config = {})
55
+ sanitize = Sanitize.new(config)
56
+ sanitize.clean!(html)
57
+ end
58
+
59
+ # Sanitizes the specified Nokogiri::XML::Node and all its children.
60
+ def self.clean_node!(node, config = {})
61
+ sanitize = Sanitize.new(config)
62
+ sanitize.clean_node!(node)
63
+ end
64
+
40
65
  #--
41
66
  # Instance Methods
42
67
  #++
43
68
 
44
69
  # Returns a new Sanitize object initialized with the settings in _config_.
45
70
  def initialize(config = {})
71
+ # Sanitize configuration.
46
72
  @config = Config::DEFAULT.merge(config)
73
+ @config[:transformers] = Array(@config[:transformers])
74
+
75
+ # Specific nodes to whitelist (along with all their attributes). This array
76
+ # is generated at runtime by transformers, and is cleared before and after
77
+ # a fragment is cleaned (so it applies only to a specific fragment).
78
+ @whitelist_nodes = []
47
79
  end
48
80
 
49
81
  # Returns a sanitized copy of _html_.
@@ -55,71 +87,16 @@ class Sanitize
55
87
  # Performs clean in place, returning _html_, or +nil+ if no changes were
56
88
  # made.
57
89
  def clean!(html)
90
+ @whitelist_nodes = []
58
91
  fragment = Nokogiri::HTML::DocumentFragment.parse(html)
92
+ clean_node!(fragment)
93
+ @whitelist_nodes = []
59
94
 
60
- fragment.traverse do |node|
61
- if node.comment?
62
- node.unlink unless @config[:allow_comments]
63
- elsif node.element?
64
- name = node.name.to_s.downcase
65
-
66
- # Delete any element that isn't in the whitelist.
67
- unless @config[:elements].include?(name)
68
- node.children.each { |n| node.add_previous_sibling(n) }
69
- node.unlink
70
- next
71
- end
72
-
73
- attr_whitelist = ((@config[:attributes][name] || []) +
74
- (@config[:attributes][:all] || [])).uniq
75
-
76
- if attr_whitelist.empty?
77
- # Delete all attributes from elements with no whitelisted
78
- # attributes.
79
- node.attribute_nodes.each { |attr| attr.remove }
80
- else
81
- # Delete any attribute that isn't in the whitelist for this element.
82
- node.attribute_nodes.each do |attr|
83
- attr.unlink unless attr_whitelist.include?(attr.name.downcase)
84
- end
85
-
86
- # Delete remaining attributes that use unacceptable protocols.
87
- if @config[:protocols].has_key?(name)
88
- protocol = @config[:protocols][name]
89
-
90
- node.attribute_nodes.each do |attr|
91
- attr_name = attr.name.downcase
92
- next false unless protocol.has_key?(attr_name)
93
-
94
- del = if attr.value.to_s.downcase =~ REGEX_PROTOCOL
95
- !protocol[attr_name].include?($1.downcase)
96
- else
97
- !protocol[attr_name].include?(:relative)
98
- end
99
-
100
- attr.unlink if del
101
- end
102
- end
103
- end
104
-
105
- # Add required attributes.
106
- if @config[:add_attributes].has_key?(name)
107
- @config[:add_attributes][name].each do |key, val|
108
- node[key] = val
109
- end
110
- end
111
- elsif node.cdata?
112
- node.replace(Nokogiri::XML::Text.new(node.text, node.document))
113
- end
114
- end
115
-
116
- # Nokogiri 1.3.3 (and possibly earlier versions) always returns a US-ASCII
117
- # string no matter what we ask for. This will be fixed in 1.4.0, but for
118
- # now we have to hack around it to prevent errors.
119
95
  output_method_params = {:encoding => 'utf-8', :indent => 0}
96
+
120
97
  if @config[:output] == :xhtml
121
98
  output_method = fragment.method(:to_xhtml)
122
- output_method_params.merge!(:save_with => Nokogiri::XML::Node::SaveOptions::AS_XHTML)
99
+ output_method_params[:save_with] = Nokogiri::XML::Node::SaveOptions::AS_XHTML
123
100
  elsif @config[:output] == :html
124
101
  output_method = fragment.method(:to_html)
125
102
  else
@@ -127,29 +104,125 @@ class Sanitize
127
104
  end
128
105
 
129
106
  result = output_method.call(output_method_params)
107
+
108
+ # Nokogiri 1.3.3 (and possibly earlier versions) always returns a US-ASCII
109
+ # string no matter what we ask for. This will be fixed in 1.4.0, but for
110
+ # now we have to hack around it to prevent errors.
130
111
  result.force_encoding('utf-8') if RUBY_VERSION >= '1.9'
131
112
 
132
113
  return result == html ? nil : html[0, html.length] = result
133
114
  end
134
115
 
135
- #--
136
- # Class Methods
137
- #++
116
+ # Sanitizes the specified Nokogiri::XML::Node and all its children.
117
+ def clean_node!(node)
118
+ raise ArgumentError unless node.is_a?(Nokogiri::XML::Node)
119
+
120
+ node.traverse do |traversed_node|
121
+ if traversed_node.element?
122
+ clean_element!(traversed_node)
123
+ elsif traversed_node.comment?
124
+ traversed_node.unlink unless @config[:allow_comments]
125
+ elsif traversed_node.cdata?
126
+ traversed_node.replace(Nokogiri::XML::Text.new(traversed_node.text,
127
+ traversed_node.document))
128
+ end
129
+ end
130
+
131
+ node
132
+ end
133
+
134
+ private
138
135
 
139
- class << self
140
- # Returns a sanitized copy of _html_, using the settings in _config_ if
141
- # specified.
142
- def clean(html, config = {})
143
- sanitize = Sanitize.new(config)
144
- sanitize.clean(html)
136
+ def clean_element!(node)
137
+ # Run this node through all configured transformers.
138
+ transform = transform_element!(node)
139
+
140
+ # If this node is in the dynamic whitelist array (built at runtime by
141
+ # transformers), let it live with all of its attributes intact.
142
+ return if @whitelist_nodes.include?(node)
143
+
144
+ name = node.name.to_s.downcase
145
+
146
+ # Delete any element that isn't in the whitelist.
147
+ unless transform[:whitelist] || @config[:elements].include?(name)
148
+ node.children.each { |n| node.add_previous_sibling(n) }
149
+ node.unlink
150
+ return
145
151
  end
146
152
 
147
- # Performs Sanitize#clean in place, returning _html_, or +nil+ if no changes
148
- # were made.
149
- def clean!(html, config = {})
150
- sanitize = Sanitize.new(config)
151
- sanitize.clean!(html)
153
+ attr_whitelist = (transform[:attr_whitelist] +
154
+ (@config[:attributes][name] || []) +
155
+ (@config[:attributes][:all] || [])).uniq
156
+
157
+ if attr_whitelist.empty?
158
+ # Delete all attributes from elements with no whitelisted attributes.
159
+ node.attribute_nodes.each {|attr| attr.remove }
160
+ else
161
+ # Delete any attribute that isn't in the whitelist for this element.
162
+ node.attribute_nodes.each do |attr|
163
+ attr.unlink unless attr_whitelist.include?(attr.name.downcase)
164
+ end
165
+
166
+ # Delete remaining attributes that use unacceptable protocols.
167
+ if @config[:protocols].has_key?(name)
168
+ protocol = @config[:protocols][name]
169
+
170
+ node.attribute_nodes.each do |attr|
171
+ attr_name = attr.name.downcase
172
+ next false unless protocol.has_key?(attr_name)
173
+
174
+ del = if attr.value.to_s.downcase =~ REGEX_PROTOCOL
175
+ !protocol[attr_name].include?($1.downcase)
176
+ else
177
+ !protocol[attr_name].include?(:relative)
178
+ end
179
+
180
+ attr.unlink if del
181
+ end
182
+ end
183
+ end
184
+
185
+ # Add required attributes.
186
+ if @config[:add_attributes].has_key?(name)
187
+ @config[:add_attributes][name].each do |key, val|
188
+ node[key] = val
189
+ end
152
190
  end
191
+
192
+ transform
153
193
  end
154
194
 
195
+ def transform_element!(node)
196
+ output = {
197
+ :attr_whitelist => [],
198
+ :node => node,
199
+ :whitelist => false
200
+ }
201
+
202
+ @config[:transformers].inject(node) do |transformer_node, transformer|
203
+ transform = transformer.call({
204
+ :config => @config,
205
+ :node => transformer_node
206
+ })
207
+
208
+ if transform.nil?
209
+ transformer_node
210
+ elsif transform.is_a?(Hash)
211
+ if transform[:whitelist_nodes].is_a?(Array)
212
+ @whitelist_nodes += transform[:whitelist_nodes]
213
+ @whitelist_nodes.uniq!
214
+ end
215
+
216
+ output[:attr_whitelist] += transform[:attr_whitelist] if transform[:attr_whitelist].is_a?(Array)
217
+ output[:whitelist] ||= true if transform[:whitelist]
218
+ output[:node] = transform[:node].is_a?(Nokogiri::XML::Node) ? transform[:node] : output[:node]
219
+ else
220
+ raise Error, "transformer output must be a Hash or nil"
221
+ end
222
+ end
223
+
224
+ node.replace(output[:node]) if node != output[:node]
225
+
226
+ return output
227
+ end
155
228
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: sanitize
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.1.1.dev.20091102
4
+ version: 1.2.0.dev.20091104
5
5
  platform: ruby
6
6
  authors:
7
7
  - Ryan Grove
@@ -9,7 +9,7 @@ autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
11
 
12
- date: 2009-11-02 00:00:00 -08:00
12
+ date: 2009-11-04 00:00:00 -08:00
13
13
  default_executable:
14
14
  dependencies:
15
15
  - !ruby/object:Gem::Dependency