sanitize 1.1.1.dev.20091102 → 1.2.0.dev.20091104
Sign up to get free protection for your applications and to get access to all the features.
Potentially problematic release.
This version of sanitize might be problematic. Click here for more details.
- data/HISTORY +4 -1
- data/README.rdoc +69 -2
- data/lib/sanitize/config/relaxed.rb +2 -0
- data/lib/sanitize/config.rb +3 -1
- data/lib/sanitize/version.rb +1 -1
- data/lib/sanitize.rb +147 -74
- metadata +2 -2
data/HISTORY
CHANGED
@@ -1,7 +1,10 @@
|
|
1
1
|
Sanitize History
|
2
2
|
================================================================================
|
3
3
|
|
4
|
-
Version 1.
|
4
|
+
Version 1.2.0.dev (git)
|
5
|
+
* Added support for transformers, which allow you to filter and alter nodes
|
6
|
+
using your own custom logic, on top of (or instead of) Sanitize's core
|
7
|
+
filter. See the README for details.
|
5
8
|
* Requires Nokogiri >= 1.4.0.
|
6
9
|
* Added elements h1 through h6 to the Relaxed whitelist. [Suggested by David
|
7
10
|
Reese]
|
data/README.rdoc
CHANGED
@@ -15,7 +15,7 @@ or maliciously-formed HTML. When in doubt, Sanitize always errs on the side of
|
|
15
15
|
caution.
|
16
16
|
|
17
17
|
*Author*:: Ryan Grove (mailto:ryan@wonko.com)
|
18
|
-
*Version*:: 1.
|
18
|
+
*Version*:: 1.2.0.dev (git)
|
19
19
|
*Copyright*:: Copyright (c) 2009 Ryan Grove. All rights reserved.
|
20
20
|
*License*:: MIT License (http://opensource.org/licenses/mit-license.php)
|
21
21
|
*Website*:: http://github.com/rgrove/sanitize
|
@@ -38,7 +38,8 @@ Latest development version:
|
|
38
38
|
== Usage
|
39
39
|
|
40
40
|
If you don't specify any configuration options, Sanitize will use its strictest
|
41
|
-
settings by default, which means it will strip all HTML
|
41
|
+
settings by default, which means it will strip all HTML and leave only text
|
42
|
+
behind.
|
42
43
|
|
43
44
|
require 'rubygems'
|
44
45
|
require 'sanitize'
|
@@ -145,6 +146,72 @@ include the symbol <code>:relative</code> in the protocol array:
|
|
145
146
|
'a' => {'href' => ['http', 'https', :relative]}
|
146
147
|
}
|
147
148
|
|
149
|
+
=== Transformers
|
150
|
+
|
151
|
+
Transformers allow you to filter and alter nodes using your own custom logic, on
|
152
|
+
top of (or instead of) Sanitize's core filter. A transformer is any object that
|
153
|
+
responds to <code>call()</code> (such as a lambda or proc) and returns either
|
154
|
+
<code>nil</code> or a Hash containing certain optional response values.
|
155
|
+
|
156
|
+
To use one or more transformers, pass them to the <code>:transformers</code>
|
157
|
+
config setting:
|
158
|
+
|
159
|
+
Sanitize.clean(html, :transformers => [transformer_one, transformer_two])
|
160
|
+
|
161
|
+
==== Input
|
162
|
+
|
163
|
+
Each registered transformer's <code>call()</code> method will be called once for
|
164
|
+
each element node in the HTML, and will receive as an argument an environment
|
165
|
+
Hash that contains the following items:
|
166
|
+
|
167
|
+
[<code>:config</code>]
|
168
|
+
The current Sanitize configuration Hash.
|
169
|
+
|
170
|
+
[<code>:node</code>]
|
171
|
+
A Nokogiri::XML::Node object representing an HTML element.
|
172
|
+
|
173
|
+
==== Processing
|
174
|
+
|
175
|
+
Each transformer has full access to the Nokogiri::XML::Node that's passed into
|
176
|
+
it and to the rest of the document via the node's <code>document()</code>
|
177
|
+
method. Any changes will be reflected instantly in the document and passed on to
|
178
|
+
subsequently-called transformers and to Sanitize itself. A transformer may even
|
179
|
+
call Sanitize internally to perform custom sanitization if needed.
|
180
|
+
|
181
|
+
Nodes are passed into transformers in the order in which they're traversed. It's
|
182
|
+
important to note that Nokogiri traverses markup from the deepest node upward,
|
183
|
+
not from the first node to the last node:
|
184
|
+
|
185
|
+
html = '<div><span>foo</span></div>'
|
186
|
+
transformer = lambda{|env| puts env[:node].name }
|
187
|
+
|
188
|
+
# Prints "span", then "div".
|
189
|
+
Sanitize.clean(html, :transformers => transformer)
|
190
|
+
|
191
|
+
Transformers have a tremendous amount of power, including the power to
|
192
|
+
completely bypass Sanitize's built-in filtering. Be careful!
|
193
|
+
|
194
|
+
==== Output
|
195
|
+
|
196
|
+
A transformer may return either +nil+ or a Hash. A return value of +nil+
|
197
|
+
indicates that the transformer does not wish to act on the current node in any
|
198
|
+
way. A returned Hash may contain the following items, all of which are optional:
|
199
|
+
|
200
|
+
[<code>:attr_whitelist</code>]
|
201
|
+
Array of attribute names to add to the whitelist for the current node, in
|
202
|
+
addition to any whitelisted attributes already defined in the current config.
|
203
|
+
|
204
|
+
[<code>:node</code>]
|
205
|
+
A Nokogiri::XML::Node object that should replace the current node. All
|
206
|
+
subsequent transformers and Sanitize itself will receive this new node.
|
207
|
+
|
208
|
+
[<code>:whitelist</code>]
|
209
|
+
If _true_, the current node (and only the current node) will be whitelisted,
|
210
|
+
regardless of the current Sanitize config.
|
211
|
+
|
212
|
+
[<code>:whitelist_nodes</code>]
|
213
|
+
Array of specific Nokogiri::XML::Node objects to whitelist, anywhere in the
|
214
|
+
document, regardless of the current Sanitize config.
|
148
215
|
|
149
216
|
== Contributors
|
150
217
|
|
data/lib/sanitize/config.rb
CHANGED
@@ -47,7 +47,9 @@ class Sanitize
|
|
47
47
|
# URL handling protocols to allow in specific attributes. By default, no
|
48
48
|
# protocols are allowed. Use :relative in place of a protocol if you want
|
49
49
|
# to allow relative URLs sans protocol.
|
50
|
-
:protocols => {}
|
50
|
+
:protocols => {},
|
51
|
+
|
52
|
+
:transformers => []
|
51
53
|
}
|
52
54
|
end
|
53
55
|
end
|
data/lib/sanitize/version.rb
CHANGED
data/lib/sanitize.rb
CHANGED
@@ -29,6 +29,7 @@ require 'sanitize/config/basic'
|
|
29
29
|
require 'sanitize/config/relaxed'
|
30
30
|
|
31
31
|
class Sanitize
|
32
|
+
attr_reader :config
|
32
33
|
|
33
34
|
# Matches an attribute value that could be treated by a browser as a URL
|
34
35
|
# with a protocol prefix, such as "http:" or "javascript:". Any string of zero
|
@@ -37,13 +38,44 @@ class Sanitize
|
|
37
38
|
# IE6 and Opera will still parse).
|
38
39
|
REGEX_PROTOCOL = /^([A-Za-z0-9\+\-\.\&\;\#\s]*?)(?:\:|�*58|�*3a)/i
|
39
40
|
|
41
|
+
#--
|
42
|
+
# Class Methods
|
43
|
+
#++
|
44
|
+
|
45
|
+
# Returns a sanitized copy of _html_, using the settings in _config_ if
|
46
|
+
# specified.
|
47
|
+
def self.clean(html, config = {})
|
48
|
+
sanitize = Sanitize.new(config)
|
49
|
+
sanitize.clean(html)
|
50
|
+
end
|
51
|
+
|
52
|
+
# Performs Sanitize#clean in place, returning _html_, or +nil+ if no changes
|
53
|
+
# were made.
|
54
|
+
def self.clean!(html, config = {})
|
55
|
+
sanitize = Sanitize.new(config)
|
56
|
+
sanitize.clean!(html)
|
57
|
+
end
|
58
|
+
|
59
|
+
# Sanitizes the specified Nokogiri::XML::Node and all its children.
|
60
|
+
def self.clean_node!(node, config = {})
|
61
|
+
sanitize = Sanitize.new(config)
|
62
|
+
sanitize.clean_node!(node)
|
63
|
+
end
|
64
|
+
|
40
65
|
#--
|
41
66
|
# Instance Methods
|
42
67
|
#++
|
43
68
|
|
44
69
|
# Returns a new Sanitize object initialized with the settings in _config_.
|
45
70
|
def initialize(config = {})
|
71
|
+
# Sanitize configuration.
|
46
72
|
@config = Config::DEFAULT.merge(config)
|
73
|
+
@config[:transformers] = Array(@config[:transformers])
|
74
|
+
|
75
|
+
# Specific nodes to whitelist (along with all their attributes). This array
|
76
|
+
# is generated at runtime by transformers, and is cleared before and after
|
77
|
+
# a fragment is cleaned (so it applies only to a specific fragment).
|
78
|
+
@whitelist_nodes = []
|
47
79
|
end
|
48
80
|
|
49
81
|
# Returns a sanitized copy of _html_.
|
@@ -55,71 +87,16 @@ class Sanitize
|
|
55
87
|
# Performs clean in place, returning _html_, or +nil+ if no changes were
|
56
88
|
# made.
|
57
89
|
def clean!(html)
|
90
|
+
@whitelist_nodes = []
|
58
91
|
fragment = Nokogiri::HTML::DocumentFragment.parse(html)
|
92
|
+
clean_node!(fragment)
|
93
|
+
@whitelist_nodes = []
|
59
94
|
|
60
|
-
fragment.traverse do |node|
|
61
|
-
if node.comment?
|
62
|
-
node.unlink unless @config[:allow_comments]
|
63
|
-
elsif node.element?
|
64
|
-
name = node.name.to_s.downcase
|
65
|
-
|
66
|
-
# Delete any element that isn't in the whitelist.
|
67
|
-
unless @config[:elements].include?(name)
|
68
|
-
node.children.each { |n| node.add_previous_sibling(n) }
|
69
|
-
node.unlink
|
70
|
-
next
|
71
|
-
end
|
72
|
-
|
73
|
-
attr_whitelist = ((@config[:attributes][name] || []) +
|
74
|
-
(@config[:attributes][:all] || [])).uniq
|
75
|
-
|
76
|
-
if attr_whitelist.empty?
|
77
|
-
# Delete all attributes from elements with no whitelisted
|
78
|
-
# attributes.
|
79
|
-
node.attribute_nodes.each { |attr| attr.remove }
|
80
|
-
else
|
81
|
-
# Delete any attribute that isn't in the whitelist for this element.
|
82
|
-
node.attribute_nodes.each do |attr|
|
83
|
-
attr.unlink unless attr_whitelist.include?(attr.name.downcase)
|
84
|
-
end
|
85
|
-
|
86
|
-
# Delete remaining attributes that use unacceptable protocols.
|
87
|
-
if @config[:protocols].has_key?(name)
|
88
|
-
protocol = @config[:protocols][name]
|
89
|
-
|
90
|
-
node.attribute_nodes.each do |attr|
|
91
|
-
attr_name = attr.name.downcase
|
92
|
-
next false unless protocol.has_key?(attr_name)
|
93
|
-
|
94
|
-
del = if attr.value.to_s.downcase =~ REGEX_PROTOCOL
|
95
|
-
!protocol[attr_name].include?($1.downcase)
|
96
|
-
else
|
97
|
-
!protocol[attr_name].include?(:relative)
|
98
|
-
end
|
99
|
-
|
100
|
-
attr.unlink if del
|
101
|
-
end
|
102
|
-
end
|
103
|
-
end
|
104
|
-
|
105
|
-
# Add required attributes.
|
106
|
-
if @config[:add_attributes].has_key?(name)
|
107
|
-
@config[:add_attributes][name].each do |key, val|
|
108
|
-
node[key] = val
|
109
|
-
end
|
110
|
-
end
|
111
|
-
elsif node.cdata?
|
112
|
-
node.replace(Nokogiri::XML::Text.new(node.text, node.document))
|
113
|
-
end
|
114
|
-
end
|
115
|
-
|
116
|
-
# Nokogiri 1.3.3 (and possibly earlier versions) always returns a US-ASCII
|
117
|
-
# string no matter what we ask for. This will be fixed in 1.4.0, but for
|
118
|
-
# now we have to hack around it to prevent errors.
|
119
95
|
output_method_params = {:encoding => 'utf-8', :indent => 0}
|
96
|
+
|
120
97
|
if @config[:output] == :xhtml
|
121
98
|
output_method = fragment.method(:to_xhtml)
|
122
|
-
output_method_params
|
99
|
+
output_method_params[:save_with] = Nokogiri::XML::Node::SaveOptions::AS_XHTML
|
123
100
|
elsif @config[:output] == :html
|
124
101
|
output_method = fragment.method(:to_html)
|
125
102
|
else
|
@@ -127,29 +104,125 @@ class Sanitize
|
|
127
104
|
end
|
128
105
|
|
129
106
|
result = output_method.call(output_method_params)
|
107
|
+
|
108
|
+
# Nokogiri 1.3.3 (and possibly earlier versions) always returns a US-ASCII
|
109
|
+
# string no matter what we ask for. This will be fixed in 1.4.0, but for
|
110
|
+
# now we have to hack around it to prevent errors.
|
130
111
|
result.force_encoding('utf-8') if RUBY_VERSION >= '1.9'
|
131
112
|
|
132
113
|
return result == html ? nil : html[0, html.length] = result
|
133
114
|
end
|
134
115
|
|
135
|
-
|
136
|
-
|
137
|
-
|
116
|
+
# Sanitizes the specified Nokogiri::XML::Node and all its children.
|
117
|
+
def clean_node!(node)
|
118
|
+
raise ArgumentError unless node.is_a?(Nokogiri::XML::Node)
|
119
|
+
|
120
|
+
node.traverse do |traversed_node|
|
121
|
+
if traversed_node.element?
|
122
|
+
clean_element!(traversed_node)
|
123
|
+
elsif traversed_node.comment?
|
124
|
+
traversed_node.unlink unless @config[:allow_comments]
|
125
|
+
elsif traversed_node.cdata?
|
126
|
+
traversed_node.replace(Nokogiri::XML::Text.new(traversed_node.text,
|
127
|
+
traversed_node.document))
|
128
|
+
end
|
129
|
+
end
|
130
|
+
|
131
|
+
node
|
132
|
+
end
|
133
|
+
|
134
|
+
private
|
138
135
|
|
139
|
-
|
140
|
-
#
|
141
|
-
|
142
|
-
|
143
|
-
|
144
|
-
|
136
|
+
def clean_element!(node)
|
137
|
+
# Run this node through all configured transformers.
|
138
|
+
transform = transform_element!(node)
|
139
|
+
|
140
|
+
# If this node is in the dynamic whitelist array (built at runtime by
|
141
|
+
# transformers), let it live with all of its attributes intact.
|
142
|
+
return if @whitelist_nodes.include?(node)
|
143
|
+
|
144
|
+
name = node.name.to_s.downcase
|
145
|
+
|
146
|
+
# Delete any element that isn't in the whitelist.
|
147
|
+
unless transform[:whitelist] || @config[:elements].include?(name)
|
148
|
+
node.children.each { |n| node.add_previous_sibling(n) }
|
149
|
+
node.unlink
|
150
|
+
return
|
145
151
|
end
|
146
152
|
|
147
|
-
|
148
|
-
|
149
|
-
|
150
|
-
|
151
|
-
|
153
|
+
attr_whitelist = (transform[:attr_whitelist] +
|
154
|
+
(@config[:attributes][name] || []) +
|
155
|
+
(@config[:attributes][:all] || [])).uniq
|
156
|
+
|
157
|
+
if attr_whitelist.empty?
|
158
|
+
# Delete all attributes from elements with no whitelisted attributes.
|
159
|
+
node.attribute_nodes.each {|attr| attr.remove }
|
160
|
+
else
|
161
|
+
# Delete any attribute that isn't in the whitelist for this element.
|
162
|
+
node.attribute_nodes.each do |attr|
|
163
|
+
attr.unlink unless attr_whitelist.include?(attr.name.downcase)
|
164
|
+
end
|
165
|
+
|
166
|
+
# Delete remaining attributes that use unacceptable protocols.
|
167
|
+
if @config[:protocols].has_key?(name)
|
168
|
+
protocol = @config[:protocols][name]
|
169
|
+
|
170
|
+
node.attribute_nodes.each do |attr|
|
171
|
+
attr_name = attr.name.downcase
|
172
|
+
next false unless protocol.has_key?(attr_name)
|
173
|
+
|
174
|
+
del = if attr.value.to_s.downcase =~ REGEX_PROTOCOL
|
175
|
+
!protocol[attr_name].include?($1.downcase)
|
176
|
+
else
|
177
|
+
!protocol[attr_name].include?(:relative)
|
178
|
+
end
|
179
|
+
|
180
|
+
attr.unlink if del
|
181
|
+
end
|
182
|
+
end
|
183
|
+
end
|
184
|
+
|
185
|
+
# Add required attributes.
|
186
|
+
if @config[:add_attributes].has_key?(name)
|
187
|
+
@config[:add_attributes][name].each do |key, val|
|
188
|
+
node[key] = val
|
189
|
+
end
|
152
190
|
end
|
191
|
+
|
192
|
+
transform
|
153
193
|
end
|
154
194
|
|
195
|
+
def transform_element!(node)
|
196
|
+
output = {
|
197
|
+
:attr_whitelist => [],
|
198
|
+
:node => node,
|
199
|
+
:whitelist => false
|
200
|
+
}
|
201
|
+
|
202
|
+
@config[:transformers].inject(node) do |transformer_node, transformer|
|
203
|
+
transform = transformer.call({
|
204
|
+
:config => @config,
|
205
|
+
:node => transformer_node
|
206
|
+
})
|
207
|
+
|
208
|
+
if transform.nil?
|
209
|
+
transformer_node
|
210
|
+
elsif transform.is_a?(Hash)
|
211
|
+
if transform[:whitelist_nodes].is_a?(Array)
|
212
|
+
@whitelist_nodes += transform[:whitelist_nodes]
|
213
|
+
@whitelist_nodes.uniq!
|
214
|
+
end
|
215
|
+
|
216
|
+
output[:attr_whitelist] += transform[:attr_whitelist] if transform[:attr_whitelist].is_a?(Array)
|
217
|
+
output[:whitelist] ||= true if transform[:whitelist]
|
218
|
+
output[:node] = transform[:node].is_a?(Nokogiri::XML::Node) ? transform[:node] : output[:node]
|
219
|
+
else
|
220
|
+
raise Error, "transformer output must be a Hash or nil"
|
221
|
+
end
|
222
|
+
end
|
223
|
+
|
224
|
+
node.replace(output[:node]) if node != output[:node]
|
225
|
+
|
226
|
+
return output
|
227
|
+
end
|
155
228
|
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: sanitize
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.
|
4
|
+
version: 1.2.0.dev.20091104
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Ryan Grove
|
@@ -9,7 +9,7 @@ autorequire:
|
|
9
9
|
bindir: bin
|
10
10
|
cert_chain: []
|
11
11
|
|
12
|
-
date: 2009-11-
|
12
|
+
date: 2009-11-04 00:00:00 -08:00
|
13
13
|
default_executable:
|
14
14
|
dependencies:
|
15
15
|
- !ruby/object:Gem::Dependency
|