darkhelmet-sanitize 1.2.0.dev.20091104
Sign up to get free protection for your applications and to get access to all the features.
- data/HISTORY +73 -0
- data/LICENSE +18 -0
- data/README.rdoc +249 -0
- data/lib/sanitize/config/basic.rb +49 -0
- data/lib/sanitize/config/relaxed.rb +59 -0
- data/lib/sanitize/config/restricted.rb +29 -0
- data/lib/sanitize/config.rb +55 -0
- data/lib/sanitize/version.rb +3 -0
- data/lib/sanitize.rb +228 -0
- metadata +92 -0
data/HISTORY
ADDED
@@ -0,0 +1,73 @@
|
|
1
|
+
Sanitize History
|
2
|
+
================================================================================
|
3
|
+
|
4
|
+
Version 1.2.0.dev (git)
|
5
|
+
* Added support for transformers, which allow you to filter and alter nodes
|
6
|
+
using your own custom logic, on top of (or instead of) Sanitize's core
|
7
|
+
filter. See the README for details.
|
8
|
+
* Requires Nokogiri >= 1.4.0.
|
9
|
+
* Added elements h1 through h6 to the Relaxed whitelist. [Suggested by David
|
10
|
+
Reese]
|
11
|
+
|
12
|
+
Version 1.1.0 (2009-10-11)
|
13
|
+
* Migrated from Hpricot to Nokogiri. Requires libxml2 >= 2.7.2 [Adam Hooper]
|
14
|
+
* Added an :output config setting to allow the output format to be specified.
|
15
|
+
Supported formats are :xhtml (the default) and :html (which outputs HTML4).
|
16
|
+
* Changed protocol regex to ensure Sanitize doesn't kill URLs with colons in
|
17
|
+
path segments. [Peter Cooper]
|
18
|
+
|
19
|
+
Version 1.0.8 (2009-04-23)
|
20
|
+
* Added a workaround for an Hpricot bug that prevents attribute names from
|
21
|
+
being downcased in recent versions of Hpricot. This was exploitable to
|
22
|
+
prevent non-whitelisted protocols from being cleaned. [Reported by Ben
|
23
|
+
Wanicur]
|
24
|
+
|
25
|
+
Version 1.0.7 (2009-04-11)
|
26
|
+
* Requires Hpricot 0.8.1+, which is finally compatible with Ruby 1.9.1.
|
27
|
+
* Fixed a bug that caused named character entities containing digits (like
|
28
|
+
²) to be escaped when they shouldn't have been. [Reported by Sebastian
|
29
|
+
Steinmetz]
|
30
|
+
|
31
|
+
Version 1.0.6 (2009-02-23)
|
32
|
+
* Removed htmlentities gem dependency.
|
33
|
+
* Existing well-formed character entity references in the input string are now
|
34
|
+
preserved rather than being decoded and re-encoded.
|
35
|
+
* The ' character is now encoded as ' instead of ' to prevent
|
36
|
+
problems in IE6.
|
37
|
+
* You can now specify the symbol :all in place of an element name in the
|
38
|
+
attributes config hash to allow certain attributes on all elements. [Thanks
|
39
|
+
to Mutwin Kraus]
|
40
|
+
|
41
|
+
Version 1.0.5 (2009-02-05)
|
42
|
+
* Fixed a bug introduced in version 1.0.3 that prevented non-whitelisted
|
43
|
+
protocols from being cleaned when relative URLs were allowed. [Reported by
|
44
|
+
Dev Purkayastha]
|
45
|
+
* Fixed "undefined method `parent='" exceptions caused by parser changes in
|
46
|
+
edge Hpricot.
|
47
|
+
|
48
|
+
Version 1.0.4 (2009-01-16)
|
49
|
+
* Fixed a bug that made it possible to sneak a non-whitelisted element through
|
50
|
+
by repeating it several times in a row. All versions of Sanitize prior to
|
51
|
+
1.0.4 are vulnerable. [Reported by Cristobal]
|
52
|
+
|
53
|
+
Version 1.0.3 (2009-01-15)
|
54
|
+
* Fixed a bug whereby incomplete Unicode or hex entities could be used to
|
55
|
+
prevent non-whitelisted protocols from being cleaned. Since IE6 and Opera
|
56
|
+
still decode the incomplete entities, users of those browsers may be
|
57
|
+
vulnerable to malicious script injection on websites using versions of
|
58
|
+
Sanitize prior to 1.0.3.
|
59
|
+
|
60
|
+
Version 1.0.2 (2009-01-04)
|
61
|
+
* Fixed a bug that caused an exception to be thrown when parsing a valueless
|
62
|
+
attribute that's expected to contain a URL.
|
63
|
+
|
64
|
+
Version 1.0.1 (2009-01-01)
|
65
|
+
* You can now specify :relative in a protocol config array to allow attributes
|
66
|
+
containing relative URLs with no protocol. The Basic and Relaxed configs
|
67
|
+
have been updated to allow relative URLs.
|
68
|
+
* Added a workaround for an Hpricot bug that causes HTML entities for
|
69
|
+
non-ASCII characters to be replaced by question marks, and all other
|
70
|
+
entities to be destructively decoded.
|
71
|
+
|
72
|
+
Version 1.0.0 (2008-12-25)
|
73
|
+
* First release.
|
data/LICENSE
ADDED
@@ -0,0 +1,18 @@
|
|
1
|
+
Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy of
|
4
|
+
this software and associated documentation files (the 'Software'), to deal in
|
5
|
+
the Software without restriction, including without limitation the rights to
|
6
|
+
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
|
7
|
+
the Software, and to permit persons to whom the Software is furnished to do so,
|
8
|
+
subject to the following conditions:
|
9
|
+
|
10
|
+
The above copyright notice and this permission notice shall be included in all
|
11
|
+
copies or substantial portions of the Software.
|
12
|
+
|
13
|
+
THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
14
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
|
15
|
+
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
|
16
|
+
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
|
17
|
+
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
|
18
|
+
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.rdoc
ADDED
@@ -0,0 +1,249 @@
|
|
1
|
+
= Sanitize
|
2
|
+
|
3
|
+
Sanitize is a whitelist-based HTML sanitizer. Given a list of acceptable
|
4
|
+
elements and attributes, Sanitize will remove all unacceptable HTML from a
|
5
|
+
string.
|
6
|
+
|
7
|
+
Using a simple configuration syntax, you can tell Sanitize to allow certain
|
8
|
+
elements, certain attributes within those elements, and even certain URL
|
9
|
+
protocols within attributes that contain URLs. Any HTML elements or attributes
|
10
|
+
that you don't explicitly allow will be removed.
|
11
|
+
|
12
|
+
Because it's based on Nokogiri, a full-fledged HTML parser, rather than a bunch
|
13
|
+
of fragile regular expressions, Sanitize has no trouble dealing with malformed
|
14
|
+
or maliciously-formed HTML. When in doubt, Sanitize always errs on the side of
|
15
|
+
caution.
|
16
|
+
|
17
|
+
*Author*:: Ryan Grove (mailto:ryan@wonko.com)
|
18
|
+
*Version*:: 1.2.0.dev (git)
|
19
|
+
*Copyright*:: Copyright (c) 2009 Ryan Grove. All rights reserved.
|
20
|
+
*License*:: MIT License (http://opensource.org/licenses/mit-license.php)
|
21
|
+
*Website*:: http://github.com/rgrove/sanitize
|
22
|
+
|
23
|
+
== Requires
|
24
|
+
|
25
|
+
* Nokogiri >= 1.4.0
|
26
|
+
* libxml2 >= 2.7.2
|
27
|
+
|
28
|
+
== Installation
|
29
|
+
|
30
|
+
Latest stable release:
|
31
|
+
|
32
|
+
gem install sanitize
|
33
|
+
|
34
|
+
Latest development version:
|
35
|
+
|
36
|
+
gem install sanitize -s http://gemcutter.org --prerelease
|
37
|
+
|
38
|
+
== Usage
|
39
|
+
|
40
|
+
If you don't specify any configuration options, Sanitize will use its strictest
|
41
|
+
settings by default, which means it will strip all HTML and leave only text
|
42
|
+
behind.
|
43
|
+
|
44
|
+
require 'rubygems'
|
45
|
+
require 'sanitize'
|
46
|
+
|
47
|
+
html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
|
48
|
+
|
49
|
+
Sanitize.clean(html) # => 'foo'
|
50
|
+
|
51
|
+
== Configuration
|
52
|
+
|
53
|
+
In addition to the ultra-safe default settings, Sanitize comes with three other
|
54
|
+
built-in modes.
|
55
|
+
|
56
|
+
=== Sanitize::Config::RESTRICTED
|
57
|
+
|
58
|
+
Allows only very simple inline formatting markup. No links, images, or block
|
59
|
+
elements.
|
60
|
+
|
61
|
+
Sanitize.clean(html, Sanitize::Config::RESTRICTED) # => '<b>foo</b>'
|
62
|
+
|
63
|
+
=== Sanitize::Config::BASIC
|
64
|
+
|
65
|
+
Allows a variety of markup including formatting tags, links, and lists. Images
|
66
|
+
and tables are not allowed, links are limited to FTP, HTTP, HTTPS, and mailto
|
67
|
+
protocols, and a <code>rel="nofollow"</code> attribute is added to all links to
|
68
|
+
mitigate SEO spam.
|
69
|
+
|
70
|
+
Sanitize.clean(html, Sanitize::Config::BASIC)
|
71
|
+
# => '<b><a href="http://foo.com/" rel="nofollow">foo</a></b>'
|
72
|
+
|
73
|
+
=== Sanitize::Config::RELAXED
|
74
|
+
|
75
|
+
Allows an even wider variety of markup than BASIC, including images and tables.
|
76
|
+
Links are still limited to FTP, HTTP, HTTPS, and mailto protocols, while images
|
77
|
+
are limited to HTTP and HTTPS. In this mode, <code>rel="nofollow"</code> is not
|
78
|
+
added to links.
|
79
|
+
|
80
|
+
Sanitize.clean(html, Sanitize::Config::RELAXED)
|
81
|
+
# => '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
|
82
|
+
|
83
|
+
=== Custom Configuration
|
84
|
+
|
85
|
+
If the built-in modes don't meet your needs, you can easily specify a custom
|
86
|
+
configuration:
|
87
|
+
|
88
|
+
Sanitize.clean(html, :elements => ['a', 'span'],
|
89
|
+
:attributes => {'a' => ['href', 'title'], 'span' => ['class']},
|
90
|
+
:protocols => {'a' => {'href' => ['http', 'https', 'mailto']}})
|
91
|
+
|
92
|
+
==== :elements
|
93
|
+
|
94
|
+
Array of element names to allow. Specify all names in lowercase.
|
95
|
+
|
96
|
+
:elements => [
|
97
|
+
'a', 'b', 'blockquote', 'br', 'cite', 'code', 'dd', 'dl', 'dt', 'em',
|
98
|
+
'i', 'li', 'ol', 'p', 'pre', 'q', 'small', 'strike', 'strong', 'sub',
|
99
|
+
'sup', 'u', 'ul'
|
100
|
+
]
|
101
|
+
|
102
|
+
==== :attributes
|
103
|
+
|
104
|
+
Attributes to allow for specific elements. Specify all element names and
|
105
|
+
attributes in lowercase.
|
106
|
+
|
107
|
+
:attributes => {
|
108
|
+
'a' => ['href', 'title'],
|
109
|
+
'blockquote' => ['cite'],
|
110
|
+
'img' => ['alt', 'src', 'title']
|
111
|
+
}
|
112
|
+
|
113
|
+
If you'd like to allow certain attributes on all elements, use the symbol
|
114
|
+
<code>:all</code> instead of an element name.
|
115
|
+
|
116
|
+
:attributes => {
|
117
|
+
:all => ['class'],
|
118
|
+
'a' => ['href', 'title']
|
119
|
+
}
|
120
|
+
|
121
|
+
==== :add_attributes
|
122
|
+
|
123
|
+
Attributes to add to specific elements. If the attribute already exists, it will
|
124
|
+
be replaced with the value specified here. Specify all element names and
|
125
|
+
attributes in lowercase.
|
126
|
+
|
127
|
+
:add_attributes => {
|
128
|
+
'a' => {'rel' => 'nofollow'}
|
129
|
+
}
|
130
|
+
|
131
|
+
==== :protocols
|
132
|
+
|
133
|
+
URL protocols to allow in specific attributes. If an attribute is listed here
|
134
|
+
and contains a protocol other than those specified (or if it contains no
|
135
|
+
protocol at all), it will be removed.
|
136
|
+
|
137
|
+
:protocols => {
|
138
|
+
'a' => {'href' => ['ftp', 'http', 'https', 'mailto']},
|
139
|
+
'img' => {'src' => ['http', 'https']}
|
140
|
+
}
|
141
|
+
|
142
|
+
If you'd like to allow the use of relative URLs which don't have a protocol,
|
143
|
+
include the symbol <code>:relative</code> in the protocol array:
|
144
|
+
|
145
|
+
:protocols => {
|
146
|
+
'a' => {'href' => ['http', 'https', :relative]}
|
147
|
+
}
|
148
|
+
|
149
|
+
=== Transformers
|
150
|
+
|
151
|
+
Transformers allow you to filter and alter nodes using your own custom logic, on
|
152
|
+
top of (or instead of) Sanitize's core filter. A transformer is any object that
|
153
|
+
responds to <code>call()</code> (such as a lambda or proc) and returns either
|
154
|
+
<code>nil</code> or a Hash containing certain optional response values.
|
155
|
+
|
156
|
+
To use one or more transformers, pass them to the <code>:transformers</code>
|
157
|
+
config setting:
|
158
|
+
|
159
|
+
Sanitize.clean(html, :transformers => [transformer_one, transformer_two])
|
160
|
+
|
161
|
+
==== Input
|
162
|
+
|
163
|
+
Each registered transformer's <code>call()</code> method will be called once for
|
164
|
+
each element node in the HTML, and will receive as an argument an environment
|
165
|
+
Hash that contains the following items:
|
166
|
+
|
167
|
+
[<code>:config</code>]
|
168
|
+
The current Sanitize configuration Hash.
|
169
|
+
|
170
|
+
[<code>:node</code>]
|
171
|
+
A Nokogiri::XML::Node object representing an HTML element.
|
172
|
+
|
173
|
+
==== Processing
|
174
|
+
|
175
|
+
Each transformer has full access to the Nokogiri::XML::Node that's passed into
|
176
|
+
it and to the rest of the document via the node's <code>document()</code>
|
177
|
+
method. Any changes will be reflected instantly in the document and passed on to
|
178
|
+
subsequently-called transformers and to Sanitize itself. A transformer may even
|
179
|
+
call Sanitize internally to perform custom sanitization if needed.
|
180
|
+
|
181
|
+
Nodes are passed into transformers in the order in which they're traversed. It's
|
182
|
+
important to note that Nokogiri traverses markup from the deepest node upward,
|
183
|
+
not from the first node to the last node:
|
184
|
+
|
185
|
+
html = '<div><span>foo</span></div>'
|
186
|
+
transformer = lambda{|env| puts env[:node].name }
|
187
|
+
|
188
|
+
# Prints "span", then "div".
|
189
|
+
Sanitize.clean(html, :transformers => transformer)
|
190
|
+
|
191
|
+
Transformers have a tremendous amount of power, including the power to
|
192
|
+
completely bypass Sanitize's built-in filtering. Be careful!
|
193
|
+
|
194
|
+
==== Output
|
195
|
+
|
196
|
+
A transformer may return either +nil+ or a Hash. A return value of +nil+
|
197
|
+
indicates that the transformer does not wish to act on the current node in any
|
198
|
+
way. A returned Hash may contain the following items, all of which are optional:
|
199
|
+
|
200
|
+
[<code>:attr_whitelist</code>]
|
201
|
+
Array of attribute names to add to the whitelist for the current node, in
|
202
|
+
addition to any whitelisted attributes already defined in the current config.
|
203
|
+
|
204
|
+
[<code>:node</code>]
|
205
|
+
A Nokogiri::XML::Node object that should replace the current node. All
|
206
|
+
subsequent transformers and Sanitize itself will receive this new node.
|
207
|
+
|
208
|
+
[<code>:whitelist</code>]
|
209
|
+
If _true_, the current node (and only the current node) will be whitelisted,
|
210
|
+
regardless of the current Sanitize config.
|
211
|
+
|
212
|
+
[<code>:whitelist_nodes</code>]
|
213
|
+
Array of specific Nokogiri::XML::Node objects to whitelist, anywhere in the
|
214
|
+
document, regardless of the current Sanitize config.
|
215
|
+
|
216
|
+
== Contributors
|
217
|
+
|
218
|
+
The following lovely people have contributed to Sanitize in the form of patches
|
219
|
+
or ideas that later became code:
|
220
|
+
|
221
|
+
* Peter Cooper <git@peterc.org>
|
222
|
+
* Gabe da Silveira <gabe@websaviour.com>
|
223
|
+
* Ryan Grove <ryan@wonko.com>
|
224
|
+
* Adam Hooper <adam@adamhooper.com>
|
225
|
+
* Mutwin Kraus <mutle@blogage.de>
|
226
|
+
* Dev Purkayastha <dev.purkayastha@gmail.com>
|
227
|
+
* David Reese <work@whatcould.com>
|
228
|
+
* Ben Wanicur <bwanicur@verticalresponse.com>
|
229
|
+
|
230
|
+
== License
|
231
|
+
|
232
|
+
Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
|
233
|
+
|
234
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy of
|
235
|
+
this software and associated documentation files (the 'Software'), to deal in
|
236
|
+
the Software without restriction, including without limitation the rights to
|
237
|
+
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
|
238
|
+
the Software, and to permit persons to whom the Software is furnished to do so,
|
239
|
+
subject to the following conditions:
|
240
|
+
|
241
|
+
The above copyright notice and this permission notice shall be included in all
|
242
|
+
copies or substantial portions of the Software.
|
243
|
+
|
244
|
+
THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
245
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
|
246
|
+
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
|
247
|
+
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
|
248
|
+
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
|
249
|
+
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
@@ -0,0 +1,49 @@
|
|
1
|
+
#--
|
2
|
+
# Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
|
3
|
+
#
|
4
|
+
# Permission is hereby granted, free of charge, to any person obtaining a copy
|
5
|
+
# of this software and associated documentation files (the 'Software'), to deal
|
6
|
+
# in the Software without restriction, including without limitation the rights
|
7
|
+
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
8
|
+
# copies of the Software, and to permit persons to whom the Software is
|
9
|
+
# furnished to do so, subject to the following conditions:
|
10
|
+
#
|
11
|
+
# The above copyright notice and this permission notice shall be included in all
|
12
|
+
# copies or substantial portions of the Software.
|
13
|
+
#
|
14
|
+
# THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
15
|
+
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
16
|
+
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
17
|
+
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
18
|
+
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
19
|
+
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
20
|
+
# SOFTWARE.
|
21
|
+
#++
|
22
|
+
|
23
|
+
class Sanitize
|
24
|
+
module Config
|
25
|
+
BASIC = {
|
26
|
+
:elements => [
|
27
|
+
'a', 'b', 'blockquote', 'br', 'cite', 'code', 'dd', 'dl', 'dt', 'em',
|
28
|
+
'i', 'li', 'ol', 'p', 'pre', 'q', 'small', 'strike', 'strong', 'sub',
|
29
|
+
'sup', 'u', 'ul'],
|
30
|
+
|
31
|
+
:attributes => {
|
32
|
+
'a' => ['href'],
|
33
|
+
'blockquote' => ['cite'],
|
34
|
+
'q' => ['cite']
|
35
|
+
},
|
36
|
+
|
37
|
+
:add_attributes => {
|
38
|
+
'a' => {'rel' => 'nofollow'}
|
39
|
+
},
|
40
|
+
|
41
|
+
:protocols => {
|
42
|
+
'a' => {'href' => ['ftp', 'http', 'https', 'mailto',
|
43
|
+
:relative]},
|
44
|
+
'blockquote' => {'cite' => ['http', 'https', :relative]},
|
45
|
+
'q' => {'cite' => ['http', 'https', :relative]}
|
46
|
+
}
|
47
|
+
}
|
48
|
+
end
|
49
|
+
end
|
@@ -0,0 +1,59 @@
|
|
1
|
+
#--
|
2
|
+
# Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
|
3
|
+
#
|
4
|
+
# Permission is hereby granted, free of charge, to any person obtaining a copy
|
5
|
+
# of this software and associated documentation files (the 'Software'), to deal
|
6
|
+
# in the Software without restriction, including without limitation the rights
|
7
|
+
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
8
|
+
# copies of the Software, and to permit persons to whom the Software is
|
9
|
+
# furnished to do so, subject to the following conditions:
|
10
|
+
#
|
11
|
+
# The above copyright notice and this permission notice shall be included in all
|
12
|
+
# copies or substantial portions of the Software.
|
13
|
+
#
|
14
|
+
# THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
15
|
+
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
16
|
+
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
17
|
+
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
18
|
+
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
19
|
+
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
20
|
+
# SOFTWARE.
|
21
|
+
#++
|
22
|
+
|
23
|
+
|
24
|
+
|
25
|
+
class Sanitize
|
26
|
+
module Config
|
27
|
+
RELAXED = {
|
28
|
+
:elements => [
|
29
|
+
'a', 'b', 'blockquote', 'br', 'caption', 'cite', 'code', 'col',
|
30
|
+
'colgroup', 'dd', 'dl', 'dt', 'em', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6',
|
31
|
+
'i', 'img', 'li', 'ol', 'p', 'pre', 'q', 'small', 'strike', 'strong',
|
32
|
+
'sub', 'sup', 'table', 'tbody', 'td', 'tfoot', 'th', 'thead', 'tr', 'u',
|
33
|
+
'ul'],
|
34
|
+
|
35
|
+
:attributes => {
|
36
|
+
'a' => ['href', 'title'],
|
37
|
+
'blockquote' => ['cite'],
|
38
|
+
'col' => ['span', 'width'],
|
39
|
+
'colgroup' => ['span', 'width'],
|
40
|
+
'img' => ['align', 'alt', 'height', 'src', 'title', 'width'],
|
41
|
+
'ol' => ['start', 'type'],
|
42
|
+
'q' => ['cite'],
|
43
|
+
'table' => ['summary', 'width'],
|
44
|
+
'td' => ['abbr', 'axis', 'colspan', 'rowspan', 'width'],
|
45
|
+
'th' => ['abbr', 'axis', 'colspan', 'rowspan', 'scope',
|
46
|
+
'width'],
|
47
|
+
'ul' => ['type']
|
48
|
+
},
|
49
|
+
|
50
|
+
:protocols => {
|
51
|
+
'a' => {'href' => ['ftp', 'http', 'https', 'mailto',
|
52
|
+
:relative]},
|
53
|
+
'blockquote' => {'cite' => ['http', 'https', :relative]},
|
54
|
+
'img' => {'src' => ['http', 'https', :relative]},
|
55
|
+
'q' => {'cite' => ['http', 'https', :relative]}
|
56
|
+
}
|
57
|
+
}
|
58
|
+
end
|
59
|
+
end
|
@@ -0,0 +1,29 @@
|
|
1
|
+
#--
|
2
|
+
# Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
|
3
|
+
#
|
4
|
+
# Permission is hereby granted, free of charge, to any person obtaining a copy
|
5
|
+
# of this software and associated documentation files (the 'Software'), to deal
|
6
|
+
# in the Software without restriction, including without limitation the rights
|
7
|
+
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
8
|
+
# copies of the Software, and to permit persons to whom the Software is
|
9
|
+
# furnished to do so, subject to the following conditions:
|
10
|
+
#
|
11
|
+
# The above copyright notice and this permission notice shall be included in all
|
12
|
+
# copies or substantial portions of the Software.
|
13
|
+
#
|
14
|
+
# THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
15
|
+
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
16
|
+
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
17
|
+
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
18
|
+
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
19
|
+
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
20
|
+
# SOFTWARE.
|
21
|
+
#++
|
22
|
+
|
23
|
+
class Sanitize
|
24
|
+
module Config
|
25
|
+
RESTRICTED = {
|
26
|
+
:elements => ['b', 'em', 'i', 'strong', 'u']
|
27
|
+
}
|
28
|
+
end
|
29
|
+
end
|
@@ -0,0 +1,55 @@
|
|
1
|
+
#--
|
2
|
+
# Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
|
3
|
+
#
|
4
|
+
# Permission is hereby granted, free of charge, to any person obtaining a copy
|
5
|
+
# of this software and associated documentation files (the 'Software'), to deal
|
6
|
+
# in the Software without restriction, including without limitation the rights
|
7
|
+
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
8
|
+
# copies of the Software, and to permit persons to whom the Software is
|
9
|
+
# furnished to do so, subject to the following conditions:
|
10
|
+
#
|
11
|
+
# The above copyright notice and this permission notice shall be included in all
|
12
|
+
# copies or substantial portions of the Software.
|
13
|
+
#
|
14
|
+
# THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
15
|
+
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
16
|
+
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
17
|
+
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
18
|
+
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
19
|
+
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
20
|
+
# SOFTWARE.
|
21
|
+
#++
|
22
|
+
|
23
|
+
class Sanitize
|
24
|
+
module Config
|
25
|
+
DEFAULT = {
|
26
|
+
# Whether or not to allow HTML comments. Allowing comments is strongly
|
27
|
+
# discouraged, since IE allows script execution within conditional
|
28
|
+
# comments.
|
29
|
+
:allow_comments => false,
|
30
|
+
|
31
|
+
# HTML attributes to add to specific elements. By default, no attributes
|
32
|
+
# are added.
|
33
|
+
:add_attributes => {},
|
34
|
+
|
35
|
+
# HTML attributes to allow in specific elements. By default, no attributes
|
36
|
+
# are allowed.
|
37
|
+
:attributes => {},
|
38
|
+
|
39
|
+
# HTML elements to allow. By default, no elements are allowed (which means
|
40
|
+
# that all HTML will be stripped).
|
41
|
+
:elements => [],
|
42
|
+
|
43
|
+
# Output format. Supported formats are :html and :xhtml (which is the
|
44
|
+
# default).
|
45
|
+
:output => :xhtml,
|
46
|
+
|
47
|
+
# URL handling protocols to allow in specific attributes. By default, no
|
48
|
+
# protocols are allowed. Use :relative in place of a protocol if you want
|
49
|
+
# to allow relative URLs sans protocol.
|
50
|
+
:protocols => {},
|
51
|
+
|
52
|
+
:transformers => []
|
53
|
+
}
|
54
|
+
end
|
55
|
+
end
|
data/lib/sanitize.rb
ADDED
@@ -0,0 +1,228 @@
|
|
1
|
+
# encoding: utf-8
|
2
|
+
#--
|
3
|
+
# Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
|
4
|
+
#
|
5
|
+
# Permission is hereby granted, free of charge, to any person obtaining a copy
|
6
|
+
# of this software and associated documentation files (the 'Software'), to deal
|
7
|
+
# in the Software without restriction, including without limitation the rights
|
8
|
+
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9
|
+
# copies of the Software, and to permit persons to whom the Software is
|
10
|
+
# furnished to do so, subject to the following conditions:
|
11
|
+
#
|
12
|
+
# The above copyright notice and this permission notice shall be included in all
|
13
|
+
# copies or substantial portions of the Software.
|
14
|
+
#
|
15
|
+
# THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16
|
+
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17
|
+
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18
|
+
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19
|
+
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20
|
+
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
21
|
+
# SOFTWARE.
|
22
|
+
#++
|
23
|
+
|
24
|
+
require 'nokogiri'
|
25
|
+
require 'sanitize/version'
|
26
|
+
require 'sanitize/config'
|
27
|
+
require 'sanitize/config/restricted'
|
28
|
+
require 'sanitize/config/basic'
|
29
|
+
require 'sanitize/config/relaxed'
|
30
|
+
|
31
|
+
class Sanitize
|
32
|
+
attr_reader :config
|
33
|
+
|
34
|
+
# Matches an attribute value that could be treated by a browser as a URL
|
35
|
+
# with a protocol prefix, such as "http:" or "javascript:". Any string of zero
|
36
|
+
# or more characters followed by a colon is considered a match, even if the
|
37
|
+
# colon is encoded as an entity and even if it's an incomplete entity (which
|
38
|
+
# IE6 and Opera will still parse).
|
39
|
+
REGEX_PROTOCOL = /^([A-Za-z0-9\+\-\.\&\;\#\s]*?)(?:\:|�*58|�*3a)/i
|
40
|
+
|
41
|
+
#--
|
42
|
+
# Class Methods
|
43
|
+
#++
|
44
|
+
|
45
|
+
# Returns a sanitized copy of _html_, using the settings in _config_ if
|
46
|
+
# specified.
|
47
|
+
def self.clean(html, config = {})
|
48
|
+
sanitize = Sanitize.new(config)
|
49
|
+
sanitize.clean(html)
|
50
|
+
end
|
51
|
+
|
52
|
+
# Performs Sanitize#clean in place, returning _html_, or +nil+ if no changes
|
53
|
+
# were made.
|
54
|
+
def self.clean!(html, config = {})
|
55
|
+
sanitize = Sanitize.new(config)
|
56
|
+
sanitize.clean!(html)
|
57
|
+
end
|
58
|
+
|
59
|
+
# Sanitizes the specified Nokogiri::XML::Node and all its children.
|
60
|
+
def self.clean_node!(node, config = {})
|
61
|
+
sanitize = Sanitize.new(config)
|
62
|
+
sanitize.clean_node!(node)
|
63
|
+
end
|
64
|
+
|
65
|
+
#--
|
66
|
+
# Instance Methods
|
67
|
+
#++
|
68
|
+
|
69
|
+
# Returns a new Sanitize object initialized with the settings in _config_.
|
70
|
+
def initialize(config = {})
|
71
|
+
# Sanitize configuration.
|
72
|
+
@config = Config::DEFAULT.merge(config)
|
73
|
+
@config[:transformers] = Array(@config[:transformers])
|
74
|
+
|
75
|
+
# Specific nodes to whitelist (along with all their attributes). This array
|
76
|
+
# is generated at runtime by transformers, and is cleared before and after
|
77
|
+
# a fragment is cleaned (so it applies only to a specific fragment).
|
78
|
+
@whitelist_nodes = []
|
79
|
+
end
|
80
|
+
|
81
|
+
# Returns a sanitized copy of _html_.
|
82
|
+
def clean(html)
|
83
|
+
dupe = html.dup
|
84
|
+
clean!(dupe) || dupe
|
85
|
+
end
|
86
|
+
|
87
|
+
# Performs clean in place, returning _html_, or +nil+ if no changes were
|
88
|
+
# made.
|
89
|
+
def clean!(html)
|
90
|
+
@whitelist_nodes = []
|
91
|
+
fragment = Nokogiri::HTML::DocumentFragment.parse(html)
|
92
|
+
clean_node!(fragment)
|
93
|
+
@whitelist_nodes = []
|
94
|
+
|
95
|
+
output_method_params = {:encoding => 'utf-8', :indent => 0}
|
96
|
+
|
97
|
+
if @config[:output] == :xhtml
|
98
|
+
output_method = fragment.method(:to_xhtml)
|
99
|
+
output_method_params[:save_with] = Nokogiri::XML::Node::SaveOptions::AS_XHTML
|
100
|
+
elsif @config[:output] == :html
|
101
|
+
output_method = fragment.method(:to_html)
|
102
|
+
else
|
103
|
+
raise Error, "unsupported output format: #{@config[:output]}"
|
104
|
+
end
|
105
|
+
|
106
|
+
result = output_method.call(output_method_params)
|
107
|
+
|
108
|
+
# Nokogiri 1.3.3 (and possibly earlier versions) always returns a US-ASCII
|
109
|
+
# string no matter what we ask for. This will be fixed in 1.4.0, but for
|
110
|
+
# now we have to hack around it to prevent errors.
|
111
|
+
result.force_encoding('utf-8') if RUBY_VERSION >= '1.9'
|
112
|
+
|
113
|
+
return result == html ? nil : html[0, html.length] = result
|
114
|
+
end
|
115
|
+
|
116
|
+
# Sanitizes the specified Nokogiri::XML::Node and all its children.
|
117
|
+
def clean_node!(node)
|
118
|
+
raise ArgumentError unless node.is_a?(Nokogiri::XML::Node)
|
119
|
+
|
120
|
+
node.traverse do |traversed_node|
|
121
|
+
if traversed_node.element?
|
122
|
+
clean_element!(traversed_node)
|
123
|
+
elsif traversed_node.comment?
|
124
|
+
traversed_node.unlink unless @config[:allow_comments]
|
125
|
+
elsif traversed_node.cdata?
|
126
|
+
traversed_node.replace(Nokogiri::XML::Text.new(traversed_node.text,
|
127
|
+
traversed_node.document))
|
128
|
+
end
|
129
|
+
end
|
130
|
+
|
131
|
+
node
|
132
|
+
end
|
133
|
+
|
134
|
+
private
|
135
|
+
|
136
|
+
def clean_element!(node)
|
137
|
+
# Run this node through all configured transformers.
|
138
|
+
transform = transform_element!(node)
|
139
|
+
|
140
|
+
# If this node is in the dynamic whitelist array (built at runtime by
|
141
|
+
# transformers), let it live with all of its attributes intact.
|
142
|
+
return if @whitelist_nodes.include?(node)
|
143
|
+
|
144
|
+
name = node.name.to_s.downcase
|
145
|
+
|
146
|
+
# Delete any element that isn't in the whitelist.
|
147
|
+
unless transform[:whitelist] || @config[:elements].include?(name)
|
148
|
+
node.children.each { |n| node.add_previous_sibling(n) }
|
149
|
+
node.unlink
|
150
|
+
return
|
151
|
+
end
|
152
|
+
|
153
|
+
attr_whitelist = (transform[:attr_whitelist] +
|
154
|
+
(@config[:attributes][name] || []) +
|
155
|
+
(@config[:attributes][:all] || [])).uniq
|
156
|
+
|
157
|
+
if attr_whitelist.empty?
|
158
|
+
# Delete all attributes from elements with no whitelisted attributes.
|
159
|
+
node.attribute_nodes.each {|attr| attr.remove }
|
160
|
+
else
|
161
|
+
# Delete any attribute that isn't in the whitelist for this element.
|
162
|
+
node.attribute_nodes.each do |attr|
|
163
|
+
attr.unlink unless attr_whitelist.include?(attr.name.downcase)
|
164
|
+
end
|
165
|
+
|
166
|
+
# Delete remaining attributes that use unacceptable protocols.
|
167
|
+
if @config[:protocols].has_key?(name)
|
168
|
+
protocol = @config[:protocols][name]
|
169
|
+
|
170
|
+
node.attribute_nodes.each do |attr|
|
171
|
+
attr_name = attr.name.downcase
|
172
|
+
next false unless protocol.has_key?(attr_name)
|
173
|
+
|
174
|
+
del = if attr.value.to_s.downcase =~ REGEX_PROTOCOL
|
175
|
+
!protocol[attr_name].include?($1.downcase)
|
176
|
+
else
|
177
|
+
!protocol[attr_name].include?(:relative)
|
178
|
+
end
|
179
|
+
|
180
|
+
attr.unlink if del
|
181
|
+
end
|
182
|
+
end
|
183
|
+
end
|
184
|
+
|
185
|
+
# Add required attributes.
|
186
|
+
if @config[:add_attributes].has_key?(name)
|
187
|
+
@config[:add_attributes][name].each do |key, val|
|
188
|
+
node[key] = val
|
189
|
+
end
|
190
|
+
end
|
191
|
+
|
192
|
+
transform
|
193
|
+
end
|
194
|
+
|
195
|
+
def transform_element!(node)
|
196
|
+
output = {
|
197
|
+
:attr_whitelist => [],
|
198
|
+
:node => node,
|
199
|
+
:whitelist => false
|
200
|
+
}
|
201
|
+
|
202
|
+
@config[:transformers].inject(node) do |transformer_node, transformer|
|
203
|
+
transform = transformer.call({
|
204
|
+
:config => @config,
|
205
|
+
:node => transformer_node
|
206
|
+
})
|
207
|
+
|
208
|
+
if transform.nil?
|
209
|
+
transformer_node
|
210
|
+
elsif transform.is_a?(Hash)
|
211
|
+
if transform[:whitelist_nodes].is_a?(Array)
|
212
|
+
@whitelist_nodes += transform[:whitelist_nodes]
|
213
|
+
@whitelist_nodes.uniq!
|
214
|
+
end
|
215
|
+
|
216
|
+
output[:attr_whitelist] += transform[:attr_whitelist] if transform[:attr_whitelist].is_a?(Array)
|
217
|
+
output[:whitelist] ||= true if transform[:whitelist]
|
218
|
+
output[:node] = transform[:node].is_a?(Nokogiri::XML::Node) ? transform[:node] : output[:node]
|
219
|
+
else
|
220
|
+
raise Error, "transformer output must be a Hash or nil"
|
221
|
+
end
|
222
|
+
end
|
223
|
+
|
224
|
+
node.replace(output[:node]) if node != output[:node]
|
225
|
+
|
226
|
+
return output
|
227
|
+
end
|
228
|
+
end
|
metadata
ADDED
@@ -0,0 +1,92 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: darkhelmet-sanitize
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 1.2.0.dev.20091104
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- Ryan Grove
|
8
|
+
autorequire:
|
9
|
+
bindir: bin
|
10
|
+
cert_chain: []
|
11
|
+
|
12
|
+
date: 2009-11-04 00:00:00 -07:00
|
13
|
+
default_executable:
|
14
|
+
dependencies:
|
15
|
+
- !ruby/object:Gem::Dependency
|
16
|
+
name: nokogiri
|
17
|
+
type: :runtime
|
18
|
+
version_requirement:
|
19
|
+
version_requirements: !ruby/object:Gem::Requirement
|
20
|
+
requirements:
|
21
|
+
- - ~>
|
22
|
+
- !ruby/object:Gem::Version
|
23
|
+
version: 1.4.0
|
24
|
+
version:
|
25
|
+
- !ruby/object:Gem::Dependency
|
26
|
+
name: bacon
|
27
|
+
type: :development
|
28
|
+
version_requirement:
|
29
|
+
version_requirements: !ruby/object:Gem::Requirement
|
30
|
+
requirements:
|
31
|
+
- - ~>
|
32
|
+
- !ruby/object:Gem::Version
|
33
|
+
version: 1.1.0
|
34
|
+
version:
|
35
|
+
- !ruby/object:Gem::Dependency
|
36
|
+
name: rake
|
37
|
+
type: :development
|
38
|
+
version_requirement:
|
39
|
+
version_requirements: !ruby/object:Gem::Requirement
|
40
|
+
requirements:
|
41
|
+
- - ~>
|
42
|
+
- !ruby/object:Gem::Version
|
43
|
+
version: 0.8.0
|
44
|
+
version:
|
45
|
+
description:
|
46
|
+
email: ryan@wonko.com
|
47
|
+
executables: []
|
48
|
+
|
49
|
+
extensions: []
|
50
|
+
|
51
|
+
extra_rdoc_files: []
|
52
|
+
|
53
|
+
files:
|
54
|
+
- HISTORY
|
55
|
+
- LICENSE
|
56
|
+
- README.rdoc
|
57
|
+
- lib/sanitize/config/basic.rb
|
58
|
+
- lib/sanitize/config/relaxed.rb
|
59
|
+
- lib/sanitize/config/restricted.rb
|
60
|
+
- lib/sanitize/config.rb
|
61
|
+
- lib/sanitize/version.rb
|
62
|
+
- lib/sanitize.rb
|
63
|
+
has_rdoc: true
|
64
|
+
homepage: http://github.com/rgrove/sanitize/
|
65
|
+
licenses: []
|
66
|
+
|
67
|
+
post_install_message:
|
68
|
+
rdoc_options: []
|
69
|
+
|
70
|
+
require_paths:
|
71
|
+
- lib
|
72
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
73
|
+
requirements:
|
74
|
+
- - ">="
|
75
|
+
- !ruby/object:Gem::Version
|
76
|
+
version: 1.8.6
|
77
|
+
version:
|
78
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
79
|
+
requirements:
|
80
|
+
- - ">"
|
81
|
+
- !ruby/object:Gem::Version
|
82
|
+
version: 1.3.1
|
83
|
+
version:
|
84
|
+
requirements: []
|
85
|
+
|
86
|
+
rubyforge_project: riposte
|
87
|
+
rubygems_version: 1.3.5
|
88
|
+
signing_key:
|
89
|
+
specification_version: 3
|
90
|
+
summary: Whitelist-based HTML sanitizer.
|
91
|
+
test_files: []
|
92
|
+
|