dasil003-sanitize 1.1.0
Sign up to get free protection for your applications and to get access to all the features.
- data/HISTORY +65 -0
- data/LICENSE +18 -0
- data/README.rdoc +212 -0
- data/lib/sanitize.rb +188 -0
- data/lib/sanitize/config.rb +75 -0
- data/lib/sanitize/config/basic.rb +49 -0
- data/lib/sanitize/config/relaxed.rb +56 -0
- data/lib/sanitize/config/restricted.rb +29 -0
- data/lib/sanitize/version.rb +3 -0
- metadata +93 -0
data/HISTORY
ADDED
@@ -0,0 +1,65 @@
|
|
1
|
+
Sanitize History
|
2
|
+
================================================================================
|
3
|
+
|
4
|
+
Version 1.1.0 (2009-10-11)
|
5
|
+
* Migrated from Hpricot to Nokogiri. Requires libxml2 >= 2.7.2 [Adam Hooper]
|
6
|
+
* Added an :output config setting to allow the output format to be specified.
|
7
|
+
Supported formats are :xhtml (the default) and :html (which outputs HTML4).
|
8
|
+
* Changed protocol regex to ensure Sanitize doesn't kill URLs with colons in
|
9
|
+
path segments. [Peter Cooper]
|
10
|
+
|
11
|
+
Version 1.0.8 (2009-04-23)
|
12
|
+
* Added a workaround for an Hpricot bug that prevents attribute names from
|
13
|
+
being downcased in recent versions of Hpricot. This was exploitable to
|
14
|
+
prevent non-whitelisted protocols from being cleaned. [Reported by Ben
|
15
|
+
Wanicur]
|
16
|
+
|
17
|
+
Version 1.0.7 (2009-04-11)
|
18
|
+
* Requires Hpricot 0.8.1+, which is finally compatible with Ruby 1.9.1.
|
19
|
+
* Fixed a bug that caused named character entities containing digits (like
|
20
|
+
²) to be escaped when they shouldn't have been. [Reported by Sebastian
|
21
|
+
Steinmetz]
|
22
|
+
|
23
|
+
Version 1.0.6 (2009-02-23)
|
24
|
+
* Removed htmlentities gem dependency.
|
25
|
+
* Existing well-formed character entity references in the input string are now
|
26
|
+
preserved rather than being decoded and re-encoded.
|
27
|
+
* The ' character is now encoded as ' instead of ' to prevent
|
28
|
+
problems in IE6.
|
29
|
+
* You can now specify the symbol :all in place of an element name in the
|
30
|
+
attributes config hash to allow certain attributes on all elements. [Thanks
|
31
|
+
to Mutwin Kraus]
|
32
|
+
|
33
|
+
Version 1.0.5 (2009-02-05)
|
34
|
+
* Fixed a bug introduced in version 1.0.3 that prevented non-whitelisted
|
35
|
+
protocols from being cleaned when relative URLs were allowed. [Reported by
|
36
|
+
Dev Purkayastha]
|
37
|
+
* Fixed "undefined method `parent='" exceptions caused by parser changes in
|
38
|
+
edge Hpricot.
|
39
|
+
|
40
|
+
Version 1.0.4 (2009-01-16)
|
41
|
+
* Fixed a bug that made it possible to sneak a non-whitelisted element through
|
42
|
+
by repeating it several times in a row. All versions of Sanitize prior to
|
43
|
+
1.0.4 are vulnerable. [Reported by Cristobal]
|
44
|
+
|
45
|
+
Version 1.0.3 (2009-01-15)
|
46
|
+
* Fixed a bug whereby incomplete Unicode or hex entities could be used to
|
47
|
+
prevent non-whitelisted protocols from being cleaned. Since IE6 and Opera
|
48
|
+
still decode the incomplete entities, users of those browsers may be
|
49
|
+
vulnerable to malicious script injection on websites using versions of
|
50
|
+
Sanitize prior to 1.0.3.
|
51
|
+
|
52
|
+
Version 1.0.2 (2009-01-04)
|
53
|
+
* Fixed a bug that caused an exception to be thrown when parsing a valueless
|
54
|
+
attribute that's expected to contain a URL.
|
55
|
+
|
56
|
+
Version 1.0.1 (2009-01-01)
|
57
|
+
* You can now specify :relative in a protocol config array to allow attributes
|
58
|
+
containing relative URLs with no protocol. The Basic and Relaxed configs
|
59
|
+
have been updated to allow relative URLs.
|
60
|
+
* Added a workaround for an Hpricot bug that causes HTML entities for
|
61
|
+
non-ASCII characters to be replaced by question marks, and all other
|
62
|
+
entities to be destructively decoded.
|
63
|
+
|
64
|
+
Version 1.0.0 (2008-12-25)
|
65
|
+
* First release.
|
data/LICENSE
ADDED
@@ -0,0 +1,18 @@
|
|
1
|
+
Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
|
2
|
+
|
3
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy of
|
4
|
+
this software and associated documentation files (the 'Software'), to deal in
|
5
|
+
the Software without restriction, including without limitation the rights to
|
6
|
+
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
|
7
|
+
the Software, and to permit persons to whom the Software is furnished to do so,
|
8
|
+
subject to the following conditions:
|
9
|
+
|
10
|
+
The above copyright notice and this permission notice shall be included in all
|
11
|
+
copies or substantial portions of the Software.
|
12
|
+
|
13
|
+
THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
14
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
|
15
|
+
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
|
16
|
+
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
|
17
|
+
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
|
18
|
+
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/README.rdoc
ADDED
@@ -0,0 +1,212 @@
|
|
1
|
+
= Sanitize
|
2
|
+
|
3
|
+
Sanitize is a whitelist-based HTML sanitizer. Given a list of acceptable
|
4
|
+
elements and attributes, Sanitize will remove all unacceptable HTML from a
|
5
|
+
string.
|
6
|
+
|
7
|
+
Using a simple configuration syntax, you can tell Sanitize to allow certain
|
8
|
+
elements, certain attributes within those elements, and even certain URL
|
9
|
+
protocols within attributes that contain URLs. Any HTML elements or attributes
|
10
|
+
that you don't explicitly allow will be removed.
|
11
|
+
|
12
|
+
Because it's based on Nokogiri, a full-fledged HTML parser, rather than a bunch
|
13
|
+
of fragile regular expressions, Sanitize has no trouble dealing with malformed
|
14
|
+
or maliciously-formed HTML. When in doubt, Sanitize always errs on the side of
|
15
|
+
caution.
|
16
|
+
|
17
|
+
*Author*:: Ryan Grove (mailto:ryan@wonko.com)
|
18
|
+
*Version*:: 1.1.0 (2009-10-11)
|
19
|
+
*Copyright*:: Copyright (c) 2009 Ryan Grove. All rights reserved.
|
20
|
+
*License*:: MIT License (http://opensource.org/licenses/mit-license.php)
|
21
|
+
*Website*:: http://github.com/rgrove/sanitize
|
22
|
+
|
23
|
+
== Requires
|
24
|
+
|
25
|
+
* Nokogiri
|
26
|
+
* libxml2 >= 2.7.2
|
27
|
+
|
28
|
+
== Installation
|
29
|
+
|
30
|
+
Latest stable release:
|
31
|
+
|
32
|
+
gem install sanitize
|
33
|
+
|
34
|
+
Latest development version:
|
35
|
+
|
36
|
+
gem install sanitize -s http://gemcutter.org --prerelease
|
37
|
+
|
38
|
+
== Usage
|
39
|
+
|
40
|
+
If you don't specify any configuration options, Sanitize will use its strictest
|
41
|
+
settings by default, which means it will strip all HTML.
|
42
|
+
|
43
|
+
require 'rubygems'
|
44
|
+
require 'sanitize'
|
45
|
+
|
46
|
+
html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
|
47
|
+
|
48
|
+
Sanitize.clean(html) # => 'foo'
|
49
|
+
|
50
|
+
== Configuration
|
51
|
+
|
52
|
+
In addition to the ultra-safe default settings, Sanitize comes with three other
|
53
|
+
built-in modes.
|
54
|
+
|
55
|
+
=== Sanitize::Config::RESTRICTED
|
56
|
+
|
57
|
+
Allows only very simple inline formatting markup. No links, images, or block
|
58
|
+
elements.
|
59
|
+
|
60
|
+
Sanitize.clean(html, Sanitize::Config::RESTRICTED) # => '<b>foo</b>'
|
61
|
+
|
62
|
+
=== Sanitize::Config::BASIC
|
63
|
+
|
64
|
+
Allows a variety of markup including formatting tags, links, and lists. Images
|
65
|
+
and tables are not allowed, links are limited to FTP, HTTP, HTTPS, and mailto
|
66
|
+
protocols, and a <code>rel="nofollow"</code> attribute is added to all links to
|
67
|
+
mitigate SEO spam.
|
68
|
+
|
69
|
+
Sanitize.clean(html, Sanitize::Config::BASIC)
|
70
|
+
# => '<b><a href="http://foo.com/" rel="nofollow">foo</a></b>'
|
71
|
+
|
72
|
+
=== Sanitize::Config::RELAXED
|
73
|
+
|
74
|
+
Allows an even wider variety of markup than BASIC, including images and tables.
|
75
|
+
Links are still limited to FTP, HTTP, HTTPS, and mailto protocols, while images
|
76
|
+
are limited to HTTP and HTTPS. In this mode, <code>rel="nofollow"</code> is not
|
77
|
+
added to links.
|
78
|
+
|
79
|
+
Sanitize.clean(html, Sanitize::Config::RELAXED)
|
80
|
+
# => '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
|
81
|
+
|
82
|
+
=== Custom Configuration
|
83
|
+
|
84
|
+
If the built-in modes don't meet your needs, you can easily specify a custom
|
85
|
+
configuration:
|
86
|
+
|
87
|
+
Sanitize.clean(html, :elements => ['a', 'span'],
|
88
|
+
:attributes => {'a' => ['href', 'title'], 'span' => ['class']},
|
89
|
+
:protocols => {'a' => {'href' => ['http', 'https', 'mailto']}})
|
90
|
+
|
91
|
+
==== :elements
|
92
|
+
|
93
|
+
Array of element names to allow. Specify all names in lowercase.
|
94
|
+
|
95
|
+
:elements => [
|
96
|
+
'a', 'b', 'blockquote', 'br', 'cite', 'code', 'dd', 'dl', 'dt', 'em',
|
97
|
+
'i', 'li', 'ol', 'p', 'pre', 'q', 'small', 'strike', 'strong', 'sub',
|
98
|
+
'sup', 'u', 'ul'
|
99
|
+
]
|
100
|
+
|
101
|
+
==== :attributes
|
102
|
+
|
103
|
+
Attributes to allow for specific elements. Specify all element names and
|
104
|
+
attributes in lowercase.
|
105
|
+
|
106
|
+
:attributes => {
|
107
|
+
'a' => ['href', 'title'],
|
108
|
+
'blockquote' => ['cite'],
|
109
|
+
'img' => ['alt', 'src', 'title']
|
110
|
+
}
|
111
|
+
|
112
|
+
If you'd like to allow certain attributes on all elements, use the symbol
|
113
|
+
<code>:all</code> instead of an element name.
|
114
|
+
|
115
|
+
:attributes => {
|
116
|
+
:all => ['class'],
|
117
|
+
'a' => ['href', 'title']
|
118
|
+
}
|
119
|
+
|
120
|
+
==== :add_attributes
|
121
|
+
|
122
|
+
Attributes to add to specific elements. If the attribute already exists, it will
|
123
|
+
be replaced with the value specified here. Specify all element names and
|
124
|
+
attributes in lowercase.
|
125
|
+
|
126
|
+
:add_attributes => {
|
127
|
+
'a' => {'rel' => 'nofollow'}
|
128
|
+
}
|
129
|
+
|
130
|
+
==== :protocols
|
131
|
+
|
132
|
+
URL protocols to allow in specific attributes. If an attribute is listed here
|
133
|
+
and contains a protocol other than those specified (or if it contains no
|
134
|
+
protocol at all), it will be removed.
|
135
|
+
|
136
|
+
:protocols => {
|
137
|
+
'a' => {'href' => ['ftp', 'http', 'https', 'mailto']},
|
138
|
+
'img' => {'src' => ['http', 'https']}
|
139
|
+
}
|
140
|
+
|
141
|
+
If you'd like to allow the use of relative URLs which don't have a protocol,
|
142
|
+
include the symbol <code>:relative</code> in the protocol array:
|
143
|
+
|
144
|
+
:protocols => {
|
145
|
+
'a' => {'href' => ['http', 'https', :relative]}
|
146
|
+
}
|
147
|
+
|
148
|
+
==== :object_urls
|
149
|
+
|
150
|
+
URL prefixes to allow specific flash embed codes. This can be used to allow
|
151
|
+
standard video embeds such as provided by YouTube:
|
152
|
+
|
153
|
+
:object_urls => ['http://www.youtube.com']
|
154
|
+
|
155
|
+
*Warning* Do not under any circumstances add 'object' or 'embed' to the standard
|
156
|
+
config. It is unnecessary and will open an XSS hole.
|
157
|
+
|
158
|
+
Because object tags are more complex than most other tags and include many
|
159
|
+
XSS attack vectors, this functionality follows a completely different code path
|
160
|
+
from the regular filtering. There is a secondary configuration variable that
|
161
|
+
controls what is allowed on the object tag and its descendents, by default it
|
162
|
+
is:
|
163
|
+
|
164
|
+
:object_config => {
|
165
|
+
:elements => ['object', 'param', 'embed'],
|
166
|
+
:attributes => {
|
167
|
+
'object' => ['width', 'height'],
|
168
|
+
'param' => ['name', 'value'],
|
169
|
+
'embed' => ['src', 'type', 'allowscriptaccess', 'allowfullscreen',
|
170
|
+
'width', 'height']
|
171
|
+
}}
|
172
|
+
|
173
|
+
This config is applied to the object tag and all of its immediate descendents
|
174
|
+
instead of the standard config. This initial configuration was crafted specifically
|
175
|
+
to allow YouTube and Vimeo embed codes in this format:
|
176
|
+
|
177
|
+
<object width="425" height="344"><param name="movie" value="http://www.youtube.com/v/qVaEPx_VyXs&hl=en&fs=1&color1=0xcc2550&color2=0xe87a9f"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/qVaEPx_VyXs&hl=en&fs=1&color1=0xcc2550&color2=0xe87a9f" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"></embed></object>
|
178
|
+
|
179
|
+
|
180
|
+
== Contributors
|
181
|
+
|
182
|
+
The following lovely people have contributed to Sanitize in the form of patches
|
183
|
+
or ideas that later became code:
|
184
|
+
|
185
|
+
* Peter Cooper <git@peterc.org>
|
186
|
+
* Gabe da Silveira <gabe@websaviour.com>
|
187
|
+
* Ryan Grove <ryan@wonko.com>
|
188
|
+
* Adam Hooper <adam@adamhooper.com>
|
189
|
+
* Mutwin Kraus <mutle@blogage.de>
|
190
|
+
* Dev Purkayastha <dev.purkayastha@gmail.com>
|
191
|
+
* Ben Wanicur <bwanicur@verticalresponse.com>
|
192
|
+
|
193
|
+
== License
|
194
|
+
|
195
|
+
Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
|
196
|
+
|
197
|
+
Permission is hereby granted, free of charge, to any person obtaining a copy of
|
198
|
+
this software and associated documentation files (the 'Software'), to deal in
|
199
|
+
the Software without restriction, including without limitation the rights to
|
200
|
+
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
|
201
|
+
the Software, and to permit persons to whom the Software is furnished to do so,
|
202
|
+
subject to the following conditions:
|
203
|
+
|
204
|
+
The above copyright notice and this permission notice shall be included in all
|
205
|
+
copies or substantial portions of the Software.
|
206
|
+
|
207
|
+
THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
208
|
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
|
209
|
+
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
|
210
|
+
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
|
211
|
+
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
|
212
|
+
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
data/lib/sanitize.rb
ADDED
@@ -0,0 +1,188 @@
|
|
1
|
+
# encoding: utf-8
|
2
|
+
#--
|
3
|
+
# Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
|
4
|
+
#
|
5
|
+
# Permission is hereby granted, free of charge, to any person obtaining a copy
|
6
|
+
# of this software and associated documentation files (the 'Software'), to deal
|
7
|
+
# in the Software without restriction, including without limitation the rights
|
8
|
+
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9
|
+
# copies of the Software, and to permit persons to whom the Software is
|
10
|
+
# furnished to do so, subject to the following conditions:
|
11
|
+
#
|
12
|
+
# The above copyright notice and this permission notice shall be included in all
|
13
|
+
# copies or substantial portions of the Software.
|
14
|
+
#
|
15
|
+
# THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16
|
+
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17
|
+
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18
|
+
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19
|
+
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20
|
+
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
21
|
+
# SOFTWARE.
|
22
|
+
#++
|
23
|
+
|
24
|
+
require 'nokogiri'
|
25
|
+
require 'sanitize/version'
|
26
|
+
require 'sanitize/config'
|
27
|
+
require 'sanitize/config/restricted'
|
28
|
+
require 'sanitize/config/basic'
|
29
|
+
require 'sanitize/config/relaxed'
|
30
|
+
|
31
|
+
class Sanitize
|
32
|
+
|
33
|
+
# Matches an attribute value that could be treated by a browser as a URL
|
34
|
+
# with a protocol prefix, such as "http:" or "javascript:". Any string of zero
|
35
|
+
# or more characters followed by a colon is considered a match, even if the
|
36
|
+
# colon is encoded as an entity and even if it's an incomplete entity (which
|
37
|
+
# IE6 and Opera will still parse).
|
38
|
+
REGEX_PROTOCOL = /^([A-Za-z0-9\+\-\.\&\;\#\s]*?)(?:\:|�*58|�*3a)/i
|
39
|
+
|
40
|
+
#--
|
41
|
+
# Instance Methods
|
42
|
+
#++
|
43
|
+
|
44
|
+
# Returns a new Sanitize object initialized with the settings in _config_.
|
45
|
+
def initialize(config = {})
|
46
|
+
@config = Config::DEFAULT.merge(config)
|
47
|
+
end
|
48
|
+
|
49
|
+
# Returns a sanitized copy of _html_.
|
50
|
+
def clean(html)
|
51
|
+
dupe = html.dup
|
52
|
+
clean!(dupe) || dupe
|
53
|
+
end
|
54
|
+
|
55
|
+
# Performs clean in place, returning _html_, or +nil+ if no changes were
|
56
|
+
# made.
|
57
|
+
def clean!(html)
|
58
|
+
fragment = Nokogiri::HTML::DocumentFragment.parse(html)
|
59
|
+
|
60
|
+
fragment.traverse do |node|
|
61
|
+
if node.comment?
|
62
|
+
node.unlink unless @config[:allow_comments]
|
63
|
+
elsif node.element?
|
64
|
+
name = node.name.to_s.downcase
|
65
|
+
parent_name = node.parent ? node.parent.name.to_s.downcase : nil
|
66
|
+
|
67
|
+
# Special handling of objects is necessary to limit by specific domains.
|
68
|
+
if @config[:object_urls].any? &&
|
69
|
+
[name, parent_name].include?('object')
|
70
|
+
unless @config[:object_config][:elements].include?(name)
|
71
|
+
node.unlink
|
72
|
+
next
|
73
|
+
end
|
74
|
+
|
75
|
+
attr_whitelist = @config[:object_config][:attributes][name] || []
|
76
|
+
|
77
|
+
# Remove non-whitelisted object interior tag attributes
|
78
|
+
node.attribute_nodes.each do |attr|
|
79
|
+
attr.unlink unless attr_whitelist.include?(attr.name.downcase)
|
80
|
+
end
|
81
|
+
|
82
|
+
# Remove non-whitelisted object URLs.
|
83
|
+
object_url = if name == 'param' && node['name'] == 'movie'
|
84
|
+
node['value']
|
85
|
+
elsif name == 'embed'
|
86
|
+
node['src']
|
87
|
+
end
|
88
|
+
|
89
|
+
if object_url &&
|
90
|
+
!@config[:object_urls].any?{|good| object_url.index(good) == 0}
|
91
|
+
node.parent.unlink
|
92
|
+
end
|
93
|
+
|
94
|
+
next
|
95
|
+
end
|
96
|
+
|
97
|
+
# Delete any element that isn't in the whitelist.
|
98
|
+
unless @config[:elements].include?(name)
|
99
|
+
node.children.each { |n| node.add_previous_sibling(n) }
|
100
|
+
node.unlink
|
101
|
+
next
|
102
|
+
end
|
103
|
+
|
104
|
+
attr_whitelist = ((@config[:attributes][name] || []) +
|
105
|
+
(@config[:attributes][:all] || [])).uniq
|
106
|
+
|
107
|
+
if attr_whitelist.empty?
|
108
|
+
# Delete all attributes from elements with no whitelisted
|
109
|
+
# attributes.
|
110
|
+
node.attribute_nodes.each { |attr| attr.remove }
|
111
|
+
else
|
112
|
+
# Delete any attribute that isn't in the whitelist for this element.
|
113
|
+
node.attribute_nodes.each do |attr|
|
114
|
+
attr.unlink unless attr_whitelist.include?(attr.name.downcase)
|
115
|
+
end
|
116
|
+
|
117
|
+
# Delete remaining attributes that use unacceptable protocols.
|
118
|
+
if @config[:protocols].has_key?(name)
|
119
|
+
protocol = @config[:protocols][name]
|
120
|
+
|
121
|
+
node.attribute_nodes.each do |attr|
|
122
|
+
attr_name = attr.name.downcase
|
123
|
+
next false unless protocol.has_key?(attr_name)
|
124
|
+
|
125
|
+
del = if attr.value.to_s.downcase =~ REGEX_PROTOCOL
|
126
|
+
!protocol[attr_name].include?($1.downcase)
|
127
|
+
else
|
128
|
+
!protocol[attr_name].include?(:relative)
|
129
|
+
end
|
130
|
+
|
131
|
+
attr.unlink if del
|
132
|
+
end
|
133
|
+
end
|
134
|
+
end
|
135
|
+
|
136
|
+
# Add required attributes.
|
137
|
+
if @config[:add_attributes].has_key?(name)
|
138
|
+
@config[:add_attributes][name].each do |key, val|
|
139
|
+
node[key] = val
|
140
|
+
end
|
141
|
+
end
|
142
|
+
elsif node.cdata?
|
143
|
+
node.replace(Nokogiri::XML::Text.new(node.text, node.document))
|
144
|
+
end
|
145
|
+
end
|
146
|
+
|
147
|
+
if @config[:output] == :xhtml
|
148
|
+
output_method = fragment.method(:to_xhtml)
|
149
|
+
elsif @config[:output] == :html
|
150
|
+
output_method = fragment.method(:to_html)
|
151
|
+
else
|
152
|
+
raise Error, "unsupported output format: #{@config[:output]}"
|
153
|
+
end
|
154
|
+
|
155
|
+
if RUBY_VERSION >= '1.9'
|
156
|
+
# Nokogiri 1.3.3 (and possibly earlier versions) always returns a US-ASCII
|
157
|
+
# string no matter what we ask for. This will be fixed in 1.4.0, but for
|
158
|
+
# now we have to hack around it to prevent errors.
|
159
|
+
result = output_method.call(:encoding => 'utf-8', :indent => 0).force_encoding('utf-8')
|
160
|
+
result.gsub!(">\n", '>')
|
161
|
+
else
|
162
|
+
result = output_method.call(:encoding => 'utf-8', :indent => 0).gsub(">\n", '>')
|
163
|
+
end
|
164
|
+
|
165
|
+
return result == html ? nil : html[0, html.length] = result
|
166
|
+
end
|
167
|
+
|
168
|
+
#--
|
169
|
+
# Class Methods
|
170
|
+
#++
|
171
|
+
|
172
|
+
class << self
|
173
|
+
# Returns a sanitized copy of _html_, using the settings in _config_ if
|
174
|
+
# specified.
|
175
|
+
def clean(html, config = {})
|
176
|
+
sanitize = Sanitize.new(config)
|
177
|
+
sanitize.clean(html)
|
178
|
+
end
|
179
|
+
|
180
|
+
# Performs Sanitize#clean in place, returning _html_, or +nil+ if no changes
|
181
|
+
# were made.
|
182
|
+
def clean!(html, config = {})
|
183
|
+
sanitize = Sanitize.new(config)
|
184
|
+
sanitize.clean!(html)
|
185
|
+
end
|
186
|
+
end
|
187
|
+
|
188
|
+
end
|
@@ -0,0 +1,75 @@
|
|
1
|
+
#--
|
2
|
+
# Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
|
3
|
+
#
|
4
|
+
# Permission is hereby granted, free of charge, to any person obtaining a copy
|
5
|
+
# of this software and associated documentation files (the 'Software'), to deal
|
6
|
+
# in the Software without restriction, including without limitation the rights
|
7
|
+
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
8
|
+
# copies of the Software, and to permit persons to whom the Software is
|
9
|
+
# furnished to do so, subject to the following conditions:
|
10
|
+
#
|
11
|
+
# The above copyright notice and this permission notice shall be included in all
|
12
|
+
# copies or substantial portions of the Software.
|
13
|
+
#
|
14
|
+
# THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
15
|
+
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
16
|
+
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
17
|
+
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
18
|
+
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
19
|
+
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
20
|
+
# SOFTWARE.
|
21
|
+
#++
|
22
|
+
|
23
|
+
class Sanitize
|
24
|
+
module Config
|
25
|
+
FLASH_VIDEO_OBJECT = {
|
26
|
+
:elements => ['object', 'param', 'embed'],
|
27
|
+
:attributes => {
|
28
|
+
'object' => ['width', 'height'],
|
29
|
+
'param' => ['name', 'value'],
|
30
|
+
'embed' => ['src', 'type', 'allowscriptaccess', 'allowfullscreen',
|
31
|
+
'width', 'height']
|
32
|
+
}
|
33
|
+
}
|
34
|
+
|
35
|
+
DEFAULT = {
|
36
|
+
# Whether or not to allow HTML comments. Allowing comments is strongly
|
37
|
+
# discouraged, since IE allows script execution within conditional
|
38
|
+
# comments.
|
39
|
+
:allow_comments => false,
|
40
|
+
|
41
|
+
# HTML attributes to add to specific elements. By default, no attributes
|
42
|
+
# are added.
|
43
|
+
:add_attributes => {},
|
44
|
+
|
45
|
+
# HTML attributes to allow in specific elements. By default, no attributes
|
46
|
+
# are allowed.
|
47
|
+
:attributes => {},
|
48
|
+
|
49
|
+
# HTML elements to allow. By default, no elements are allowed (which means
|
50
|
+
# that all HTML will be stripped).
|
51
|
+
:elements => [],
|
52
|
+
|
53
|
+
# URL prefixes to be allowed in object embeds. Note that any kind of arbitrary
|
54
|
+
# object embed would be insecure, therefore this is locked down pretty tight
|
55
|
+
# to allow only YouTube-style embed codes. Under no circumstances should you
|
56
|
+
# add object to the allowed element above, these are handled by a separate code
|
57
|
+
# path in the sanitizer. You must include the fully qualified URL name including
|
58
|
+
# protocol since it matches directly against the attribute value.
|
59
|
+
:object_urls => [],
|
60
|
+
|
61
|
+
# This specifies the elements and attributes on an object and its immediate
|
62
|
+
# descendents. The default configuration is for standard flash video embeds.
|
63
|
+
:object_config => FLASH_VIDEO_OBJECT,
|
64
|
+
|
65
|
+
# Output format. Supported formats are :html and :xhtml (which is the
|
66
|
+
# default).
|
67
|
+
:output => :xhtml,
|
68
|
+
|
69
|
+
# URL handling protocols to allow in specific attributes. By default, no
|
70
|
+
# protocols are allowed. Use :relative in place of a protocol if you want
|
71
|
+
# to allow relative URLs sans protocol.
|
72
|
+
:protocols => {}
|
73
|
+
}
|
74
|
+
end
|
75
|
+
end
|
@@ -0,0 +1,49 @@
|
|
1
|
+
#--
|
2
|
+
# Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
|
3
|
+
#
|
4
|
+
# Permission is hereby granted, free of charge, to any person obtaining a copy
|
5
|
+
# of this software and associated documentation files (the 'Software'), to deal
|
6
|
+
# in the Software without restriction, including without limitation the rights
|
7
|
+
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
8
|
+
# copies of the Software, and to permit persons to whom the Software is
|
9
|
+
# furnished to do so, subject to the following conditions:
|
10
|
+
#
|
11
|
+
# The above copyright notice and this permission notice shall be included in all
|
12
|
+
# copies or substantial portions of the Software.
|
13
|
+
#
|
14
|
+
# THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
15
|
+
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
16
|
+
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
17
|
+
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
18
|
+
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
19
|
+
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
20
|
+
# SOFTWARE.
|
21
|
+
#++
|
22
|
+
|
23
|
+
class Sanitize
|
24
|
+
module Config
|
25
|
+
BASIC = {
|
26
|
+
:elements => [
|
27
|
+
'a', 'b', 'blockquote', 'br', 'cite', 'code', 'dd', 'dl', 'dt', 'em',
|
28
|
+
'i', 'li', 'ol', 'p', 'pre', 'q', 'small', 'strike', 'strong', 'sub',
|
29
|
+
'sup', 'u', 'ul'],
|
30
|
+
|
31
|
+
:attributes => {
|
32
|
+
'a' => ['href'],
|
33
|
+
'blockquote' => ['cite'],
|
34
|
+
'q' => ['cite']
|
35
|
+
},
|
36
|
+
|
37
|
+
:add_attributes => {
|
38
|
+
'a' => {'rel' => 'nofollow'}
|
39
|
+
},
|
40
|
+
|
41
|
+
:protocols => {
|
42
|
+
'a' => {'href' => ['ftp', 'http', 'https', 'mailto',
|
43
|
+
:relative]},
|
44
|
+
'blockquote' => {'cite' => ['http', 'https', :relative]},
|
45
|
+
'q' => {'cite' => ['http', 'https', :relative]}
|
46
|
+
}
|
47
|
+
}
|
48
|
+
end
|
49
|
+
end
|
@@ -0,0 +1,56 @@
|
|
1
|
+
#--
|
2
|
+
# Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
|
3
|
+
#
|
4
|
+
# Permission is hereby granted, free of charge, to any person obtaining a copy
|
5
|
+
# of this software and associated documentation files (the 'Software'), to deal
|
6
|
+
# in the Software without restriction, including without limitation the rights
|
7
|
+
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
8
|
+
# copies of the Software, and to permit persons to whom the Software is
|
9
|
+
# furnished to do so, subject to the following conditions:
|
10
|
+
#
|
11
|
+
# The above copyright notice and this permission notice shall be included in all
|
12
|
+
# copies or substantial portions of the Software.
|
13
|
+
#
|
14
|
+
# THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
15
|
+
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
16
|
+
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
17
|
+
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
18
|
+
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
19
|
+
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
20
|
+
# SOFTWARE.
|
21
|
+
#++
|
22
|
+
|
23
|
+
class Sanitize
|
24
|
+
module Config
|
25
|
+
RELAXED = {
|
26
|
+
:elements => [
|
27
|
+
'a', 'b', 'blockquote', 'br', 'caption', 'cite', 'code', 'col',
|
28
|
+
'colgroup', 'dd', 'dl', 'dt', 'em', 'i', 'img', 'li', 'ol', 'p', 'pre',
|
29
|
+
'q', 'small', 'strike', 'strong', 'sub', 'sup', 'table', 'tbody', 'td',
|
30
|
+
'tfoot', 'th', 'thead', 'tr', 'u', 'ul'],
|
31
|
+
|
32
|
+
:attributes => {
|
33
|
+
'a' => ['href', 'title'],
|
34
|
+
'blockquote' => ['cite'],
|
35
|
+
'col' => ['span', 'width'],
|
36
|
+
'colgroup' => ['span', 'width'],
|
37
|
+
'img' => ['align', 'alt', 'height', 'src', 'title', 'width'],
|
38
|
+
'ol' => ['start', 'type'],
|
39
|
+
'q' => ['cite'],
|
40
|
+
'table' => ['summary', 'width'],
|
41
|
+
'td' => ['abbr', 'axis', 'colspan', 'rowspan', 'width'],
|
42
|
+
'th' => ['abbr', 'axis', 'colspan', 'rowspan', 'scope',
|
43
|
+
'width'],
|
44
|
+
'ul' => ['type']
|
45
|
+
},
|
46
|
+
|
47
|
+
:protocols => {
|
48
|
+
'a' => {'href' => ['ftp', 'http', 'https', 'mailto',
|
49
|
+
:relative]},
|
50
|
+
'blockquote' => {'cite' => ['http', 'https', :relative]},
|
51
|
+
'img' => {'src' => ['http', 'https', :relative]},
|
52
|
+
'q' => {'cite' => ['http', 'https', :relative]}
|
53
|
+
}
|
54
|
+
}
|
55
|
+
end
|
56
|
+
end
|
@@ -0,0 +1,29 @@
|
|
1
|
+
#--
|
2
|
+
# Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
|
3
|
+
#
|
4
|
+
# Permission is hereby granted, free of charge, to any person obtaining a copy
|
5
|
+
# of this software and associated documentation files (the 'Software'), to deal
|
6
|
+
# in the Software without restriction, including without limitation the rights
|
7
|
+
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
8
|
+
# copies of the Software, and to permit persons to whom the Software is
|
9
|
+
# furnished to do so, subject to the following conditions:
|
10
|
+
#
|
11
|
+
# The above copyright notice and this permission notice shall be included in all
|
12
|
+
# copies or substantial portions of the Software.
|
13
|
+
#
|
14
|
+
# THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
15
|
+
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
16
|
+
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
17
|
+
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
18
|
+
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
19
|
+
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
20
|
+
# SOFTWARE.
|
21
|
+
#++
|
22
|
+
|
23
|
+
class Sanitize
|
24
|
+
module Config
|
25
|
+
RESTRICTED = {
|
26
|
+
:elements => ['b', 'em', 'i', 'strong', 'u']
|
27
|
+
}
|
28
|
+
end
|
29
|
+
end
|
metadata
ADDED
@@ -0,0 +1,93 @@
|
|
1
|
+
--- !ruby/object:Gem::Specification
|
2
|
+
name: dasil003-sanitize
|
3
|
+
version: !ruby/object:Gem::Version
|
4
|
+
version: 1.1.0
|
5
|
+
platform: ruby
|
6
|
+
authors:
|
7
|
+
- Ryan Grove
|
8
|
+
- Gabe da Silveira
|
9
|
+
autorequire:
|
10
|
+
bindir: bin
|
11
|
+
cert_chain: []
|
12
|
+
|
13
|
+
date: 2009-10-13 00:00:00 -07:00
|
14
|
+
default_executable:
|
15
|
+
dependencies:
|
16
|
+
- !ruby/object:Gem::Dependency
|
17
|
+
name: nokogiri
|
18
|
+
type: :runtime
|
19
|
+
version_requirement:
|
20
|
+
version_requirements: !ruby/object:Gem::Requirement
|
21
|
+
requirements:
|
22
|
+
- - ~>
|
23
|
+
- !ruby/object:Gem::Version
|
24
|
+
version: 1.3.3
|
25
|
+
version:
|
26
|
+
- !ruby/object:Gem::Dependency
|
27
|
+
name: bacon
|
28
|
+
type: :development
|
29
|
+
version_requirement:
|
30
|
+
version_requirements: !ruby/object:Gem::Requirement
|
31
|
+
requirements:
|
32
|
+
- - ~>
|
33
|
+
- !ruby/object:Gem::Version
|
34
|
+
version: 1.1.0
|
35
|
+
version:
|
36
|
+
- !ruby/object:Gem::Dependency
|
37
|
+
name: rake
|
38
|
+
type: :development
|
39
|
+
version_requirement:
|
40
|
+
version_requirements: !ruby/object:Gem::Requirement
|
41
|
+
requirements:
|
42
|
+
- - ~>
|
43
|
+
- !ruby/object:Gem::Version
|
44
|
+
version: 0.8.0
|
45
|
+
version:
|
46
|
+
description:
|
47
|
+
email: gabe@websaviour.com
|
48
|
+
executables: []
|
49
|
+
|
50
|
+
extensions: []
|
51
|
+
|
52
|
+
extra_rdoc_files: []
|
53
|
+
|
54
|
+
files:
|
55
|
+
- HISTORY
|
56
|
+
- LICENSE
|
57
|
+
- README.rdoc
|
58
|
+
- lib/sanitize/config/basic.rb
|
59
|
+
- lib/sanitize/config/relaxed.rb
|
60
|
+
- lib/sanitize/config/restricted.rb
|
61
|
+
- lib/sanitize/config.rb
|
62
|
+
- lib/sanitize/version.rb
|
63
|
+
- lib/sanitize.rb
|
64
|
+
has_rdoc: true
|
65
|
+
homepage: http://github.com/dasil003/sanitize/
|
66
|
+
licenses: []
|
67
|
+
|
68
|
+
post_install_message:
|
69
|
+
rdoc_options: []
|
70
|
+
|
71
|
+
require_paths:
|
72
|
+
- lib
|
73
|
+
required_ruby_version: !ruby/object:Gem::Requirement
|
74
|
+
requirements:
|
75
|
+
- - ">="
|
76
|
+
- !ruby/object:Gem::Version
|
77
|
+
version: 1.8.6
|
78
|
+
version:
|
79
|
+
required_rubygems_version: !ruby/object:Gem::Requirement
|
80
|
+
requirements:
|
81
|
+
- - ">="
|
82
|
+
- !ruby/object:Gem::Version
|
83
|
+
version: "0"
|
84
|
+
version:
|
85
|
+
requirements: []
|
86
|
+
|
87
|
+
rubyforge_project:
|
88
|
+
rubygems_version: 1.3.5
|
89
|
+
signing_key:
|
90
|
+
specification_version: 3
|
91
|
+
summary: Whitelist-based HTML sanitizer.
|
92
|
+
test_files: []
|
93
|
+
|