darkhelmet-sanitize 1.2.0.dev.20091104

Sign up to get free protection for your applications and to get access to all the features.
data/HISTORY ADDED
@@ -0,0 +1,73 @@
1
+ Sanitize History
2
+ ================================================================================
3
+
4
+ Version 1.2.0.dev (git)
5
+ * Added support for transformers, which allow you to filter and alter nodes
6
+ using your own custom logic, on top of (or instead of) Sanitize's core
7
+ filter. See the README for details.
8
+ * Requires Nokogiri >= 1.4.0.
9
+ * Added elements h1 through h6 to the Relaxed whitelist. [Suggested by David
10
+ Reese]
11
+
12
+ Version 1.1.0 (2009-10-11)
13
+ * Migrated from Hpricot to Nokogiri. Requires libxml2 >= 2.7.2 [Adam Hooper]
14
+ * Added an :output config setting to allow the output format to be specified.
15
+ Supported formats are :xhtml (the default) and :html (which outputs HTML4).
16
+ * Changed protocol regex to ensure Sanitize doesn't kill URLs with colons in
17
+ path segments. [Peter Cooper]
18
+
19
+ Version 1.0.8 (2009-04-23)
20
+ * Added a workaround for an Hpricot bug that prevents attribute names from
21
+ being downcased in recent versions of Hpricot. This was exploitable to
22
+ prevent non-whitelisted protocols from being cleaned. [Reported by Ben
23
+ Wanicur]
24
+
25
+ Version 1.0.7 (2009-04-11)
26
+ * Requires Hpricot 0.8.1+, which is finally compatible with Ruby 1.9.1.
27
+ * Fixed a bug that caused named character entities containing digits (like
28
+ ²) to be escaped when they shouldn't have been. [Reported by Sebastian
29
+ Steinmetz]
30
+
31
+ Version 1.0.6 (2009-02-23)
32
+ * Removed htmlentities gem dependency.
33
+ * Existing well-formed character entity references in the input string are now
34
+ preserved rather than being decoded and re-encoded.
35
+ * The ' character is now encoded as ' instead of ' to prevent
36
+ problems in IE6.
37
+ * You can now specify the symbol :all in place of an element name in the
38
+ attributes config hash to allow certain attributes on all elements. [Thanks
39
+ to Mutwin Kraus]
40
+
41
+ Version 1.0.5 (2009-02-05)
42
+ * Fixed a bug introduced in version 1.0.3 that prevented non-whitelisted
43
+ protocols from being cleaned when relative URLs were allowed. [Reported by
44
+ Dev Purkayastha]
45
+ * Fixed "undefined method `parent='" exceptions caused by parser changes in
46
+ edge Hpricot.
47
+
48
+ Version 1.0.4 (2009-01-16)
49
+ * Fixed a bug that made it possible to sneak a non-whitelisted element through
50
+ by repeating it several times in a row. All versions of Sanitize prior to
51
+ 1.0.4 are vulnerable. [Reported by Cristobal]
52
+
53
+ Version 1.0.3 (2009-01-15)
54
+ * Fixed a bug whereby incomplete Unicode or hex entities could be used to
55
+ prevent non-whitelisted protocols from being cleaned. Since IE6 and Opera
56
+ still decode the incomplete entities, users of those browsers may be
57
+ vulnerable to malicious script injection on websites using versions of
58
+ Sanitize prior to 1.0.3.
59
+
60
+ Version 1.0.2 (2009-01-04)
61
+ * Fixed a bug that caused an exception to be thrown when parsing a valueless
62
+ attribute that's expected to contain a URL.
63
+
64
+ Version 1.0.1 (2009-01-01)
65
+ * You can now specify :relative in a protocol config array to allow attributes
66
+ containing relative URLs with no protocol. The Basic and Relaxed configs
67
+ have been updated to allow relative URLs.
68
+ * Added a workaround for an Hpricot bug that causes HTML entities for
69
+ non-ASCII characters to be replaced by question marks, and all other
70
+ entities to be destructively decoded.
71
+
72
+ Version 1.0.0 (2008-12-25)
73
+ * First release.
data/LICENSE ADDED
@@ -0,0 +1,18 @@
1
+ Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining a copy of
4
+ this software and associated documentation files (the 'Software'), to deal in
5
+ the Software without restriction, including without limitation the rights to
6
+ use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
7
+ the Software, and to permit persons to whom the Software is furnished to do so,
8
+ subject to the following conditions:
9
+
10
+ The above copyright notice and this permission notice shall be included in all
11
+ copies or substantial portions of the Software.
12
+
13
+ THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
15
+ FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
16
+ COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
17
+ IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
18
+ CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.rdoc ADDED
@@ -0,0 +1,249 @@
1
+ = Sanitize
2
+
3
+ Sanitize is a whitelist-based HTML sanitizer. Given a list of acceptable
4
+ elements and attributes, Sanitize will remove all unacceptable HTML from a
5
+ string.
6
+
7
+ Using a simple configuration syntax, you can tell Sanitize to allow certain
8
+ elements, certain attributes within those elements, and even certain URL
9
+ protocols within attributes that contain URLs. Any HTML elements or attributes
10
+ that you don't explicitly allow will be removed.
11
+
12
+ Because it's based on Nokogiri, a full-fledged HTML parser, rather than a bunch
13
+ of fragile regular expressions, Sanitize has no trouble dealing with malformed
14
+ or maliciously-formed HTML. When in doubt, Sanitize always errs on the side of
15
+ caution.
16
+
17
+ *Author*:: Ryan Grove (mailto:ryan@wonko.com)
18
+ *Version*:: 1.2.0.dev (git)
19
+ *Copyright*:: Copyright (c) 2009 Ryan Grove. All rights reserved.
20
+ *License*:: MIT License (http://opensource.org/licenses/mit-license.php)
21
+ *Website*:: http://github.com/rgrove/sanitize
22
+
23
+ == Requires
24
+
25
+ * Nokogiri >= 1.4.0
26
+ * libxml2 >= 2.7.2
27
+
28
+ == Installation
29
+
30
+ Latest stable release:
31
+
32
+ gem install sanitize
33
+
34
+ Latest development version:
35
+
36
+ gem install sanitize -s http://gemcutter.org --prerelease
37
+
38
+ == Usage
39
+
40
+ If you don't specify any configuration options, Sanitize will use its strictest
41
+ settings by default, which means it will strip all HTML and leave only text
42
+ behind.
43
+
44
+ require 'rubygems'
45
+ require 'sanitize'
46
+
47
+ html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
48
+
49
+ Sanitize.clean(html) # => 'foo'
50
+
51
+ == Configuration
52
+
53
+ In addition to the ultra-safe default settings, Sanitize comes with three other
54
+ built-in modes.
55
+
56
+ === Sanitize::Config::RESTRICTED
57
+
58
+ Allows only very simple inline formatting markup. No links, images, or block
59
+ elements.
60
+
61
+ Sanitize.clean(html, Sanitize::Config::RESTRICTED) # => '<b>foo</b>'
62
+
63
+ === Sanitize::Config::BASIC
64
+
65
+ Allows a variety of markup including formatting tags, links, and lists. Images
66
+ and tables are not allowed, links are limited to FTP, HTTP, HTTPS, and mailto
67
+ protocols, and a <code>rel="nofollow"</code> attribute is added to all links to
68
+ mitigate SEO spam.
69
+
70
+ Sanitize.clean(html, Sanitize::Config::BASIC)
71
+ # => '<b><a href="http://foo.com/" rel="nofollow">foo</a></b>'
72
+
73
+ === Sanitize::Config::RELAXED
74
+
75
+ Allows an even wider variety of markup than BASIC, including images and tables.
76
+ Links are still limited to FTP, HTTP, HTTPS, and mailto protocols, while images
77
+ are limited to HTTP and HTTPS. In this mode, <code>rel="nofollow"</code> is not
78
+ added to links.
79
+
80
+ Sanitize.clean(html, Sanitize::Config::RELAXED)
81
+ # => '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
82
+
83
+ === Custom Configuration
84
+
85
+ If the built-in modes don't meet your needs, you can easily specify a custom
86
+ configuration:
87
+
88
+ Sanitize.clean(html, :elements => ['a', 'span'],
89
+ :attributes => {'a' => ['href', 'title'], 'span' => ['class']},
90
+ :protocols => {'a' => {'href' => ['http', 'https', 'mailto']}})
91
+
92
+ ==== :elements
93
+
94
+ Array of element names to allow. Specify all names in lowercase.
95
+
96
+ :elements => [
97
+ 'a', 'b', 'blockquote', 'br', 'cite', 'code', 'dd', 'dl', 'dt', 'em',
98
+ 'i', 'li', 'ol', 'p', 'pre', 'q', 'small', 'strike', 'strong', 'sub',
99
+ 'sup', 'u', 'ul'
100
+ ]
101
+
102
+ ==== :attributes
103
+
104
+ Attributes to allow for specific elements. Specify all element names and
105
+ attributes in lowercase.
106
+
107
+ :attributes => {
108
+ 'a' => ['href', 'title'],
109
+ 'blockquote' => ['cite'],
110
+ 'img' => ['alt', 'src', 'title']
111
+ }
112
+
113
+ If you'd like to allow certain attributes on all elements, use the symbol
114
+ <code>:all</code> instead of an element name.
115
+
116
+ :attributes => {
117
+ :all => ['class'],
118
+ 'a' => ['href', 'title']
119
+ }
120
+
121
+ ==== :add_attributes
122
+
123
+ Attributes to add to specific elements. If the attribute already exists, it will
124
+ be replaced with the value specified here. Specify all element names and
125
+ attributes in lowercase.
126
+
127
+ :add_attributes => {
128
+ 'a' => {'rel' => 'nofollow'}
129
+ }
130
+
131
+ ==== :protocols
132
+
133
+ URL protocols to allow in specific attributes. If an attribute is listed here
134
+ and contains a protocol other than those specified (or if it contains no
135
+ protocol at all), it will be removed.
136
+
137
+ :protocols => {
138
+ 'a' => {'href' => ['ftp', 'http', 'https', 'mailto']},
139
+ 'img' => {'src' => ['http', 'https']}
140
+ }
141
+
142
+ If you'd like to allow the use of relative URLs which don't have a protocol,
143
+ include the symbol <code>:relative</code> in the protocol array:
144
+
145
+ :protocols => {
146
+ 'a' => {'href' => ['http', 'https', :relative]}
147
+ }
148
+
149
+ === Transformers
150
+
151
+ Transformers allow you to filter and alter nodes using your own custom logic, on
152
+ top of (or instead of) Sanitize's core filter. A transformer is any object that
153
+ responds to <code>call()</code> (such as a lambda or proc) and returns either
154
+ <code>nil</code> or a Hash containing certain optional response values.
155
+
156
+ To use one or more transformers, pass them to the <code>:transformers</code>
157
+ config setting:
158
+
159
+ Sanitize.clean(html, :transformers => [transformer_one, transformer_two])
160
+
161
+ ==== Input
162
+
163
+ Each registered transformer's <code>call()</code> method will be called once for
164
+ each element node in the HTML, and will receive as an argument an environment
165
+ Hash that contains the following items:
166
+
167
+ [<code>:config</code>]
168
+ The current Sanitize configuration Hash.
169
+
170
+ [<code>:node</code>]
171
+ A Nokogiri::XML::Node object representing an HTML element.
172
+
173
+ ==== Processing
174
+
175
+ Each transformer has full access to the Nokogiri::XML::Node that's passed into
176
+ it and to the rest of the document via the node's <code>document()</code>
177
+ method. Any changes will be reflected instantly in the document and passed on to
178
+ subsequently-called transformers and to Sanitize itself. A transformer may even
179
+ call Sanitize internally to perform custom sanitization if needed.
180
+
181
+ Nodes are passed into transformers in the order in which they're traversed. It's
182
+ important to note that Nokogiri traverses markup from the deepest node upward,
183
+ not from the first node to the last node:
184
+
185
+ html = '<div><span>foo</span></div>'
186
+ transformer = lambda{|env| puts env[:node].name }
187
+
188
+ # Prints "span", then "div".
189
+ Sanitize.clean(html, :transformers => transformer)
190
+
191
+ Transformers have a tremendous amount of power, including the power to
192
+ completely bypass Sanitize's built-in filtering. Be careful!
193
+
194
+ ==== Output
195
+
196
+ A transformer may return either +nil+ or a Hash. A return value of +nil+
197
+ indicates that the transformer does not wish to act on the current node in any
198
+ way. A returned Hash may contain the following items, all of which are optional:
199
+
200
+ [<code>:attr_whitelist</code>]
201
+ Array of attribute names to add to the whitelist for the current node, in
202
+ addition to any whitelisted attributes already defined in the current config.
203
+
204
+ [<code>:node</code>]
205
+ A Nokogiri::XML::Node object that should replace the current node. All
206
+ subsequent transformers and Sanitize itself will receive this new node.
207
+
208
+ [<code>:whitelist</code>]
209
+ If _true_, the current node (and only the current node) will be whitelisted,
210
+ regardless of the current Sanitize config.
211
+
212
+ [<code>:whitelist_nodes</code>]
213
+ Array of specific Nokogiri::XML::Node objects to whitelist, anywhere in the
214
+ document, regardless of the current Sanitize config.
215
+
216
+ == Contributors
217
+
218
+ The following lovely people have contributed to Sanitize in the form of patches
219
+ or ideas that later became code:
220
+
221
+ * Peter Cooper <git@peterc.org>
222
+ * Gabe da Silveira <gabe@websaviour.com>
223
+ * Ryan Grove <ryan@wonko.com>
224
+ * Adam Hooper <adam@adamhooper.com>
225
+ * Mutwin Kraus <mutle@blogage.de>
226
+ * Dev Purkayastha <dev.purkayastha@gmail.com>
227
+ * David Reese <work@whatcould.com>
228
+ * Ben Wanicur <bwanicur@verticalresponse.com>
229
+
230
+ == License
231
+
232
+ Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
233
+
234
+ Permission is hereby granted, free of charge, to any person obtaining a copy of
235
+ this software and associated documentation files (the 'Software'), to deal in
236
+ the Software without restriction, including without limitation the rights to
237
+ use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
238
+ the Software, and to permit persons to whom the Software is furnished to do so,
239
+ subject to the following conditions:
240
+
241
+ The above copyright notice and this permission notice shall be included in all
242
+ copies or substantial portions of the Software.
243
+
244
+ THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
245
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
246
+ FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
247
+ COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
248
+ IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
249
+ CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,49 @@
1
+ #--
2
+ # Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
3
+ #
4
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
5
+ # of this software and associated documentation files (the 'Software'), to deal
6
+ # in the Software without restriction, including without limitation the rights
7
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
8
+ # copies of the Software, and to permit persons to whom the Software is
9
+ # furnished to do so, subject to the following conditions:
10
+ #
11
+ # The above copyright notice and this permission notice shall be included in all
12
+ # copies or substantial portions of the Software.
13
+ #
14
+ # THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
15
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
16
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
17
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
18
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
19
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
20
+ # SOFTWARE.
21
+ #++
22
+
23
+ class Sanitize
24
+ module Config
25
+ BASIC = {
26
+ :elements => [
27
+ 'a', 'b', 'blockquote', 'br', 'cite', 'code', 'dd', 'dl', 'dt', 'em',
28
+ 'i', 'li', 'ol', 'p', 'pre', 'q', 'small', 'strike', 'strong', 'sub',
29
+ 'sup', 'u', 'ul'],
30
+
31
+ :attributes => {
32
+ 'a' => ['href'],
33
+ 'blockquote' => ['cite'],
34
+ 'q' => ['cite']
35
+ },
36
+
37
+ :add_attributes => {
38
+ 'a' => {'rel' => 'nofollow'}
39
+ },
40
+
41
+ :protocols => {
42
+ 'a' => {'href' => ['ftp', 'http', 'https', 'mailto',
43
+ :relative]},
44
+ 'blockquote' => {'cite' => ['http', 'https', :relative]},
45
+ 'q' => {'cite' => ['http', 'https', :relative]}
46
+ }
47
+ }
48
+ end
49
+ end
@@ -0,0 +1,59 @@
1
+ #--
2
+ # Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
3
+ #
4
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
5
+ # of this software and associated documentation files (the 'Software'), to deal
6
+ # in the Software without restriction, including without limitation the rights
7
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
8
+ # copies of the Software, and to permit persons to whom the Software is
9
+ # furnished to do so, subject to the following conditions:
10
+ #
11
+ # The above copyright notice and this permission notice shall be included in all
12
+ # copies or substantial portions of the Software.
13
+ #
14
+ # THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
15
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
16
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
17
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
18
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
19
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
20
+ # SOFTWARE.
21
+ #++
22
+
23
+
24
+
25
+ class Sanitize
26
+ module Config
27
+ RELAXED = {
28
+ :elements => [
29
+ 'a', 'b', 'blockquote', 'br', 'caption', 'cite', 'code', 'col',
30
+ 'colgroup', 'dd', 'dl', 'dt', 'em', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6',
31
+ 'i', 'img', 'li', 'ol', 'p', 'pre', 'q', 'small', 'strike', 'strong',
32
+ 'sub', 'sup', 'table', 'tbody', 'td', 'tfoot', 'th', 'thead', 'tr', 'u',
33
+ 'ul'],
34
+
35
+ :attributes => {
36
+ 'a' => ['href', 'title'],
37
+ 'blockquote' => ['cite'],
38
+ 'col' => ['span', 'width'],
39
+ 'colgroup' => ['span', 'width'],
40
+ 'img' => ['align', 'alt', 'height', 'src', 'title', 'width'],
41
+ 'ol' => ['start', 'type'],
42
+ 'q' => ['cite'],
43
+ 'table' => ['summary', 'width'],
44
+ 'td' => ['abbr', 'axis', 'colspan', 'rowspan', 'width'],
45
+ 'th' => ['abbr', 'axis', 'colspan', 'rowspan', 'scope',
46
+ 'width'],
47
+ 'ul' => ['type']
48
+ },
49
+
50
+ :protocols => {
51
+ 'a' => {'href' => ['ftp', 'http', 'https', 'mailto',
52
+ :relative]},
53
+ 'blockquote' => {'cite' => ['http', 'https', :relative]},
54
+ 'img' => {'src' => ['http', 'https', :relative]},
55
+ 'q' => {'cite' => ['http', 'https', :relative]}
56
+ }
57
+ }
58
+ end
59
+ end
@@ -0,0 +1,29 @@
1
+ #--
2
+ # Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
3
+ #
4
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
5
+ # of this software and associated documentation files (the 'Software'), to deal
6
+ # in the Software without restriction, including without limitation the rights
7
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
8
+ # copies of the Software, and to permit persons to whom the Software is
9
+ # furnished to do so, subject to the following conditions:
10
+ #
11
+ # The above copyright notice and this permission notice shall be included in all
12
+ # copies or substantial portions of the Software.
13
+ #
14
+ # THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
15
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
16
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
17
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
18
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
19
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
20
+ # SOFTWARE.
21
+ #++
22
+
23
+ class Sanitize
24
+ module Config
25
+ RESTRICTED = {
26
+ :elements => ['b', 'em', 'i', 'strong', 'u']
27
+ }
28
+ end
29
+ end
@@ -0,0 +1,55 @@
1
+ #--
2
+ # Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
3
+ #
4
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
5
+ # of this software and associated documentation files (the 'Software'), to deal
6
+ # in the Software without restriction, including without limitation the rights
7
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
8
+ # copies of the Software, and to permit persons to whom the Software is
9
+ # furnished to do so, subject to the following conditions:
10
+ #
11
+ # The above copyright notice and this permission notice shall be included in all
12
+ # copies or substantial portions of the Software.
13
+ #
14
+ # THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
15
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
16
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
17
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
18
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
19
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
20
+ # SOFTWARE.
21
+ #++
22
+
23
+ class Sanitize
24
+ module Config
25
+ DEFAULT = {
26
+ # Whether or not to allow HTML comments. Allowing comments is strongly
27
+ # discouraged, since IE allows script execution within conditional
28
+ # comments.
29
+ :allow_comments => false,
30
+
31
+ # HTML attributes to add to specific elements. By default, no attributes
32
+ # are added.
33
+ :add_attributes => {},
34
+
35
+ # HTML attributes to allow in specific elements. By default, no attributes
36
+ # are allowed.
37
+ :attributes => {},
38
+
39
+ # HTML elements to allow. By default, no elements are allowed (which means
40
+ # that all HTML will be stripped).
41
+ :elements => [],
42
+
43
+ # Output format. Supported formats are :html and :xhtml (which is the
44
+ # default).
45
+ :output => :xhtml,
46
+
47
+ # URL handling protocols to allow in specific attributes. By default, no
48
+ # protocols are allowed. Use :relative in place of a protocol if you want
49
+ # to allow relative URLs sans protocol.
50
+ :protocols => {},
51
+
52
+ :transformers => []
53
+ }
54
+ end
55
+ end
@@ -0,0 +1,3 @@
1
+ class Sanitize
2
+ VERSION = '1.2.0.dev.20091104'
3
+ end
data/lib/sanitize.rb ADDED
@@ -0,0 +1,228 @@
1
+ # encoding: utf-8
2
+ #--
3
+ # Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
4
+ #
5
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ # of this software and associated documentation files (the 'Software'), to deal
7
+ # in the Software without restriction, including without limitation the rights
8
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ # copies of the Software, and to permit persons to whom the Software is
10
+ # furnished to do so, subject to the following conditions:
11
+ #
12
+ # The above copyright notice and this permission notice shall be included in all
13
+ # copies or substantial portions of the Software.
14
+ #
15
+ # THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ # SOFTWARE.
22
+ #++
23
+
24
+ require 'nokogiri'
25
+ require 'sanitize/version'
26
+ require 'sanitize/config'
27
+ require 'sanitize/config/restricted'
28
+ require 'sanitize/config/basic'
29
+ require 'sanitize/config/relaxed'
30
+
31
+ class Sanitize
32
+ attr_reader :config
33
+
34
+ # Matches an attribute value that could be treated by a browser as a URL
35
+ # with a protocol prefix, such as "http:" or "javascript:". Any string of zero
36
+ # or more characters followed by a colon is considered a match, even if the
37
+ # colon is encoded as an entity and even if it's an incomplete entity (which
38
+ # IE6 and Opera will still parse).
39
+ REGEX_PROTOCOL = /^([A-Za-z0-9\+\-\.\&\;\#\s]*?)(?:\:|&#0*58|&#x0*3a)/i
40
+
41
+ #--
42
+ # Class Methods
43
+ #++
44
+
45
+ # Returns a sanitized copy of _html_, using the settings in _config_ if
46
+ # specified.
47
+ def self.clean(html, config = {})
48
+ sanitize = Sanitize.new(config)
49
+ sanitize.clean(html)
50
+ end
51
+
52
+ # Performs Sanitize#clean in place, returning _html_, or +nil+ if no changes
53
+ # were made.
54
+ def self.clean!(html, config = {})
55
+ sanitize = Sanitize.new(config)
56
+ sanitize.clean!(html)
57
+ end
58
+
59
+ # Sanitizes the specified Nokogiri::XML::Node and all its children.
60
+ def self.clean_node!(node, config = {})
61
+ sanitize = Sanitize.new(config)
62
+ sanitize.clean_node!(node)
63
+ end
64
+
65
+ #--
66
+ # Instance Methods
67
+ #++
68
+
69
+ # Returns a new Sanitize object initialized with the settings in _config_.
70
+ def initialize(config = {})
71
+ # Sanitize configuration.
72
+ @config = Config::DEFAULT.merge(config)
73
+ @config[:transformers] = Array(@config[:transformers])
74
+
75
+ # Specific nodes to whitelist (along with all their attributes). This array
76
+ # is generated at runtime by transformers, and is cleared before and after
77
+ # a fragment is cleaned (so it applies only to a specific fragment).
78
+ @whitelist_nodes = []
79
+ end
80
+
81
+ # Returns a sanitized copy of _html_.
82
+ def clean(html)
83
+ dupe = html.dup
84
+ clean!(dupe) || dupe
85
+ end
86
+
87
+ # Performs clean in place, returning _html_, or +nil+ if no changes were
88
+ # made.
89
+ def clean!(html)
90
+ @whitelist_nodes = []
91
+ fragment = Nokogiri::HTML::DocumentFragment.parse(html)
92
+ clean_node!(fragment)
93
+ @whitelist_nodes = []
94
+
95
+ output_method_params = {:encoding => 'utf-8', :indent => 0}
96
+
97
+ if @config[:output] == :xhtml
98
+ output_method = fragment.method(:to_xhtml)
99
+ output_method_params[:save_with] = Nokogiri::XML::Node::SaveOptions::AS_XHTML
100
+ elsif @config[:output] == :html
101
+ output_method = fragment.method(:to_html)
102
+ else
103
+ raise Error, "unsupported output format: #{@config[:output]}"
104
+ end
105
+
106
+ result = output_method.call(output_method_params)
107
+
108
+ # Nokogiri 1.3.3 (and possibly earlier versions) always returns a US-ASCII
109
+ # string no matter what we ask for. This will be fixed in 1.4.0, but for
110
+ # now we have to hack around it to prevent errors.
111
+ result.force_encoding('utf-8') if RUBY_VERSION >= '1.9'
112
+
113
+ return result == html ? nil : html[0, html.length] = result
114
+ end
115
+
116
+ # Sanitizes the specified Nokogiri::XML::Node and all its children.
117
+ def clean_node!(node)
118
+ raise ArgumentError unless node.is_a?(Nokogiri::XML::Node)
119
+
120
+ node.traverse do |traversed_node|
121
+ if traversed_node.element?
122
+ clean_element!(traversed_node)
123
+ elsif traversed_node.comment?
124
+ traversed_node.unlink unless @config[:allow_comments]
125
+ elsif traversed_node.cdata?
126
+ traversed_node.replace(Nokogiri::XML::Text.new(traversed_node.text,
127
+ traversed_node.document))
128
+ end
129
+ end
130
+
131
+ node
132
+ end
133
+
134
+ private
135
+
136
+ def clean_element!(node)
137
+ # Run this node through all configured transformers.
138
+ transform = transform_element!(node)
139
+
140
+ # If this node is in the dynamic whitelist array (built at runtime by
141
+ # transformers), let it live with all of its attributes intact.
142
+ return if @whitelist_nodes.include?(node)
143
+
144
+ name = node.name.to_s.downcase
145
+
146
+ # Delete any element that isn't in the whitelist.
147
+ unless transform[:whitelist] || @config[:elements].include?(name)
148
+ node.children.each { |n| node.add_previous_sibling(n) }
149
+ node.unlink
150
+ return
151
+ end
152
+
153
+ attr_whitelist = (transform[:attr_whitelist] +
154
+ (@config[:attributes][name] || []) +
155
+ (@config[:attributes][:all] || [])).uniq
156
+
157
+ if attr_whitelist.empty?
158
+ # Delete all attributes from elements with no whitelisted attributes.
159
+ node.attribute_nodes.each {|attr| attr.remove }
160
+ else
161
+ # Delete any attribute that isn't in the whitelist for this element.
162
+ node.attribute_nodes.each do |attr|
163
+ attr.unlink unless attr_whitelist.include?(attr.name.downcase)
164
+ end
165
+
166
+ # Delete remaining attributes that use unacceptable protocols.
167
+ if @config[:protocols].has_key?(name)
168
+ protocol = @config[:protocols][name]
169
+
170
+ node.attribute_nodes.each do |attr|
171
+ attr_name = attr.name.downcase
172
+ next false unless protocol.has_key?(attr_name)
173
+
174
+ del = if attr.value.to_s.downcase =~ REGEX_PROTOCOL
175
+ !protocol[attr_name].include?($1.downcase)
176
+ else
177
+ !protocol[attr_name].include?(:relative)
178
+ end
179
+
180
+ attr.unlink if del
181
+ end
182
+ end
183
+ end
184
+
185
+ # Add required attributes.
186
+ if @config[:add_attributes].has_key?(name)
187
+ @config[:add_attributes][name].each do |key, val|
188
+ node[key] = val
189
+ end
190
+ end
191
+
192
+ transform
193
+ end
194
+
195
+ def transform_element!(node)
196
+ output = {
197
+ :attr_whitelist => [],
198
+ :node => node,
199
+ :whitelist => false
200
+ }
201
+
202
+ @config[:transformers].inject(node) do |transformer_node, transformer|
203
+ transform = transformer.call({
204
+ :config => @config,
205
+ :node => transformer_node
206
+ })
207
+
208
+ if transform.nil?
209
+ transformer_node
210
+ elsif transform.is_a?(Hash)
211
+ if transform[:whitelist_nodes].is_a?(Array)
212
+ @whitelist_nodes += transform[:whitelist_nodes]
213
+ @whitelist_nodes.uniq!
214
+ end
215
+
216
+ output[:attr_whitelist] += transform[:attr_whitelist] if transform[:attr_whitelist].is_a?(Array)
217
+ output[:whitelist] ||= true if transform[:whitelist]
218
+ output[:node] = transform[:node].is_a?(Nokogiri::XML::Node) ? transform[:node] : output[:node]
219
+ else
220
+ raise Error, "transformer output must be a Hash or nil"
221
+ end
222
+ end
223
+
224
+ node.replace(output[:node]) if node != output[:node]
225
+
226
+ return output
227
+ end
228
+ end
metadata ADDED
@@ -0,0 +1,92 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: darkhelmet-sanitize
3
+ version: !ruby/object:Gem::Version
4
+ version: 1.2.0.dev.20091104
5
+ platform: ruby
6
+ authors:
7
+ - Ryan Grove
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+
12
+ date: 2009-11-04 00:00:00 -07:00
13
+ default_executable:
14
+ dependencies:
15
+ - !ruby/object:Gem::Dependency
16
+ name: nokogiri
17
+ type: :runtime
18
+ version_requirement:
19
+ version_requirements: !ruby/object:Gem::Requirement
20
+ requirements:
21
+ - - ~>
22
+ - !ruby/object:Gem::Version
23
+ version: 1.4.0
24
+ version:
25
+ - !ruby/object:Gem::Dependency
26
+ name: bacon
27
+ type: :development
28
+ version_requirement:
29
+ version_requirements: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - ~>
32
+ - !ruby/object:Gem::Version
33
+ version: 1.1.0
34
+ version:
35
+ - !ruby/object:Gem::Dependency
36
+ name: rake
37
+ type: :development
38
+ version_requirement:
39
+ version_requirements: !ruby/object:Gem::Requirement
40
+ requirements:
41
+ - - ~>
42
+ - !ruby/object:Gem::Version
43
+ version: 0.8.0
44
+ version:
45
+ description:
46
+ email: ryan@wonko.com
47
+ executables: []
48
+
49
+ extensions: []
50
+
51
+ extra_rdoc_files: []
52
+
53
+ files:
54
+ - HISTORY
55
+ - LICENSE
56
+ - README.rdoc
57
+ - lib/sanitize/config/basic.rb
58
+ - lib/sanitize/config/relaxed.rb
59
+ - lib/sanitize/config/restricted.rb
60
+ - lib/sanitize/config.rb
61
+ - lib/sanitize/version.rb
62
+ - lib/sanitize.rb
63
+ has_rdoc: true
64
+ homepage: http://github.com/rgrove/sanitize/
65
+ licenses: []
66
+
67
+ post_install_message:
68
+ rdoc_options: []
69
+
70
+ require_paths:
71
+ - lib
72
+ required_ruby_version: !ruby/object:Gem::Requirement
73
+ requirements:
74
+ - - ">="
75
+ - !ruby/object:Gem::Version
76
+ version: 1.8.6
77
+ version:
78
+ required_rubygems_version: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - ">"
81
+ - !ruby/object:Gem::Version
82
+ version: 1.3.1
83
+ version:
84
+ requirements: []
85
+
86
+ rubyforge_project: riposte
87
+ rubygems_version: 1.3.5
88
+ signing_key:
89
+ specification_version: 3
90
+ summary: Whitelist-based HTML sanitizer.
91
+ test_files: []
92
+