darkhelmet-sanitize 1.2.0.dev.20091104

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/HISTORY ADDED
@@ -0,0 +1,73 @@
1
+ Sanitize History
2
+ ================================================================================
3
+
4
+ Version 1.2.0.dev (git)
5
+ * Added support for transformers, which allow you to filter and alter nodes
6
+ using your own custom logic, on top of (or instead of) Sanitize's core
7
+ filter. See the README for details.
8
+ * Requires Nokogiri >= 1.4.0.
9
+ * Added elements h1 through h6 to the Relaxed whitelist. [Suggested by David
10
+ Reese]
11
+
12
+ Version 1.1.0 (2009-10-11)
13
+ * Migrated from Hpricot to Nokogiri. Requires libxml2 >= 2.7.2 [Adam Hooper]
14
+ * Added an :output config setting to allow the output format to be specified.
15
+ Supported formats are :xhtml (the default) and :html (which outputs HTML4).
16
+ * Changed protocol regex to ensure Sanitize doesn't kill URLs with colons in
17
+ path segments. [Peter Cooper]
18
+
19
+ Version 1.0.8 (2009-04-23)
20
+ * Added a workaround for an Hpricot bug that prevents attribute names from
21
+ being downcased in recent versions of Hpricot. This was exploitable to
22
+ prevent non-whitelisted protocols from being cleaned. [Reported by Ben
23
+ Wanicur]
24
+
25
+ Version 1.0.7 (2009-04-11)
26
+ * Requires Hpricot 0.8.1+, which is finally compatible with Ruby 1.9.1.
27
+ * Fixed a bug that caused named character entities containing digits (like
28
+ ²) to be escaped when they shouldn't have been. [Reported by Sebastian
29
+ Steinmetz]
30
+
31
+ Version 1.0.6 (2009-02-23)
32
+ * Removed htmlentities gem dependency.
33
+ * Existing well-formed character entity references in the input string are now
34
+ preserved rather than being decoded and re-encoded.
35
+ * The ' character is now encoded as ' instead of ' to prevent
36
+ problems in IE6.
37
+ * You can now specify the symbol :all in place of an element name in the
38
+ attributes config hash to allow certain attributes on all elements. [Thanks
39
+ to Mutwin Kraus]
40
+
41
+ Version 1.0.5 (2009-02-05)
42
+ * Fixed a bug introduced in version 1.0.3 that prevented non-whitelisted
43
+ protocols from being cleaned when relative URLs were allowed. [Reported by
44
+ Dev Purkayastha]
45
+ * Fixed "undefined method `parent='" exceptions caused by parser changes in
46
+ edge Hpricot.
47
+
48
+ Version 1.0.4 (2009-01-16)
49
+ * Fixed a bug that made it possible to sneak a non-whitelisted element through
50
+ by repeating it several times in a row. All versions of Sanitize prior to
51
+ 1.0.4 are vulnerable. [Reported by Cristobal]
52
+
53
+ Version 1.0.3 (2009-01-15)
54
+ * Fixed a bug whereby incomplete Unicode or hex entities could be used to
55
+ prevent non-whitelisted protocols from being cleaned. Since IE6 and Opera
56
+ still decode the incomplete entities, users of those browsers may be
57
+ vulnerable to malicious script injection on websites using versions of
58
+ Sanitize prior to 1.0.3.
59
+
60
+ Version 1.0.2 (2009-01-04)
61
+ * Fixed a bug that caused an exception to be thrown when parsing a valueless
62
+ attribute that's expected to contain a URL.
63
+
64
+ Version 1.0.1 (2009-01-01)
65
+ * You can now specify :relative in a protocol config array to allow attributes
66
+ containing relative URLs with no protocol. The Basic and Relaxed configs
67
+ have been updated to allow relative URLs.
68
+ * Added a workaround for an Hpricot bug that causes HTML entities for
69
+ non-ASCII characters to be replaced by question marks, and all other
70
+ entities to be destructively decoded.
71
+
72
+ Version 1.0.0 (2008-12-25)
73
+ * First release.
data/LICENSE ADDED
@@ -0,0 +1,18 @@
1
+ Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining a copy of
4
+ this software and associated documentation files (the 'Software'), to deal in
5
+ the Software without restriction, including without limitation the rights to
6
+ use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
7
+ the Software, and to permit persons to whom the Software is furnished to do so,
8
+ subject to the following conditions:
9
+
10
+ The above copyright notice and this permission notice shall be included in all
11
+ copies or substantial portions of the Software.
12
+
13
+ THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
15
+ FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
16
+ COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
17
+ IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
18
+ CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
data/README.rdoc ADDED
@@ -0,0 +1,249 @@
1
+ = Sanitize
2
+
3
+ Sanitize is a whitelist-based HTML sanitizer. Given a list of acceptable
4
+ elements and attributes, Sanitize will remove all unacceptable HTML from a
5
+ string.
6
+
7
+ Using a simple configuration syntax, you can tell Sanitize to allow certain
8
+ elements, certain attributes within those elements, and even certain URL
9
+ protocols within attributes that contain URLs. Any HTML elements or attributes
10
+ that you don't explicitly allow will be removed.
11
+
12
+ Because it's based on Nokogiri, a full-fledged HTML parser, rather than a bunch
13
+ of fragile regular expressions, Sanitize has no trouble dealing with malformed
14
+ or maliciously-formed HTML. When in doubt, Sanitize always errs on the side of
15
+ caution.
16
+
17
+ *Author*:: Ryan Grove (mailto:ryan@wonko.com)
18
+ *Version*:: 1.2.0.dev (git)
19
+ *Copyright*:: Copyright (c) 2009 Ryan Grove. All rights reserved.
20
+ *License*:: MIT License (http://opensource.org/licenses/mit-license.php)
21
+ *Website*:: http://github.com/rgrove/sanitize
22
+
23
+ == Requires
24
+
25
+ * Nokogiri >= 1.4.0
26
+ * libxml2 >= 2.7.2
27
+
28
+ == Installation
29
+
30
+ Latest stable release:
31
+
32
+ gem install sanitize
33
+
34
+ Latest development version:
35
+
36
+ gem install sanitize -s http://gemcutter.org --prerelease
37
+
38
+ == Usage
39
+
40
+ If you don't specify any configuration options, Sanitize will use its strictest
41
+ settings by default, which means it will strip all HTML and leave only text
42
+ behind.
43
+
44
+ require 'rubygems'
45
+ require 'sanitize'
46
+
47
+ html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
48
+
49
+ Sanitize.clean(html) # => 'foo'
50
+
51
+ == Configuration
52
+
53
+ In addition to the ultra-safe default settings, Sanitize comes with three other
54
+ built-in modes.
55
+
56
+ === Sanitize::Config::RESTRICTED
57
+
58
+ Allows only very simple inline formatting markup. No links, images, or block
59
+ elements.
60
+
61
+ Sanitize.clean(html, Sanitize::Config::RESTRICTED) # => '<b>foo</b>'
62
+
63
+ === Sanitize::Config::BASIC
64
+
65
+ Allows a variety of markup including formatting tags, links, and lists. Images
66
+ and tables are not allowed, links are limited to FTP, HTTP, HTTPS, and mailto
67
+ protocols, and a <code>rel="nofollow"</code> attribute is added to all links to
68
+ mitigate SEO spam.
69
+
70
+ Sanitize.clean(html, Sanitize::Config::BASIC)
71
+ # => '<b><a href="http://foo.com/" rel="nofollow">foo</a></b>'
72
+
73
+ === Sanitize::Config::RELAXED
74
+
75
+ Allows an even wider variety of markup than BASIC, including images and tables.
76
+ Links are still limited to FTP, HTTP, HTTPS, and mailto protocols, while images
77
+ are limited to HTTP and HTTPS. In this mode, <code>rel="nofollow"</code> is not
78
+ added to links.
79
+
80
+ Sanitize.clean(html, Sanitize::Config::RELAXED)
81
+ # => '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
82
+
83
+ === Custom Configuration
84
+
85
+ If the built-in modes don't meet your needs, you can easily specify a custom
86
+ configuration:
87
+
88
+ Sanitize.clean(html, :elements => ['a', 'span'],
89
+ :attributes => {'a' => ['href', 'title'], 'span' => ['class']},
90
+ :protocols => {'a' => {'href' => ['http', 'https', 'mailto']}})
91
+
92
+ ==== :elements
93
+
94
+ Array of element names to allow. Specify all names in lowercase.
95
+
96
+ :elements => [
97
+ 'a', 'b', 'blockquote', 'br', 'cite', 'code', 'dd', 'dl', 'dt', 'em',
98
+ 'i', 'li', 'ol', 'p', 'pre', 'q', 'small', 'strike', 'strong', 'sub',
99
+ 'sup', 'u', 'ul'
100
+ ]
101
+
102
+ ==== :attributes
103
+
104
+ Attributes to allow for specific elements. Specify all element names and
105
+ attributes in lowercase.
106
+
107
+ :attributes => {
108
+ 'a' => ['href', 'title'],
109
+ 'blockquote' => ['cite'],
110
+ 'img' => ['alt', 'src', 'title']
111
+ }
112
+
113
+ If you'd like to allow certain attributes on all elements, use the symbol
114
+ <code>:all</code> instead of an element name.
115
+
116
+ :attributes => {
117
+ :all => ['class'],
118
+ 'a' => ['href', 'title']
119
+ }
120
+
121
+ ==== :add_attributes
122
+
123
+ Attributes to add to specific elements. If the attribute already exists, it will
124
+ be replaced with the value specified here. Specify all element names and
125
+ attributes in lowercase.
126
+
127
+ :add_attributes => {
128
+ 'a' => {'rel' => 'nofollow'}
129
+ }
130
+
131
+ ==== :protocols
132
+
133
+ URL protocols to allow in specific attributes. If an attribute is listed here
134
+ and contains a protocol other than those specified (or if it contains no
135
+ protocol at all), it will be removed.
136
+
137
+ :protocols => {
138
+ 'a' => {'href' => ['ftp', 'http', 'https', 'mailto']},
139
+ 'img' => {'src' => ['http', 'https']}
140
+ }
141
+
142
+ If you'd like to allow the use of relative URLs which don't have a protocol,
143
+ include the symbol <code>:relative</code> in the protocol array:
144
+
145
+ :protocols => {
146
+ 'a' => {'href' => ['http', 'https', :relative]}
147
+ }
148
+
149
+ === Transformers
150
+
151
+ Transformers allow you to filter and alter nodes using your own custom logic, on
152
+ top of (or instead of) Sanitize's core filter. A transformer is any object that
153
+ responds to <code>call()</code> (such as a lambda or proc) and returns either
154
+ <code>nil</code> or a Hash containing certain optional response values.
155
+
156
+ To use one or more transformers, pass them to the <code>:transformers</code>
157
+ config setting:
158
+
159
+ Sanitize.clean(html, :transformers => [transformer_one, transformer_two])
160
+
161
+ ==== Input
162
+
163
+ Each registered transformer's <code>call()</code> method will be called once for
164
+ each element node in the HTML, and will receive as an argument an environment
165
+ Hash that contains the following items:
166
+
167
+ [<code>:config</code>]
168
+ The current Sanitize configuration Hash.
169
+
170
+ [<code>:node</code>]
171
+ A Nokogiri::XML::Node object representing an HTML element.
172
+
173
+ ==== Processing
174
+
175
+ Each transformer has full access to the Nokogiri::XML::Node that's passed into
176
+ it and to the rest of the document via the node's <code>document()</code>
177
+ method. Any changes will be reflected instantly in the document and passed on to
178
+ subsequently-called transformers and to Sanitize itself. A transformer may even
179
+ call Sanitize internally to perform custom sanitization if needed.
180
+
181
+ Nodes are passed into transformers in the order in which they're traversed. It's
182
+ important to note that Nokogiri traverses markup from the deepest node upward,
183
+ not from the first node to the last node:
184
+
185
+ html = '<div><span>foo</span></div>'
186
+ transformer = lambda{|env| puts env[:node].name }
187
+
188
+ # Prints "span", then "div".
189
+ Sanitize.clean(html, :transformers => transformer)
190
+
191
+ Transformers have a tremendous amount of power, including the power to
192
+ completely bypass Sanitize's built-in filtering. Be careful!
193
+
194
+ ==== Output
195
+
196
+ A transformer may return either +nil+ or a Hash. A return value of +nil+
197
+ indicates that the transformer does not wish to act on the current node in any
198
+ way. A returned Hash may contain the following items, all of which are optional:
199
+
200
+ [<code>:attr_whitelist</code>]
201
+ Array of attribute names to add to the whitelist for the current node, in
202
+ addition to any whitelisted attributes already defined in the current config.
203
+
204
+ [<code>:node</code>]
205
+ A Nokogiri::XML::Node object that should replace the current node. All
206
+ subsequent transformers and Sanitize itself will receive this new node.
207
+
208
+ [<code>:whitelist</code>]
209
+ If _true_, the current node (and only the current node) will be whitelisted,
210
+ regardless of the current Sanitize config.
211
+
212
+ [<code>:whitelist_nodes</code>]
213
+ Array of specific Nokogiri::XML::Node objects to whitelist, anywhere in the
214
+ document, regardless of the current Sanitize config.
215
+
216
+ == Contributors
217
+
218
+ The following lovely people have contributed to Sanitize in the form of patches
219
+ or ideas that later became code:
220
+
221
+ * Peter Cooper <git@peterc.org>
222
+ * Gabe da Silveira <gabe@websaviour.com>
223
+ * Ryan Grove <ryan@wonko.com>
224
+ * Adam Hooper <adam@adamhooper.com>
225
+ * Mutwin Kraus <mutle@blogage.de>
226
+ * Dev Purkayastha <dev.purkayastha@gmail.com>
227
+ * David Reese <work@whatcould.com>
228
+ * Ben Wanicur <bwanicur@verticalresponse.com>
229
+
230
+ == License
231
+
232
+ Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
233
+
234
+ Permission is hereby granted, free of charge, to any person obtaining a copy of
235
+ this software and associated documentation files (the 'Software'), to deal in
236
+ the Software without restriction, including without limitation the rights to
237
+ use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
238
+ the Software, and to permit persons to whom the Software is furnished to do so,
239
+ subject to the following conditions:
240
+
241
+ The above copyright notice and this permission notice shall be included in all
242
+ copies or substantial portions of the Software.
243
+
244
+ THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
245
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
246
+ FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
247
+ COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
248
+ IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
249
+ CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,49 @@
1
+ #--
2
+ # Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
3
+ #
4
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
5
+ # of this software and associated documentation files (the 'Software'), to deal
6
+ # in the Software without restriction, including without limitation the rights
7
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
8
+ # copies of the Software, and to permit persons to whom the Software is
9
+ # furnished to do so, subject to the following conditions:
10
+ #
11
+ # The above copyright notice and this permission notice shall be included in all
12
+ # copies or substantial portions of the Software.
13
+ #
14
+ # THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
15
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
16
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
17
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
18
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
19
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
20
+ # SOFTWARE.
21
+ #++
22
+
23
+ class Sanitize
24
+ module Config
25
+ BASIC = {
26
+ :elements => [
27
+ 'a', 'b', 'blockquote', 'br', 'cite', 'code', 'dd', 'dl', 'dt', 'em',
28
+ 'i', 'li', 'ol', 'p', 'pre', 'q', 'small', 'strike', 'strong', 'sub',
29
+ 'sup', 'u', 'ul'],
30
+
31
+ :attributes => {
32
+ 'a' => ['href'],
33
+ 'blockquote' => ['cite'],
34
+ 'q' => ['cite']
35
+ },
36
+
37
+ :add_attributes => {
38
+ 'a' => {'rel' => 'nofollow'}
39
+ },
40
+
41
+ :protocols => {
42
+ 'a' => {'href' => ['ftp', 'http', 'https', 'mailto',
43
+ :relative]},
44
+ 'blockquote' => {'cite' => ['http', 'https', :relative]},
45
+ 'q' => {'cite' => ['http', 'https', :relative]}
46
+ }
47
+ }
48
+ end
49
+ end
@@ -0,0 +1,59 @@
1
+ #--
2
+ # Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
3
+ #
4
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
5
+ # of this software and associated documentation files (the 'Software'), to deal
6
+ # in the Software without restriction, including without limitation the rights
7
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
8
+ # copies of the Software, and to permit persons to whom the Software is
9
+ # furnished to do so, subject to the following conditions:
10
+ #
11
+ # The above copyright notice and this permission notice shall be included in all
12
+ # copies or substantial portions of the Software.
13
+ #
14
+ # THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
15
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
16
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
17
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
18
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
19
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
20
+ # SOFTWARE.
21
+ #++
22
+
23
+
24
+
25
+ class Sanitize
26
+ module Config
27
+ RELAXED = {
28
+ :elements => [
29
+ 'a', 'b', 'blockquote', 'br', 'caption', 'cite', 'code', 'col',
30
+ 'colgroup', 'dd', 'dl', 'dt', 'em', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6',
31
+ 'i', 'img', 'li', 'ol', 'p', 'pre', 'q', 'small', 'strike', 'strong',
32
+ 'sub', 'sup', 'table', 'tbody', 'td', 'tfoot', 'th', 'thead', 'tr', 'u',
33
+ 'ul'],
34
+
35
+ :attributes => {
36
+ 'a' => ['href', 'title'],
37
+ 'blockquote' => ['cite'],
38
+ 'col' => ['span', 'width'],
39
+ 'colgroup' => ['span', 'width'],
40
+ 'img' => ['align', 'alt', 'height', 'src', 'title', 'width'],
41
+ 'ol' => ['start', 'type'],
42
+ 'q' => ['cite'],
43
+ 'table' => ['summary', 'width'],
44
+ 'td' => ['abbr', 'axis', 'colspan', 'rowspan', 'width'],
45
+ 'th' => ['abbr', 'axis', 'colspan', 'rowspan', 'scope',
46
+ 'width'],
47
+ 'ul' => ['type']
48
+ },
49
+
50
+ :protocols => {
51
+ 'a' => {'href' => ['ftp', 'http', 'https', 'mailto',
52
+ :relative]},
53
+ 'blockquote' => {'cite' => ['http', 'https', :relative]},
54
+ 'img' => {'src' => ['http', 'https', :relative]},
55
+ 'q' => {'cite' => ['http', 'https', :relative]}
56
+ }
57
+ }
58
+ end
59
+ end
@@ -0,0 +1,29 @@
1
+ #--
2
+ # Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
3
+ #
4
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
5
+ # of this software and associated documentation files (the 'Software'), to deal
6
+ # in the Software without restriction, including without limitation the rights
7
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
8
+ # copies of the Software, and to permit persons to whom the Software is
9
+ # furnished to do so, subject to the following conditions:
10
+ #
11
+ # The above copyright notice and this permission notice shall be included in all
12
+ # copies or substantial portions of the Software.
13
+ #
14
+ # THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
15
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
16
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
17
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
18
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
19
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
20
+ # SOFTWARE.
21
+ #++
22
+
23
+ class Sanitize
24
+ module Config
25
+ RESTRICTED = {
26
+ :elements => ['b', 'em', 'i', 'strong', 'u']
27
+ }
28
+ end
29
+ end
@@ -0,0 +1,55 @@
1
+ #--
2
+ # Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
3
+ #
4
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
5
+ # of this software and associated documentation files (the 'Software'), to deal
6
+ # in the Software without restriction, including without limitation the rights
7
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
8
+ # copies of the Software, and to permit persons to whom the Software is
9
+ # furnished to do so, subject to the following conditions:
10
+ #
11
+ # The above copyright notice and this permission notice shall be included in all
12
+ # copies or substantial portions of the Software.
13
+ #
14
+ # THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
15
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
16
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
17
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
18
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
19
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
20
+ # SOFTWARE.
21
+ #++
22
+
23
+ class Sanitize
24
+ module Config
25
+ DEFAULT = {
26
+ # Whether or not to allow HTML comments. Allowing comments is strongly
27
+ # discouraged, since IE allows script execution within conditional
28
+ # comments.
29
+ :allow_comments => false,
30
+
31
+ # HTML attributes to add to specific elements. By default, no attributes
32
+ # are added.
33
+ :add_attributes => {},
34
+
35
+ # HTML attributes to allow in specific elements. By default, no attributes
36
+ # are allowed.
37
+ :attributes => {},
38
+
39
+ # HTML elements to allow. By default, no elements are allowed (which means
40
+ # that all HTML will be stripped).
41
+ :elements => [],
42
+
43
+ # Output format. Supported formats are :html and :xhtml (which is the
44
+ # default).
45
+ :output => :xhtml,
46
+
47
+ # URL handling protocols to allow in specific attributes. By default, no
48
+ # protocols are allowed. Use :relative in place of a protocol if you want
49
+ # to allow relative URLs sans protocol.
50
+ :protocols => {},
51
+
52
+ :transformers => []
53
+ }
54
+ end
55
+ end
@@ -0,0 +1,3 @@
1
+ class Sanitize
2
+ VERSION = '1.2.0.dev.20091104'
3
+ end
data/lib/sanitize.rb ADDED
@@ -0,0 +1,228 @@
1
+ # encoding: utf-8
2
+ #--
3
+ # Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
4
+ #
5
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ # of this software and associated documentation files (the 'Software'), to deal
7
+ # in the Software without restriction, including without limitation the rights
8
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ # copies of the Software, and to permit persons to whom the Software is
10
+ # furnished to do so, subject to the following conditions:
11
+ #
12
+ # The above copyright notice and this permission notice shall be included in all
13
+ # copies or substantial portions of the Software.
14
+ #
15
+ # THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ # SOFTWARE.
22
+ #++
23
+
24
+ require 'nokogiri'
25
+ require 'sanitize/version'
26
+ require 'sanitize/config'
27
+ require 'sanitize/config/restricted'
28
+ require 'sanitize/config/basic'
29
+ require 'sanitize/config/relaxed'
30
+
31
+ class Sanitize
32
+ attr_reader :config
33
+
34
+ # Matches an attribute value that could be treated by a browser as a URL
35
+ # with a protocol prefix, such as "http:" or "javascript:". Any string of zero
36
+ # or more characters followed by a colon is considered a match, even if the
37
+ # colon is encoded as an entity and even if it's an incomplete entity (which
38
+ # IE6 and Opera will still parse).
39
+ REGEX_PROTOCOL = /^([A-Za-z0-9\+\-\.\&\;\#\s]*?)(?:\:|&#0*58|&#x0*3a)/i
40
+
41
+ #--
42
+ # Class Methods
43
+ #++
44
+
45
+ # Returns a sanitized copy of _html_, using the settings in _config_ if
46
+ # specified.
47
+ def self.clean(html, config = {})
48
+ sanitize = Sanitize.new(config)
49
+ sanitize.clean(html)
50
+ end
51
+
52
+ # Performs Sanitize#clean in place, returning _html_, or +nil+ if no changes
53
+ # were made.
54
+ def self.clean!(html, config = {})
55
+ sanitize = Sanitize.new(config)
56
+ sanitize.clean!(html)
57
+ end
58
+
59
+ # Sanitizes the specified Nokogiri::XML::Node and all its children.
60
+ def self.clean_node!(node, config = {})
61
+ sanitize = Sanitize.new(config)
62
+ sanitize.clean_node!(node)
63
+ end
64
+
65
+ #--
66
+ # Instance Methods
67
+ #++
68
+
69
+ # Returns a new Sanitize object initialized with the settings in _config_.
70
+ def initialize(config = {})
71
+ # Sanitize configuration.
72
+ @config = Config::DEFAULT.merge(config)
73
+ @config[:transformers] = Array(@config[:transformers])
74
+
75
+ # Specific nodes to whitelist (along with all their attributes). This array
76
+ # is generated at runtime by transformers, and is cleared before and after
77
+ # a fragment is cleaned (so it applies only to a specific fragment).
78
+ @whitelist_nodes = []
79
+ end
80
+
81
+ # Returns a sanitized copy of _html_.
82
+ def clean(html)
83
+ dupe = html.dup
84
+ clean!(dupe) || dupe
85
+ end
86
+
87
+ # Performs clean in place, returning _html_, or +nil+ if no changes were
88
+ # made.
89
+ def clean!(html)
90
+ @whitelist_nodes = []
91
+ fragment = Nokogiri::HTML::DocumentFragment.parse(html)
92
+ clean_node!(fragment)
93
+ @whitelist_nodes = []
94
+
95
+ output_method_params = {:encoding => 'utf-8', :indent => 0}
96
+
97
+ if @config[:output] == :xhtml
98
+ output_method = fragment.method(:to_xhtml)
99
+ output_method_params[:save_with] = Nokogiri::XML::Node::SaveOptions::AS_XHTML
100
+ elsif @config[:output] == :html
101
+ output_method = fragment.method(:to_html)
102
+ else
103
+ raise Error, "unsupported output format: #{@config[:output]}"
104
+ end
105
+
106
+ result = output_method.call(output_method_params)
107
+
108
+ # Nokogiri 1.3.3 (and possibly earlier versions) always returns a US-ASCII
109
+ # string no matter what we ask for. This will be fixed in 1.4.0, but for
110
+ # now we have to hack around it to prevent errors.
111
+ result.force_encoding('utf-8') if RUBY_VERSION >= '1.9'
112
+
113
+ return result == html ? nil : html[0, html.length] = result
114
+ end
115
+
116
+ # Sanitizes the specified Nokogiri::XML::Node and all its children.
117
+ def clean_node!(node)
118
+ raise ArgumentError unless node.is_a?(Nokogiri::XML::Node)
119
+
120
+ node.traverse do |traversed_node|
121
+ if traversed_node.element?
122
+ clean_element!(traversed_node)
123
+ elsif traversed_node.comment?
124
+ traversed_node.unlink unless @config[:allow_comments]
125
+ elsif traversed_node.cdata?
126
+ traversed_node.replace(Nokogiri::XML::Text.new(traversed_node.text,
127
+ traversed_node.document))
128
+ end
129
+ end
130
+
131
+ node
132
+ end
133
+
134
+ private
135
+
136
+ def clean_element!(node)
137
+ # Run this node through all configured transformers.
138
+ transform = transform_element!(node)
139
+
140
+ # If this node is in the dynamic whitelist array (built at runtime by
141
+ # transformers), let it live with all of its attributes intact.
142
+ return if @whitelist_nodes.include?(node)
143
+
144
+ name = node.name.to_s.downcase
145
+
146
+ # Delete any element that isn't in the whitelist.
147
+ unless transform[:whitelist] || @config[:elements].include?(name)
148
+ node.children.each { |n| node.add_previous_sibling(n) }
149
+ node.unlink
150
+ return
151
+ end
152
+
153
+ attr_whitelist = (transform[:attr_whitelist] +
154
+ (@config[:attributes][name] || []) +
155
+ (@config[:attributes][:all] || [])).uniq
156
+
157
+ if attr_whitelist.empty?
158
+ # Delete all attributes from elements with no whitelisted attributes.
159
+ node.attribute_nodes.each {|attr| attr.remove }
160
+ else
161
+ # Delete any attribute that isn't in the whitelist for this element.
162
+ node.attribute_nodes.each do |attr|
163
+ attr.unlink unless attr_whitelist.include?(attr.name.downcase)
164
+ end
165
+
166
+ # Delete remaining attributes that use unacceptable protocols.
167
+ if @config[:protocols].has_key?(name)
168
+ protocol = @config[:protocols][name]
169
+
170
+ node.attribute_nodes.each do |attr|
171
+ attr_name = attr.name.downcase
172
+ next false unless protocol.has_key?(attr_name)
173
+
174
+ del = if attr.value.to_s.downcase =~ REGEX_PROTOCOL
175
+ !protocol[attr_name].include?($1.downcase)
176
+ else
177
+ !protocol[attr_name].include?(:relative)
178
+ end
179
+
180
+ attr.unlink if del
181
+ end
182
+ end
183
+ end
184
+
185
+ # Add required attributes.
186
+ if @config[:add_attributes].has_key?(name)
187
+ @config[:add_attributes][name].each do |key, val|
188
+ node[key] = val
189
+ end
190
+ end
191
+
192
+ transform
193
+ end
194
+
195
+ def transform_element!(node)
196
+ output = {
197
+ :attr_whitelist => [],
198
+ :node => node,
199
+ :whitelist => false
200
+ }
201
+
202
+ @config[:transformers].inject(node) do |transformer_node, transformer|
203
+ transform = transformer.call({
204
+ :config => @config,
205
+ :node => transformer_node
206
+ })
207
+
208
+ if transform.nil?
209
+ transformer_node
210
+ elsif transform.is_a?(Hash)
211
+ if transform[:whitelist_nodes].is_a?(Array)
212
+ @whitelist_nodes += transform[:whitelist_nodes]
213
+ @whitelist_nodes.uniq!
214
+ end
215
+
216
+ output[:attr_whitelist] += transform[:attr_whitelist] if transform[:attr_whitelist].is_a?(Array)
217
+ output[:whitelist] ||= true if transform[:whitelist]
218
+ output[:node] = transform[:node].is_a?(Nokogiri::XML::Node) ? transform[:node] : output[:node]
219
+ else
220
+ raise Error, "transformer output must be a Hash or nil"
221
+ end
222
+ end
223
+
224
+ node.replace(output[:node]) if node != output[:node]
225
+
226
+ return output
227
+ end
228
+ end
metadata ADDED
@@ -0,0 +1,92 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: darkhelmet-sanitize
3
+ version: !ruby/object:Gem::Version
4
+ version: 1.2.0.dev.20091104
5
+ platform: ruby
6
+ authors:
7
+ - Ryan Grove
8
+ autorequire:
9
+ bindir: bin
10
+ cert_chain: []
11
+
12
+ date: 2009-11-04 00:00:00 -07:00
13
+ default_executable:
14
+ dependencies:
15
+ - !ruby/object:Gem::Dependency
16
+ name: nokogiri
17
+ type: :runtime
18
+ version_requirement:
19
+ version_requirements: !ruby/object:Gem::Requirement
20
+ requirements:
21
+ - - ~>
22
+ - !ruby/object:Gem::Version
23
+ version: 1.4.0
24
+ version:
25
+ - !ruby/object:Gem::Dependency
26
+ name: bacon
27
+ type: :development
28
+ version_requirement:
29
+ version_requirements: !ruby/object:Gem::Requirement
30
+ requirements:
31
+ - - ~>
32
+ - !ruby/object:Gem::Version
33
+ version: 1.1.0
34
+ version:
35
+ - !ruby/object:Gem::Dependency
36
+ name: rake
37
+ type: :development
38
+ version_requirement:
39
+ version_requirements: !ruby/object:Gem::Requirement
40
+ requirements:
41
+ - - ~>
42
+ - !ruby/object:Gem::Version
43
+ version: 0.8.0
44
+ version:
45
+ description:
46
+ email: ryan@wonko.com
47
+ executables: []
48
+
49
+ extensions: []
50
+
51
+ extra_rdoc_files: []
52
+
53
+ files:
54
+ - HISTORY
55
+ - LICENSE
56
+ - README.rdoc
57
+ - lib/sanitize/config/basic.rb
58
+ - lib/sanitize/config/relaxed.rb
59
+ - lib/sanitize/config/restricted.rb
60
+ - lib/sanitize/config.rb
61
+ - lib/sanitize/version.rb
62
+ - lib/sanitize.rb
63
+ has_rdoc: true
64
+ homepage: http://github.com/rgrove/sanitize/
65
+ licenses: []
66
+
67
+ post_install_message:
68
+ rdoc_options: []
69
+
70
+ require_paths:
71
+ - lib
72
+ required_ruby_version: !ruby/object:Gem::Requirement
73
+ requirements:
74
+ - - ">="
75
+ - !ruby/object:Gem::Version
76
+ version: 1.8.6
77
+ version:
78
+ required_rubygems_version: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - ">"
81
+ - !ruby/object:Gem::Version
82
+ version: 1.3.1
83
+ version:
84
+ requirements: []
85
+
86
+ rubyforge_project: riposte
87
+ rubygems_version: 1.3.5
88
+ signing_key:
89
+ specification_version: 3
90
+ summary: Whitelist-based HTML sanitizer.
91
+ test_files: []
92
+