glebm-sanitize 1.2.1.1

Sign up to get free protection for your applications and to get access to all the features.
data/HISTORY ADDED
@@ -0,0 +1,90 @@
1
+ Sanitize History
2
+ ================================================================================
3
+
4
+ Version 1.2.1 (2010-04-20)
5
+ * Added a :remove_contents config setting. If set to true, Sanitize will
6
+ remove the contents of all non-whitelisted elements in addition to the
7
+ elements themselves. If set to an Array of element names, Sanitize will
8
+ remove the contents of only those elements (when filtered), and leave the
9
+ contents of other filtered elements. [Thanks to Rafael Souza for the Array
10
+ option]
11
+ * Added an :output_encoding config setting to allow the character encoding for
12
+ HTML output to be specified. The default is 'utf-8'.
13
+ * The environment hash passed into transformers now includes a :node_name item
14
+ containing the lowercase name of the current HTML node (e.g. "div").
15
+ * Returning anything other than a Hash or nil from a transformer will now
16
+ raise a meaningful Sanitize::Error exception rather than an unintended
17
+ NameError.
18
+
19
+ Version 1.2.0 (2010-01-17)
20
+ * Requires Nokogiri ~> 1.4.1.
21
+ * Added support for transformers, which allow you to filter and alter nodes
22
+ using your own custom logic, on top of (or instead of) Sanitize's core
23
+ filter. See the README for details and examples.
24
+ * Added Sanitize.clean_node!, which sanitizes a Nokogiri::XML::Node and all
25
+ its children.
26
+ * Added elements <h1> through <h6> to the Relaxed whitelist. [Suggested by
27
+ David Reese]
28
+
29
+ Version 1.1.0 (2009-10-11)
30
+ * Migrated from Hpricot to Nokogiri. Requires libxml2 >= 2.7.2 [Adam Hooper]
31
+ * Added an :output config setting to allow the output format to be specified.
32
+ Supported formats are :xhtml (the default) and :html (which outputs HTML4).
33
+ * Changed protocol regex to ensure Sanitize doesn't kill URLs with colons in
34
+ path segments. [Peter Cooper]
35
+
36
+ Version 1.0.8 (2009-04-23)
37
+ * Added a workaround for an Hpricot bug that prevents attribute names from
38
+ being downcased in recent versions of Hpricot. This was exploitable to
39
+ prevent non-whitelisted protocols from being cleaned. [Reported by Ben
40
+ Wanicur]
41
+
42
+ Version 1.0.7 (2009-04-11)
43
+ * Requires Hpricot 0.8.1+, which is finally compatible with Ruby 1.9.1.
44
+ * Fixed a bug that caused named character entities containing digits (like
45
+ &sup2;) to be escaped when they shouldn't have been. [Reported by Sebastian
46
+ Steinmetz]
47
+
48
+ Version 1.0.6 (2009-02-23)
49
+ * Removed htmlentities gem dependency.
50
+ * Existing well-formed character entity references in the input string are now
51
+ preserved rather than being decoded and re-encoded.
52
+ * The ' character is now encoded as &#39; instead of &apos; to prevent
53
+ problems in IE6.
54
+ * You can now specify the symbol :all in place of an element name in the
55
+ attributes config hash to allow certain attributes on all elements. [Thanks
56
+ to Mutwin Kraus]
57
+
58
+ Version 1.0.5 (2009-02-05)
59
+ * Fixed a bug introduced in version 1.0.3 that prevented non-whitelisted
60
+ protocols from being cleaned when relative URLs were allowed. [Reported by
61
+ Dev Purkayastha]
62
+ * Fixed "undefined method `parent='" exceptions caused by parser changes in
63
+ edge Hpricot.
64
+
65
+ Version 1.0.4 (2009-01-16)
66
+ * Fixed a bug that made it possible to sneak a non-whitelisted element through
67
+ by repeating it several times in a row. All versions of Sanitize prior to
68
+ 1.0.4 are vulnerable. [Reported by Cristobal]
69
+
70
+ Version 1.0.3 (2009-01-15)
71
+ * Fixed a bug whereby incomplete Unicode or hex entities could be used to
72
+ prevent non-whitelisted protocols from being cleaned. Since IE6 and Opera
73
+ still decode the incomplete entities, users of those browsers may be
74
+ vulnerable to malicious script injection on websites using versions of
75
+ Sanitize prior to 1.0.3.
76
+
77
+ Version 1.0.2 (2009-01-04)
78
+ * Fixed a bug that caused an exception to be thrown when parsing a valueless
79
+ attribute that's expected to contain a URL.
80
+
81
+ Version 1.0.1 (2009-01-01)
82
+ * You can now specify :relative in a protocol config array to allow attributes
83
+ containing relative URLs with no protocol. The Basic and Relaxed configs
84
+ have been updated to allow relative URLs.
85
+ * Added a workaround for an Hpricot bug that causes HTML entities for
86
+ non-ASCII characters to be replaced by question marks, and all other
87
+ entities to be destructively decoded.
88
+
89
+ Version 1.0.0 (2008-12-25)
90
+ * First release.
data/LICENSE ADDED
@@ -0,0 +1,18 @@
1
+ Copyright (c) 2010 Ryan Grove <ryan@wonko.com>
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining a copy of
4
+ this software and associated documentation files (the 'Software'), to deal in
5
+ the Software without restriction, including without limitation the rights to
6
+ use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
7
+ the Software, and to permit persons to whom the Software is furnished to do so,
8
+ subject to the following conditions:
9
+
10
+ The above copyright notice and this permission notice shall be included in all
11
+ copies or substantial portions of the Software.
12
+
13
+ THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
15
+ FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
16
+ COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
17
+ IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
18
+ CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,334 @@
1
+ = Sanitize
2
+
3
+ Sanitize is a whitelist-based HTML sanitizer. Given a list of acceptable
4
+ elements and attributes, Sanitize will remove all unacceptable HTML from a
5
+ string.
6
+
7
+ Using a simple configuration syntax, you can tell Sanitize to allow certain
8
+ elements, certain attributes within those elements, and even certain URL
9
+ protocols within attributes that contain URLs. Any HTML elements or attributes
10
+ that you don't explicitly allow will be removed.
11
+
12
+ Because it's based on Nokogiri, a full-fledged HTML parser, rather than a bunch
13
+ of fragile regular expressions, Sanitize has no trouble dealing with malformed
14
+ or maliciously-formed HTML, and will always output valid HTML or XHTML.
15
+
16
+ *Author*:: Ryan Grove (mailto:ryan@wonko.com)
17
+ *Version*:: 1.2.1 (2010-04-20)
18
+ *Copyright*:: Copyright (c) 2010 Ryan Grove. All rights reserved.
19
+ *License*:: MIT License (http://opensource.org/licenses/mit-license.php)
20
+ *Website*:: http://github.com/rgrove/sanitize
21
+
22
+ == Requires
23
+
24
+ * Nokogiri ~> 1.4.1
25
+ * libxml2 >= 2.7.2
26
+
27
+ == Installation
28
+
29
+ Latest stable release:
30
+
31
+ gem install sanitize
32
+
33
+ Latest development version:
34
+
35
+ gem install sanitize --pre
36
+
37
+ == Usage
38
+
39
+ If you don't specify any configuration options, Sanitize will use its strictest
40
+ settings by default, which means it will strip all HTML and leave only text
41
+ behind.
42
+
43
+ require 'rubygems'
44
+ require 'sanitize'
45
+
46
+ html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
47
+
48
+ Sanitize.clean(html) # => 'foo'
49
+
50
+ == Configuration
51
+
52
+ In addition to the ultra-safe default settings, Sanitize comes with three other
53
+ built-in modes.
54
+
55
+ === Sanitize::Config::RESTRICTED
56
+
57
+ Allows only very simple inline formatting markup. No links, images, or block
58
+ elements.
59
+
60
+ Sanitize.clean(html, Sanitize::Config::RESTRICTED) # => '<b>foo</b>'
61
+
62
+ === Sanitize::Config::BASIC
63
+
64
+ Allows a variety of markup including formatting tags, links, and lists. Images
65
+ and tables are not allowed, links are limited to FTP, HTTP, HTTPS, and mailto
66
+ protocols, and a <code>rel="nofollow"</code> attribute is added to all links to
67
+ mitigate SEO spam.
68
+
69
+ Sanitize.clean(html, Sanitize::Config::BASIC)
70
+ # => '<b><a href="http://foo.com/" rel="nofollow">foo</a></b>'
71
+
72
+ === Sanitize::Config::RELAXED
73
+
74
+ Allows an even wider variety of markup than BASIC, including images and tables.
75
+ Links are still limited to FTP, HTTP, HTTPS, and mailto protocols, while images
76
+ are limited to HTTP and HTTPS. In this mode, <code>rel="nofollow"</code> is not
77
+ added to links.
78
+
79
+ Sanitize.clean(html, Sanitize::Config::RELAXED)
80
+ # => '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
81
+
82
+ === Custom Configuration
83
+
84
+ If the built-in modes don't meet your needs, you can easily specify a custom
85
+ configuration:
86
+
87
+ Sanitize.clean(html, :elements => ['a', 'span'],
88
+ :attributes => {'a' => ['href', 'title'], 'span' => ['class']},
89
+ :protocols => {'a' => {'href' => ['http', 'https', 'mailto']}})
90
+
91
+ ==== :add_attributes (Hash)
92
+
93
+ Attributes to add to specific elements. If the attribute already exists, it will
94
+ be replaced with the value specified here. Specify all element names and
95
+ attributes in lowercase.
96
+
97
+ :add_attributes => {
98
+ 'a' => {'rel' => 'nofollow'}
99
+ }
100
+
101
+ ==== :attributes (Hash)
102
+
103
+ Attributes to allow for specific elements. Specify all element names and
104
+ attributes in lowercase.
105
+
106
+ :attributes => {
107
+ 'a' => ['href', 'title'],
108
+ 'blockquote' => ['cite'],
109
+ 'img' => ['alt', 'src', 'title']
110
+ }
111
+
112
+ If you'd like to allow certain attributes on all elements, use the symbol
113
+ <code>:all</code> instead of an element name.
114
+
115
+ :attributes => {
116
+ :all => ['class'],
117
+ 'a' => ['href', 'title']
118
+ }
119
+
120
+ ==== :allow_comments (boolean)
121
+
122
+ Whether or not to allow HTML comments. Allowing comments is strongly
123
+ discouraged, since IE allows script execution within conditional comments. The
124
+ default value is <code>false</code>.
125
+
126
+ ==== :elements (Array)
127
+
128
+ Array of element names to allow. Specify all names in lowercase.
129
+
130
+ :elements => [
131
+ 'a', 'b', 'blockquote', 'br', 'cite', 'code', 'dd', 'dl', 'dt', 'em',
132
+ 'i', 'li', 'ol', 'p', 'pre', 'q', 'small', 'strike', 'strong', 'sub',
133
+ 'sup', 'u', 'ul'
134
+ ]
135
+
136
+ ==== :output (Symbol)
137
+
138
+ Output format. Supported formats are <code>:html</code> and <code>:xhtml</code>,
139
+ defaulting to <code>:xhtml</code>.
140
+
141
+ ==== :output_encoding (String)
142
+
143
+ Character encoding to use for HTML output. Default is <code>'utf-8'</code>.
144
+
145
+ ==== :protocols (Hash)
146
+
147
+ URL protocols to allow in specific attributes. If an attribute is listed here
148
+ and contains a protocol other than those specified (or if it contains no
149
+ protocol at all), it will be removed.
150
+
151
+ :protocols => {
152
+ 'a' => {'href' => ['ftp', 'http', 'https', 'mailto']},
153
+ 'img' => {'src' => ['http', 'https']}
154
+ }
155
+
156
+ If you'd like to allow the use of relative URLs which don't have a protocol,
157
+ include the symbol <code>:relative</code> in the protocol array:
158
+
159
+ :protocols => {
160
+ 'a' => {'href' => ['http', 'https', :relative]}
161
+ }
162
+
163
+ ==== :remove_contents (boolean or Array)
164
+
165
+ If set to +true+, Sanitize will remove the contents of any non-whitelisted
166
+ elements in addition to the elements themselves. By default, Sanitize leaves the
167
+ safe parts of an element's contents behind when the element is removed.
168
+
169
+ If set to an Array of element names, then only the contents of the specified
170
+ elements (when filtered) will be removed, and the contents of all other filtered
171
+ elements will be left behind.
172
+
173
+ The default value is <code>false</code>.
174
+
175
+ ==== :transformers
176
+
177
+ See below.
178
+
179
+ === Transformers
180
+
181
+ Transformers allow you to filter and alter nodes using your own custom logic, on
182
+ top of (or instead of) Sanitize's core filter. A transformer is any object that
183
+ responds to <code>call()</code> (such as a lambda or proc) and returns either
184
+ <code>nil</code> or a Hash containing certain optional response values.
185
+
186
+ To use one or more transformers, pass them to the <code>:transformers</code>
187
+ config setting:
188
+
189
+ Sanitize.clean(html, :transformers => [transformer_one, transformer_two])
190
+
191
+ ==== Input
192
+
193
+ Each registered transformer's <code>call()</code> method will be called once for
194
+ each element node in the HTML, and will receive as an argument an environment
195
+ Hash that contains the following items:
196
+
197
+ [<code>:config</code>]
198
+ The current Sanitize configuration Hash.
199
+
200
+ [<code>:node</code>]
201
+ A Nokogiri::XML::Node object representing an HTML element.
202
+
203
+ [<code>:node_name</code>]
204
+ The name of the current HTML node, always lowercase (e.g. "div" or "span").
205
+
206
+ ==== Processing
207
+
208
+ Each transformer has full access to the Nokogiri::XML::Node that's passed into
209
+ it and to the rest of the document via the node's <code>document()</code>
210
+ method. Any changes will be reflected instantly in the document and passed on to
211
+ subsequently-called transformers and to Sanitize itself. A transformer may even
212
+ call Sanitize internally to perform custom sanitization if needed.
213
+
214
+ Nodes are passed into transformers in the order in which they're traversed. It's
215
+ important to note that Nokogiri traverses markup from the deepest node upward,
216
+ not from the first node to the last node:
217
+
218
+ html = '<div><span>foo</span></div>'
219
+ transformer = lambda{|env| puts env[:node].name }
220
+
221
+ # Prints "span", then "div".
222
+ Sanitize.clean(html, :transformers => transformer)
223
+
224
+ Transformers have a tremendous amount of power, including the power to
225
+ completely bypass Sanitize's built-in filtering. Be careful!
226
+
227
+ ==== Output
228
+
229
+ A transformer may return either +nil+ or a Hash. A return value of +nil+
230
+ indicates that the transformer does not wish to act on the current node in any
231
+ way. A returned Hash may contain the following items, all of which are optional:
232
+
233
+ [<code>:attr_whitelist</code>]
234
+ Array of attribute names to add to the whitelist for the current node, in
235
+ addition to any whitelisted attributes already defined in the current config.
236
+
237
+ [<code>:node</code>]
238
+ A Nokogiri::XML::Node object that should replace the current node. All
239
+ subsequent transformers and Sanitize itself will receive this new node.
240
+
241
+ [<code>:whitelist</code>]
242
+ If _true_, the current node (and only the current node) will be whitelisted,
243
+ regardless of the current Sanitize config.
244
+
245
+ [<code>:whitelist_nodes</code>]
246
+ Array of specific Nokogiri::XML::Node objects to whitelist, anywhere in the
247
+ document, regardless of the current Sanitize config.
248
+
249
+ ==== Example: Transformer to whitelist YouTube video embeds
250
+
251
+ The following example demonstrates how to create a Sanitize transformer that
252
+ will safely whitelist valid YouTube video embeds without having to blindly allow
253
+ other kinds of embedded content, which would be the case if you tried to do this
254
+ by just whitelisting all <code><object></code>, <code><embed></code>, and
255
+ <code><param></code> elements:
256
+
257
+ lambda do |env|
258
+ node = env[:node]
259
+ node_name = env[:node_name]
260
+ parent = node.parent
261
+
262
+ # Since the transformer receives the deepest nodes first, we look for a
263
+ # <param> element or an <embed> element whose parent is an <object>.
264
+ return nil unless (node_name == 'param' || node_name == 'embed') &&
265
+ parent.name.to_s.downcase == 'object'
266
+
267
+ if node_name == 'param'
268
+ # Quick XPath search to find the <param> node that contains the video URL.
269
+ return nil unless movie_node = parent.search('param[@name="movie"]')[0]
270
+ url = movie_node['value']
271
+ else
272
+ # Since this is an <embed>, the video URL is in the "src" attribute. No
273
+ # extra work needed.
274
+ url = node['src']
275
+ end
276
+
277
+ # Verify that the video URL is actually a valid YouTube video URL.
278
+ return nil unless url =~ /^http:\/\/(?:www\.)?youtube\.com\/v\//
279
+
280
+ # We're now certain that this is a YouTube embed, but we still need to run
281
+ # it through a special Sanitize step to ensure that no unwanted elements or
282
+ # attributes that don't belong in a YouTube embed can sneak in.
283
+ Sanitize.clean_node!(parent, {
284
+ :elements => ['embed', 'object', 'param'],
285
+ :attributes => {
286
+ 'embed' => ['allowfullscreen', 'allowscriptaccess', 'height', 'src', 'type', 'width'],
287
+ 'object' => ['height', 'width'],
288
+ 'param' => ['name', 'value']
289
+ }
290
+ })
291
+
292
+ # Now that we're sure that this is a valid YouTube embed and that there are
293
+ # no unwanted elements or attributes hidden inside it, we can tell Sanitize
294
+ # to whitelist the current node (<param> or <embed>) and its parent
295
+ # (<object>).
296
+ {:whitelist_nodes => [node, parent]}
297
+ end
298
+
299
+ == Contributors
300
+
301
+ The following lovely people have contributed to Sanitize in the form of patches
302
+ or ideas that later became code:
303
+
304
+ * Wilson Bilkovich <wilson@supremetyrant.com>
305
+ * Peter Cooper <git@peterc.org>
306
+ * Gabe da Silveira <gabe@websaviour.com>
307
+ * Ryan Grove <ryan@wonko.com>
308
+ * Adam Hooper <adam@adamhooper.com>
309
+ * Mutwin Kraus <mutle@blogage.de>
310
+ * Dev Purkayastha <dev.purkayastha@gmail.com>
311
+ * David Reese <work@whatcould.com>
312
+ * Rafael Souza <me@rafaelss.com>
313
+ * Ben Wanicur <bwanicur@verticalresponse.com>
314
+
315
+ == License
316
+
317
+ Copyright (c) 2010 Ryan Grove <ryan@wonko.com>
318
+
319
+ Permission is hereby granted, free of charge, to any person obtaining a copy of
320
+ this software and associated documentation files (the 'Software'), to deal in
321
+ the Software without restriction, including without limitation the rights to
322
+ use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
323
+ the Software, and to permit persons to whom the Software is furnished to do so,
324
+ subject to the following conditions:
325
+
326
+ The above copyright notice and this permission notice shall be included in all
327
+ copies or substantial portions of the Software.
328
+
329
+ THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
330
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
331
+ FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
332
+ COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
333
+ IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
334
+ CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,247 @@
1
+ # encoding: utf-8
2
+ #--
3
+ # Copyright (c) 2010 Ryan Grove <ryan@wonko.com>
4
+ #
5
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ # of this software and associated documentation files (the 'Software'), to deal
7
+ # in the Software without restriction, including without limitation the rights
8
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ # copies of the Software, and to permit persons to whom the Software is
10
+ # furnished to do so, subject to the following conditions:
11
+ #
12
+ # The above copyright notice and this permission notice shall be included in all
13
+ # copies or substantial portions of the Software.
14
+ #
15
+ # THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ # SOFTWARE.
22
+ #++
23
+
24
+ require 'nokogiri'
25
+ require 'sanitize/version'
26
+ require 'sanitize/config'
27
+ require 'sanitize/config/restricted'
28
+ require 'sanitize/config/basic'
29
+ require 'sanitize/config/relaxed'
30
+
31
+ class Sanitize
32
+ attr_reader :config
33
+
34
+ # Matches an attribute value that could be treated by a browser as a URL
35
+ # with a protocol prefix, such as "http:" or "javascript:". Any string of zero
36
+ # or more characters followed by a colon is considered a match, even if the
37
+ # colon is encoded as an entity and even if it's an incomplete entity (which
38
+ # IE6 and Opera will still parse).
39
+ REGEX_PROTOCOL = /^([A-Za-z0-9\+\-\.\&\;\#\s]*?)(?:\:|&#0*58|&#x0*3a)/i
40
+
41
+ #--
42
+ # Class Methods
43
+ #++
44
+
45
+ # Returns a sanitized copy of _html_, using the settings in _config_ if
46
+ # specified.
47
+ def self.clean(html, config = {})
48
+ sanitize = Sanitize.new(config)
49
+ sanitize.clean(html)
50
+ end
51
+
52
+ # Performs Sanitize#clean in place, returning _html_, or +nil+ if no changes
53
+ # were made.
54
+ def self.clean!(html, config = {})
55
+ sanitize = Sanitize.new(config)
56
+ sanitize.clean!(html)
57
+ end
58
+
59
+ # Sanitizes the specified Nokogiri::XML::Node and all its children.
60
+ def self.clean_node!(node, config = {})
61
+ sanitize = Sanitize.new(config)
62
+ sanitize.clean_node!(node)
63
+ end
64
+
65
+ #--
66
+ # Instance Methods
67
+ #++
68
+
69
+ # Returns a new Sanitize object initialized with the settings in _config_.
70
+ def initialize(config = {})
71
+ # Sanitize configuration.
72
+ @config = Config::DEFAULT.merge(config)
73
+ @config[:transformers] = Array(@config[:transformers].dup)
74
+
75
+ # Convert the list of allowed elements to a Hash for faster lookup.
76
+ @allowed_elements = {}
77
+ @config[:elements].each {|el| @allowed_elements[el] = true }
78
+
79
+ # Convert the list of :remove_contents elements to a Hash for faster lookup.
80
+ @remove_all_contents = false
81
+ @remove_element_contents = {}
82
+
83
+ if @config[:remove_contents].is_a?(Array)
84
+ @config[:remove_contents].each {|el| @remove_element_contents[el] = true }
85
+ else
86
+ @remove_all_contents = !!@config[:remove_contents]
87
+ end
88
+
89
+ # Specific nodes to whitelist (along with all their attributes). This array
90
+ # is generated at runtime by transformers, and is cleared before and after
91
+ # a fragment is cleaned (so it applies only to a specific fragment).
92
+ @whitelist_nodes = []
93
+ end
94
+
95
+ # Returns a sanitized copy of _html_.
96
+ def clean(html)
97
+ if html
98
+ dupe = html.dup
99
+ clean!(dupe) || dupe
100
+ end
101
+ end
102
+
103
+ # Performs clean in place, returning _html_, or +nil+ if no changes were
104
+ # made.
105
+ def clean!(html)
106
+ fragment = Nokogiri::HTML::DocumentFragment.parse(html)
107
+ clean_node!(fragment)
108
+
109
+ output_method_params = {:encoding => @config[:output_encoding], :indent => 0}
110
+
111
+ if @config[:output] == :xhtml
112
+ output_method = fragment.method(:to_xhtml)
113
+ output_method_params[:save_with] = Nokogiri::XML::Node::SaveOptions::AS_XHTML
114
+ elsif @config[:output] == :html
115
+ output_method = fragment.method(:to_html)
116
+ else
117
+ raise Error, "unsupported output format: #{@config[:output]}"
118
+ end
119
+
120
+ result = output_method.call(output_method_params)
121
+
122
+ return result == html ? nil : html[0, html.length] = result
123
+ end
124
+
125
+ # Sanitizes the specified Nokogiri::XML::Node and all its children.
126
+ def clean_node!(node)
127
+ raise ArgumentError unless node.is_a?(Nokogiri::XML::Node)
128
+
129
+ @whitelist_nodes = []
130
+
131
+ node.traverse do |child|
132
+ if child.element?
133
+ clean_element!(child)
134
+ elsif child.comment?
135
+ child.unlink unless @config[:allow_comments]
136
+ elsif child.cdata?
137
+ child.replace(Nokogiri::XML::Text.new(child.text, child.document))
138
+ end
139
+ end
140
+
141
+ @whitelist_nodes = []
142
+
143
+ node
144
+ end
145
+
146
+ private
147
+
148
+ def clean_element!(node)
149
+ # Run this node through all configured transformers.
150
+ transform = transform_element!(node)
151
+
152
+ # If this node is in the dynamic whitelist array (built at runtime by
153
+ # transformers), let it live with all of its attributes intact.
154
+ return if @whitelist_nodes.include?(node)
155
+
156
+ name = node.name.to_s.downcase
157
+
158
+ # Delete any element that isn't in the whitelist.
159
+ unless transform[:whitelist] || @allowed_elements[name]
160
+ unless @remove_all_contents || @remove_element_contents[name]
161
+ node.children.each { |n| node.add_previous_sibling(n) }
162
+ end
163
+
164
+ node.unlink
165
+
166
+ return
167
+ end
168
+
169
+ attr_whitelist = (transform[:attr_whitelist] +
170
+ (@config[:attributes][name] || []) +
171
+ (@config[:attributes][:all] || [])).uniq
172
+
173
+ if attr_whitelist.empty?
174
+ # Delete all attributes from elements with no whitelisted attributes.
175
+ node.attribute_nodes.each {|attr| attr.remove }
176
+ else
177
+ # Delete any attribute that isn't in the whitelist for this element.
178
+ node.attribute_nodes.each do |attr|
179
+ attr.unlink unless attr_whitelist.include?(attr.name.downcase)
180
+ end
181
+
182
+ # Delete remaining attributes that use unacceptable protocols.
183
+ if @config[:protocols].has_key?(name)
184
+ protocol = @config[:protocols][name]
185
+
186
+ node.attribute_nodes.each do |attr|
187
+ attr_name = attr.name.downcase
188
+ next false unless protocol.has_key?(attr_name)
189
+
190
+ del = if attr.value.to_s.downcase =~ REGEX_PROTOCOL
191
+ !protocol[attr_name].include?($1.downcase)
192
+ else
193
+ !protocol[attr_name].include?(:relative)
194
+ end
195
+
196
+ attr.unlink if del
197
+ end
198
+ end
199
+ end
200
+
201
+ # Add required attributes.
202
+ if @config[:add_attributes].has_key?(name)
203
+ @config[:add_attributes][name].each do |key, val|
204
+ node[key] = val
205
+ end
206
+ end
207
+
208
+ transform
209
+ end
210
+
211
+ def transform_element!(node)
212
+ output = {
213
+ :attr_whitelist => [],
214
+ :node => node,
215
+ :whitelist => false
216
+ }
217
+
218
+ @config[:transformers].inject(node) do |transformer_node, transformer|
219
+ transform = transformer.call({
220
+ :config => @config,
221
+ :node => transformer_node,
222
+ :node_name => transformer_node.name.downcase
223
+ })
224
+
225
+ if transform.nil?
226
+ transformer_node
227
+ elsif transform.is_a?(Hash)
228
+ if transform[:whitelist_nodes].is_a?(Array)
229
+ @whitelist_nodes += transform[:whitelist_nodes]
230
+ @whitelist_nodes.uniq!
231
+ end
232
+
233
+ output[:attr_whitelist] += transform[:attr_whitelist] if transform[:attr_whitelist].is_a?(Array)
234
+ output[:whitelist] ||= true if transform[:whitelist]
235
+ output[:node] = transform[:node].is_a?(Nokogiri::XML::Node) ? transform[:node] : output[:node]
236
+ else
237
+ raise Error, "transformer output must be a Hash or nil"
238
+ end
239
+ end
240
+
241
+ node.replace(output[:node]) if node != output[:node]
242
+
243
+ return output
244
+ end
245
+
246
+ class Error < StandardError; end
247
+ end
@@ -0,0 +1,70 @@
1
+ #--
2
+ # Copyright (c) 2010 Ryan Grove <ryan@wonko.com>
3
+ #
4
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
5
+ # of this software and associated documentation files (the 'Software'), to deal
6
+ # in the Software without restriction, including without limitation the rights
7
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
8
+ # copies of the Software, and to permit persons to whom the Software is
9
+ # furnished to do so, subject to the following conditions:
10
+ #
11
+ # The above copyright notice and this permission notice shall be included in all
12
+ # copies or substantial portions of the Software.
13
+ #
14
+ # THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
15
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
16
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
17
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
18
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
19
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
20
+ # SOFTWARE.
21
+ #++
22
+
23
+ class Sanitize
24
+ module Config
25
+ DEFAULT = {
26
+ # Whether or not to allow HTML comments. Allowing comments is strongly
27
+ # discouraged, since IE allows script execution within conditional
28
+ # comments.
29
+ :allow_comments => false,
30
+
31
+ # HTML attributes to add to specific elements. By default, no attributes
32
+ # are added.
33
+ :add_attributes => {},
34
+
35
+ # HTML attributes to allow in specific elements. By default, no attributes
36
+ # are allowed.
37
+ :attributes => {},
38
+
39
+ # HTML elements to allow. By default, no elements are allowed (which means
40
+ # that all HTML will be stripped).
41
+ :elements => [],
42
+
43
+ # Output format. Supported formats are :html and :xhtml (which is the
44
+ # default).
45
+ :output => :xhtml,
46
+
47
+ # Character encoding to use for HTML output. Default is 'utf-8'.
48
+ :output_encoding => 'utf-8',
49
+
50
+ # URL handling protocols to allow in specific attributes. By default, no
51
+ # protocols are allowed. Use :relative in place of a protocol if you want
52
+ # to allow relative URLs sans protocol.
53
+ :protocols => {},
54
+
55
+ # If this is true, Sanitize will remove the contents of any filtered
56
+ # elements in addition to the elements themselves. By default, Sanitize
57
+ # leaves the safe parts of an element's contents behind when the element
58
+ # is removed.
59
+ #
60
+ # If this is an Array of element names, then only the contents of the
61
+ # specified elements (when filtered) will be removed, and the contents of
62
+ # all other filtered elements will be left behind.
63
+ :remove_contents => false,
64
+
65
+ # Transformers allow you to filter or alter nodes using custom logic. See
66
+ # README.rdoc for details and examples.
67
+ :transformers => []
68
+ }
69
+ end
70
+ end
@@ -0,0 +1,49 @@
1
+ #--
2
+ # Copyright (c) 2010 Ryan Grove <ryan@wonko.com>
3
+ #
4
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
5
+ # of this software and associated documentation files (the 'Software'), to deal
6
+ # in the Software without restriction, including without limitation the rights
7
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
8
+ # copies of the Software, and to permit persons to whom the Software is
9
+ # furnished to do so, subject to the following conditions:
10
+ #
11
+ # The above copyright notice and this permission notice shall be included in all
12
+ # copies or substantial portions of the Software.
13
+ #
14
+ # THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
15
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
16
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
17
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
18
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
19
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
20
+ # SOFTWARE.
21
+ #++
22
+
23
+ class Sanitize
24
+ module Config
25
+ BASIC = {
26
+ :elements => [
27
+ 'a', 'b', 'blockquote', 'br', 'cite', 'code', 'dd', 'dl', 'dt', 'em',
28
+ 'i', 'li', 'ol', 'p', 'pre', 'q', 'small', 'strike', 'strong', 'sub',
29
+ 'sup', 'u', 'ul'],
30
+
31
+ :attributes => {
32
+ 'a' => ['href'],
33
+ 'blockquote' => ['cite'],
34
+ 'q' => ['cite']
35
+ },
36
+
37
+ :add_attributes => {
38
+ 'a' => {'rel' => 'nofollow'}
39
+ },
40
+
41
+ :protocols => {
42
+ 'a' => {'href' => ['ftp', 'http', 'https', 'mailto',
43
+ :relative]},
44
+ 'blockquote' => {'cite' => ['http', 'https', :relative]},
45
+ 'q' => {'cite' => ['http', 'https', :relative]}
46
+ }
47
+ }
48
+ end
49
+ end
@@ -0,0 +1,57 @@
1
+ #--
2
+ # Copyright (c) 2010 Ryan Grove <ryan@wonko.com>
3
+ #
4
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
5
+ # of this software and associated documentation files (the 'Software'), to deal
6
+ # in the Software without restriction, including without limitation the rights
7
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
8
+ # copies of the Software, and to permit persons to whom the Software is
9
+ # furnished to do so, subject to the following conditions:
10
+ #
11
+ # The above copyright notice and this permission notice shall be included in all
12
+ # copies or substantial portions of the Software.
13
+ #
14
+ # THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
15
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
16
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
17
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
18
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
19
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
20
+ # SOFTWARE.
21
+ #++
22
+
23
+ class Sanitize
24
+ module Config
25
+ RELAXED = {
26
+ :elements => [
27
+ 'a', 'b', 'blockquote', 'br', 'caption', 'cite', 'code', 'col',
28
+ 'colgroup', 'dd', 'dl', 'dt', 'em', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6',
29
+ 'i', 'img', 'li', 'ol', 'p', 'pre', 'q', 'small', 'strike', 'strong',
30
+ 'sub', 'sup', 'table', 'tbody', 'td', 'tfoot', 'th', 'thead', 'tr', 'u',
31
+ 'ul'],
32
+
33
+ :attributes => {
34
+ 'a' => ['href', 'title'],
35
+ 'blockquote' => ['cite'],
36
+ 'col' => ['span', 'width'],
37
+ 'colgroup' => ['span', 'width'],
38
+ 'img' => ['align', 'alt', 'height', 'src', 'title', 'width'],
39
+ 'ol' => ['start', 'type'],
40
+ 'q' => ['cite'],
41
+ 'table' => ['summary', 'width'],
42
+ 'td' => ['abbr', 'axis', 'colspan', 'rowspan', 'width'],
43
+ 'th' => ['abbr', 'axis', 'colspan', 'rowspan', 'scope',
44
+ 'width'],
45
+ 'ul' => ['type']
46
+ },
47
+
48
+ :protocols => {
49
+ 'a' => {'href' => ['ftp', 'http', 'https', 'mailto',
50
+ :relative]},
51
+ 'blockquote' => {'cite' => ['http', 'https', :relative]},
52
+ 'img' => {'src' => ['http', 'https', :relative]},
53
+ 'q' => {'cite' => ['http', 'https', :relative]}
54
+ }
55
+ }
56
+ end
57
+ end
@@ -0,0 +1,29 @@
1
+ #--
2
+ # Copyright (c) 2010 Ryan Grove <ryan@wonko.com>
3
+ #
4
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
5
+ # of this software and associated documentation files (the 'Software'), to deal
6
+ # in the Software without restriction, including without limitation the rights
7
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
8
+ # copies of the Software, and to permit persons to whom the Software is
9
+ # furnished to do so, subject to the following conditions:
10
+ #
11
+ # The above copyright notice and this permission notice shall be included in all
12
+ # copies or substantial portions of the Software.
13
+ #
14
+ # THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
15
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
16
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
17
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
18
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
19
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
20
+ # SOFTWARE.
21
+ #++
22
+
23
+ class Sanitize
24
+ module Config
25
+ RESTRICTED = {
26
+ :elements => ['b', 'em', 'i', 'strong', 'u']
27
+ }
28
+ end
29
+ end
@@ -0,0 +1,3 @@
1
+ class Sanitize
2
+ VERSION = '1.2.1.1'
3
+ end
metadata ADDED
@@ -0,0 +1,124 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: glebm-sanitize
3
+ version: !ruby/object:Gem::Version
4
+ hash: 73
5
+ prerelease: false
6
+ segments:
7
+ - 1
8
+ - 2
9
+ - 1
10
+ - 1
11
+ version: 1.2.1.1
12
+ platform: ruby
13
+ authors:
14
+ - Ryan Grove
15
+ autorequire:
16
+ bindir: bin
17
+ cert_chain: []
18
+
19
+ date: 2010-07-19 00:00:00 +02:00
20
+ default_executable:
21
+ dependencies:
22
+ - !ruby/object:Gem::Dependency
23
+ name: glebm-nokogiri
24
+ prerelease: false
25
+ requirement: &id001 !ruby/object:Gem::Requirement
26
+ none: false
27
+ requirements:
28
+ - - ">="
29
+ - !ruby/object:Gem::Version
30
+ hash: 7
31
+ segments:
32
+ - 1
33
+ - 4
34
+ version: "1.4"
35
+ type: :runtime
36
+ version_requirements: *id001
37
+ - !ruby/object:Gem::Dependency
38
+ name: bacon
39
+ prerelease: false
40
+ requirement: &id002 !ruby/object:Gem::Requirement
41
+ none: false
42
+ requirements:
43
+ - - ~>
44
+ - !ruby/object:Gem::Version
45
+ hash: 19
46
+ segments:
47
+ - 1
48
+ - 1
49
+ - 0
50
+ version: 1.1.0
51
+ type: :development
52
+ version_requirements: *id002
53
+ - !ruby/object:Gem::Dependency
54
+ name: rake
55
+ prerelease: false
56
+ requirement: &id003 !ruby/object:Gem::Requirement
57
+ none: false
58
+ requirements:
59
+ - - ~>
60
+ - !ruby/object:Gem::Version
61
+ hash: 63
62
+ segments:
63
+ - 0
64
+ - 8
65
+ - 0
66
+ version: 0.8.0
67
+ type: :development
68
+ version_requirements: *id003
69
+ description:
70
+ email: glex.spb@gmail.com
71
+ executables: []
72
+
73
+ extensions: []
74
+
75
+ extra_rdoc_files: []
76
+
77
+ files:
78
+ - HISTORY
79
+ - LICENSE
80
+ - README.rdoc
81
+ - lib/sanitize/config/restricted.rb
82
+ - lib/sanitize/config/basic.rb
83
+ - lib/sanitize/config/relaxed.rb
84
+ - lib/sanitize/config.rb
85
+ - lib/sanitize/version.rb
86
+ - lib/sanitize.rb
87
+ has_rdoc: true
88
+ homepage: http://github.com/rgrove/sanitize/
89
+ licenses: []
90
+
91
+ post_install_message:
92
+ rdoc_options: []
93
+
94
+ require_paths:
95
+ - lib
96
+ required_ruby_version: !ruby/object:Gem::Requirement
97
+ none: false
98
+ requirements:
99
+ - - ">="
100
+ - !ruby/object:Gem::Version
101
+ hash: 59
102
+ segments:
103
+ - 1
104
+ - 8
105
+ - 6
106
+ version: 1.8.6
107
+ required_rubygems_version: !ruby/object:Gem::Requirement
108
+ none: false
109
+ requirements:
110
+ - - ">="
111
+ - !ruby/object:Gem::Version
112
+ hash: 3
113
+ segments:
114
+ - 0
115
+ version: "0"
116
+ requirements: []
117
+
118
+ rubyforge_project: riposte
119
+ rubygems_version: 1.3.7
120
+ signing_key:
121
+ specification_version: 3
122
+ summary: Whitelist-based HTML sanitizer.
123
+ test_files: []
124
+