glebm-sanitize 1.2.1.1

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
data/HISTORY ADDED
@@ -0,0 +1,90 @@
1
+ Sanitize History
2
+ ================================================================================
3
+
4
+ Version 1.2.1 (2010-04-20)
5
+ * Added a :remove_contents config setting. If set to true, Sanitize will
6
+ remove the contents of all non-whitelisted elements in addition to the
7
+ elements themselves. If set to an Array of element names, Sanitize will
8
+ remove the contents of only those elements (when filtered), and leave the
9
+ contents of other filtered elements. [Thanks to Rafael Souza for the Array
10
+ option]
11
+ * Added an :output_encoding config setting to allow the character encoding for
12
+ HTML output to be specified. The default is 'utf-8'.
13
+ * The environment hash passed into transformers now includes a :node_name item
14
+ containing the lowercase name of the current HTML node (e.g. "div").
15
+ * Returning anything other than a Hash or nil from a transformer will now
16
+ raise a meaningful Sanitize::Error exception rather than an unintended
17
+ NameError.
18
+
19
+ Version 1.2.0 (2010-01-17)
20
+ * Requires Nokogiri ~> 1.4.1.
21
+ * Added support for transformers, which allow you to filter and alter nodes
22
+ using your own custom logic, on top of (or instead of) Sanitize's core
23
+ filter. See the README for details and examples.
24
+ * Added Sanitize.clean_node!, which sanitizes a Nokogiri::XML::Node and all
25
+ its children.
26
+ * Added elements <h1> through <h6> to the Relaxed whitelist. [Suggested by
27
+ David Reese]
28
+
29
+ Version 1.1.0 (2009-10-11)
30
+ * Migrated from Hpricot to Nokogiri. Requires libxml2 >= 2.7.2 [Adam Hooper]
31
+ * Added an :output config setting to allow the output format to be specified.
32
+ Supported formats are :xhtml (the default) and :html (which outputs HTML4).
33
+ * Changed protocol regex to ensure Sanitize doesn't kill URLs with colons in
34
+ path segments. [Peter Cooper]
35
+
36
+ Version 1.0.8 (2009-04-23)
37
+ * Added a workaround for an Hpricot bug that prevents attribute names from
38
+ being downcased in recent versions of Hpricot. This was exploitable to
39
+ prevent non-whitelisted protocols from being cleaned. [Reported by Ben
40
+ Wanicur]
41
+
42
+ Version 1.0.7 (2009-04-11)
43
+ * Requires Hpricot 0.8.1+, which is finally compatible with Ruby 1.9.1.
44
+ * Fixed a bug that caused named character entities containing digits (like
45
+ &sup2;) to be escaped when they shouldn't have been. [Reported by Sebastian
46
+ Steinmetz]
47
+
48
+ Version 1.0.6 (2009-02-23)
49
+ * Removed htmlentities gem dependency.
50
+ * Existing well-formed character entity references in the input string are now
51
+ preserved rather than being decoded and re-encoded.
52
+ * The ' character is now encoded as &#39; instead of &apos; to prevent
53
+ problems in IE6.
54
+ * You can now specify the symbol :all in place of an element name in the
55
+ attributes config hash to allow certain attributes on all elements. [Thanks
56
+ to Mutwin Kraus]
57
+
58
+ Version 1.0.5 (2009-02-05)
59
+ * Fixed a bug introduced in version 1.0.3 that prevented non-whitelisted
60
+ protocols from being cleaned when relative URLs were allowed. [Reported by
61
+ Dev Purkayastha]
62
+ * Fixed "undefined method `parent='" exceptions caused by parser changes in
63
+ edge Hpricot.
64
+
65
+ Version 1.0.4 (2009-01-16)
66
+ * Fixed a bug that made it possible to sneak a non-whitelisted element through
67
+ by repeating it several times in a row. All versions of Sanitize prior to
68
+ 1.0.4 are vulnerable. [Reported by Cristobal]
69
+
70
+ Version 1.0.3 (2009-01-15)
71
+ * Fixed a bug whereby incomplete Unicode or hex entities could be used to
72
+ prevent non-whitelisted protocols from being cleaned. Since IE6 and Opera
73
+ still decode the incomplete entities, users of those browsers may be
74
+ vulnerable to malicious script injection on websites using versions of
75
+ Sanitize prior to 1.0.3.
76
+
77
+ Version 1.0.2 (2009-01-04)
78
+ * Fixed a bug that caused an exception to be thrown when parsing a valueless
79
+ attribute that's expected to contain a URL.
80
+
81
+ Version 1.0.1 (2009-01-01)
82
+ * You can now specify :relative in a protocol config array to allow attributes
83
+ containing relative URLs with no protocol. The Basic and Relaxed configs
84
+ have been updated to allow relative URLs.
85
+ * Added a workaround for an Hpricot bug that causes HTML entities for
86
+ non-ASCII characters to be replaced by question marks, and all other
87
+ entities to be destructively decoded.
88
+
89
+ Version 1.0.0 (2008-12-25)
90
+ * First release.
data/LICENSE ADDED
@@ -0,0 +1,18 @@
1
+ Copyright (c) 2010 Ryan Grove <ryan@wonko.com>
2
+
3
+ Permission is hereby granted, free of charge, to any person obtaining a copy of
4
+ this software and associated documentation files (the 'Software'), to deal in
5
+ the Software without restriction, including without limitation the rights to
6
+ use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
7
+ the Software, and to permit persons to whom the Software is furnished to do so,
8
+ subject to the following conditions:
9
+
10
+ The above copyright notice and this permission notice shall be included in all
11
+ copies or substantial portions of the Software.
12
+
13
+ THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
14
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
15
+ FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
16
+ COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
17
+ IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
18
+ CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,334 @@
1
+ = Sanitize
2
+
3
+ Sanitize is a whitelist-based HTML sanitizer. Given a list of acceptable
4
+ elements and attributes, Sanitize will remove all unacceptable HTML from a
5
+ string.
6
+
7
+ Using a simple configuration syntax, you can tell Sanitize to allow certain
8
+ elements, certain attributes within those elements, and even certain URL
9
+ protocols within attributes that contain URLs. Any HTML elements or attributes
10
+ that you don't explicitly allow will be removed.
11
+
12
+ Because it's based on Nokogiri, a full-fledged HTML parser, rather than a bunch
13
+ of fragile regular expressions, Sanitize has no trouble dealing with malformed
14
+ or maliciously-formed HTML, and will always output valid HTML or XHTML.
15
+
16
+ *Author*:: Ryan Grove (mailto:ryan@wonko.com)
17
+ *Version*:: 1.2.1 (2010-04-20)
18
+ *Copyright*:: Copyright (c) 2010 Ryan Grove. All rights reserved.
19
+ *License*:: MIT License (http://opensource.org/licenses/mit-license.php)
20
+ *Website*:: http://github.com/rgrove/sanitize
21
+
22
+ == Requires
23
+
24
+ * Nokogiri ~> 1.4.1
25
+ * libxml2 >= 2.7.2
26
+
27
+ == Installation
28
+
29
+ Latest stable release:
30
+
31
+ gem install sanitize
32
+
33
+ Latest development version:
34
+
35
+ gem install sanitize --pre
36
+
37
+ == Usage
38
+
39
+ If you don't specify any configuration options, Sanitize will use its strictest
40
+ settings by default, which means it will strip all HTML and leave only text
41
+ behind.
42
+
43
+ require 'rubygems'
44
+ require 'sanitize'
45
+
46
+ html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
47
+
48
+ Sanitize.clean(html) # => 'foo'
49
+
50
+ == Configuration
51
+
52
+ In addition to the ultra-safe default settings, Sanitize comes with three other
53
+ built-in modes.
54
+
55
+ === Sanitize::Config::RESTRICTED
56
+
57
+ Allows only very simple inline formatting markup. No links, images, or block
58
+ elements.
59
+
60
+ Sanitize.clean(html, Sanitize::Config::RESTRICTED) # => '<b>foo</b>'
61
+
62
+ === Sanitize::Config::BASIC
63
+
64
+ Allows a variety of markup including formatting tags, links, and lists. Images
65
+ and tables are not allowed, links are limited to FTP, HTTP, HTTPS, and mailto
66
+ protocols, and a <code>rel="nofollow"</code> attribute is added to all links to
67
+ mitigate SEO spam.
68
+
69
+ Sanitize.clean(html, Sanitize::Config::BASIC)
70
+ # => '<b><a href="http://foo.com/" rel="nofollow">foo</a></b>'
71
+
72
+ === Sanitize::Config::RELAXED
73
+
74
+ Allows an even wider variety of markup than BASIC, including images and tables.
75
+ Links are still limited to FTP, HTTP, HTTPS, and mailto protocols, while images
76
+ are limited to HTTP and HTTPS. In this mode, <code>rel="nofollow"</code> is not
77
+ added to links.
78
+
79
+ Sanitize.clean(html, Sanitize::Config::RELAXED)
80
+ # => '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg" />'
81
+
82
+ === Custom Configuration
83
+
84
+ If the built-in modes don't meet your needs, you can easily specify a custom
85
+ configuration:
86
+
87
+ Sanitize.clean(html, :elements => ['a', 'span'],
88
+ :attributes => {'a' => ['href', 'title'], 'span' => ['class']},
89
+ :protocols => {'a' => {'href' => ['http', 'https', 'mailto']}})
90
+
91
+ ==== :add_attributes (Hash)
92
+
93
+ Attributes to add to specific elements. If the attribute already exists, it will
94
+ be replaced with the value specified here. Specify all element names and
95
+ attributes in lowercase.
96
+
97
+ :add_attributes => {
98
+ 'a' => {'rel' => 'nofollow'}
99
+ }
100
+
101
+ ==== :attributes (Hash)
102
+
103
+ Attributes to allow for specific elements. Specify all element names and
104
+ attributes in lowercase.
105
+
106
+ :attributes => {
107
+ 'a' => ['href', 'title'],
108
+ 'blockquote' => ['cite'],
109
+ 'img' => ['alt', 'src', 'title']
110
+ }
111
+
112
+ If you'd like to allow certain attributes on all elements, use the symbol
113
+ <code>:all</code> instead of an element name.
114
+
115
+ :attributes => {
116
+ :all => ['class'],
117
+ 'a' => ['href', 'title']
118
+ }
119
+
120
+ ==== :allow_comments (boolean)
121
+
122
+ Whether or not to allow HTML comments. Allowing comments is strongly
123
+ discouraged, since IE allows script execution within conditional comments. The
124
+ default value is <code>false</code>.
125
+
126
+ ==== :elements (Array)
127
+
128
+ Array of element names to allow. Specify all names in lowercase.
129
+
130
+ :elements => [
131
+ 'a', 'b', 'blockquote', 'br', 'cite', 'code', 'dd', 'dl', 'dt', 'em',
132
+ 'i', 'li', 'ol', 'p', 'pre', 'q', 'small', 'strike', 'strong', 'sub',
133
+ 'sup', 'u', 'ul'
134
+ ]
135
+
136
+ ==== :output (Symbol)
137
+
138
+ Output format. Supported formats are <code>:html</code> and <code>:xhtml</code>,
139
+ defaulting to <code>:xhtml</code>.
140
+
141
+ ==== :output_encoding (String)
142
+
143
+ Character encoding to use for HTML output. Default is <code>'utf-8'</code>.
144
+
145
+ ==== :protocols (Hash)
146
+
147
+ URL protocols to allow in specific attributes. If an attribute is listed here
148
+ and contains a protocol other than those specified (or if it contains no
149
+ protocol at all), it will be removed.
150
+
151
+ :protocols => {
152
+ 'a' => {'href' => ['ftp', 'http', 'https', 'mailto']},
153
+ 'img' => {'src' => ['http', 'https']}
154
+ }
155
+
156
+ If you'd like to allow the use of relative URLs which don't have a protocol,
157
+ include the symbol <code>:relative</code> in the protocol array:
158
+
159
+ :protocols => {
160
+ 'a' => {'href' => ['http', 'https', :relative]}
161
+ }
162
+
163
+ ==== :remove_contents (boolean or Array)
164
+
165
+ If set to +true+, Sanitize will remove the contents of any non-whitelisted
166
+ elements in addition to the elements themselves. By default, Sanitize leaves the
167
+ safe parts of an element's contents behind when the element is removed.
168
+
169
+ If set to an Array of element names, then only the contents of the specified
170
+ elements (when filtered) will be removed, and the contents of all other filtered
171
+ elements will be left behind.
172
+
173
+ The default value is <code>false</code>.
174
+
175
+ ==== :transformers
176
+
177
+ See below.
178
+
179
+ === Transformers
180
+
181
+ Transformers allow you to filter and alter nodes using your own custom logic, on
182
+ top of (or instead of) Sanitize's core filter. A transformer is any object that
183
+ responds to <code>call()</code> (such as a lambda or proc) and returns either
184
+ <code>nil</code> or a Hash containing certain optional response values.
185
+
186
+ To use one or more transformers, pass them to the <code>:transformers</code>
187
+ config setting:
188
+
189
+ Sanitize.clean(html, :transformers => [transformer_one, transformer_two])
190
+
191
+ ==== Input
192
+
193
+ Each registered transformer's <code>call()</code> method will be called once for
194
+ each element node in the HTML, and will receive as an argument an environment
195
+ Hash that contains the following items:
196
+
197
+ [<code>:config</code>]
198
+ The current Sanitize configuration Hash.
199
+
200
+ [<code>:node</code>]
201
+ A Nokogiri::XML::Node object representing an HTML element.
202
+
203
+ [<code>:node_name</code>]
204
+ The name of the current HTML node, always lowercase (e.g. "div" or "span").
205
+
206
+ ==== Processing
207
+
208
+ Each transformer has full access to the Nokogiri::XML::Node that's passed into
209
+ it and to the rest of the document via the node's <code>document()</code>
210
+ method. Any changes will be reflected instantly in the document and passed on to
211
+ subsequently-called transformers and to Sanitize itself. A transformer may even
212
+ call Sanitize internally to perform custom sanitization if needed.
213
+
214
+ Nodes are passed into transformers in the order in which they're traversed. It's
215
+ important to note that Nokogiri traverses markup from the deepest node upward,
216
+ not from the first node to the last node:
217
+
218
+ html = '<div><span>foo</span></div>'
219
+ transformer = lambda{|env| puts env[:node].name }
220
+
221
+ # Prints "span", then "div".
222
+ Sanitize.clean(html, :transformers => transformer)
223
+
224
+ Transformers have a tremendous amount of power, including the power to
225
+ completely bypass Sanitize's built-in filtering. Be careful!
226
+
227
+ ==== Output
228
+
229
+ A transformer may return either +nil+ or a Hash. A return value of +nil+
230
+ indicates that the transformer does not wish to act on the current node in any
231
+ way. A returned Hash may contain the following items, all of which are optional:
232
+
233
+ [<code>:attr_whitelist</code>]
234
+ Array of attribute names to add to the whitelist for the current node, in
235
+ addition to any whitelisted attributes already defined in the current config.
236
+
237
+ [<code>:node</code>]
238
+ A Nokogiri::XML::Node object that should replace the current node. All
239
+ subsequent transformers and Sanitize itself will receive this new node.
240
+
241
+ [<code>:whitelist</code>]
242
+ If _true_, the current node (and only the current node) will be whitelisted,
243
+ regardless of the current Sanitize config.
244
+
245
+ [<code>:whitelist_nodes</code>]
246
+ Array of specific Nokogiri::XML::Node objects to whitelist, anywhere in the
247
+ document, regardless of the current Sanitize config.
248
+
249
+ ==== Example: Transformer to whitelist YouTube video embeds
250
+
251
+ The following example demonstrates how to create a Sanitize transformer that
252
+ will safely whitelist valid YouTube video embeds without having to blindly allow
253
+ other kinds of embedded content, which would be the case if you tried to do this
254
+ by just whitelisting all <code><object></code>, <code><embed></code>, and
255
+ <code><param></code> elements:
256
+
257
+ lambda do |env|
258
+ node = env[:node]
259
+ node_name = env[:node_name]
260
+ parent = node.parent
261
+
262
+ # Since the transformer receives the deepest nodes first, we look for a
263
+ # <param> element or an <embed> element whose parent is an <object>.
264
+ return nil unless (node_name == 'param' || node_name == 'embed') &&
265
+ parent.name.to_s.downcase == 'object'
266
+
267
+ if node_name == 'param'
268
+ # Quick XPath search to find the <param> node that contains the video URL.
269
+ return nil unless movie_node = parent.search('param[@name="movie"]')[0]
270
+ url = movie_node['value']
271
+ else
272
+ # Since this is an <embed>, the video URL is in the "src" attribute. No
273
+ # extra work needed.
274
+ url = node['src']
275
+ end
276
+
277
+ # Verify that the video URL is actually a valid YouTube video URL.
278
+ return nil unless url =~ /^http:\/\/(?:www\.)?youtube\.com\/v\//
279
+
280
+ # We're now certain that this is a YouTube embed, but we still need to run
281
+ # it through a special Sanitize step to ensure that no unwanted elements or
282
+ # attributes that don't belong in a YouTube embed can sneak in.
283
+ Sanitize.clean_node!(parent, {
284
+ :elements => ['embed', 'object', 'param'],
285
+ :attributes => {
286
+ 'embed' => ['allowfullscreen', 'allowscriptaccess', 'height', 'src', 'type', 'width'],
287
+ 'object' => ['height', 'width'],
288
+ 'param' => ['name', 'value']
289
+ }
290
+ })
291
+
292
+ # Now that we're sure that this is a valid YouTube embed and that there are
293
+ # no unwanted elements or attributes hidden inside it, we can tell Sanitize
294
+ # to whitelist the current node (<param> or <embed>) and its parent
295
+ # (<object>).
296
+ {:whitelist_nodes => [node, parent]}
297
+ end
298
+
299
+ == Contributors
300
+
301
+ The following lovely people have contributed to Sanitize in the form of patches
302
+ or ideas that later became code:
303
+
304
+ * Wilson Bilkovich <wilson@supremetyrant.com>
305
+ * Peter Cooper <git@peterc.org>
306
+ * Gabe da Silveira <gabe@websaviour.com>
307
+ * Ryan Grove <ryan@wonko.com>
308
+ * Adam Hooper <adam@adamhooper.com>
309
+ * Mutwin Kraus <mutle@blogage.de>
310
+ * Dev Purkayastha <dev.purkayastha@gmail.com>
311
+ * David Reese <work@whatcould.com>
312
+ * Rafael Souza <me@rafaelss.com>
313
+ * Ben Wanicur <bwanicur@verticalresponse.com>
314
+
315
+ == License
316
+
317
+ Copyright (c) 2010 Ryan Grove <ryan@wonko.com>
318
+
319
+ Permission is hereby granted, free of charge, to any person obtaining a copy of
320
+ this software and associated documentation files (the 'Software'), to deal in
321
+ the Software without restriction, including without limitation the rights to
322
+ use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
323
+ the Software, and to permit persons to whom the Software is furnished to do so,
324
+ subject to the following conditions:
325
+
326
+ The above copyright notice and this permission notice shall be included in all
327
+ copies or substantial portions of the Software.
328
+
329
+ THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
330
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
331
+ FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
332
+ COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
333
+ IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
334
+ CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -0,0 +1,247 @@
1
+ # encoding: utf-8
2
+ #--
3
+ # Copyright (c) 2010 Ryan Grove <ryan@wonko.com>
4
+ #
5
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ # of this software and associated documentation files (the 'Software'), to deal
7
+ # in the Software without restriction, including without limitation the rights
8
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ # copies of the Software, and to permit persons to whom the Software is
10
+ # furnished to do so, subject to the following conditions:
11
+ #
12
+ # The above copyright notice and this permission notice shall be included in all
13
+ # copies or substantial portions of the Software.
14
+ #
15
+ # THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ # SOFTWARE.
22
+ #++
23
+
24
+ require 'nokogiri'
25
+ require 'sanitize/version'
26
+ require 'sanitize/config'
27
+ require 'sanitize/config/restricted'
28
+ require 'sanitize/config/basic'
29
+ require 'sanitize/config/relaxed'
30
+
31
+ class Sanitize
32
+ attr_reader :config
33
+
34
+ # Matches an attribute value that could be treated by a browser as a URL
35
+ # with a protocol prefix, such as "http:" or "javascript:". Any string of zero
36
+ # or more characters followed by a colon is considered a match, even if the
37
+ # colon is encoded as an entity and even if it's an incomplete entity (which
38
+ # IE6 and Opera will still parse).
39
+ REGEX_PROTOCOL = /^([A-Za-z0-9\+\-\.\&\;\#\s]*?)(?:\:|&#0*58|&#x0*3a)/i
40
+
41
+ #--
42
+ # Class Methods
43
+ #++
44
+
45
+ # Returns a sanitized copy of _html_, using the settings in _config_ if
46
+ # specified.
47
+ def self.clean(html, config = {})
48
+ sanitize = Sanitize.new(config)
49
+ sanitize.clean(html)
50
+ end
51
+
52
+ # Performs Sanitize#clean in place, returning _html_, or +nil+ if no changes
53
+ # were made.
54
+ def self.clean!(html, config = {})
55
+ sanitize = Sanitize.new(config)
56
+ sanitize.clean!(html)
57
+ end
58
+
59
+ # Sanitizes the specified Nokogiri::XML::Node and all its children.
60
+ def self.clean_node!(node, config = {})
61
+ sanitize = Sanitize.new(config)
62
+ sanitize.clean_node!(node)
63
+ end
64
+
65
+ #--
66
+ # Instance Methods
67
+ #++
68
+
69
+ # Returns a new Sanitize object initialized with the settings in _config_.
70
+ def initialize(config = {})
71
+ # Sanitize configuration.
72
+ @config = Config::DEFAULT.merge(config)
73
+ @config[:transformers] = Array(@config[:transformers].dup)
74
+
75
+ # Convert the list of allowed elements to a Hash for faster lookup.
76
+ @allowed_elements = {}
77
+ @config[:elements].each {|el| @allowed_elements[el] = true }
78
+
79
+ # Convert the list of :remove_contents elements to a Hash for faster lookup.
80
+ @remove_all_contents = false
81
+ @remove_element_contents = {}
82
+
83
+ if @config[:remove_contents].is_a?(Array)
84
+ @config[:remove_contents].each {|el| @remove_element_contents[el] = true }
85
+ else
86
+ @remove_all_contents = !!@config[:remove_contents]
87
+ end
88
+
89
+ # Specific nodes to whitelist (along with all their attributes). This array
90
+ # is generated at runtime by transformers, and is cleared before and after
91
+ # a fragment is cleaned (so it applies only to a specific fragment).
92
+ @whitelist_nodes = []
93
+ end
94
+
95
+ # Returns a sanitized copy of _html_.
96
+ def clean(html)
97
+ if html
98
+ dupe = html.dup
99
+ clean!(dupe) || dupe
100
+ end
101
+ end
102
+
103
+ # Performs clean in place, returning _html_, or +nil+ if no changes were
104
+ # made.
105
+ def clean!(html)
106
+ fragment = Nokogiri::HTML::DocumentFragment.parse(html)
107
+ clean_node!(fragment)
108
+
109
+ output_method_params = {:encoding => @config[:output_encoding], :indent => 0}
110
+
111
+ if @config[:output] == :xhtml
112
+ output_method = fragment.method(:to_xhtml)
113
+ output_method_params[:save_with] = Nokogiri::XML::Node::SaveOptions::AS_XHTML
114
+ elsif @config[:output] == :html
115
+ output_method = fragment.method(:to_html)
116
+ else
117
+ raise Error, "unsupported output format: #{@config[:output]}"
118
+ end
119
+
120
+ result = output_method.call(output_method_params)
121
+
122
+ return result == html ? nil : html[0, html.length] = result
123
+ end
124
+
125
+ # Sanitizes the specified Nokogiri::XML::Node and all its children.
126
+ def clean_node!(node)
127
+ raise ArgumentError unless node.is_a?(Nokogiri::XML::Node)
128
+
129
+ @whitelist_nodes = []
130
+
131
+ node.traverse do |child|
132
+ if child.element?
133
+ clean_element!(child)
134
+ elsif child.comment?
135
+ child.unlink unless @config[:allow_comments]
136
+ elsif child.cdata?
137
+ child.replace(Nokogiri::XML::Text.new(child.text, child.document))
138
+ end
139
+ end
140
+
141
+ @whitelist_nodes = []
142
+
143
+ node
144
+ end
145
+
146
+ private
147
+
148
+ def clean_element!(node)
149
+ # Run this node through all configured transformers.
150
+ transform = transform_element!(node)
151
+
152
+ # If this node is in the dynamic whitelist array (built at runtime by
153
+ # transformers), let it live with all of its attributes intact.
154
+ return if @whitelist_nodes.include?(node)
155
+
156
+ name = node.name.to_s.downcase
157
+
158
+ # Delete any element that isn't in the whitelist.
159
+ unless transform[:whitelist] || @allowed_elements[name]
160
+ unless @remove_all_contents || @remove_element_contents[name]
161
+ node.children.each { |n| node.add_previous_sibling(n) }
162
+ end
163
+
164
+ node.unlink
165
+
166
+ return
167
+ end
168
+
169
+ attr_whitelist = (transform[:attr_whitelist] +
170
+ (@config[:attributes][name] || []) +
171
+ (@config[:attributes][:all] || [])).uniq
172
+
173
+ if attr_whitelist.empty?
174
+ # Delete all attributes from elements with no whitelisted attributes.
175
+ node.attribute_nodes.each {|attr| attr.remove }
176
+ else
177
+ # Delete any attribute that isn't in the whitelist for this element.
178
+ node.attribute_nodes.each do |attr|
179
+ attr.unlink unless attr_whitelist.include?(attr.name.downcase)
180
+ end
181
+
182
+ # Delete remaining attributes that use unacceptable protocols.
183
+ if @config[:protocols].has_key?(name)
184
+ protocol = @config[:protocols][name]
185
+
186
+ node.attribute_nodes.each do |attr|
187
+ attr_name = attr.name.downcase
188
+ next false unless protocol.has_key?(attr_name)
189
+
190
+ del = if attr.value.to_s.downcase =~ REGEX_PROTOCOL
191
+ !protocol[attr_name].include?($1.downcase)
192
+ else
193
+ !protocol[attr_name].include?(:relative)
194
+ end
195
+
196
+ attr.unlink if del
197
+ end
198
+ end
199
+ end
200
+
201
+ # Add required attributes.
202
+ if @config[:add_attributes].has_key?(name)
203
+ @config[:add_attributes][name].each do |key, val|
204
+ node[key] = val
205
+ end
206
+ end
207
+
208
+ transform
209
+ end
210
+
211
+ def transform_element!(node)
212
+ output = {
213
+ :attr_whitelist => [],
214
+ :node => node,
215
+ :whitelist => false
216
+ }
217
+
218
+ @config[:transformers].inject(node) do |transformer_node, transformer|
219
+ transform = transformer.call({
220
+ :config => @config,
221
+ :node => transformer_node,
222
+ :node_name => transformer_node.name.downcase
223
+ })
224
+
225
+ if transform.nil?
226
+ transformer_node
227
+ elsif transform.is_a?(Hash)
228
+ if transform[:whitelist_nodes].is_a?(Array)
229
+ @whitelist_nodes += transform[:whitelist_nodes]
230
+ @whitelist_nodes.uniq!
231
+ end
232
+
233
+ output[:attr_whitelist] += transform[:attr_whitelist] if transform[:attr_whitelist].is_a?(Array)
234
+ output[:whitelist] ||= true if transform[:whitelist]
235
+ output[:node] = transform[:node].is_a?(Nokogiri::XML::Node) ? transform[:node] : output[:node]
236
+ else
237
+ raise Error, "transformer output must be a Hash or nil"
238
+ end
239
+ end
240
+
241
+ node.replace(output[:node]) if node != output[:node]
242
+
243
+ return output
244
+ end
245
+
246
+ class Error < StandardError; end
247
+ end
@@ -0,0 +1,70 @@
1
+ #--
2
+ # Copyright (c) 2010 Ryan Grove <ryan@wonko.com>
3
+ #
4
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
5
+ # of this software and associated documentation files (the 'Software'), to deal
6
+ # in the Software without restriction, including without limitation the rights
7
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
8
+ # copies of the Software, and to permit persons to whom the Software is
9
+ # furnished to do so, subject to the following conditions:
10
+ #
11
+ # The above copyright notice and this permission notice shall be included in all
12
+ # copies or substantial portions of the Software.
13
+ #
14
+ # THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
15
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
16
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
17
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
18
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
19
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
20
+ # SOFTWARE.
21
+ #++
22
+
23
+ class Sanitize
24
+ module Config
25
+ DEFAULT = {
26
+ # Whether or not to allow HTML comments. Allowing comments is strongly
27
+ # discouraged, since IE allows script execution within conditional
28
+ # comments.
29
+ :allow_comments => false,
30
+
31
+ # HTML attributes to add to specific elements. By default, no attributes
32
+ # are added.
33
+ :add_attributes => {},
34
+
35
+ # HTML attributes to allow in specific elements. By default, no attributes
36
+ # are allowed.
37
+ :attributes => {},
38
+
39
+ # HTML elements to allow. By default, no elements are allowed (which means
40
+ # that all HTML will be stripped).
41
+ :elements => [],
42
+
43
+ # Output format. Supported formats are :html and :xhtml (which is the
44
+ # default).
45
+ :output => :xhtml,
46
+
47
+ # Character encoding to use for HTML output. Default is 'utf-8'.
48
+ :output_encoding => 'utf-8',
49
+
50
+ # URL handling protocols to allow in specific attributes. By default, no
51
+ # protocols are allowed. Use :relative in place of a protocol if you want
52
+ # to allow relative URLs sans protocol.
53
+ :protocols => {},
54
+
55
+ # If this is true, Sanitize will remove the contents of any filtered
56
+ # elements in addition to the elements themselves. By default, Sanitize
57
+ # leaves the safe parts of an element's contents behind when the element
58
+ # is removed.
59
+ #
60
+ # If this is an Array of element names, then only the contents of the
61
+ # specified elements (when filtered) will be removed, and the contents of
62
+ # all other filtered elements will be left behind.
63
+ :remove_contents => false,
64
+
65
+ # Transformers allow you to filter or alter nodes using custom logic. See
66
+ # README.rdoc for details and examples.
67
+ :transformers => []
68
+ }
69
+ end
70
+ end
@@ -0,0 +1,49 @@
1
+ #--
2
+ # Copyright (c) 2010 Ryan Grove <ryan@wonko.com>
3
+ #
4
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
5
+ # of this software and associated documentation files (the 'Software'), to deal
6
+ # in the Software without restriction, including without limitation the rights
7
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
8
+ # copies of the Software, and to permit persons to whom the Software is
9
+ # furnished to do so, subject to the following conditions:
10
+ #
11
+ # The above copyright notice and this permission notice shall be included in all
12
+ # copies or substantial portions of the Software.
13
+ #
14
+ # THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
15
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
16
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
17
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
18
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
19
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
20
+ # SOFTWARE.
21
+ #++
22
+
23
+ class Sanitize
24
+ module Config
25
+ BASIC = {
26
+ :elements => [
27
+ 'a', 'b', 'blockquote', 'br', 'cite', 'code', 'dd', 'dl', 'dt', 'em',
28
+ 'i', 'li', 'ol', 'p', 'pre', 'q', 'small', 'strike', 'strong', 'sub',
29
+ 'sup', 'u', 'ul'],
30
+
31
+ :attributes => {
32
+ 'a' => ['href'],
33
+ 'blockquote' => ['cite'],
34
+ 'q' => ['cite']
35
+ },
36
+
37
+ :add_attributes => {
38
+ 'a' => {'rel' => 'nofollow'}
39
+ },
40
+
41
+ :protocols => {
42
+ 'a' => {'href' => ['ftp', 'http', 'https', 'mailto',
43
+ :relative]},
44
+ 'blockquote' => {'cite' => ['http', 'https', :relative]},
45
+ 'q' => {'cite' => ['http', 'https', :relative]}
46
+ }
47
+ }
48
+ end
49
+ end
@@ -0,0 +1,57 @@
1
+ #--
2
+ # Copyright (c) 2010 Ryan Grove <ryan@wonko.com>
3
+ #
4
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
5
+ # of this software and associated documentation files (the 'Software'), to deal
6
+ # in the Software without restriction, including without limitation the rights
7
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
8
+ # copies of the Software, and to permit persons to whom the Software is
9
+ # furnished to do so, subject to the following conditions:
10
+ #
11
+ # The above copyright notice and this permission notice shall be included in all
12
+ # copies or substantial portions of the Software.
13
+ #
14
+ # THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
15
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
16
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
17
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
18
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
19
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
20
+ # SOFTWARE.
21
+ #++
22
+
23
+ class Sanitize
24
+ module Config
25
+ RELAXED = {
26
+ :elements => [
27
+ 'a', 'b', 'blockquote', 'br', 'caption', 'cite', 'code', 'col',
28
+ 'colgroup', 'dd', 'dl', 'dt', 'em', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6',
29
+ 'i', 'img', 'li', 'ol', 'p', 'pre', 'q', 'small', 'strike', 'strong',
30
+ 'sub', 'sup', 'table', 'tbody', 'td', 'tfoot', 'th', 'thead', 'tr', 'u',
31
+ 'ul'],
32
+
33
+ :attributes => {
34
+ 'a' => ['href', 'title'],
35
+ 'blockquote' => ['cite'],
36
+ 'col' => ['span', 'width'],
37
+ 'colgroup' => ['span', 'width'],
38
+ 'img' => ['align', 'alt', 'height', 'src', 'title', 'width'],
39
+ 'ol' => ['start', 'type'],
40
+ 'q' => ['cite'],
41
+ 'table' => ['summary', 'width'],
42
+ 'td' => ['abbr', 'axis', 'colspan', 'rowspan', 'width'],
43
+ 'th' => ['abbr', 'axis', 'colspan', 'rowspan', 'scope',
44
+ 'width'],
45
+ 'ul' => ['type']
46
+ },
47
+
48
+ :protocols => {
49
+ 'a' => {'href' => ['ftp', 'http', 'https', 'mailto',
50
+ :relative]},
51
+ 'blockquote' => {'cite' => ['http', 'https', :relative]},
52
+ 'img' => {'src' => ['http', 'https', :relative]},
53
+ 'q' => {'cite' => ['http', 'https', :relative]}
54
+ }
55
+ }
56
+ end
57
+ end
@@ -0,0 +1,29 @@
1
+ #--
2
+ # Copyright (c) 2010 Ryan Grove <ryan@wonko.com>
3
+ #
4
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
5
+ # of this software and associated documentation files (the 'Software'), to deal
6
+ # in the Software without restriction, including without limitation the rights
7
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
8
+ # copies of the Software, and to permit persons to whom the Software is
9
+ # furnished to do so, subject to the following conditions:
10
+ #
11
+ # The above copyright notice and this permission notice shall be included in all
12
+ # copies or substantial portions of the Software.
13
+ #
14
+ # THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
15
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
16
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
17
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
18
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
19
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
20
+ # SOFTWARE.
21
+ #++
22
+
23
+ class Sanitize
24
+ module Config
25
+ RESTRICTED = {
26
+ :elements => ['b', 'em', 'i', 'strong', 'u']
27
+ }
28
+ end
29
+ end
@@ -0,0 +1,3 @@
1
+ class Sanitize
2
+ VERSION = '1.2.1.1'
3
+ end
metadata ADDED
@@ -0,0 +1,124 @@
1
+ --- !ruby/object:Gem::Specification
2
+ name: glebm-sanitize
3
+ version: !ruby/object:Gem::Version
4
+ hash: 73
5
+ prerelease: false
6
+ segments:
7
+ - 1
8
+ - 2
9
+ - 1
10
+ - 1
11
+ version: 1.2.1.1
12
+ platform: ruby
13
+ authors:
14
+ - Ryan Grove
15
+ autorequire:
16
+ bindir: bin
17
+ cert_chain: []
18
+
19
+ date: 2010-07-19 00:00:00 +02:00
20
+ default_executable:
21
+ dependencies:
22
+ - !ruby/object:Gem::Dependency
23
+ name: glebm-nokogiri
24
+ prerelease: false
25
+ requirement: &id001 !ruby/object:Gem::Requirement
26
+ none: false
27
+ requirements:
28
+ - - ">="
29
+ - !ruby/object:Gem::Version
30
+ hash: 7
31
+ segments:
32
+ - 1
33
+ - 4
34
+ version: "1.4"
35
+ type: :runtime
36
+ version_requirements: *id001
37
+ - !ruby/object:Gem::Dependency
38
+ name: bacon
39
+ prerelease: false
40
+ requirement: &id002 !ruby/object:Gem::Requirement
41
+ none: false
42
+ requirements:
43
+ - - ~>
44
+ - !ruby/object:Gem::Version
45
+ hash: 19
46
+ segments:
47
+ - 1
48
+ - 1
49
+ - 0
50
+ version: 1.1.0
51
+ type: :development
52
+ version_requirements: *id002
53
+ - !ruby/object:Gem::Dependency
54
+ name: rake
55
+ prerelease: false
56
+ requirement: &id003 !ruby/object:Gem::Requirement
57
+ none: false
58
+ requirements:
59
+ - - ~>
60
+ - !ruby/object:Gem::Version
61
+ hash: 63
62
+ segments:
63
+ - 0
64
+ - 8
65
+ - 0
66
+ version: 0.8.0
67
+ type: :development
68
+ version_requirements: *id003
69
+ description:
70
+ email: glex.spb@gmail.com
71
+ executables: []
72
+
73
+ extensions: []
74
+
75
+ extra_rdoc_files: []
76
+
77
+ files:
78
+ - HISTORY
79
+ - LICENSE
80
+ - README.rdoc
81
+ - lib/sanitize/config/restricted.rb
82
+ - lib/sanitize/config/basic.rb
83
+ - lib/sanitize/config/relaxed.rb
84
+ - lib/sanitize/config.rb
85
+ - lib/sanitize/version.rb
86
+ - lib/sanitize.rb
87
+ has_rdoc: true
88
+ homepage: http://github.com/rgrove/sanitize/
89
+ licenses: []
90
+
91
+ post_install_message:
92
+ rdoc_options: []
93
+
94
+ require_paths:
95
+ - lib
96
+ required_ruby_version: !ruby/object:Gem::Requirement
97
+ none: false
98
+ requirements:
99
+ - - ">="
100
+ - !ruby/object:Gem::Version
101
+ hash: 59
102
+ segments:
103
+ - 1
104
+ - 8
105
+ - 6
106
+ version: 1.8.6
107
+ required_rubygems_version: !ruby/object:Gem::Requirement
108
+ none: false
109
+ requirements:
110
+ - - ">="
111
+ - !ruby/object:Gem::Version
112
+ hash: 3
113
+ segments:
114
+ - 0
115
+ version: "0"
116
+ requirements: []
117
+
118
+ rubyforge_project: riposte
119
+ rubygems_version: 1.3.7
120
+ signing_key:
121
+ specification_version: 3
122
+ summary: Whitelist-based HTML sanitizer.
123
+ test_files: []
124
+