sanitize 2.0.6 → 2.1.0

Sign up to get free protection for your applications and to get access to all the features.

Potentially problematic release.


This version of sanitize might be problematic. Click here for more details.

checksums.yaml CHANGED
@@ -1,7 +1,7 @@
1
1
  ---
2
2
  SHA1:
3
- metadata.gz: 240d390dd3a6813197ab1e3ccafb42f2103bf136
4
- data.tar.gz: 5813a179d76ec2e44a7eb0bd0a7582c23ea0696b
3
+ metadata.gz: a1be4f7e5790c7e0fa8943b793803e507bbaa2ce
4
+ data.tar.gz: a879b798b76f4bfff12532e4779bb418a89d4500
5
5
  SHA512:
6
- metadata.gz: dbc7db8d41dbac5be557a50ab69d096fe5373cd8310196b4666e7e0d7fb3f12c5138e7605c86b4ccc713b500e9ad2c0e3e06374b891c12b0fed7b2949c90868c
7
- data.tar.gz: 401cdf8549edce7742fb6b0498aa82da817926537f9eca32ec87b400290541aa1051645c7dea61de8edd128f8302a42d39359e5b1bc30d4a3fb25de673024c96
6
+ metadata.gz: ecdbc579a9ed3f737539118ac5b6c17612a736268263fafd03b9daf39da433309a11e090494c2008859edc16c278dcc1ea63ea52b5693479c625b825bbbfbc80
7
+ data.tar.gz: 4fff69ad6c6812fb6aac4c492a7644f196faeb82039096dcd204461b07872a05d97c02e0b92237fc65b36891783256e84ee335fc83b03365e92ec5e07a2af57e
data/HISTORY.md CHANGED
@@ -1,6 +1,22 @@
1
1
  Sanitize History
2
2
  ================================================================================
3
3
 
4
+ Version 2.1.0 (2014-01-13)
5
+ --------------------------
6
+
7
+ * Added support for whitelisting arbitrary HTML5 `data-*` attributes. Use the
8
+ symbol `:data` instead of an attribute name in the `:attributes` config to
9
+ indicate that arbitrary data attributes should be allowed on an element.
10
+
11
+ * Added the following elements to the relaxed config: `address`, `bdi`, `hr`,
12
+ and `summary`.
13
+
14
+ * Fixed: A colon (`:`) character in a URL fragment identifier such as `#foo:1`
15
+ was incorrectly treated as a protocol delimiter. [@heathd - #87][87]
16
+
17
+ [87]:https://github.com/rgrove/sanitize/pull/87
18
+
19
+
4
20
  Version 2.0.6 (2013-07-10)
5
21
  --------------------------
6
22
 
data/LICENSE CHANGED
@@ -1,4 +1,4 @@
1
- Copyright (c) 2013 Ryan Grove <ryan@wonko.com>
1
+ Copyright (c) 2014 Ryan Grove <ryan@wonko.com>
2
2
 
3
3
  Permission is hereby granted, free of charge, to any person obtaining a copy of
4
4
  this software and associated documentation files (the 'Software'), to deal in
@@ -0,0 +1,399 @@
1
+ Sanitize
2
+ ========
3
+
4
+ Sanitize is a whitelist-based HTML sanitizer. Given a list of acceptable
5
+ elements and attributes, Sanitize will remove all unacceptable HTML from a
6
+ string.
7
+
8
+ Using a simple configuration syntax, you can tell Sanitize to allow certain
9
+ elements, certain attributes within those elements, and even certain URL
10
+ protocols within attributes that contain URLs. Any HTML elements or attributes
11
+ that you don't explicitly allow will be removed.
12
+
13
+ Because it's based on Nokogiri, a full-fledged HTML parser, rather than a bunch
14
+ of fragile regular expressions, Sanitize has no trouble dealing with malformed
15
+ or maliciously-formed HTML and returning safe output.
16
+
17
+ [![Build Status](https://travis-ci.org/rgrove/sanitize.png?branch=master)](https://travis-ci.org/rgrove/sanitize?branch=master)
18
+
19
+ Installation
20
+ -------------
21
+
22
+ ```
23
+ gem install sanitize
24
+ ```
25
+
26
+ Usage
27
+ -----
28
+
29
+ If you don't specify any configuration options, Sanitize will use its strictest
30
+ settings by default, which means it will strip all HTML and leave only text
31
+ behind.
32
+
33
+ ```ruby
34
+ require 'rubygems'
35
+ require 'sanitize'
36
+
37
+ html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg">'
38
+
39
+ Sanitize.clean(html) # => 'foo'
40
+
41
+ # or sanitize an entire HTML document (example assumes _html_ is whitelisted)
42
+ html = '<!DOCTYPE html><html><b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg"></html>'
43
+ Sanitize.clean_document(html) # => '<!DOCTYPE html>\n<html>foo</html>\n'
44
+ ```
45
+
46
+ Configuration
47
+ -------------
48
+
49
+ In addition to the ultra-safe default settings, Sanitize comes with three other
50
+ built-in modes.
51
+
52
+ ### Sanitize::Config::RESTRICTED
53
+
54
+ Allows only very simple inline formatting markup. No links, images, or block
55
+ elements.
56
+
57
+ ```ruby
58
+ Sanitize.clean(html, Sanitize::Config::RESTRICTED) # => '<b>foo</b>'
59
+ ```
60
+
61
+ ### Sanitize::Config::BASIC
62
+
63
+ Allows a variety of markup including formatting tags, links, and lists. Images
64
+ and tables are not allowed, links are limited to FTP, HTTP, HTTPS, and mailto
65
+ protocols, and a `rel="nofollow"` attribute is added to all links to
66
+ mitigate SEO spam.
67
+
68
+ ```ruby
69
+ Sanitize.clean(html, Sanitize::Config::BASIC)
70
+ # => '<b><a href="http://foo.com/" rel="nofollow">foo</a></b>'
71
+ ```
72
+
73
+ ### Sanitize::Config::RELAXED
74
+
75
+ Allows an even wider variety of markup than BASIC, including images and tables.
76
+ Links are still limited to FTP, HTTP, HTTPS, and mailto protocols, while images
77
+ are limited to HTTP and HTTPS. In this mode, `rel="nofollow"` is not added to
78
+ links.
79
+
80
+ ```ruby
81
+ Sanitize.clean(html, Sanitize::Config::RELAXED)
82
+ # => '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg">'
83
+ ```
84
+
85
+ ### Custom Configuration
86
+
87
+ If the built-in modes don't meet your needs, you can easily specify a custom
88
+ configuration:
89
+
90
+ ```ruby
91
+ Sanitize.clean(html, :elements => ['a', 'span'],
92
+ :attributes => {'a' => ['href', 'title'], 'span' => ['class']},
93
+ :protocols => {'a' => {'href' => ['http', 'https', 'mailto']}})
94
+ ```
95
+
96
+ #### :add_attributes (Hash)
97
+
98
+ Attributes to add to specific elements. If the attribute already exists, it will
99
+ be replaced with the value specified here. Specify all element names and
100
+ attributes in lowercase.
101
+
102
+ ```ruby
103
+ :add_attributes => {
104
+ 'a' => {'rel' => 'nofollow'}
105
+ }
106
+ ```
107
+
108
+ #### :allow_comments (boolean)
109
+
110
+ Whether or not to allow HTML comments. Allowing comments is strongly
111
+ discouraged, since IE allows script execution within conditional comments. The
112
+ default value is `false`.
113
+
114
+ #### :attributes (Hash)
115
+
116
+ Attributes to allow for specific elements. Specify all element names and
117
+ attributes in lowercase.
118
+
119
+ ```ruby
120
+ :attributes => {
121
+ 'a' => ['href', 'title'],
122
+ 'blockquote' => ['cite'],
123
+ 'img' => ['alt', 'src', 'title']
124
+ }
125
+ ```
126
+
127
+ If you'd like to allow certain attributes on all elements, use the symbol
128
+ `:all` instead of an element name.
129
+
130
+ ```ruby
131
+ # Allow the class attribute on all elements.
132
+ :attributes => {
133
+ :all => ['class'],
134
+ 'a' => ['href', 'title']
135
+ }
136
+ ```
137
+
138
+ To allow arbitrary HTML5 `data-*` attributes, use the symbol
139
+ `:data` in place of an attribute name.
140
+
141
+ ```ruby
142
+ # Allow arbitrary HTML5 data-* attributes on <div> elements.
143
+ :attributes => {
144
+ 'div' => [:data]
145
+ }
146
+ ```
147
+
148
+ #### :elements (Array)
149
+
150
+ Array of element names to allow. Specify all names in lowercase.
151
+
152
+ ```ruby
153
+ :elements => %w[
154
+ a abbr b blockquote br cite code dd dfn dl dt em i kbd li mark ol p pre
155
+ q s samp small strike strong sub sup time u ul var
156
+ ]
157
+ ```
158
+
159
+ #### :output (Symbol)
160
+
161
+ Output format. Supported formats are `:html` and `:xhtml`,
162
+ defaulting to `:html`.
163
+
164
+ #### :output_encoding (String)
165
+
166
+ Character encoding to use for HTML output. Default is `utf-8`.
167
+
168
+ #### :protocols (Hash)
169
+
170
+ URL protocols to allow in specific attributes. If an attribute is listed here
171
+ and contains a protocol other than those specified (or if it contains no
172
+ protocol at all), it will be removed.
173
+
174
+ ```ruby
175
+ :protocols => {
176
+ 'a' => {'href' => ['ftp', 'http', 'https', 'mailto']},
177
+ 'img' => {'src' => ['http', 'https']}
178
+ }
179
+ ```
180
+
181
+ If you'd like to allow the use of relative URLs which don't have a protocol,
182
+ include the symbol `:relative` in the protocol array:
183
+
184
+ ```ruby
185
+ :protocols => {
186
+ 'a' => {'href' => ['http', 'https', :relative]}
187
+ }
188
+ ```
189
+
190
+ #### :remove_contents (boolean or Array)
191
+
192
+ If set to +true+, Sanitize will remove the contents of any non-whitelisted
193
+ elements in addition to the elements themselves. By default, Sanitize leaves the
194
+ safe parts of an element's contents behind when the element is removed.
195
+
196
+ If set to an array of element names, then only the contents of the specified
197
+ elements (when filtered) will be removed, and the contents of all other filtered
198
+ elements will be left behind.
199
+
200
+ The default value is `false`.
201
+
202
+ #### :transformers
203
+
204
+ Custom transformer or array of custom transformers to run using depth-first
205
+ traversal. See the Transformers section below for details.
206
+
207
+ #### :transformers_breadth
208
+
209
+ Custom transformer or array of custom transformers to run using breadth-first
210
+ traversal. See the Transformers section below for details.
211
+
212
+ #### :whitespace_elements (Array)
213
+
214
+ Array of lowercase element names that should be replaced with whitespace when
215
+ removed in order to preserve readability. For example,
216
+ `foo<div>bar</div>baz` will become
217
+ `foo bar baz` when the `<div>` is removed.
218
+
219
+ By default, the following elements are included in the
220
+ `:whitespace_elements` array:
221
+
222
+ ```
223
+ address article aside blockquote br dd div dl dt footer h1 h2 h3 h4 h5
224
+ h6 header hgroup hr li nav ol p pre section ul
225
+ ```
226
+
227
+ ### Transformers
228
+
229
+ Transformers allow you to filter and modify nodes using your own custom logic,
230
+ on top of (or instead of) Sanitize's core filter. A transformer is any object
231
+ that responds to `call()` (such as a lambda or proc).
232
+
233
+ To use one or more transformers, pass them to the `:transformers`
234
+ config setting. You may pass a single transformer or an array of transformers.
235
+
236
+ ```ruby
237
+ Sanitize.clean(html, :transformers => [transformer_one, transformer_two])
238
+ ```
239
+
240
+ #### Input
241
+
242
+ Each registered transformer's `call()` method will be called once for
243
+ each node in the HTML (including elements, text nodes, comments, etc.), and will
244
+ receive as an argument an environment Hash that contains the following items:
245
+
246
+ * **:config** - The current Sanitize configuration Hash.
247
+
248
+ * **:is_whitelisted** - `true` if the current node has been whitelisted by a
249
+ previous transformer, `false` otherwise. It's generally bad form to remove
250
+ a node that a previous transformer has whitelisted.
251
+
252
+ * **:node** - A `Nokogiri::XML::Node` object representing an HTML node. The
253
+ node may be an element, a text node, a comment, a CDATA node, or a document
254
+ fragment. Use Nokogiri's inspection methods (`element?`, `text?`, etc.) to
255
+ selectively ignore node types you aren't interested in.
256
+
257
+ * **:node_name** - The name of the current HTML node, always lowercase (e.g.
258
+ "div" or "span"). For non-element nodes, the name will be something like
259
+ "text", "comment", "#cdata-section", "#document-fragment", etc.
260
+
261
+ * **:node_whitelist** - Set of `Nokogiri::XML::Node` objects in the current
262
+ document that have been whitelisted by previous transformers, if any. It's
263
+ generally bad form to remove a node that a previous transformer has
264
+ whitelisted.
265
+
266
+ * **:traversal_mode** - Current node traversal mode, either `:depth` for
267
+ depth-first (the default mode) or `:breadth` for breadth-first.
268
+
269
+ #### Output
270
+
271
+ A transformer doesn't have to return anything, but may optionally return a Hash,
272
+ which may contain the following items:
273
+
274
+ * **:node_whitelist** - Array or Set of specific Nokogiri::XML::Node objects
275
+ to add to the document's whitelist, bypassing the current Sanitize config.
276
+ These specific nodes and all their attributes will be whitelisted, but
277
+ their children will not be.
278
+
279
+ If a transformer returns anything other than a Hash, the return value will be
280
+ ignored.
281
+
282
+ #### Processing
283
+
284
+ Each transformer has full access to the `Nokogiri::XML::Node` that's passed into
285
+ it and to the rest of the document via the node's `document()` method. Any
286
+ changes made to the current node or to the document will be reflected instantly
287
+ in the document and passed on to subsequently called transformers and to
288
+ Sanitize itself. A transformer may even call Sanitize internally to perform
289
+ custom sanitization if needed.
290
+
291
+ Nodes are passed into transformers in the order in which they're traversed. By
292
+ default, depth-first traversal is used, meaning that markup is traversed from
293
+ the deepest node upward (not from the first node to the last node):
294
+
295
+ ```ruby
296
+ html = '<div><span>foo</span></div>'
297
+ transformer = lambda{|env| puts env[:node_name] }
298
+
299
+ # Prints "text", "span", "div", "#document-fragment".
300
+ Sanitize.clean(html, :transformers => transformer)
301
+ ```
302
+
303
+ You may use the `:transformers_breadth` config to specify one or more
304
+ transformers that should traverse nodes in breadth-first mode:
305
+
306
+ ```ruby
307
+ html = '<div><span>foo</span></div>'
308
+ transformer = lambda{|env| puts env[:node_name] }
309
+
310
+ # Prints "#document-fragment", "div", "span", "text".
311
+ Sanitize.clean(html, :transformers_breadth => transformer)
312
+ ```
313
+
314
+ Transformers have a tremendous amount of power, including the power to
315
+ completely bypass Sanitize's built-in filtering. Be careful! Your safety is in
316
+ your own hands.
317
+
318
+ #### Example: Transformer to whitelist YouTube video embeds
319
+
320
+ The following example demonstrates how to create a depth-first Sanitize
321
+ transformer that will safely whitelist valid YouTube video embeds without having
322
+ to blindly allow other kinds of embedded content, which would be the case if you
323
+ tried to do this by just whitelisting all `<iframe>` elements:
324
+
325
+ ```ruby
326
+ lambda do |env|
327
+ node = env[:node]
328
+ node_name = env[:node_name]
329
+
330
+ # Don't continue if this node is already whitelisted or is not an element.
331
+ return if env[:is_whitelisted] || !node.element?
332
+
333
+ # Don't continue unless the node is an iframe.
334
+ return unless node_name == 'iframe'
335
+
336
+ # Verify that the video URL is actually a valid YouTube video URL.
337
+ return unless node['src'] =~ /\A(https?:)?\/\/(?:www\.)?youtube(?:-nocookie)?\.com\//
338
+
339
+ # We're now certain that this is a YouTube embed, but we still need to run
340
+ # it through a special Sanitize step to ensure that no unwanted elements or
341
+ # attributes that don't belong in a YouTube embed can sneak in.
342
+ Sanitize.clean_node!(node, {
343
+ :elements => %w[iframe],
344
+
345
+ :attributes => {
346
+ 'iframe' => %w[allowfullscreen frameborder height src width]
347
+ }
348
+ })
349
+
350
+ # Now that we're sure that this is a valid YouTube embed and that there are
351
+ # no unwanted elements or attributes hidden inside it, we can tell Sanitize
352
+ # to whitelist the current node.
353
+ {:node_whitelist => [node]}
354
+ end
355
+ ```
356
+
357
+ Contributors
358
+ ------------
359
+
360
+ Sanitize was created and is maintained by Ryan Grove (ryan@wonko.com).
361
+
362
+ The following lovely people have also contributed to Sanitize:
363
+
364
+ * Ben Anderson
365
+ * Wilson Bilkovich
366
+ * Peter Cooper
367
+ * Gabe da Silveira
368
+ * Nicholas Evans
369
+ * Nils Gemeinhardt
370
+ * Adam Hooper
371
+ * Mutwin Kraus
372
+ * Eaden McKee
373
+ * Dev Purkayastha
374
+ * David Reese
375
+ * Ardie Saeidi
376
+ * Rafael Souza
377
+ * Ben Wanicur
378
+
379
+ License
380
+ -------
381
+
382
+ Copyright (c) 2014 Ryan Grove (ryan@wonko.com)
383
+
384
+ Permission is hereby granted, free of charge, to any person obtaining a copy of
385
+ this software and associated documentation files (the 'Software'), to deal in
386
+ the Software without restriction, including without limitation the rights to
387
+ use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
388
+ the Software, and to permit persons to whom the Software is furnished to do so,
389
+ subject to the following conditions:
390
+
391
+ The above copyright notice and this permission notice shall be included in all
392
+ copies or substantial portions of the Software.
393
+
394
+ THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
395
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
396
+ FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
397
+ COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
398
+ IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
399
+ CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
@@ -36,12 +36,26 @@ require 'sanitize/transformers/clean_element'
36
36
  class Sanitize
37
37
  attr_reader :config
38
38
 
39
+ # Matches a valid HTML5 data attribute name. The unicode ranges included here
40
+ # are a conservative subset of the full range of characters that are
41
+ # technically allowed, with the intent of matching the most common characters
42
+ # used in data attribute names while excluding uncommon or potentially
43
+ # misleading characters, or characters with the potential to be normalized
44
+ # into unsafe or confusing forms.
45
+ #
46
+ # If you need data attr names with characters that aren't included here (such
47
+ # as combining marks, full-width characters, or CJK), please consider creating
48
+ # a custom transformer to validate attributes according to your needs.
49
+ #
50
+ # http://www.whatwg.org/specs/web-apps/current-work/multipage/elements.html#embedding-custom-non-visible-data-with-the-data-*-attributes
51
+ REGEX_DATA_ATTR = /\Adata-(?!xml)[a-z_][\w.\u00E0-\u00F6\u00F8-\u017F\u01DD-\u02AF-]*\z/u
52
+
39
53
  # Matches an attribute value that could be treated by a browser as a URL
40
54
  # with a protocol prefix, such as "http:" or "javascript:". Any string of zero
41
55
  # or more characters followed by a colon is considered a match, even if the
42
56
  # colon is encoded as an entity and even if it's an incomplete entity (which
43
57
  # IE6 and Opera will still parse).
44
- REGEX_PROTOCOL = /\A([^\/]*?)(?:\:|&#0*58|&#x0*3a)/i
58
+ REGEX_PROTOCOL = /\A([^\/#]*?)(?:\:|&#0*58|&#x0*3a)/i
45
59
 
46
60
  #--
47
61
  # Class Methods
@@ -99,7 +113,7 @@ class Sanitize
99
113
  Transformers::CleanElement.new(@config)
100
114
  end
101
115
 
102
- # Returns a sanitized copy of _html_.
116
+ # Returns a sanitized copy of the given _html_ fragment.
103
117
  def clean(html)
104
118
  if html
105
119
  dupe = html.dup
@@ -129,12 +143,15 @@ class Sanitize
129
143
  return result == html ? nil : html[0, html.length] = result
130
144
  end
131
145
 
146
+ # Returns a sanitized copy of the given full _html_ document.
132
147
  def clean_document(html)
133
148
  unless html.nil?
134
149
  clean_document!(html.dup) || html
135
150
  end
136
151
  end
137
152
 
153
+ # Performs clean_document in place, returning _html_, or +nil+ if no changes
154
+ # were made.
138
155
  def clean_document!(html)
139
156
  if !@config[:elements].include?('html') && !@config[:remove_contents]
140
157
  raise 'You must have the HTML element whitelisted to call #clean_document unless remove_contents is set to true'
@@ -34,7 +34,8 @@ class Sanitize
34
34
  :add_attributes => {},
35
35
 
36
36
  # HTML attributes to allow in specific elements. By default, no attributes
37
- # are allowed.
37
+ # are allowed. Use the symbol :data to indicate that arbitrary HTML5
38
+ # data-* attributes should be allowed.
38
39
  :attributes => {},
39
40
 
40
41
  # HTML elements to allow. By default, no elements are allowed (which means
@@ -24,10 +24,10 @@ class Sanitize
24
24
  module Config
25
25
  RELAXED = {
26
26
  :elements => %w[
27
- a abbr b bdo blockquote br caption cite code col colgroup dd del dfn dl
28
- dt em figcaption figure h1 h2 h3 h4 h5 h6 hgroup i img ins kbd li mark
29
- ol p pre q rp rt ruby s samp small strike strong sub sup table tbody td
30
- tfoot th thead time tr u ul var wbr
27
+ a abbr address b bdi bdo blockquote br caption cite code col colgroup dd
28
+ del dfn dl dt em figcaption figure h1 h2 h3 h4 h5 h6 hgroup hr i img ins
29
+ kbd li mark ol p pre q rp rt ruby s samp small strike strong sub summary
30
+ sup table tbody td tfoot th thead time tr u ul var wbr
31
31
  ],
32
32
 
33
33
  :attributes => {
@@ -49,13 +49,29 @@ class Sanitize; module Transformers
49
49
  attr_whitelist = Set.new((@attributes[name] || []) +
50
50
  (@attributes[:all] || []))
51
51
 
52
+ allow_data_attributes = attr_whitelist.include?(:data)
53
+
52
54
  if attr_whitelist.empty?
53
55
  # Delete all attributes from elements with no whitelisted attributes.
54
56
  node.attribute_nodes.each {|attr| attr.unlink }
55
57
  else
56
- # Delete any attribute that isn't in the whitelist for this element.
58
+ # Delete any attribute that isn't allowed on this element.
57
59
  node.attribute_nodes.each do |attr|
58
- attr.unlink unless attr_whitelist.include?(attr.name.downcase)
60
+ attr_name = attr.name.downcase
61
+
62
+ unless attr_whitelist.include?(attr_name)
63
+ # The attribute isn't explicitly whitelisted.
64
+
65
+ if allow_data_attributes && attr_name.start_with?('data-')
66
+ # Arbitrary data attributes are allowed. Verify that the attribute
67
+ # is a valid data attribute.
68
+ attr.unlink unless attr_name =~ REGEX_DATA_ATTR
69
+ else
70
+ # Either the attribute isn't a data attribute, or arbitrary data
71
+ # attributes aren't allowed. Remove the attribute.
72
+ attr.unlink
73
+ end
74
+ end
59
75
  end
60
76
 
61
77
  # Delete remaining attributes that use unacceptable protocols.
@@ -1,3 +1,3 @@
1
1
  class Sanitize
2
- VERSION = '2.0.6'
2
+ VERSION = '2.1.0'
3
3
  end
@@ -344,6 +344,16 @@ describe 'Custom configs' do
344
344
  Sanitize.clean(input, { :elements => ['a'], :attributes => {'a' => ['href']}, :protocols => { 'a' => { 'href' => [:relative] }} }).must_equal(input)
345
345
  end
346
346
 
347
+ it 'should allow relative URLs containing colons where the colon is part of an anchor' do
348
+ input = '<a href="#fn:1">Footnote 1</a>'
349
+ Sanitize.clean(input, { :elements => ['a'], :attributes => {'a' => ['href']}, :protocols => { 'a' => { 'href' => [:relative] }} }).must_equal(input)
350
+ end
351
+
352
+ it 'should allow relative URLs containing colons where the colon is part of an anchor' do
353
+ input = '<a href="somepage#fn:1">Footnote 1</a>'
354
+ Sanitize.clean(input, { :elements => ['a'], :attributes => {'a' => ['href']}, :protocols => { 'a' => { 'href' => [:relative] }} }).must_equal(input)
355
+ end
356
+
347
357
  it 'should output HTML when :output == :html' do
348
358
  input = 'foo<br/>bar<br>baz'
349
359
  Sanitize.clean(input, :elements => ['br'], :output => :html).must_equal('foo<br>bar<br>baz')
@@ -366,6 +376,51 @@ describe 'Custom configs' do
366
376
  Sanitize.clean(html).must_equal("foo\302\240bar")
367
377
  Sanitize.clean(html, :output_encoding => 'ASCII').must_equal("foo&#160;bar")
368
378
  end
379
+
380
+ it 'should not allow arbitrary HTML5 data attributes by default' do
381
+ config = {
382
+ :elements => ['b']
383
+ }
384
+
385
+ Sanitize.clean('<b data-foo="bar"></b>', config)
386
+ .must_equal('<b></b>')
387
+
388
+ config[:attributes] = {'b' => ['class']}
389
+
390
+ Sanitize.clean('<b class="foo" data-foo="bar"></b>', config)
391
+ .must_equal('<b class="foo"></b>')
392
+ end
393
+
394
+ it 'should allow arbitrary HTML5 data attributes when the :attributes config includes :data' do
395
+ config = {
396
+ :attributes => {'b' => [:data]},
397
+ :elements => ['b']
398
+ }
399
+
400
+ Sanitize.clean('<b data-foo="valid" data-bar="valid"></b>', config)
401
+ .must_equal('<b data-foo="valid" data-bar="valid"></b>')
402
+
403
+ Sanitize.clean('<b data-="invalid"></b>', config)
404
+ .must_equal('<b></b>')
405
+
406
+ Sanitize.clean('<b data-="invalid"></b>', config)
407
+ .must_equal('<b></b>')
408
+
409
+ Sanitize.clean('<b data-xml="invalid"></b>', config)
410
+ .must_equal('<b></b>')
411
+
412
+ Sanitize.clean('<b data-xmlfoo="invalid"></b>', config)
413
+ .must_equal('<b></b>')
414
+
415
+ Sanitize.clean('<b data-f:oo="valid"></b>', config)
416
+ .must_equal('<b></b>')
417
+
418
+ Sanitize.clean('<b data-f/oo="partial"></b>', config)
419
+ .must_equal('<b data-f></b>') # Nokogiri quirk; not ideal, but harmless
420
+
421
+ Sanitize.clean('<b data-éfoo="valid"></b>', config)
422
+ .must_equal('<b></b>') # Another annoying Nokogiri quirk.
423
+ end
369
424
  end
370
425
 
371
426
  describe 'Sanitize.clean' do
metadata CHANGED
@@ -1,57 +1,85 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: sanitize
3
3
  version: !ruby/object:Gem::Version
4
- version: 2.0.6
4
+ version: 2.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Ryan Grove
8
8
  autorequire:
9
9
  bindir: bin
10
10
  cert_chain: []
11
- date: 2013-07-11 00:00:00.000000000 Z
11
+ date: 2014-01-13 00:00:00.000000000 Z
12
12
  dependencies:
13
13
  - !ruby/object:Gem::Dependency
14
14
  name: nokogiri
15
15
  requirement: !ruby/object:Gem::Requirement
16
16
  requirements:
17
- - - '>='
17
+ - - ">="
18
18
  - !ruby/object:Gem::Version
19
19
  version: 1.4.4
20
20
  type: :runtime
21
21
  prerelease: false
22
22
  version_requirements: !ruby/object:Gem::Requirement
23
23
  requirements:
24
- - - '>='
24
+ - - ">="
25
25
  - !ruby/object:Gem::Version
26
26
  version: 1.4.4
27
27
  - !ruby/object:Gem::Dependency
28
28
  name: minitest
29
29
  requirement: !ruby/object:Gem::Requirement
30
30
  requirements:
31
- - - '>='
31
+ - - "~>"
32
32
  - !ruby/object:Gem::Version
33
- version: 2.0.0
33
+ version: '4.7'
34
34
  type: :development
35
35
  prerelease: false
36
36
  version_requirements: !ruby/object:Gem::Requirement
37
37
  requirements:
38
- - - '>='
38
+ - - "~>"
39
39
  - !ruby/object:Gem::Version
40
- version: 2.0.0
40
+ version: '4.7'
41
41
  - !ruby/object:Gem::Dependency
42
42
  name: rake
43
43
  requirement: !ruby/object:Gem::Requirement
44
44
  requirements:
45
- - - '>='
45
+ - - "~>"
46
46
  - !ruby/object:Gem::Version
47
- version: '0.9'
47
+ version: '10.1'
48
48
  type: :development
49
49
  prerelease: false
50
50
  version_requirements: !ruby/object:Gem::Requirement
51
51
  requirements:
52
- - - '>='
52
+ - - "~>"
53
53
  - !ruby/object:Gem::Version
54
- version: '0.9'
54
+ version: '10.1'
55
+ - !ruby/object:Gem::Dependency
56
+ name: redcarpet
57
+ requirement: !ruby/object:Gem::Requirement
58
+ requirements:
59
+ - - "~>"
60
+ - !ruby/object:Gem::Version
61
+ version: 3.0.0
62
+ type: :development
63
+ prerelease: false
64
+ version_requirements: !ruby/object:Gem::Requirement
65
+ requirements:
66
+ - - "~>"
67
+ - !ruby/object:Gem::Version
68
+ version: 3.0.0
69
+ - !ruby/object:Gem::Dependency
70
+ name: yard
71
+ requirement: !ruby/object:Gem::Requirement
72
+ requirements:
73
+ - - "~>"
74
+ - !ruby/object:Gem::Version
75
+ version: 0.8.7
76
+ type: :development
77
+ prerelease: false
78
+ version_requirements: !ruby/object:Gem::Requirement
79
+ requirements:
80
+ - - "~>"
81
+ - !ruby/object:Gem::Version
82
+ version: 0.8.7
55
83
  description:
56
84
  email: ryan@wonko.com
57
85
  executables: []
@@ -60,16 +88,16 @@ extra_rdoc_files: []
60
88
  files:
61
89
  - HISTORY.md
62
90
  - LICENSE
63
- - README.rdoc
91
+ - README.md
92
+ - lib/sanitize.rb
93
+ - lib/sanitize/config.rb
64
94
  - lib/sanitize/config/basic.rb
65
95
  - lib/sanitize/config/relaxed.rb
66
96
  - lib/sanitize/config/restricted.rb
67
- - lib/sanitize/config.rb
68
97
  - lib/sanitize/transformers/clean_cdata.rb
69
98
  - lib/sanitize/transformers/clean_comment.rb
70
99
  - lib/sanitize/transformers/clean_element.rb
71
100
  - lib/sanitize/version.rb
72
- - lib/sanitize.rb
73
101
  - test/test_sanitize.rb
74
102
  homepage: https://github.com/rgrove/sanitize/
75
103
  licenses: []
@@ -80,17 +108,17 @@ require_paths:
80
108
  - lib
81
109
  required_ruby_version: !ruby/object:Gem::Requirement
82
110
  requirements:
83
- - - '>='
111
+ - - ">="
84
112
  - !ruby/object:Gem::Version
85
113
  version: 1.9.2
86
114
  required_rubygems_version: !ruby/object:Gem::Requirement
87
115
  requirements:
88
- - - '>='
116
+ - - ">="
89
117
  - !ruby/object:Gem::Version
90
118
  version: 1.2.0
91
119
  requirements: []
92
120
  rubyforge_project:
93
- rubygems_version: 2.0.0
121
+ rubygems_version: 2.2.0
94
122
  signing_key:
95
123
  specification_version: 4
96
124
  summary: Whitelist-based HTML sanitizer.
@@ -1,367 +0,0 @@
1
- = Sanitize
2
-
3
- Sanitize is a whitelist-based HTML sanitizer. Given a list of acceptable
4
- elements and attributes, Sanitize will remove all unacceptable HTML from a
5
- string.
6
-
7
- Using a simple configuration syntax, you can tell Sanitize to allow certain
8
- elements, certain attributes within those elements, and even certain URL
9
- protocols within attributes that contain URLs. Any HTML elements or attributes
10
- that you don't explicitly allow will be removed.
11
-
12
- Because it's based on Nokogiri, a full-fledged HTML parser, rather than a bunch
13
- of fragile regular expressions, Sanitize has no trouble dealing with malformed
14
- or maliciously-formed HTML, and will always output valid HTML or XHTML.
15
-
16
- *Author*:: Ryan Grove (mailto:ryan@wonko.com)
17
- *Version*:: 2.0.6 (2013-07-10)
18
- *Copyright*:: Copyright (c) 2013 Ryan Grove. All rights reserved.
19
- *License*:: MIT License (http://opensource.org/licenses/mit-license.php)
20
- *Website*:: http://github.com/rgrove/sanitize
21
-
22
- {<img src="https://secure.travis-ci.org/rgrove/sanitize.png?branch=master" alt="Build Status" />}[http://travis-ci.org/rgrove/sanitize]
23
-
24
- == Installation
25
-
26
- Latest stable release:
27
-
28
- gem install sanitize
29
-
30
- Latest development version:
31
-
32
- gem install sanitize --pre
33
-
34
- == Usage
35
-
36
- If you don't specify any configuration options, Sanitize will use its strictest
37
- settings by default, which means it will strip all HTML and leave only text
38
- behind.
39
-
40
- require 'rubygems'
41
- require 'sanitize'
42
-
43
- html = '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg">'
44
-
45
- Sanitize.clean(html) # => 'foo'
46
-
47
- ...
48
-
49
- # or sanitize an entire HTML document (example assumes _html_ is whitelisted)
50
- html = '<!DOCTYPE html><html><b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg"></html>'
51
- Sanitize.clean_document(html) # => '<!DOCTYPE html>\n<html>foo</html>\n'
52
-
53
- == Configuration
54
-
55
- In addition to the ultra-safe default settings, Sanitize comes with three other
56
- built-in modes.
57
-
58
- === Sanitize::Config::RESTRICTED
59
-
60
- Allows only very simple inline formatting markup. No links, images, or block
61
- elements.
62
-
63
- Sanitize.clean(html, Sanitize::Config::RESTRICTED) # => '<b>foo</b>'
64
-
65
- === Sanitize::Config::BASIC
66
-
67
- Allows a variety of markup including formatting tags, links, and lists. Images
68
- and tables are not allowed, links are limited to FTP, HTTP, HTTPS, and mailto
69
- protocols, and a <code>rel="nofollow"</code> attribute is added to all links to
70
- mitigate SEO spam.
71
-
72
- Sanitize.clean(html, Sanitize::Config::BASIC)
73
- # => '<b><a href="http://foo.com/" rel="nofollow">foo</a></b>'
74
-
75
- === Sanitize::Config::RELAXED
76
-
77
- Allows an even wider variety of markup than BASIC, including images and tables.
78
- Links are still limited to FTP, HTTP, HTTPS, and mailto protocols, while images
79
- are limited to HTTP and HTTPS. In this mode, <code>rel="nofollow"</code> is not
80
- added to links.
81
-
82
- Sanitize.clean(html, Sanitize::Config::RELAXED)
83
- # => '<b><a href="http://foo.com/">foo</a></b><img src="http://foo.com/bar.jpg">'
84
-
85
- === Custom Configuration
86
-
87
- If the built-in modes don't meet your needs, you can easily specify a custom
88
- configuration:
89
-
90
- Sanitize.clean(html, :elements => ['a', 'span'],
91
- :attributes => {'a' => ['href', 'title'], 'span' => ['class']},
92
- :protocols => {'a' => {'href' => ['http', 'https', 'mailto']}})
93
-
94
- ==== :add_attributes (Hash)
95
-
96
- Attributes to add to specific elements. If the attribute already exists, it will
97
- be replaced with the value specified here. Specify all element names and
98
- attributes in lowercase.
99
-
100
- :add_attributes => {
101
- 'a' => {'rel' => 'nofollow'}
102
- }
103
-
104
- ==== :attributes (Hash)
105
-
106
- Attributes to allow for specific elements. Specify all element names and
107
- attributes in lowercase.
108
-
109
- :attributes => {
110
- 'a' => ['href', 'title'],
111
- 'blockquote' => ['cite'],
112
- 'img' => ['alt', 'src', 'title']
113
- }
114
-
115
- If you'd like to allow certain attributes on all elements, use the symbol
116
- <code>:all</code> instead of an element name.
117
-
118
- :attributes => {
119
- :all => ['class'],
120
- 'a' => ['href', 'title']
121
- }
122
-
123
- ==== :allow_comments (boolean)
124
-
125
- Whether or not to allow HTML comments. Allowing comments is strongly
126
- discouraged, since IE allows script execution within conditional comments. The
127
- default value is <code>false</code>.
128
-
129
- ==== :elements (Array)
130
-
131
- Array of element names to allow. Specify all names in lowercase.
132
-
133
- :elements => %w[
134
- a abbr b blockquote br cite code dd dfn dl dt em i kbd li mark ol p pre
135
- q s samp small strike strong sub sup time u ul var
136
- ]
137
-
138
- ==== :output (Symbol)
139
-
140
- Output format. Supported formats are <code>:html</code> and <code>:xhtml</code>,
141
- defaulting to <code>:html</code>.
142
-
143
- ==== :output_encoding (String)
144
-
145
- Character encoding to use for HTML output. Default is <code>utf-8</code>.
146
-
147
- ==== :protocols (Hash)
148
-
149
- URL protocols to allow in specific attributes. If an attribute is listed here
150
- and contains a protocol other than those specified (or if it contains no
151
- protocol at all), it will be removed.
152
-
153
- :protocols => {
154
- 'a' => {'href' => ['ftp', 'http', 'https', 'mailto']},
155
- 'img' => {'src' => ['http', 'https']}
156
- }
157
-
158
- If you'd like to allow the use of relative URLs which don't have a protocol,
159
- include the symbol <code>:relative</code> in the protocol array:
160
-
161
- :protocols => {
162
- 'a' => {'href' => ['http', 'https', :relative]}
163
- }
164
-
165
- ==== :remove_contents (boolean or Array)
166
-
167
- If set to +true+, Sanitize will remove the contents of any non-whitelisted
168
- elements in addition to the elements themselves. By default, Sanitize leaves the
169
- safe parts of an element's contents behind when the element is removed.
170
-
171
- If set to an array of element names, then only the contents of the specified
172
- elements (when filtered) will be removed, and the contents of all other filtered
173
- elements will be left behind.
174
-
175
- The default value is <code>false</code>.
176
-
177
- ==== :transformers
178
-
179
- Custom transformer or array of custom transformers to run using depth-first
180
- traversal. See the Transformers section below for details.
181
-
182
- ==== :transformers_breadth
183
-
184
- Custom transformer or array of custom transformers to run using breadth-first
185
- traversal. See the Transformers section below for details.
186
-
187
- ==== :whitespace_elements (Array)
188
-
189
- Array of lowercase element names that should be replaced with whitespace when
190
- removed in order to preserve readability. For example,
191
- <code>foo<div>bar</div>baz</code> will become
192
- <code>foo bar baz</code> when the <code><div></code> is removed.
193
-
194
- By default, the following elements are included in the
195
- <code>:whitespace_elements</code> array:
196
-
197
- address article aside blockquote br dd div dl dt footer h1 h2 h3 h4 h5
198
- h6 header hgroup hr li nav ol p pre section ul
199
-
200
- === Transformers
201
-
202
- Transformers allow you to filter and modify nodes using your own custom logic,
203
- on top of (or instead of) Sanitize's core filter. A transformer is any object
204
- that responds to <code>call()</code> (such as a lambda or proc).
205
-
206
- To use one or more transformers, pass them to the <code>:transformers</code>
207
- config setting. You may pass a single transformer or an array of transformers.
208
-
209
- Sanitize.clean(html, :transformers => [transformer_one, transformer_two])
210
-
211
- ==== Input
212
-
213
- Each registered transformer's <code>call()</code> method will be called once for
214
- each node in the HTML (including elements, text nodes, comments, etc.), and will
215
- receive as an argument an environment Hash that contains the following items:
216
-
217
- [<code>:config</code>]
218
- The current Sanitize configuration Hash.
219
-
220
- [<code>:is_whitelisted</code>]
221
- <code>true</code> if the current node has been whitelisted by a previous
222
- transformer, <code>false</code> otherwise. It's generally bad form to remove a
223
- node that a previous transformer has whitelisted.
224
-
225
- [<code>:node</code>]
226
- A Nokogiri::XML::Node object representing an HTML node. The node may be an
227
- element, a text node, a comment, a CDATA node, or a document fragment. Use
228
- Nokogiri's inspection methods (<code>element?</code>, <code>text?</code>,
229
- etc.) to selectively ignore node types you aren't interested in.
230
-
231
- [<code>:node_name</code>]
232
- The name of the current HTML node, always lowercase (e.g. "div" or "span").
233
- For non-element nodes, the name will be something like "text", "comment",
234
- "#cdata-section", "#document-fragment", etc.
235
-
236
- [<code>:node_whitelist</code>]
237
- Set of Nokogiri::XML::Node objects in the current document that have been
238
- whitelisted by previous transformers, if any. It's generally bad form to
239
- remove a node that a previous transformer has whitelisted.
240
-
241
- [<code>:traversal_mode</code>]
242
- Current node traversal mode, either <code>:depth</code> for depth-first (the
243
- default mode) or <code>:breadth</code> for breadth-first.
244
-
245
- ==== Output
246
-
247
- A transformer doesn't have to return anything, but may optionally return a Hash,
248
- which may contain the following items:
249
-
250
- [<code>:node_whitelist</code>]
251
- Array or Set of specific Nokogiri::XML::Node objects to add to the document's
252
- whitelist, bypassing the current Sanitize config. These specific nodes and all
253
- their attributes will be whitelisted, but their children will not be.
254
-
255
- If a transformer returns anything other than a Hash, the return value will be
256
- ignored.
257
-
258
- ==== Processing
259
-
260
- Each transformer has full access to the Nokogiri::XML::Node that's passed into
261
- it and to the rest of the document via the node's <code>document()</code>
262
- method. Any changes made to the current node or to the document will be
263
- reflected instantly in the document and passed on to subsequently-called
264
- transformers and to Sanitize itself. A transformer may even call Sanitize
265
- internally to perform custom sanitization if needed.
266
-
267
- Nodes are passed into transformers in the order in which they're traversed. By
268
- default, depth-first traversal is used, meaning that markup is traversed from
269
- the deepest node upward (not from the first node to the last node):
270
-
271
- html = '<div><span>foo</span></div>'
272
- transformer = lambda{|env| puts env[:node_name] }
273
-
274
- # Prints "text", "span", "div", "#document-fragment".
275
- Sanitize.clean(html, :transformers => transformer)
276
-
277
- You may use the <code>:transformers_breadth</code> config to specify one or more
278
- transformers that should traverse nodes in breadth-first mode:
279
-
280
- html = '<div><span>foo</span></div>'
281
- transformer = lambda{|env| puts env[:node_name] }
282
-
283
- # Prints "#document-fragment", "div", "span", "text".
284
- Sanitize.clean(html, :transformers_breadth => transformer)
285
-
286
- Transformers have a tremendous amount of power, including the power to
287
- completely bypass Sanitize's built-in filtering. Be careful! Your safety is in
288
- your own hands.
289
-
290
- ==== Example: Transformer to whitelist YouTube video embeds
291
-
292
- The following example demonstrates how to create a depth-first Sanitize
293
- transformer that will safely whitelist valid YouTube video embeds without having
294
- to blindly allow other kinds of embedded content, which would be the case if you
295
- tried to do this by just whitelisting all <code><iframe></code> elements:
296
-
297
- lambda do |env|
298
- node = env[:node]
299
- node_name = env[:node_name]
300
-
301
- # Don't continue if this node is already whitelisted or is not an element.
302
- return if env[:is_whitelisted] || !node.element?
303
-
304
- # Don't continue unless the node is an iframe.
305
- return unless node_name == 'iframe'
306
-
307
- # Verify that the video URL is actually a valid YouTube video URL.
308
- return unless node['src'] =~ /\Ahttps?:\/\/(?:www\.)?youtube(?:-nocookie)?\.com\//
309
-
310
- # We're now certain that this is a YouTube embed, but we still need to run
311
- # it through a special Sanitize step to ensure that no unwanted elements or
312
- # attributes that don't belong in a YouTube embed can sneak in.
313
- Sanitize.clean_node!(node, {
314
- :elements => %w[iframe],
315
-
316
- :attributes => {
317
- 'iframe' => %w[allowfullscreen frameborder height src width]
318
- }
319
- })
320
-
321
- # Now that we're sure that this is a valid YouTube embed and that there are
322
- # no unwanted elements or attributes hidden inside it, we can tell Sanitize
323
- # to whitelist the current node.
324
- {:node_whitelist => [node]}
325
- end
326
-
327
- == Contributors
328
-
329
- Sanitize was created and is maintained by Ryan Grove (ryan@wonko.com).
330
-
331
- The following lovely people have also contributed to Sanitize:
332
-
333
- * Ben Anderson
334
- * Wilson Bilkovich
335
- * Peter Cooper
336
- * Gabe da Silveira
337
- * Nicholas Evans
338
- * Nils Gemeinhardt
339
- * Adam Hooper
340
- * Mutwin Kraus
341
- * Eaden McKee
342
- * Dev Purkayastha
343
- * David Reese
344
- * Ardie Saeidi
345
- * Rafael Souza
346
- * Ben Wanicur
347
-
348
- == License
349
-
350
- Copyright (c) 2013 Ryan Grove (ryan@wonko.com)
351
-
352
- Permission is hereby granted, free of charge, to any person obtaining a copy of
353
- this software and associated documentation files (the 'Software'), to deal in
354
- the Software without restriction, including without limitation the rights to
355
- use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
356
- the Software, and to permit persons to whom the Software is furnished to do so,
357
- subject to the following conditions:
358
-
359
- The above copyright notice and this permission notice shall be included in all
360
- copies or substantial portions of the Software.
361
-
362
- THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
363
- IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
364
- FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
365
- COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
366
- IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
367
- CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.