adamh-sanitize 1.0.4.4 → 1.1.0
Sign up to get free protection for your applications and to get access to all the features.
- data/HISTORY +36 -0
- data/README.rdoc +25 -5
- data/lib/sanitize.rb +77 -64
- metadata +3 -13
data/HISTORY
CHANGED
@@ -1,6 +1,42 @@
|
|
1
1
|
Sanitize History
|
2
2
|
================================================================================
|
3
3
|
|
4
|
+
Version 1.1.0
|
5
|
+
* Migrated from Hpricot to Nokogiri. Requires libxml2 >= 2.7.2 [Adam Hooper]
|
6
|
+
|
7
|
+
Version 1.0.8.1 (git)
|
8
|
+
* Changed protocol regex to ensure Sanitize doesn't kill URLs with colons in
|
9
|
+
path segments. [Peter Cooper]
|
10
|
+
|
11
|
+
Version 1.0.8 (2009-04-23)
|
12
|
+
* Added a workaround for an Hpricot bug that prevents attribute names from
|
13
|
+
being downcased in recent versions of Hpricot. This was exploitable to
|
14
|
+
prevent non-whitelisted protocols from being cleaned. [Reported by Ben
|
15
|
+
Wanicur]
|
16
|
+
|
17
|
+
Version 1.0.7 (2009-04-11)
|
18
|
+
* Requires Hpricot 0.8.1+, which is finally compatible with Ruby 1.9.1.
|
19
|
+
* Fixed a bug that caused named character entities containing digits (like
|
20
|
+
²) to be escaped when they shouldn't have been. [Reported by Sebastian
|
21
|
+
Steinmetz]
|
22
|
+
|
23
|
+
Version 1.0.6 (2009-02-23)
|
24
|
+
* Removed htmlentities gem dependency.
|
25
|
+
* Existing well-formed character entity references in the input string are now
|
26
|
+
preserved rather than being decoded and re-encoded.
|
27
|
+
* The ' character is now encoded as ' instead of ' to prevent
|
28
|
+
problems in IE6.
|
29
|
+
* You can now specify the symbol :all in place of an element name in the
|
30
|
+
attributes config hash to allow certain attributes on all elements. [Thanks
|
31
|
+
to Mutwin Kraus]
|
32
|
+
|
33
|
+
Version 1.0.5 (2009-02-05)
|
34
|
+
* Fixed a bug introduced in version 1.0.3 that prevented non-whitelisted
|
35
|
+
protocols from being cleaned when relative URLs were allowed. [Reported by
|
36
|
+
Dev Purkayastha]
|
37
|
+
* Fixed "undefined method `parent='" exceptions caused by parser changes in
|
38
|
+
edge Hpricot.
|
39
|
+
|
4
40
|
Version 1.0.4 (2009-01-16)
|
5
41
|
* Fixed a bug that made it possible to sneak a non-whitelisted element through
|
6
42
|
by repeating it several times in a row. All versions of Sanitize prior to
|
data/README.rdoc
CHANGED
@@ -9,13 +9,13 @@ elements, certain attributes within those elements, and even certain URL
|
|
9
9
|
protocols within attributes that contain URLs. Any HTML elements or attributes
|
10
10
|
that you don't explicitly allow will be removed.
|
11
11
|
|
12
|
-
Because it's based on
|
12
|
+
Because it's based on nokogiri, a full-fledged HTML parser, rather than a bunch
|
13
13
|
of fragile regular expressions, Sanitize has no trouble dealing with malformed
|
14
14
|
or maliciously-formed HTML. When in doubt, Sanitize always errs on the side of
|
15
15
|
caution.
|
16
16
|
|
17
17
|
*Author*:: Ryan Grove (mailto:ryan@wonko.com)
|
18
|
-
*Version*:: 1.0.
|
18
|
+
*Version*:: 1.0.8 (2009-04-23)
|
19
19
|
*Copyright*:: Copyright (c) 2009 Ryan Grove. All rights reserved.
|
20
20
|
*License*:: MIT License (http://opensource.org/licenses/mit-license.php)
|
21
21
|
*Website*:: http://github.com/rgrove/sanitize
|
@@ -23,8 +23,7 @@ caution.
|
|
23
23
|
== Requires
|
24
24
|
|
25
25
|
* RubyGems
|
26
|
-
*
|
27
|
-
* HTMLEntities 4.0.0+
|
26
|
+
* nokogiri
|
28
27
|
|
29
28
|
== Usage
|
30
29
|
|
@@ -100,6 +99,14 @@ attributes in lowercase.
|
|
100
99
|
'img' => ['alt', 'src', 'title']
|
101
100
|
}
|
102
101
|
|
102
|
+
If you'd like to allow certain attributes on all elements, use the symbol
|
103
|
+
<code>:all</code> instead of an element name.
|
104
|
+
|
105
|
+
:attributes => {
|
106
|
+
:all => ['class'],
|
107
|
+
'a' => ['href', 'title']
|
108
|
+
}
|
109
|
+
|
103
110
|
==== :add_attributes
|
104
111
|
|
105
112
|
Attributes to add to specific elements. If the attribute already exists, it will
|
@@ -122,12 +129,25 @@ protocol at all), it will be removed.
|
|
122
129
|
}
|
123
130
|
|
124
131
|
If you'd like to allow the use of relative URLs which don't have a protocol,
|
125
|
-
include the
|
132
|
+
include the symbol <code>:relative</code> in the protocol array:
|
126
133
|
|
127
134
|
:protocols => {
|
128
135
|
'a' => {'href' => ['http', 'https', :relative]}
|
129
136
|
}
|
130
137
|
|
138
|
+
|
139
|
+
== Contributors
|
140
|
+
|
141
|
+
The following lovely people have contributed to Sanitize in the form of patches
|
142
|
+
or ideas that later became code:
|
143
|
+
|
144
|
+
* Peter Cooper <git@peterc.org>
|
145
|
+
* Ryan Grove <ryan@wonko.com>
|
146
|
+
* Adam Hooper <adam@adamhooper.com>
|
147
|
+
* Mutwin Kraus <mutle@blogage.de>
|
148
|
+
* Dev Purkayastha <dev.purkayastha@gmail.com>
|
149
|
+
* Ben Wanicur <bwanicur@verticalresponse.com>
|
150
|
+
|
131
151
|
== License
|
132
152
|
|
133
153
|
Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
|
data/lib/sanitize.rb
CHANGED
@@ -26,11 +26,9 @@ $:.uniq!
|
|
26
26
|
|
27
27
|
require 'rubygems'
|
28
28
|
|
29
|
-
gem '
|
30
|
-
gem 'htmlentities', '~> 4.0.0'
|
29
|
+
gem 'nokogiri', '~> 1.3.3'
|
31
30
|
|
32
|
-
require '
|
33
|
-
require 'htmlentities'
|
31
|
+
require 'nokogiri'
|
34
32
|
require 'sanitize/config'
|
35
33
|
require 'sanitize/config/restricted'
|
36
34
|
require 'sanitize/config/basic'
|
@@ -38,30 +36,24 @@ require 'sanitize/config/relaxed'
|
|
38
36
|
|
39
37
|
class Sanitize
|
40
38
|
|
39
|
+
# Characters that should be replaced with entities in text nodes.
|
40
|
+
ENTITY_MAP = {
|
41
|
+
'<' => '<',
|
42
|
+
'>' => '>',
|
43
|
+
'"' => '"',
|
44
|
+
"'" => '''
|
45
|
+
}
|
46
|
+
|
47
|
+
# Matches an unencoded ampersand that is not part of a valid character entity
|
48
|
+
# reference.
|
49
|
+
REGEX_AMPERSAND = /&(?!(?:[a-z]+[0-9]{0,2}|#[0-9]+|#x[0-9a-f]+);)/i
|
50
|
+
|
41
51
|
# Matches an attribute value that could be treated by a browser as a URL
|
42
|
-
# with a protocol prefix, such as "http:" or "javascript:". Any string of
|
52
|
+
# with a protocol prefix, such as "http:" or "javascript:". Any string of zero
|
43
53
|
# or more characters followed by a colon is considered a match, even if the
|
44
54
|
# colon is encoded as an entity and even if it's an incomplete entity (which
|
45
55
|
# IE6 and Opera will still parse).
|
46
|
-
REGEX_PROTOCOL = /^([
|
47
|
-
|
48
|
-
#--
|
49
|
-
# Class Methods
|
50
|
-
#++
|
51
|
-
|
52
|
-
# Returns a sanitized copy of _html_, using the settings in _config_ if
|
53
|
-
# specified.
|
54
|
-
def self.clean(html, config = {})
|
55
|
-
sanitize = Sanitize.new(config)
|
56
|
-
sanitize.clean(html)
|
57
|
-
end
|
58
|
-
|
59
|
-
# Performs Sanitize#clean in place, returning _html_, or +nil+ if no changes
|
60
|
-
# were made.
|
61
|
-
def self.clean!(html, config = {})
|
62
|
-
sanitize = Sanitize.new(config)
|
63
|
-
sanitize.clean!(html)
|
64
|
-
end
|
56
|
+
REGEX_PROTOCOL = /^([A-Za-z0-9\+\-\.\&\;\#\s]*?)(?:\:|�*58|�*3a)/i
|
65
57
|
|
66
58
|
#--
|
67
59
|
# Instance Methods
|
@@ -81,77 +73,98 @@ class Sanitize
|
|
81
73
|
# Performs clean in place, returning _html_, or +nil+ if no changes were
|
82
74
|
# made.
|
83
75
|
def clean!(html)
|
84
|
-
fragment =
|
85
|
-
|
86
|
-
fragment.search('*') do |node|
|
87
|
-
if node.bogusetag? || node.doctype? || node.procins? || node.xmldecl?
|
88
|
-
node.parent.altered!
|
89
|
-
node.parent.children[node.parent.children.index(node), 1] = []
|
90
|
-
next
|
91
|
-
end
|
76
|
+
fragment = Nokogiri::HTML::DocumentFragment.parse(html)
|
92
77
|
|
78
|
+
fragment.traverse do |node|
|
93
79
|
if node.comment?
|
94
|
-
unless @config[:allow_comments]
|
95
|
-
|
96
|
-
node.parent.children[node.parent.children.index(node), 1] = []
|
97
|
-
end
|
98
|
-
elsif node.elem?
|
80
|
+
node.unlink unless @config[:allow_comments]
|
81
|
+
elsif node.element?
|
99
82
|
name = node.name.to_s.downcase
|
100
83
|
|
101
84
|
# Delete any element that isn't in the whitelist.
|
102
85
|
unless @config[:elements].include?(name)
|
103
|
-
node.
|
86
|
+
node.children.each { |n| node.add_previous_sibling(n) }
|
87
|
+
node.unlink
|
104
88
|
next
|
105
89
|
end
|
106
90
|
|
107
|
-
|
108
|
-
|
91
|
+
attr_whitelist = ((@config[:attributes][name] || []) +
|
92
|
+
(@config[:attributes][:all] || [])).uniq
|
93
|
+
|
94
|
+
if attr_whitelist.empty?
|
95
|
+
# Delete all attributes from elements with no whitelisted
|
96
|
+
# attributes.
|
97
|
+
node.attribute_nodes.each { |attr| attr.remove }
|
98
|
+
else
|
109
99
|
# Delete any attribute that isn't in the whitelist for this element.
|
110
|
-
node.
|
111
|
-
|
100
|
+
node.attribute_nodes.each do |attr|
|
101
|
+
attr.unlink unless attr_whitelist.include?(attr.name.downcase)
|
112
102
|
end
|
113
103
|
|
114
104
|
# Delete remaining attributes that use unacceptable protocols.
|
115
105
|
if @config[:protocols].has_key?(name)
|
116
106
|
protocol = @config[:protocols][name]
|
117
107
|
|
118
|
-
node.
|
119
|
-
|
120
|
-
next
|
108
|
+
node.attribute_nodes.each do |attr|
|
109
|
+
attr_name = attr.name.downcase
|
110
|
+
next false unless protocol.has_key?(attr_name)
|
121
111
|
|
122
|
-
if value.to_s.downcase =~ REGEX_PROTOCOL
|
123
|
-
!protocol[
|
112
|
+
del = if attr.value.to_s.downcase =~ REGEX_PROTOCOL
|
113
|
+
!protocol[attr_name].include?($1.downcase)
|
124
114
|
else
|
125
|
-
!protocol[
|
115
|
+
!protocol[attr_name].include?(:relative)
|
126
116
|
end
|
117
|
+
|
118
|
+
attr.unlink if del
|
127
119
|
end
|
128
120
|
end
|
129
|
-
else
|
130
|
-
# Delete all attributes from elements with no whitelisted
|
131
|
-
# attributes.
|
132
|
-
node.raw_attributes = {}
|
133
121
|
end
|
134
122
|
|
135
123
|
# Add required attributes.
|
136
124
|
if @config[:add_attributes].has_key?(name)
|
137
|
-
|
125
|
+
@config[:add_attributes][name].each do |key, val|
|
126
|
+
node[key] = val
|
127
|
+
end
|
138
128
|
end
|
129
|
+
elsif node.cdata?
|
130
|
+
node.replace(Nokogiri::XML::Text.new(node.text, node.document))
|
139
131
|
end
|
140
132
|
end
|
141
133
|
|
142
|
-
|
143
|
-
|
144
|
-
|
145
|
-
# burning desire to decode all entities.
|
146
|
-
coder = HTMLEntities.new
|
134
|
+
result = fragment.to_xhtml(:encoding => 'UTF-8', :indent => 0).gsub(/>\n/, '>')
|
135
|
+
return result == html ? nil : html[0, html.length] = result
|
136
|
+
end
|
147
137
|
|
148
|
-
|
149
|
-
|
150
|
-
|
151
|
-
|
138
|
+
#--
|
139
|
+
# Class Methods
|
140
|
+
#++
|
141
|
+
|
142
|
+
class << self
|
143
|
+
# Returns a sanitized copy of _html_, using the settings in _config_ if
|
144
|
+
# specified.
|
145
|
+
def clean(html, config = {})
|
146
|
+
sanitize = Sanitize.new(config)
|
147
|
+
sanitize.clean(html)
|
152
148
|
end
|
153
149
|
|
154
|
-
|
155
|
-
|
150
|
+
# Performs Sanitize#clean in place, returning _html_, or +nil+ if no changes
|
151
|
+
# were made.
|
152
|
+
def clean!(html, config = {})
|
153
|
+
sanitize = Sanitize.new(config)
|
154
|
+
sanitize.clean!(html)
|
155
|
+
end
|
156
|
+
|
157
|
+
# Encodes special HTML characters (<, >, ", ', and &) in _html_ as entity
|
158
|
+
# references and returns the encoded string.
|
159
|
+
def encode_html(html)
|
160
|
+
str = html.dup
|
161
|
+
|
162
|
+
# Encode special chars.
|
163
|
+
ENTITY_MAP.each {|char, entity| str.gsub!(char, entity) }
|
164
|
+
|
165
|
+
# Convert unencoded ampersands to entity references.
|
166
|
+
str.gsub(REGEX_AMPERSAND, '&')
|
167
|
+
end
|
156
168
|
end
|
169
|
+
|
157
170
|
end
|
metadata
CHANGED
@@ -1,7 +1,7 @@
|
|
1
1
|
--- !ruby/object:Gem::Specification
|
2
2
|
name: adamh-sanitize
|
3
3
|
version: !ruby/object:Gem::Version
|
4
|
-
version: 1.0
|
4
|
+
version: 1.1.0
|
5
5
|
platform: ruby
|
6
6
|
authors:
|
7
7
|
- Ryan Grove
|
@@ -13,24 +13,14 @@ date: 2009-05-16 00:00:00 -07:00
|
|
13
13
|
default_executable:
|
14
14
|
dependencies:
|
15
15
|
- !ruby/object:Gem::Dependency
|
16
|
-
name:
|
16
|
+
name: nokogiri
|
17
17
|
type: :runtime
|
18
18
|
version_requirement:
|
19
19
|
version_requirements: !ruby/object:Gem::Requirement
|
20
20
|
requirements:
|
21
21
|
- - ~>
|
22
22
|
- !ruby/object:Gem::Version
|
23
|
-
version:
|
24
|
-
version:
|
25
|
-
- !ruby/object:Gem::Dependency
|
26
|
-
name: htmlentities
|
27
|
-
type: :runtime
|
28
|
-
version_requirement:
|
29
|
-
version_requirements: !ruby/object:Gem::Requirement
|
30
|
-
requirements:
|
31
|
-
- - ~>
|
32
|
-
- !ruby/object:Gem::Version
|
33
|
-
version: 4.0.0
|
23
|
+
version: 1.3.3
|
34
24
|
version:
|
35
25
|
description:
|
36
26
|
email: ryan@wonko.com
|