adamh-sanitize 1.0.4.4 → 1.1.0

Sign up to get free protection for your applications and to get access to all the features.
Files changed (4) hide show
  1. data/HISTORY +36 -0
  2. data/README.rdoc +25 -5
  3. data/lib/sanitize.rb +77 -64
  4. metadata +3 -13
data/HISTORY CHANGED
@@ -1,6 +1,42 @@
1
1
  Sanitize History
2
2
  ================================================================================
3
3
 
4
+ Version 1.1.0
5
+ * Migrated from Hpricot to Nokogiri. Requires libxml2 >= 2.7.2 [Adam Hooper]
6
+
7
+ Version 1.0.8.1 (git)
8
+ * Changed protocol regex to ensure Sanitize doesn't kill URLs with colons in
9
+ path segments. [Peter Cooper]
10
+
11
+ Version 1.0.8 (2009-04-23)
12
+ * Added a workaround for an Hpricot bug that prevents attribute names from
13
+ being downcased in recent versions of Hpricot. This was exploitable to
14
+ prevent non-whitelisted protocols from being cleaned. [Reported by Ben
15
+ Wanicur]
16
+
17
+ Version 1.0.7 (2009-04-11)
18
+ * Requires Hpricot 0.8.1+, which is finally compatible with Ruby 1.9.1.
19
+ * Fixed a bug that caused named character entities containing digits (like
20
+ ²) to be escaped when they shouldn't have been. [Reported by Sebastian
21
+ Steinmetz]
22
+
23
+ Version 1.0.6 (2009-02-23)
24
+ * Removed htmlentities gem dependency.
25
+ * Existing well-formed character entity references in the input string are now
26
+ preserved rather than being decoded and re-encoded.
27
+ * The ' character is now encoded as ' instead of ' to prevent
28
+ problems in IE6.
29
+ * You can now specify the symbol :all in place of an element name in the
30
+ attributes config hash to allow certain attributes on all elements. [Thanks
31
+ to Mutwin Kraus]
32
+
33
+ Version 1.0.5 (2009-02-05)
34
+ * Fixed a bug introduced in version 1.0.3 that prevented non-whitelisted
35
+ protocols from being cleaned when relative URLs were allowed. [Reported by
36
+ Dev Purkayastha]
37
+ * Fixed "undefined method `parent='" exceptions caused by parser changes in
38
+ edge Hpricot.
39
+
4
40
  Version 1.0.4 (2009-01-16)
5
41
  * Fixed a bug that made it possible to sneak a non-whitelisted element through
6
42
  by repeating it several times in a row. All versions of Sanitize prior to
data/README.rdoc CHANGED
@@ -9,13 +9,13 @@ elements, certain attributes within those elements, and even certain URL
9
9
  protocols within attributes that contain URLs. Any HTML elements or attributes
10
10
  that you don't explicitly allow will be removed.
11
11
 
12
- Because it's based on Hpricot, a full-fledged HTML parser, rather than a bunch
12
+ Because it's based on nokogiri, a full-fledged HTML parser, rather than a bunch
13
13
  of fragile regular expressions, Sanitize has no trouble dealing with malformed
14
14
  or maliciously-formed HTML. When in doubt, Sanitize always errs on the side of
15
15
  caution.
16
16
 
17
17
  *Author*:: Ryan Grove (mailto:ryan@wonko.com)
18
- *Version*:: 1.0.4 (2009-01-16)
18
+ *Version*:: 1.0.8 (2009-04-23)
19
19
  *Copyright*:: Copyright (c) 2009 Ryan Grove. All rights reserved.
20
20
  *License*:: MIT License (http://opensource.org/licenses/mit-license.php)
21
21
  *Website*:: http://github.com/rgrove/sanitize
@@ -23,8 +23,7 @@ caution.
23
23
  == Requires
24
24
 
25
25
  * RubyGems
26
- * Hpricot 0.6+
27
- * HTMLEntities 4.0.0+
26
+ * nokogiri
28
27
 
29
28
  == Usage
30
29
 
@@ -100,6 +99,14 @@ attributes in lowercase.
100
99
  'img' => ['alt', 'src', 'title']
101
100
  }
102
101
 
102
+ If you'd like to allow certain attributes on all elements, use the symbol
103
+ <code>:all</code> instead of an element name.
104
+
105
+ :attributes => {
106
+ :all => ['class'],
107
+ 'a' => ['href', 'title']
108
+ }
109
+
103
110
  ==== :add_attributes
104
111
 
105
112
  Attributes to add to specific elements. If the attribute already exists, it will
@@ -122,12 +129,25 @@ protocol at all), it will be removed.
122
129
  }
123
130
 
124
131
  If you'd like to allow the use of relative URLs which don't have a protocol,
125
- include the special value <code>:relative</code> in the protocol array:
132
+ include the symbol <code>:relative</code> in the protocol array:
126
133
 
127
134
  :protocols => {
128
135
  'a' => {'href' => ['http', 'https', :relative]}
129
136
  }
130
137
 
138
+
139
+ == Contributors
140
+
141
+ The following lovely people have contributed to Sanitize in the form of patches
142
+ or ideas that later became code:
143
+
144
+ * Peter Cooper <git@peterc.org>
145
+ * Ryan Grove <ryan@wonko.com>
146
+ * Adam Hooper <adam@adamhooper.com>
147
+ * Mutwin Kraus <mutle@blogage.de>
148
+ * Dev Purkayastha <dev.purkayastha@gmail.com>
149
+ * Ben Wanicur <bwanicur@verticalresponse.com>
150
+
131
151
  == License
132
152
 
133
153
  Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
data/lib/sanitize.rb CHANGED
@@ -26,11 +26,9 @@ $:.uniq!
26
26
 
27
27
  require 'rubygems'
28
28
 
29
- gem 'why-hpricot', '~> 0.7'
30
- gem 'htmlentities', '~> 4.0.0'
29
+ gem 'nokogiri', '~> 1.3.3'
31
30
 
32
- require 'hpricot'
33
- require 'htmlentities'
31
+ require 'nokogiri'
34
32
  require 'sanitize/config'
35
33
  require 'sanitize/config/restricted'
36
34
  require 'sanitize/config/basic'
@@ -38,30 +36,24 @@ require 'sanitize/config/relaxed'
38
36
 
39
37
  class Sanitize
40
38
 
39
+ # Characters that should be replaced with entities in text nodes.
40
+ ENTITY_MAP = {
41
+ '<' => '&lt;',
42
+ '>' => '&gt;',
43
+ '"' => '&quot;',
44
+ "'" => '&#39;'
45
+ }
46
+
47
+ # Matches an unencoded ampersand that is not part of a valid character entity
48
+ # reference.
49
+ REGEX_AMPERSAND = /&(?!(?:[a-z]+[0-9]{0,2}|#[0-9]+|#x[0-9a-f]+);)/i
50
+
41
51
  # Matches an attribute value that could be treated by a browser as a URL
42
- # with a protocol prefix, such as "http:" or "javascript:". Any string of one
52
+ # with a protocol prefix, such as "http:" or "javascript:". Any string of zero
43
53
  # or more characters followed by a colon is considered a match, even if the
44
54
  # colon is encoded as an entity and even if it's an incomplete entity (which
45
55
  # IE6 and Opera will still parse).
46
- REGEX_PROTOCOL = /^([^:]+)(?:\:|&#0*58|&#x0*3a)(?:[^0-9a-f]|$)/i
47
-
48
- #--
49
- # Class Methods
50
- #++
51
-
52
- # Returns a sanitized copy of _html_, using the settings in _config_ if
53
- # specified.
54
- def self.clean(html, config = {})
55
- sanitize = Sanitize.new(config)
56
- sanitize.clean(html)
57
- end
58
-
59
- # Performs Sanitize#clean in place, returning _html_, or +nil+ if no changes
60
- # were made.
61
- def self.clean!(html, config = {})
62
- sanitize = Sanitize.new(config)
63
- sanitize.clean!(html)
64
- end
56
+ REGEX_PROTOCOL = /^([A-Za-z0-9\+\-\.\&\;\#\s]*?)(?:\:|&#0*58|&#x0*3a)/i
65
57
 
66
58
  #--
67
59
  # Instance Methods
@@ -81,77 +73,98 @@ class Sanitize
81
73
  # Performs clean in place, returning _html_, or +nil+ if no changes were
82
74
  # made.
83
75
  def clean!(html)
84
- fragment = Hpricot(html)
85
-
86
- fragment.search('*') do |node|
87
- if node.bogusetag? || node.doctype? || node.procins? || node.xmldecl?
88
- node.parent.altered!
89
- node.parent.children[node.parent.children.index(node), 1] = []
90
- next
91
- end
76
+ fragment = Nokogiri::HTML::DocumentFragment.parse(html)
92
77
 
78
+ fragment.traverse do |node|
93
79
  if node.comment?
94
- unless @config[:allow_comments]
95
- node.parent.altered!
96
- node.parent.children[node.parent.children.index(node), 1] = []
97
- end
98
- elsif node.elem?
80
+ node.unlink unless @config[:allow_comments]
81
+ elsif node.element?
99
82
  name = node.name.to_s.downcase
100
83
 
101
84
  # Delete any element that isn't in the whitelist.
102
85
  unless @config[:elements].include?(name)
103
- node.parent.replace_child(node, node.children || [])
86
+ node.children.each { |n| node.add_previous_sibling(n) }
87
+ node.unlink
104
88
  next
105
89
  end
106
90
 
107
- node.raw_attributes ||= {}
108
- if @config[:attributes].has_key?(name)
91
+ attr_whitelist = ((@config[:attributes][name] || []) +
92
+ (@config[:attributes][:all] || [])).uniq
93
+
94
+ if attr_whitelist.empty?
95
+ # Delete all attributes from elements with no whitelisted
96
+ # attributes.
97
+ node.attribute_nodes.each { |attr| attr.remove }
98
+ else
109
99
  # Delete any attribute that isn't in the whitelist for this element.
110
- node.raw_attributes.delete_if do |key, value|
111
- !@config[:attributes][name].include?(key.to_s.downcase)
100
+ node.attribute_nodes.each do |attr|
101
+ attr.unlink unless attr_whitelist.include?(attr.name.downcase)
112
102
  end
113
103
 
114
104
  # Delete remaining attributes that use unacceptable protocols.
115
105
  if @config[:protocols].has_key?(name)
116
106
  protocol = @config[:protocols][name]
117
107
 
118
- node.raw_attributes.delete_if do |key, value|
119
- next false unless protocol.has_key?(key)
120
- next true if value.nil?
108
+ node.attribute_nodes.each do |attr|
109
+ attr_name = attr.name.downcase
110
+ next false unless protocol.has_key?(attr_name)
121
111
 
122
- if value.to_s.downcase =~ REGEX_PROTOCOL
123
- !protocol[key].include?($1.downcase)
112
+ del = if attr.value.to_s.downcase =~ REGEX_PROTOCOL
113
+ !protocol[attr_name].include?($1.downcase)
124
114
  else
125
- !protocol[key].include?(:relative)
115
+ !protocol[attr_name].include?(:relative)
126
116
  end
117
+
118
+ attr.unlink if del
127
119
  end
128
120
  end
129
- else
130
- # Delete all attributes from elements with no whitelisted
131
- # attributes.
132
- node.raw_attributes = {}
133
121
  end
134
122
 
135
123
  # Add required attributes.
136
124
  if @config[:add_attributes].has_key?(name)
137
- node.raw_attributes.merge!(@config[:add_attributes][name])
125
+ @config[:add_attributes][name].each do |key, val|
126
+ node[key] = val
127
+ end
138
128
  end
129
+ elsif node.cdata?
130
+ node.replace(Nokogiri::XML::Text.new(node.text, node.document))
139
131
  end
140
132
  end
141
133
 
142
- # Make one last pass through the fragment and encode all special HTML chars
143
- # and non-ASCII chars as entities. This eliminates certain types of
144
- # maliciously-malformed nested tags and also compensates for Hpricot's
145
- # burning desire to decode all entities.
146
- coder = HTMLEntities.new
134
+ result = fragment.to_xhtml(:encoding => 'UTF-8', :indent => 0).gsub(/>\n/, '>')
135
+ return result == html ? nil : html[0, html.length] = result
136
+ end
147
137
 
148
- fragment.traverse_element do |node|
149
- if node.text?
150
- node.swap(coder.encode(node.inner_text, :named))
151
- end
138
+ #--
139
+ # Class Methods
140
+ #++
141
+
142
+ class << self
143
+ # Returns a sanitized copy of _html_, using the settings in _config_ if
144
+ # specified.
145
+ def clean(html, config = {})
146
+ sanitize = Sanitize.new(config)
147
+ sanitize.clean(html)
152
148
  end
153
149
 
154
- result = fragment.to_s
155
- return result == html ? nil : html[0, html.length] = result
150
+ # Performs Sanitize#clean in place, returning _html_, or +nil+ if no changes
151
+ # were made.
152
+ def clean!(html, config = {})
153
+ sanitize = Sanitize.new(config)
154
+ sanitize.clean!(html)
155
+ end
156
+
157
+ # Encodes special HTML characters (<, >, ", ', and &) in _html_ as entity
158
+ # references and returns the encoded string.
159
+ def encode_html(html)
160
+ str = html.dup
161
+
162
+ # Encode special chars.
163
+ ENTITY_MAP.each {|char, entity| str.gsub!(char, entity) }
164
+
165
+ # Convert unencoded ampersands to entity references.
166
+ str.gsub(REGEX_AMPERSAND, '&amp;')
167
+ end
156
168
  end
169
+
157
170
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: adamh-sanitize
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.4.4
4
+ version: 1.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Ryan Grove
@@ -13,24 +13,14 @@ date: 2009-05-16 00:00:00 -07:00
13
13
  default_executable:
14
14
  dependencies:
15
15
  - !ruby/object:Gem::Dependency
16
- name: adamh-hpricot
16
+ name: nokogiri
17
17
  type: :runtime
18
18
  version_requirement:
19
19
  version_requirements: !ruby/object:Gem::Requirement
20
20
  requirements:
21
21
  - - ~>
22
22
  - !ruby/object:Gem::Version
23
- version: "0.6"
24
- version:
25
- - !ruby/object:Gem::Dependency
26
- name: htmlentities
27
- type: :runtime
28
- version_requirement:
29
- version_requirements: !ruby/object:Gem::Requirement
30
- requirements:
31
- - - ~>
32
- - !ruby/object:Gem::Version
33
- version: 4.0.0
23
+ version: 1.3.3
34
24
  version:
35
25
  description:
36
26
  email: ryan@wonko.com