adamh-sanitize 1.0.4.4 → 1.1.0

This diff represents the content of publicly available package versions that have been released to one of the supported registries. The information contained in this diff is provided for informational purposes only and reflects changes between package versions as they appear in their respective public registries.
Files changed (4) hide show
  1. data/HISTORY +36 -0
  2. data/README.rdoc +25 -5
  3. data/lib/sanitize.rb +77 -64
  4. metadata +3 -13
data/HISTORY CHANGED
@@ -1,6 +1,42 @@
1
1
  Sanitize History
2
2
  ================================================================================
3
3
 
4
+ Version 1.1.0
5
+ * Migrated from Hpricot to Nokogiri. Requires libxml2 >= 2.7.2 [Adam Hooper]
6
+
7
+ Version 1.0.8.1 (git)
8
+ * Changed protocol regex to ensure Sanitize doesn't kill URLs with colons in
9
+ path segments. [Peter Cooper]
10
+
11
+ Version 1.0.8 (2009-04-23)
12
+ * Added a workaround for an Hpricot bug that prevents attribute names from
13
+ being downcased in recent versions of Hpricot. This was exploitable to
14
+ prevent non-whitelisted protocols from being cleaned. [Reported by Ben
15
+ Wanicur]
16
+
17
+ Version 1.0.7 (2009-04-11)
18
+ * Requires Hpricot 0.8.1+, which is finally compatible with Ruby 1.9.1.
19
+ * Fixed a bug that caused named character entities containing digits (like
20
+ ²) to be escaped when they shouldn't have been. [Reported by Sebastian
21
+ Steinmetz]
22
+
23
+ Version 1.0.6 (2009-02-23)
24
+ * Removed htmlentities gem dependency.
25
+ * Existing well-formed character entity references in the input string are now
26
+ preserved rather than being decoded and re-encoded.
27
+ * The ' character is now encoded as ' instead of ' to prevent
28
+ problems in IE6.
29
+ * You can now specify the symbol :all in place of an element name in the
30
+ attributes config hash to allow certain attributes on all elements. [Thanks
31
+ to Mutwin Kraus]
32
+
33
+ Version 1.0.5 (2009-02-05)
34
+ * Fixed a bug introduced in version 1.0.3 that prevented non-whitelisted
35
+ protocols from being cleaned when relative URLs were allowed. [Reported by
36
+ Dev Purkayastha]
37
+ * Fixed "undefined method `parent='" exceptions caused by parser changes in
38
+ edge Hpricot.
39
+
4
40
  Version 1.0.4 (2009-01-16)
5
41
  * Fixed a bug that made it possible to sneak a non-whitelisted element through
6
42
  by repeating it several times in a row. All versions of Sanitize prior to
data/README.rdoc CHANGED
@@ -9,13 +9,13 @@ elements, certain attributes within those elements, and even certain URL
9
9
  protocols within attributes that contain URLs. Any HTML elements or attributes
10
10
  that you don't explicitly allow will be removed.
11
11
 
12
- Because it's based on Hpricot, a full-fledged HTML parser, rather than a bunch
12
+ Because it's based on nokogiri, a full-fledged HTML parser, rather than a bunch
13
13
  of fragile regular expressions, Sanitize has no trouble dealing with malformed
14
14
  or maliciously-formed HTML. When in doubt, Sanitize always errs on the side of
15
15
  caution.
16
16
 
17
17
  *Author*:: Ryan Grove (mailto:ryan@wonko.com)
18
- *Version*:: 1.0.4 (2009-01-16)
18
+ *Version*:: 1.0.8 (2009-04-23)
19
19
  *Copyright*:: Copyright (c) 2009 Ryan Grove. All rights reserved.
20
20
  *License*:: MIT License (http://opensource.org/licenses/mit-license.php)
21
21
  *Website*:: http://github.com/rgrove/sanitize
@@ -23,8 +23,7 @@ caution.
23
23
  == Requires
24
24
 
25
25
  * RubyGems
26
- * Hpricot 0.6+
27
- * HTMLEntities 4.0.0+
26
+ * nokogiri
28
27
 
29
28
  == Usage
30
29
 
@@ -100,6 +99,14 @@ attributes in lowercase.
100
99
  'img' => ['alt', 'src', 'title']
101
100
  }
102
101
 
102
+ If you'd like to allow certain attributes on all elements, use the symbol
103
+ <code>:all</code> instead of an element name.
104
+
105
+ :attributes => {
106
+ :all => ['class'],
107
+ 'a' => ['href', 'title']
108
+ }
109
+
103
110
  ==== :add_attributes
104
111
 
105
112
  Attributes to add to specific elements. If the attribute already exists, it will
@@ -122,12 +129,25 @@ protocol at all), it will be removed.
122
129
  }
123
130
 
124
131
  If you'd like to allow the use of relative URLs which don't have a protocol,
125
- include the special value <code>:relative</code> in the protocol array:
132
+ include the symbol <code>:relative</code> in the protocol array:
126
133
 
127
134
  :protocols => {
128
135
  'a' => {'href' => ['http', 'https', :relative]}
129
136
  }
130
137
 
138
+
139
+ == Contributors
140
+
141
+ The following lovely people have contributed to Sanitize in the form of patches
142
+ or ideas that later became code:
143
+
144
+ * Peter Cooper <git@peterc.org>
145
+ * Ryan Grove <ryan@wonko.com>
146
+ * Adam Hooper <adam@adamhooper.com>
147
+ * Mutwin Kraus <mutle@blogage.de>
148
+ * Dev Purkayastha <dev.purkayastha@gmail.com>
149
+ * Ben Wanicur <bwanicur@verticalresponse.com>
150
+
131
151
  == License
132
152
 
133
153
  Copyright (c) 2009 Ryan Grove <ryan@wonko.com>
data/lib/sanitize.rb CHANGED
@@ -26,11 +26,9 @@ $:.uniq!
26
26
 
27
27
  require 'rubygems'
28
28
 
29
- gem 'why-hpricot', '~> 0.7'
30
- gem 'htmlentities', '~> 4.0.0'
29
+ gem 'nokogiri', '~> 1.3.3'
31
30
 
32
- require 'hpricot'
33
- require 'htmlentities'
31
+ require 'nokogiri'
34
32
  require 'sanitize/config'
35
33
  require 'sanitize/config/restricted'
36
34
  require 'sanitize/config/basic'
@@ -38,30 +36,24 @@ require 'sanitize/config/relaxed'
38
36
 
39
37
  class Sanitize
40
38
 
39
+ # Characters that should be replaced with entities in text nodes.
40
+ ENTITY_MAP = {
41
+ '<' => '&lt;',
42
+ '>' => '&gt;',
43
+ '"' => '&quot;',
44
+ "'" => '&#39;'
45
+ }
46
+
47
+ # Matches an unencoded ampersand that is not part of a valid character entity
48
+ # reference.
49
+ REGEX_AMPERSAND = /&(?!(?:[a-z]+[0-9]{0,2}|#[0-9]+|#x[0-9a-f]+);)/i
50
+
41
51
  # Matches an attribute value that could be treated by a browser as a URL
42
- # with a protocol prefix, such as "http:" or "javascript:". Any string of one
52
+ # with a protocol prefix, such as "http:" or "javascript:". Any string of zero
43
53
  # or more characters followed by a colon is considered a match, even if the
44
54
  # colon is encoded as an entity and even if it's an incomplete entity (which
45
55
  # IE6 and Opera will still parse).
46
- REGEX_PROTOCOL = /^([^:]+)(?:\:|&#0*58|&#x0*3a)(?:[^0-9a-f]|$)/i
47
-
48
- #--
49
- # Class Methods
50
- #++
51
-
52
- # Returns a sanitized copy of _html_, using the settings in _config_ if
53
- # specified.
54
- def self.clean(html, config = {})
55
- sanitize = Sanitize.new(config)
56
- sanitize.clean(html)
57
- end
58
-
59
- # Performs Sanitize#clean in place, returning _html_, or +nil+ if no changes
60
- # were made.
61
- def self.clean!(html, config = {})
62
- sanitize = Sanitize.new(config)
63
- sanitize.clean!(html)
64
- end
56
+ REGEX_PROTOCOL = /^([A-Za-z0-9\+\-\.\&\;\#\s]*?)(?:\:|&#0*58|&#x0*3a)/i
65
57
 
66
58
  #--
67
59
  # Instance Methods
@@ -81,77 +73,98 @@ class Sanitize
81
73
  # Performs clean in place, returning _html_, or +nil+ if no changes were
82
74
  # made.
83
75
  def clean!(html)
84
- fragment = Hpricot(html)
85
-
86
- fragment.search('*') do |node|
87
- if node.bogusetag? || node.doctype? || node.procins? || node.xmldecl?
88
- node.parent.altered!
89
- node.parent.children[node.parent.children.index(node), 1] = []
90
- next
91
- end
76
+ fragment = Nokogiri::HTML::DocumentFragment.parse(html)
92
77
 
78
+ fragment.traverse do |node|
93
79
  if node.comment?
94
- unless @config[:allow_comments]
95
- node.parent.altered!
96
- node.parent.children[node.parent.children.index(node), 1] = []
97
- end
98
- elsif node.elem?
80
+ node.unlink unless @config[:allow_comments]
81
+ elsif node.element?
99
82
  name = node.name.to_s.downcase
100
83
 
101
84
  # Delete any element that isn't in the whitelist.
102
85
  unless @config[:elements].include?(name)
103
- node.parent.replace_child(node, node.children || [])
86
+ node.children.each { |n| node.add_previous_sibling(n) }
87
+ node.unlink
104
88
  next
105
89
  end
106
90
 
107
- node.raw_attributes ||= {}
108
- if @config[:attributes].has_key?(name)
91
+ attr_whitelist = ((@config[:attributes][name] || []) +
92
+ (@config[:attributes][:all] || [])).uniq
93
+
94
+ if attr_whitelist.empty?
95
+ # Delete all attributes from elements with no whitelisted
96
+ # attributes.
97
+ node.attribute_nodes.each { |attr| attr.remove }
98
+ else
109
99
  # Delete any attribute that isn't in the whitelist for this element.
110
- node.raw_attributes.delete_if do |key, value|
111
- !@config[:attributes][name].include?(key.to_s.downcase)
100
+ node.attribute_nodes.each do |attr|
101
+ attr.unlink unless attr_whitelist.include?(attr.name.downcase)
112
102
  end
113
103
 
114
104
  # Delete remaining attributes that use unacceptable protocols.
115
105
  if @config[:protocols].has_key?(name)
116
106
  protocol = @config[:protocols][name]
117
107
 
118
- node.raw_attributes.delete_if do |key, value|
119
- next false unless protocol.has_key?(key)
120
- next true if value.nil?
108
+ node.attribute_nodes.each do |attr|
109
+ attr_name = attr.name.downcase
110
+ next false unless protocol.has_key?(attr_name)
121
111
 
122
- if value.to_s.downcase =~ REGEX_PROTOCOL
123
- !protocol[key].include?($1.downcase)
112
+ del = if attr.value.to_s.downcase =~ REGEX_PROTOCOL
113
+ !protocol[attr_name].include?($1.downcase)
124
114
  else
125
- !protocol[key].include?(:relative)
115
+ !protocol[attr_name].include?(:relative)
126
116
  end
117
+
118
+ attr.unlink if del
127
119
  end
128
120
  end
129
- else
130
- # Delete all attributes from elements with no whitelisted
131
- # attributes.
132
- node.raw_attributes = {}
133
121
  end
134
122
 
135
123
  # Add required attributes.
136
124
  if @config[:add_attributes].has_key?(name)
137
- node.raw_attributes.merge!(@config[:add_attributes][name])
125
+ @config[:add_attributes][name].each do |key, val|
126
+ node[key] = val
127
+ end
138
128
  end
129
+ elsif node.cdata?
130
+ node.replace(Nokogiri::XML::Text.new(node.text, node.document))
139
131
  end
140
132
  end
141
133
 
142
- # Make one last pass through the fragment and encode all special HTML chars
143
- # and non-ASCII chars as entities. This eliminates certain types of
144
- # maliciously-malformed nested tags and also compensates for Hpricot's
145
- # burning desire to decode all entities.
146
- coder = HTMLEntities.new
134
+ result = fragment.to_xhtml(:encoding => 'UTF-8', :indent => 0).gsub(/>\n/, '>')
135
+ return result == html ? nil : html[0, html.length] = result
136
+ end
147
137
 
148
- fragment.traverse_element do |node|
149
- if node.text?
150
- node.swap(coder.encode(node.inner_text, :named))
151
- end
138
+ #--
139
+ # Class Methods
140
+ #++
141
+
142
+ class << self
143
+ # Returns a sanitized copy of _html_, using the settings in _config_ if
144
+ # specified.
145
+ def clean(html, config = {})
146
+ sanitize = Sanitize.new(config)
147
+ sanitize.clean(html)
152
148
  end
153
149
 
154
- result = fragment.to_s
155
- return result == html ? nil : html[0, html.length] = result
150
+ # Performs Sanitize#clean in place, returning _html_, or +nil+ if no changes
151
+ # were made.
152
+ def clean!(html, config = {})
153
+ sanitize = Sanitize.new(config)
154
+ sanitize.clean!(html)
155
+ end
156
+
157
+ # Encodes special HTML characters (<, >, ", ', and &) in _html_ as entity
158
+ # references and returns the encoded string.
159
+ def encode_html(html)
160
+ str = html.dup
161
+
162
+ # Encode special chars.
163
+ ENTITY_MAP.each {|char, entity| str.gsub!(char, entity) }
164
+
165
+ # Convert unencoded ampersands to entity references.
166
+ str.gsub(REGEX_AMPERSAND, '&amp;')
167
+ end
156
168
  end
169
+
157
170
  end
metadata CHANGED
@@ -1,7 +1,7 @@
1
1
  --- !ruby/object:Gem::Specification
2
2
  name: adamh-sanitize
3
3
  version: !ruby/object:Gem::Version
4
- version: 1.0.4.4
4
+ version: 1.1.0
5
5
  platform: ruby
6
6
  authors:
7
7
  - Ryan Grove
@@ -13,24 +13,14 @@ date: 2009-05-16 00:00:00 -07:00
13
13
  default_executable:
14
14
  dependencies:
15
15
  - !ruby/object:Gem::Dependency
16
- name: adamh-hpricot
16
+ name: nokogiri
17
17
  type: :runtime
18
18
  version_requirement:
19
19
  version_requirements: !ruby/object:Gem::Requirement
20
20
  requirements:
21
21
  - - ~>
22
22
  - !ruby/object:Gem::Version
23
- version: "0.6"
24
- version:
25
- - !ruby/object:Gem::Dependency
26
- name: htmlentities
27
- type: :runtime
28
- version_requirement:
29
- version_requirements: !ruby/object:Gem::Requirement
30
- requirements:
31
- - - ~>
32
- - !ruby/object:Gem::Version
33
- version: 4.0.0
23
+ version: 1.3.3
34
24
  version:
35
25
  description:
36
26
  email: ryan@wonko.com